BEACON: A Tool for Industry Self-Classification in the Economic Census

Dumbacher, Brian; Whitehead, Daniel; Jeong, Jiseok; Pfeiff, Sarah

doi:10.6339/25-JDS1180

Journal of Data Science

BEACON: A Tool for Industry Self-Classification in the Economic Census

Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 429–448

Brian Dumbacher

Daniel Whitehead Jiseok Jeong All authors (4)

https://doi.org/10.6339/25-JDS1180

Pub. online: 17 April 2025 Type: Data Science In Action

Open Access

Received
19 July 2024

Accepted
20 March 2025

Published
17 April 2025

Abstract

Business Establishment Automated Classification of NAICS (BEACON) is a text classification tool that helps respondents to the U.S. Census Bureau’s economic surveys self-classify their business activity in real time. The tool is based on rich training data, natural language processing, machine learning, and information retrieval. It is implemented using Python and an application programming interface. This paper describes BEACON’s methodology and successful application to the 2022 Economic Census, during which the tool was used over half a million times. BEACON has demonstrated that it recognizes a large vocabulary, quickly returns relevant results to respondents, and reduces clerical work associated with industry code assignment.

Supplementary material

Supplementary Material

The supplementary material consists of a Python program that implements a simplified version of BEACON. All of the methodological components are present, but the full text cleaning algorithm cannot be shared for confidentiality reasons. Likewise, the confidential data sources used by BEACON cannot be shared. The public data sources that are part of BEACON’s training data are available at the references cited. See https://github.com/uscensusbureau/BEACON for additional files and documentation.

References

Aggarwal CC (2018). Machine Learning for Text. Springer International Publishing, Cham.

Baumgartner P, Smith A, Olmsted M, Ohse D (2021). A framework for using machine learning to support qualitative data coding. OSF Preprints. https://doi.org/10.31219/osf.io/fueyj

Bird S (2006). NLTK: The natural language toolkit. In: Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, 69–72.

Bishop CM (2013). Model-based machine learning. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984): 1–17. https://doi.org/10.1098/rsta.2012.0222

Bojanowski P, Grave E, Joulin A, Mikolov T (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5: 135–146. https://doi.org/10.1162/tacl_a_00051

Chu K, Poirier C (2015). Machine learning documentation initiative. United Nations Economic Commission for Europe. In: Conference of European Statisticians: Workshop on the Modernisation of Statistical Production Meeting. 15–17 April 2015, https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.50/2015/Topic3_Canada_paper.pdf. [Online; accessed 15 March 2024].

Cuffe J, Bhattacharjee S, Etudo U, Smith JC, Basdeo N, Burbank N, et al. (2022). Using public data to generate industrial classification codes. In: Big Data for 21st Century Economic Statistics (K Abraham, R Jarmin, B Moyer, M Shapiro, eds.), volume 79 of National Bureau of Economic Research: Studies in Income and Wealth, chapter 8, 229–246. University of Chicago Press.

Dumbacher B, Russell A (2019). Using machine learning to assign North American industry classification system codes to establishments based on business description write-ins. In: 2019 Proceedings of the American Statistical Association, 1497–1514.

Dumbacher B, Whitehead D (2022). Industry self-classification in the Economic Census. In: 2022 Proceedings of the American Statistical Association, 1049–1064.

Dumbacher B, Whitehead D (2024). Ranked short text classification using co-occurrence features and score functions. U.S. Census Bureau ADEP Working Paper Series, (ADEP-WP-2024-06).

Džeroski S, Ženko B (2004). Is combining classifiers with stacking better than selecting the best one? Machine Learning, 54: 255–273. https://doi.org/10.1023/B:MACH.0000015881.36452.6e

Evans J, Oyarzun J (2021). Need for speed: Using fastText (machine learning) to code the Labour Force Survey. In: 2021 Proceedings of the Statistics Canada Symposium.

Figueiredo F, Rocha L, Couto T, Salles T, Gonçalves MA, Meira W Jr (2011). Word co-occurrence features for text classification. Information Systems, 36(5): 843–858. https://doi.org/10.1016/j.is.2011.02.002

Hastie T, Tibshirani R, Friedman J (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.

Internal Revenue Service (2023). Form SS-4 application for Employer Identification Number. https://www.irs.gov/pub/irs-pdf/fss4.pdf. [Online; accessed 7 March 2024].

Jordan MI, Mitchell TM (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245): 255–260. https://doi.org/10.1126/science.aaa8415

Jurafsky D, Martin JH (2009). Speech and Language Processing. Pearson Education, Inc., Upper Saddle River.

Kearney AT, Kornbau ME (2005). An automated industry coding application for new U.S. business establishments. In: 2005 Proceedings of the American Statistical Association, 867–874.

Kirkendall NK, White Jr GD, Citro CF, Abraham KG (2018). Reengineering the Census Bureau’s Annual Economic Surveys. National Academies Press, Washington, DC.

Kornbau ME (2016). Automating processes for the U.S. Census Bureau register. 25th Meeting of the Wiesbaden Group on Business Registers.

Mikolov T, Chen K, Corrado G, Dean J (2013). Efficient estimation of word representations in vector space. arXiv preprint: https://arxiv.org/abs/1301.3781

Mullainathan S, Spiess J (2017). Machine learning: An applied econometric approach. The Journal of Economic Perspectives, 31(2): 87–106. https://doi.org/10.1257/jep.31.2.87

Oehlert C, Schulz E, Parker A (2022). NAICS code prediction using supervised methods. Statistics and Public Policy, 9(1): 58–66. https://doi.org/10.1080/2330443X.2022.2033654

Oyarzun J (2018). The imitation game: An overview of a machine learning approach to code the industrial classification. In: 2018 Proceedings of the Statistics Canada Symposium.

Porter MF (2001). Snowball: A language for stemming algorithms. http://snowball.tartarus.org/texts/introduction.html. [Online; accessed 11 March 2024].

Rizinski M, Jankov A, Sankaradas V, Pinsky E, Mishkovski I, Trajanov D (2024). Comparative analysis of NLP-based models for company classification. Information, 15(77): 1–32. https://doi.org/10.3390/info15020077

Roberson A, Nguyen J (2018). Comparison of machine learning algorithms to build a predictive model for classification of survey write-in responses. In: Proceedings of the 2018 Federal Committee on Statistical Methodology (FCSM) Research and Policy Conference.

Roelands M, van Delden A, Windmeijer D (2018). Classifying Businesses by Economic Activity using Web-based Text Mining. Statistics Netherlands, Technical report.

Snijkers G, Haraldsen G, Jones J, Willimack DK (2013). Designing and Conducting Business Surveys. John Wiley & Sons, Inc., Hoboken.

Tan PN, Steinbach M, Karpatne A, Kumar V (2019). Introduction to Data Mining. Pearson Education, Inc., New York.

Tarnow-Mordi R (2017). The intelligent coder: Developing a machine-learning classification system. Methodological News. Australian Bureau of Statistics. https://www.abs.gov.au/ausstats/abs@.nsf/Previousproducts/1504.0Main%20Features5Sep%202017. [Online; accessed 8 March 2024].

Todorovski L, Džeroski S (2003). Combining classifiers with meta decision trees. Machine Learning, 50: 223–249. https://doi.org/10.1023/A:1021709817809

U.S. Census Bureau (2024a). Economic Census. https://www.census.gov/programs-surveys/economic-census.html. [Online; accessed 4 March 2024].

U.S. Census Bureau (2024b). Economic Census technical documentation. https://www.census.gov/programs-surveys/economic-census/technical-documentation.html. [Online; accessed 4 March 2024].

U.S. Census Bureau (2024c). Foreign trade reference codes. https://www.census.gov/foreign-trade/reference/codes/index.html. [Online; accessed 8 April 2024].

U.S. Census Bureau (2024d). North American Industry Classification System. https://www.census.gov/naics/. [Online; accessed 4 March 2024].

Whitehead D, Dumbacher B (2023). Ensemble modeling techniques for NAICS classification in the Economic Census. In: Proceedings of the 2023 Federal Committee on Statistical Methodology (FCSM) Research and Policy Conference.

2025 This is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may apply. International copyright, 2025, U.S. Department of Commerce, U.S. Government. Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China. Open access article.

Open access article under the CC BY license.

Keywords

Economic Census machine learning NAICS ranked text classification short text

Metrics

since February 2021

1276

Article info
views

527

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file