BEACON: A Tool for Industry Self-Classification in the Economic Census
Pub. online: 17 April 2025
Type: Data Science In Action
Open Access
Received
19 July 2024
19 July 2024
Accepted
20 March 2025
20 March 2025
Published
17 April 2025
17 April 2025
Abstract
Business Establishment Automated Classification of NAICS (BEACON) is a text classification tool that helps respondents to the U.S. Census Bureau’s economic surveys self-classify their business activity in real time. The tool is based on rich training data, natural language processing, machine learning, and information retrieval. It is implemented using Python and an application programming interface. This paper describes BEACON’s methodology and successful application to the 2022 Economic Census, during which the tool was used over half a million times. BEACON has demonstrated that it recognizes a large vocabulary, quickly returns relevant results to respondents, and reduces clerical work associated with industry code assignment.
Supplementary material
Supplementary MaterialThe supplementary material consists of a Python program that implements a simplified version of BEACON. All of the methodological components are present, but the full text cleaning algorithm cannot be shared for confidentiality reasons. Likewise, the confidential data sources used by BEACON cannot be shared. The public data sources that are part of BEACON’s training data are available at the references cited. See https://github.com/uscensusbureau/BEACON for additional files and documentation.
References
Baumgartner P, Smith A, Olmsted M, Ohse D (2021). A framework for using machine learning to support qualitative data coding. OSF Preprints. https://doi.org/10.31219/osf.io/fueyj
Bishop CM (2013). Model-based machine learning. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984): 1–17. https://doi.org/10.1098/rsta.2012.0222
Bojanowski P, Grave E, Joulin A, Mikolov T (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5: 135–146. https://doi.org/10.1162/tacl_a_00051
Chu K, Poirier C (2015). Machine learning documentation initiative. United Nations Economic Commission for Europe. In: Conference of European Statisticians: Workshop on the Modernisation of Statistical Production Meeting. 15–17 April 2015, https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.50/2015/Topic3_Canada_paper.pdf. [Online; accessed 15 March 2024].
Cuffe J, Bhattacharjee S, Etudo U, Smith JC, Basdeo N, Burbank N, et al. (2022). Using public data to generate industrial classification codes. In: Big Data for 21st Century Economic Statistics (K Abraham, R Jarmin, B Moyer, M Shapiro, eds.), volume 79 of National Bureau of Economic Research: Studies in Income and Wealth, chapter 8, 229–246. University of Chicago Press.
Džeroski S, Ženko B (2004). Is combining classifiers with stacking better than selecting the best one? Machine Learning, 54: 255–273. https://doi.org/10.1023/B:MACH.0000015881.36452.6e
Figueiredo F, Rocha L, Couto T, Salles T, Gonçalves MA, Meira W Jr (2011). Word co-occurrence features for text classification. Information Systems, 36(5): 843–858. https://doi.org/10.1016/j.is.2011.02.002
Internal Revenue Service (2023). Form SS-4 application for Employer Identification Number. https://www.irs.gov/pub/irs-pdf/fss4.pdf. [Online; accessed 7 March 2024].
Jordan MI, Mitchell TM (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245): 255–260. https://doi.org/10.1126/science.aaa8415
Mikolov T, Chen K, Corrado G, Dean J (2013). Efficient estimation of word representations in vector space. arXiv preprint: https://arxiv.org/abs/1301.3781
Mullainathan S, Spiess J (2017). Machine learning: An applied econometric approach. The Journal of Economic Perspectives, 31(2): 87–106. https://doi.org/10.1257/jep.31.2.87
Oehlert C, Schulz E, Parker A (2022). NAICS code prediction using supervised methods. Statistics and Public Policy, 9(1): 58–66. https://doi.org/10.1080/2330443X.2022.2033654
Porter MF (2001). Snowball: A language for stemming algorithms. http://snowball.tartarus.org/texts/introduction.html. [Online; accessed 11 March 2024].
Rizinski M, Jankov A, Sankaradas V, Pinsky E, Mishkovski I, Trajanov D (2024). Comparative analysis of NLP-based models for company classification. Information, 15(77): 1–32. https://doi.org/10.3390/info15020077
Tarnow-Mordi R (2017). The intelligent coder: Developing a machine-learning classification system. Methodological News. Australian Bureau of Statistics. https://www.abs.gov.au/ausstats/abs@.nsf/Previousproducts/1504.0Main%20Features5Sep%202017. [Online; accessed 8 March 2024].
Todorovski L, Džeroski S (2003). Combining classifiers with meta decision trees. Machine Learning, 50: 223–249. https://doi.org/10.1023/A:1021709817809
U.S. Census Bureau (2024a). Economic Census. https://www.census.gov/programs-surveys/economic-census.html. [Online; accessed 4 March 2024].
U.S. Census Bureau (2024b). Economic Census technical documentation. https://www.census.gov/programs-surveys/economic-census/technical-documentation.html. [Online; accessed 4 March 2024].
U.S. Census Bureau (2024c). Foreign trade reference codes. https://www.census.gov/foreign-trade/reference/codes/index.html. [Online; accessed 8 April 2024].
U.S. Census Bureau (2024d). North American Industry Classification System. https://www.census.gov/naics/. [Online; accessed 4 March 2024].