Journal of Data Science logo


Login Register

  1. Home
  2. To appear
  3. BEACON: A Tool for Industry Self-Classif ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

BEACON: A Tool for Industry Self-Classification in the Economic Census
Brian Dumbacher ORCID icon link to view author Brian Dumbacher details   Daniel Whitehead   Jiseok Jeong     All authors (4)

Authors

 
Placeholder
https://doi.org/10.6339/25-JDS1180
Pub. online: 17 April 2025      Type: Data Science In Action      Open accessOpen Access

Received
19 July 2024
Accepted
20 March 2025
Published
17 April 2025

Abstract

Business Establishment Automated Classification of NAICS (BEACON) is a text classification tool that helps respondents to the U.S. Census Bureau’s economic surveys self-classify their business activity in real time. The tool is based on rich training data, natural language processing, machine learning, and information retrieval. It is implemented using Python and an application programming interface. This paper describes BEACON’s methodology and successful application to the 2022 Economic Census, during which the tool was used over half a million times. BEACON has demonstrated that it recognizes a large vocabulary, quickly returns relevant results to respondents, and reduces clerical work associated with industry code assignment.

Supplementary material

 Supplementary Material
The supplementary material consists of a Python program that implements a simplified version of BEACON. All of the methodological components are present, but the full text cleaning algorithm cannot be shared for confidentiality reasons. Likewise, the confidential data sources used by BEACON cannot be shared. The public data sources that are part of BEACON’s training data are available at the references cited. See https://github.com/uscensusbureau/BEACON for additional files and documentation.

References

 
Aggarwal CC (2018). Machine Learning for Text. Springer International Publishing, Cham.
 
Baumgartner P, Smith A, Olmsted M, Ohse D (2021). A framework for using machine learning to support qualitative data coding. OSF Preprints. https://doi.org/10.31219/osf.io/fueyj
 
Bird S (2006). NLTK: The natural language toolkit. In: Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, 69–72.
 
Bishop CM (2013). Model-based machine learning. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984): 1–17. https://doi.org/10.1098/rsta.2012.0222
 
Bojanowski P, Grave E, Joulin A, Mikolov T (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5: 135–146. https://doi.org/10.1162/tacl_a_00051
 
Chu K, Poirier C (2015). Machine learning documentation initiative. United Nations Economic Commission for Europe. In: Conference of European Statisticians: Workshop on the Modernisation of Statistical Production Meeting. 15–17 April 2015, https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.50/2015/Topic3_Canada_paper.pdf. [Online; accessed 15 March 2024].
 
Cuffe J, Bhattacharjee S, Etudo U, Smith JC, Basdeo N, Burbank N, et al. (2022). Using public data to generate industrial classification codes. In: Big Data for 21st Century Economic Statistics (K Abraham, R Jarmin, B Moyer, M Shapiro, eds.), volume 79 of National Bureau of Economic Research: Studies in Income and Wealth, chapter 8, 229–246. University of Chicago Press.
 
Dumbacher B, Russell A (2019). Using machine learning to assign North American industry classification system codes to establishments based on business description write-ins. In: 2019 Proceedings of the American Statistical Association, 1497–1514.
 
Dumbacher B, Whitehead D (2022). Industry self-classification in the Economic Census. In: 2022 Proceedings of the American Statistical Association, 1049–1064.
 
Dumbacher B, Whitehead D (2024). Ranked short text classification using co-occurrence features and score functions. U.S. Census Bureau ADEP Working Paper Series, (ADEP-WP-2024-06).
 
Džeroski S, Ženko B (2004). Is combining classifiers with stacking better than selecting the best one? Machine Learning, 54: 255–273. https://doi.org/10.1023/B:MACH.0000015881.36452.6e
 
Evans J, Oyarzun J (2021). Need for speed: Using fastText (machine learning) to code the Labour Force Survey. In: 2021 Proceedings of the Statistics Canada Symposium.
 
Figueiredo F, Rocha L, Couto T, Salles T, Gonçalves MA, Meira W Jr (2011). Word co-occurrence features for text classification. Information Systems, 36(5): 843–858. https://doi.org/10.1016/j.is.2011.02.002
 
Hastie T, Tibshirani R, Friedman J (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.
 
Internal Revenue Service (2023). Form SS-4 application for Employer Identification Number. https://www.irs.gov/pub/irs-pdf/fss4.pdf. [Online; accessed 7 March 2024].
 
Jordan MI, Mitchell TM (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245): 255–260. https://doi.org/10.1126/science.aaa8415
 
Jurafsky D, Martin JH (2009). Speech and Language Processing. Pearson Education, Inc., Upper Saddle River.
 
Kearney AT, Kornbau ME (2005). An automated industry coding application for new U.S. business establishments. In: 2005 Proceedings of the American Statistical Association, 867–874.
 
Kirkendall NK, White Jr GD, Citro CF, Abraham KG (2018). Reengineering the Census Bureau’s Annual Economic Surveys. National Academies Press, Washington, DC.
 
Kornbau ME (2016). Automating processes for the U.S. Census Bureau register. 25th Meeting of the Wiesbaden Group on Business Registers.
 
Mikolov T, Chen K, Corrado G, Dean J (2013). Efficient estimation of word representations in vector space. arXiv preprint: https://arxiv.org/abs/1301.3781
 
Mullainathan S, Spiess J (2017). Machine learning: An applied econometric approach. The Journal of Economic Perspectives, 31(2): 87–106. https://doi.org/10.1257/jep.31.2.87
 
Oehlert C, Schulz E, Parker A (2022). NAICS code prediction using supervised methods. Statistics and Public Policy, 9(1): 58–66. https://doi.org/10.1080/2330443X.2022.2033654
 
Oyarzun J (2018). The imitation game: An overview of a machine learning approach to code the industrial classification. In: 2018 Proceedings of the Statistics Canada Symposium.
 
Porter MF (2001). Snowball: A language for stemming algorithms. http://snowball.tartarus.org/texts/introduction.html. [Online; accessed 11 March 2024].
 
Rizinski M, Jankov A, Sankaradas V, Pinsky E, Mishkovski I, Trajanov D (2024). Comparative analysis of NLP-based models for company classification. Information, 15(77): 1–32. https://doi.org/10.3390/info15020077
 
Roberson A, Nguyen J (2018). Comparison of machine learning algorithms to build a predictive model for classification of survey write-in responses. In: Proceedings of the 2018 Federal Committee on Statistical Methodology (FCSM) Research and Policy Conference.
 
Roelands M, van Delden A, Windmeijer D (2018). Classifying Businesses by Economic Activity using Web-based Text Mining. Statistics Netherlands, Technical report.
 
Snijkers G, Haraldsen G, Jones J, Willimack DK (2013). Designing and Conducting Business Surveys. John Wiley & Sons, Inc., Hoboken.
 
Tan PN, Steinbach M, Karpatne A, Kumar V (2019). Introduction to Data Mining. Pearson Education, Inc., New York.
 
Tarnow-Mordi R (2017). The intelligent coder: Developing a machine-learning classification system. Methodological News. Australian Bureau of Statistics. https://www.abs.gov.au/ausstats/abs@.nsf/Previousproducts/1504.0Main%20Features5Sep%202017. [Online; accessed 8 March 2024].
 
Todorovski L, Džeroski S (2003). Combining classifiers with meta decision trees. Machine Learning, 50: 223–249. https://doi.org/10.1023/A:1021709817809
 
U.S. Census Bureau (2024a). Economic Census. https://www.census.gov/programs-surveys/economic-census.html. [Online; accessed 4 March 2024].
 
U.S. Census Bureau (2024b). Economic Census technical documentation. https://www.census.gov/programs-surveys/economic-census/technical-documentation.html. [Online; accessed 4 March 2024].
 
U.S. Census Bureau (2024c). Foreign trade reference codes. https://www.census.gov/foreign-trade/reference/codes/index.html. [Online; accessed 8 April 2024].
 
U.S. Census Bureau (2024d). North American Industry Classification System. https://www.census.gov/naics/. [Online; accessed 4 March 2024].
 
Whitehead D, Dumbacher B (2023). Ensemble modeling techniques for NAICS classification in the Economic Census. In: Proceedings of the 2023 Federal Committee on Statistical Methodology (FCSM) Research and Policy Conference.

Related articles PDF XML
Related articles PDF XML

Copyright
2025 This is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may apply. International copyright, 2025, U.S. Department of Commerce, U.S. Government. Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China. Open access article.

Keywords
Economic Census machine learning NAICS ranked text classification short text

Metrics
since February 2021
83

Article info
views

24

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy