<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">JDS</journal-id>
<journal-title-group><journal-title>Journal of Data Science</journal-title></journal-title-group>
<issn pub-type="epub">1683-8602</issn><issn pub-type="ppub">1680-743X</issn><issn-l>1680-743X</issn-l>
<publisher>
<publisher-name>School of Statistics, Renmin University of China</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">JDS1147</article-id>
<article-id pub-id-type="doi">10.6339/24-JDS1147</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Data Science in Action</subject></subj-group></article-categories>
<title-group>
<article-title>Bringing Search to the Economic Census – The NAPCS Classification Tool<xref ref-type="fn" rid="j_jds1147_fn_001"><sup>✩</sup></xref></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Knappenberger</surname><given-names>Clayton</given-names></name><email xlink:href="mailto:clayton.g.knappenberger@census.gov">clayton.g.knappenberger@census.gov</email><xref ref-type="aff" rid="j_jds1147_aff_001">1</xref>
</contrib>
<aff id="j_jds1147_aff_001"><label>1</label><institution>U.S. Census Bureau</institution>, 4600 Silver Hill Road, Washington, DC 20233, <country>USA</country></aff>
</contrib-group>
<author-notes>
<fn id="j_jds1147_fn_001"><label>✩</label>
<p>Any opinions and conclusions expressed herein are those of the author(s) and do not reflect the views of the U.S. Census Bureau. The Census Bureau has reviewed this data product to ensure appropriate access, use, and disclosure avoidance protection of the confidential source data (Project No. P-7504847), Disclosure Review Board (DRB) approval number: CBDRB-FY23-EWD001-002.</p></fn>
</author-notes>
<pub-date pub-type="ppub"><year>2024</year></pub-date><pub-date pub-type="epub"><day>7</day><month>8</month><year>2024</year></pub-date><volume>22</volume><issue>3</issue><fpage>409</fpage><lpage>422</lpage><supplementary-material id="S1" content-type="archive" xlink:href="jds1147_s001.zip" mimetype="application" mime-subtype="x-zip-compressed">
<caption>
<title>Supplementary Material</title>
<p>SINCT code.</p>
</caption>
</supplementary-material><history><date date-type="received"><day>30</day><month>11</month><year>2023</year></date><date date-type="accepted"><day>5</day><month>7</month><year>2024</year></date></history>
<permissions><copyright-statement>2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.</copyright-statement><copyright-year>2024</copyright-year>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>The North American Product Classification System (NAPCS) was first introduced in the 2017 Economic Census and provides greater detail on the range of products and services offered by businesses than what was previously available with just an industry code. In the 2022 Economic Census, NAPCS consisted of 7,234 codes and respondents often found that they were unable to identify correct NAPCS codes for their business, leaving instead written descriptions of their products and services. Over one million of these needed to be reviewed by Census analysts in the 2017 Economic Census. The Smart Instrument NAPCS Classification Tool (SINCT) offers respondents a low latency search engine to find appropriate NAPCS codes based on a written description of their products and services. SINCT uses a neural network document embedding model (doc2vec) to embed respondent searches in a numerical space and then identifies NAPCS codes that are close to the search text. This paper shows one way in which machine learning can improve the survey respondent experience and reduce the amount of expensive manual processing that is necessary after data collection. We also show how relatively simple tools can achieve an estimated 72% top-ten accuracy with thousands of possible classes, limited training data, and strict latency requirements.</p>
</abstract>
<kwd-group>
<label>Keywords</label>
<kwd>natural language processing</kwd>
<kwd>neural networks</kwd>
<kwd>search</kwd>
<kwd>survey collection</kwd>
</kwd-group>
</article-meta>
</front>
<back>
<ref-list id="j_jds1147_reflist_001">
<title>References</title>
<ref id="j_jds1147_ref_001">
<mixed-citation publication-type="book"> <string-name><surname>Büttcher</surname> <given-names>S</given-names></string-name>, <string-name><surname>Clarke</surname> <given-names>C</given-names></string-name>, <string-name><surname>Cormack</surname> <given-names>G</given-names></string-name> (<year>2016</year>). <source><italic>Information Retrieval: Implementing and Evaluating Search Engines</italic></source>. <publisher-name>The MIT Press</publisher-name>, <publisher-loc>Cambridge, MA</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_002">
<mixed-citation publication-type="journal"> <string-name><surname>Chen</surname> <given-names>B</given-names></string-name>, <string-name><surname>Creecy</surname> <given-names>R</given-names></string-name>, <string-name><surname>Appel</surname> <given-names>M</given-names></string-name> (<year>1993</year>). <article-title>Error control of automated industry and occupation coding</article-title>. <source><italic>Journal of Official Statistics</italic></source>, <volume>9</volume>(<issue>4</issue>): <fpage>729</fpage>–<lpage>745</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_003">
<mixed-citation publication-type="chapter"> <string-name><surname>Devlin</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chang</surname> <given-names>M</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>K</given-names></string-name>, <string-name><surname>Toutanova</surname> <given-names>K</given-names></string-name> (<year>2019</year>). <chapter-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</chapter-title>. In: <source><italic>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics</italic></source> (<string-name><given-names>J</given-names> <surname>Burstein</surname></string-name>, <string-name><given-names>C</given-names> <surname>Doran</surname></string-name>, <string-name><given-names>T</given-names> <surname>Solorio</surname></string-name>, eds.), <fpage>4171</fpage>–<lpage>4186</lpage>. <publisher-name>Association for Computational Linguistics</publisher-name>, <publisher-loc>Minneapolis, MN</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_004">
<mixed-citation publication-type="other"> <string-name><surname>Dumbacher</surname> <given-names>B</given-names></string-name>, <string-name><surname>Whitehead</surname> <given-names>D</given-names></string-name> (<year>2024</year>). Industry self-classification in the economic census. Accessed: July 11, 2024.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_005">
<mixed-citation publication-type="journal"> <string-name><surname>Graves</surname> <given-names>A</given-names></string-name>, <string-name><surname>Schmidhuber</surname> <given-names>J</given-names></string-name> (<year>2005</year>). <article-title>Framewise phoneme classification with bidirectional lstm and other neural network architectures</article-title>. <source><italic>Neural Networks</italic></source>, <volume>18</volume>: <fpage>602</fpage>–<lpage>610</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.neunet.2005.06.042" xlink:type="simple">https://doi.org/10.1016/j.neunet.2005.06.042</ext-link></mixed-citation>
</ref>
<ref id="j_jds1147_ref_006">
<mixed-citation publication-type="book"> <string-name><surname>Hastie</surname> <given-names>T</given-names></string-name>, <string-name><surname>Friedman</surname> <given-names>J</given-names></string-name>, <string-name><surname>Tibshirani</surname> <given-names>R</given-names></string-name> (<year>2011</year>). <source><italic>The Elements of Statistical Learning: Data Mining Inference and Prediction</italic></source>. <publisher-name>Springer</publisher-name>, <publisher-loc>New York, NY</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_007">
<mixed-citation publication-type="chapter"> <string-name><surname>Le</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Mikolov</surname> <given-names>T</given-names></string-name> (<year>2014</year>). <chapter-title>Distributed representations of sentence and documents</chapter-title>. In: <source><italic>Proceedings of the 31st International Conference on Machine Learning</italic></source> (<string-name><given-names>E</given-names> <surname>Xing</surname></string-name>, <string-name><given-names>T</given-names> <surname>Jebara</surname></string-name>, eds.), volume <volume>32</volume>, <fpage>1188</fpage>–<lpage>1196</lpage>. <publisher-name>Proceedings of Machine Learning Research</publisher-name>, <publisher-loc>Beijing, China</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_008">
<mixed-citation publication-type="chapter"> <string-name><surname>LeCun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Boser</surname> <given-names>B</given-names></string-name>, <string-name><surname>Denker</surname> <given-names>J</given-names></string-name>, <string-name><surname>Henderson</surname> <given-names>D</given-names></string-name>, <string-name><surname>Howard</surname> <given-names>R</given-names></string-name>, <string-name><surname>Hubbard</surname> <given-names>W</given-names></string-name>, <etal>et al.</etal> (<year>1990</year>). <chapter-title>Handwritten digit recognition with a back-propagation network</chapter-title>. In: <source><italic>Advances in Neural Information Processing Systems 2, NIPS 1989</italic></source>, <fpage>396</fpage>–<lpage>404</lpage>. <publisher-name>Morgan Kaufmann Publishers</publisher-name>. <comment>1989</comment>.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_009">
<mixed-citation publication-type="other"> <string-name><surname>Measure</surname> <given-names>A</given-names></string-name> (<year>2017</year>). Deep neural networks for worker injury autocoding. Accessed: April 3, 2023.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_010">
<mixed-citation publication-type="other"> <string-name><surname>Mikolov</surname> <given-names>T</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>K</given-names></string-name>, <string-name><surname>Corrado</surname> <given-names>G</given-names></string-name>, <string-name><surname>Dean</surname> <given-names>J</given-names></string-name> (<year>2013</year>a). Efficient estimation of word representations in vector space. arXiv preprint: <uri>https://arxiv.org/abs/1301.3781</uri>.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_011">
<mixed-citation publication-type="other"> <string-name><surname>Mikolov</surname> <given-names>T</given-names></string-name>, <string-name><surname>Sutskever</surname> <given-names>I</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>K</given-names></string-name>, <string-name><surname>Corrado</surname> <given-names>G</given-names></string-name>, <string-name><surname>Dean</surname> <given-names>J</given-names></string-name> (<year>2013</year>b). Distributed representations of words and phrases in their compositionality. Accessed: April 3, 2023.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_012">
<mixed-citation publication-type="book"> <string-name><surname>Mitchell</surname> <given-names>T</given-names></string-name> (<year>1997</year>). <source><italic>Machine Learning</italic></source> (<string-name><given-names>E</given-names> <surname>Munson</surname></string-name>, ed.). <publisher-name>McGraw-Hill</publisher-name>, <publisher-loc>New York, NY</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_013">
<mixed-citation publication-type="chapter"> <string-name><surname>Moscardi</surname> <given-names>C</given-names></string-name>, <string-name><surname>Schultz</surname> <given-names>B</given-names></string-name> (<year>2023</year>). <chapter-title>Using machine learning to classify products for the commodity flow survey</chapter-title>. In: <source><italic>Advances in Business Statistics, Methods and Data Collection: Introduction</italic></source> (<string-name><given-names>G</given-names> <surname>Snijkers</surname></string-name>, <string-name><given-names>M</given-names> <surname>Bavdź</surname></string-name>, <string-name><given-names>S</given-names> <surname>Bender</surname></string-name>, <string-name><given-names>J</given-names> <surname>Jones</surname></string-name>, <string-name><given-names>S</given-names> <surname>MacFeely</surname></string-name>, <string-name><given-names>J</given-names> <surname>Sakshaug</surname></string-name>, <string-name><given-names>K</given-names> <surname>Thompson</surname></string-name>, <string-name><given-names>A</given-names> <surname>van Delden</surname></string-name>, eds.), <fpage>573</fpage>–<lpage>591</lpage>. <publisher-name>Wiley Online Library</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_014">
<mixed-citation publication-type="other"> <string-name><surname>Office of National Statistics</surname></string-name> (<year>2023</year>). Automated text coding: Census 2021. Accessed: May 16, 2024.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_015">
<mixed-citation publication-type="journal"> <string-name><surname>O’Reagan</surname></string-name> (<year>1972</year>). <article-title>Computer assigned codes from verbal responses</article-title>. <source><italic>Communications of the ACM</italic></source>, <volume>15</volume>: <fpage>455</fpage>–<lpage>459</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1145/361405.361419" xlink:type="simple">https://doi.org/10.1145/361405.361419</ext-link></mixed-citation>
</ref>
<ref id="j_jds1147_ref_016">
<mixed-citation publication-type="chapter"> <string-name><surname>Řehůřek</surname> <given-names>R</given-names></string-name>, <string-name><surname>Sojka</surname> <given-names>P</given-names></string-name> (<year>2010</year>). <chapter-title>Software framework for topic modeling with large corpora</chapter-title>. In: <source><italic>Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks</italic></source>, <fpage>45</fpage>–<lpage>50</lpage>. <publisher-name>ELRA</publisher-name>, <publisher-loc>Malta, Valletta</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_017">
<mixed-citation publication-type="journal"> <string-name><surname>Roberson</surname> <given-names>A</given-names></string-name> (<year>2021</year>). <article-title>Applying machine learning for automatic product categorization</article-title>. <source><italic>Journal of Official Statistics</italic></source>, <volume>37</volume>(<issue>2</issue>): <fpage>395</fpage>–<lpage>410</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.2478/jos-2021-0017" xlink:type="simple">https://doi.org/10.2478/jos-2021-0017</ext-link></mixed-citation>
</ref>
<ref id="j_jds1147_ref_018">
<mixed-citation publication-type="chapter"> <string-name><surname>Roberson</surname> <given-names>A</given-names></string-name>, <string-name><surname>Nguyen</surname> <given-names>J</given-names></string-name> (<year>2018</year>). <chapter-title>Comparison of machine learning algorithms to build a predictive model for classification of survey write-in responses</chapter-title>. In: <source><italic>2018 Proceedings of the Federal Committee on Statistical Methodology (FCSM) Research Conference</italic></source>. <publisher-name>FCSM</publisher-name>, <publisher-loc>Washington, DC</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_019">
<mixed-citation publication-type="other"> <string-name><surname>Srivastava</surname> <given-names>R</given-names></string-name>, <string-name><surname>Greff</surname> <given-names>K</given-names></string-name>, <string-name><surname>Schmidhuber</surname> <given-names>J</given-names></string-name> (<year>2015</year>). Highway networks. Access: May 16, 2024.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_020">
<mixed-citation publication-type="other"> <string-name><surname>United States Bureau of Labor Statistics</surname></string-name> (<year>2023</year>). Automatic coding of injury and illness data. Accessed: April 10, 2022.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_021">
<mixed-citation publication-type="other"> <string-name><surname>United States Census Bureau</surname></string-name> (<year>2022</year>). About the economic census. Accessed: April 10, 2022.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_022">
<mixed-citation publication-type="chapter"> <string-name><surname>Vaswani</surname> <given-names>A</given-names></string-name>, <string-name><surname>Shazeer</surname> <given-names>N</given-names></string-name>, <string-name><surname>Parmar</surname> <given-names>N</given-names></string-name>, <string-name><surname>Uszkoreit</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jones</surname> <given-names>L</given-names></string-name>, <string-name><surname>Gomez</surname> <given-names>A</given-names></string-name>, <etal>et al.</etal> (<year>2017</year>). <chapter-title>Attention is all you need</chapter-title>, In: <source><italic>Proceedings of the 30th Conference on Neural Information Processing Systems</italic></source> (<string-name><given-names>I</given-names> <surname>Guyon</surname></string-name>, <string-name><given-names>U</given-names> <surname>Von Luxburg</surname></string-name>, <string-name><given-names>S</given-names> <surname>Bengio</surname></string-name>, <string-name><given-names>H</given-names> <surname>Wallach</surname></string-name>, <string-name><given-names>R</given-names> <surname>Fergus</surname></string-name>, <string-name><given-names>S</given-names> <surname>Vishwanathan</surname></string-name>, <string-name><given-names>R</given-names> <surname>Garnett</surname></string-name>, eds.), <publisher-name>Curran Associates, Inc</publisher-name>, <publisher-loc>Long Beach, CA</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1147_ref_023">
<mixed-citation publication-type="chapter"> <string-name><surname>Wiley</surname> <given-names>E</given-names></string-name>, <string-name><surname>Whitehead</surname> <given-names>D</given-names></string-name> (<year>2022</year>). <chapter-title>Implementing interactive classification tools in the 2022 economic census</chapter-title>. In: <source><italic>2022 Proceedings of the Federal Committee on Statistical Methodology Research and Policy Conference</italic></source>. <publisher-name>FCSM</publisher-name>, <publisher-loc>Washington, DC</publisher-loc>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
