<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">JDS</journal-id>
<journal-title-group><journal-title>Journal of Data Science</journal-title></journal-title-group>
<issn pub-type="epub">1683-8602</issn><issn pub-type="ppub">1680-743X</issn><issn-l>1680-743X</issn-l>
<publisher>
<publisher-name>School of Statistics, Renmin University of China</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">JDS1087</article-id>
<article-id pub-id-type="doi">10.6339/23-JDS1087</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Data Science in Action</subject></subj-group></article-categories>
<title-group>
<article-title>Identifying Drone Web Sites in Multiple Countries and Languages with a Single Model</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-1541-0315</contrib-id>
<name><surname>Daas</surname><given-names>Piet</given-names></name><email xlink:href="mailto:p.j.h.daas@tue.nl">p.j.h.daas@tue.nl</email><xref ref-type="aff" rid="j_jds1087_aff_001">1</xref><xref ref-type="aff" rid="j_jds1087_aff_002">2</xref><xref ref-type="corresp" rid="cor1">∗</xref><xref ref-type="fn" rid="j_jds1087_fn_001">†</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-1267-6070</contrib-id>
<name><surname>de Miguel</surname><given-names>Blanca</given-names></name><xref ref-type="aff" rid="j_jds1087_aff_003">3</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0003-4264-8000</contrib-id>
<name><surname>de Miguel</surname><given-names>Maria</given-names></name><xref ref-type="aff" rid="j_jds1087_aff_003">3</xref>
</contrib>
<aff id="j_jds1087_aff_001"><label>1</label>De Groene Loper 5, 5612AZ Eindhoven, <institution>Eindhoven University of Technology</institution>, <country>the Netherlands</country></aff>
<aff id="j_jds1087_aff_002"><label>2</label>CBS-weg 11, 6412EX, Heerlen, <institution>Statistics Netherlands</institution>, <country>the Netherlands</country></aff>
<aff id="j_jds1087_aff_003"><label>3</label>Camí de Vera, s/n 46022 Valencia, <institution>Universitat Politècnica de València</institution>, <country>Spain</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author. Email: <ext-link ext-link-type="uri" xlink:href="mailto:p.j.h.daas@tue.nl">p.j.h.daas@tue.nl</ext-link>.</corresp><fn id="j_jds1087_fn_001"><label>†</label>
<p>The views expressed in this paper are those of the authors and do not necessarily reflect the policies of Statistics Netherlands.</p></fn>
</author-notes>
<pub-date pub-type="ppub"><year>2023</year></pub-date><pub-date pub-type="epub"><day>26</day><month>1</month><year>2023</year></pub-date><volume>21</volume><issue>2</issue><fpage>225</fpage><lpage>238</lpage><history><date date-type="received"><day>25</day><month>7</month><year>2022</year></date><date date-type="accepted"><day>17</day><month>1</month><year>2023</year></date></history>
<permissions><copyright-statement>2023 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.</copyright-statement><copyright-year>2023</copyright-year>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>A text-based, bag-of-words, model was developed to identify drone company websites for multiple European countries in different languages. A collection of Spanish drone and non-drone websites was used for initial model development. Various classification methods were compared. Supervised logistic regression (L2-norm) performed best with an accuracy of 87% on the unseen test set. The accuracy of the later model improved to 88% when it was trained on texts in which all Spanish words were translated into English. Retraining the model on texts in which all typical Spanish words, such as names of cities and regions, and words indicative for specific periods in time, such as the months of the year and days of the week, were removed did not affect the overall performance of the model and made it more generally applicable. Applying the cleaned, completely English word-based, model to a collection of Irish and Italian drone and non-drone websites revealed, after manual inspection, that it was able to detect drone websites in those countries with an accuracy of 82 and 86%, respectively. The classification of Italian texts required the creation of a translation list in which all 1560 English word-based features in the model were translated to their Italian analogs. Because the model had a very high recall, 93, 100, and 97% on Spanish, Irish and Italian drone websites respectively, it was particularly well suited to select potential drone websites in large collections of websites.</p>
</abstract>
<kwd-group>
<label>Keywords</label>
<kwd>bag of words</kwd>
<kwd>classification model</kwd>
<kwd>multiple languages</kwd>
<kwd>text</kwd>
</kwd-group>
<funding-group><award-group><funding-source xlink:href="https://doi.org/10.13039/501100013214">Eurostat</funding-source><award-id>2018.0086</award-id></award-group><funding-statement>This research was performed as part of the study “Web intelligence for measuring emerging economic trends: the drone industry” led by GOPA under the framework contract on Methodological Support (Ref. 2018.0086) for Eurostat. </funding-statement></funding-group>
</article-meta>
</front>
<back>
<ref-list id="j_jds1087_reflist_001">
<title>References</title>
<ref id="j_jds1087_ref_001">
<mixed-citation publication-type="book"> <string-name><surname>Aggarwal</surname> <given-names>C</given-names></string-name> (<year>2016</year>). <source><italic>Data Mining: The Textbook</italic></source>. <publisher-name>Springer</publisher-name>, <publisher-loc>New York</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_002">
<mixed-citation publication-type="other"> <string-name><surname>Almeida</surname> <given-names>F</given-names></string-name>, <string-name><surname>Xexéo</surname> <given-names>G</given-names></string-name> (<year>2019</year>). Word embeddings: A survey. <italic>CoRR</italic>, arXiv preprint: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1901.09069">https://arxiv.org/abs/1901.09069</ext-link></mixed-citation>
</ref>
<ref id="j_jds1087_ref_003">
<mixed-citation publication-type="book"> <string-name><surname>Antonacopoulos</surname> <given-names>A</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>J</given-names></string-name> (<year>2003</year>). <source><italic>Web document analysis: Challenges and opportunities</italic></source>. <publisher-name>World Scientific Publishing Co. Pte. Ltd.</publisher-name>, <publisher-loc>Singapore</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_004">
<mixed-citation publication-type="other"> <string-name><surname>Apertium</surname></string-name> (2021). Website of apertium, a free/open-source machine translation platform. <ext-link ext-link-type="uri" xlink:href="http://www.apertium.org">http://www.apertium.org</ext-link>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_005">
<mixed-citation publication-type="journal"> <string-name><surname>Aweisi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Arora</surname> <given-names>D</given-names></string-name>, <string-name><surname>Emby</surname> <given-names>R</given-names></string-name>, <string-name><surname>Rehman</surname> <given-names>M</given-names></string-name>, <string-name><surname>Tanev</surname> <given-names>G</given-names></string-name>, <string-name><surname>Tanev</surname> <given-names>S</given-names></string-name> (<year>2021</year>). <article-title>Using web text analytics to categorize the business focus of innovative digital health companies</article-title>. <source><italic>Technology Innovation Management Review</italic></source>, <volume>11</volume>(<issue>7/8</issue>): <fpage>65</fpage>–<lpage>78</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_006">
<mixed-citation publication-type="chapter"> <string-name><surname>Bergstra</surname> <given-names>J</given-names></string-name>, <string-name><surname>Bardenet</surname> <given-names>R</given-names></string-name>, <string-name><surname>Bengio</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Kégl</surname> <given-names>B</given-names></string-name> (<year>2011</year>). <chapter-title>Algorithms for hyper-parameter optimization</chapter-title>. <source><italic>Advances in Neural Information Processing Systems 24</italic></source>. <publisher-name>Curran Associates, Inc.</publisher-name>, <publisher-loc>New York</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_007">
<mixed-citation publication-type="book"> <string-name><surname>Beręsewicz</surname> <given-names>M</given-names></string-name>, <string-name><surname>Pater</surname> <given-names>R</given-names></string-name> (<year>2021</year>). <source><italic>Inferring job vacancies from online job advertisements. Statistical Working papers</italic></source>. <publisher-name>Eurostat</publisher-name>, <publisher-loc>Luxembourg</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_008">
<mixed-citation publication-type="chapter"> <string-name><surname>Daas</surname> <given-names>P</given-names></string-name>, <string-name><surname>de Wolf</surname> <given-names>N</given-names></string-name> (<year>2021</year>). <chapter-title>Identifying different types of companies via their website text</chapter-title>. In: <source><italic>Symposium on Data Science and Statistics (SDSS)</italic></source>. <conf-loc>Virtual, June 2-4, 2021.</conf-loc></mixed-citation>
</ref>
<ref id="j_jds1087_ref_009">
<mixed-citation publication-type="book"> <string-name><surname>Daas</surname> <given-names>P</given-names></string-name>, <string-name><surname>Tennekes</surname> <given-names>M</given-names></string-name>, <string-name><surname>De Miguel</surname> <given-names>B</given-names></string-name>, <string-name><surname>De Miguel</surname> <given-names>M</given-names></string-name>, <string-name><surname>Santamarina</surname> <given-names>V</given-names></string-name>, <string-name><surname>Carausu</surname> <given-names>F</given-names></string-name> (<year>2022</year>). <source><italic>Web intelligence for measuring emerging economic trends: The drone industry</italic></source> <series><italic>Statistical Working papers</italic></series>. <publisher-name>Eurostat</publisher-name>, <publisher-loc>Luxembourg</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_010">
<mixed-citation publication-type="journal"> <string-name><surname>Daas</surname> <given-names>P</given-names></string-name>, <string-name><surname>van der Doef</surname> <given-names>S</given-names></string-name> (<year>2020</year>). <article-title>Detecting innovative companies via their website</article-title>. <source><italic>Statistical Journal of IAOS</italic></source>, <volume>36</volume>(<issue>4</issue>): <fpage>1239</fpage>–<lpage>1251</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_011">
<mixed-citation publication-type="other"> <string-name><surname>De Kunder</surname> <given-names>M</given-names></string-name> (2022). The size of the world wide web (the internet). <uri>https://www.worldwidewebsize.com/</uri>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_012">
<mixed-citation publication-type="other"> <string-name><surname>Devlin</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chang</surname> <given-names>MW</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>K</given-names></string-name>, <string-name><surname>Toutanova</surname> <given-names>K</given-names></string-name> (2018). Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint: <uri>https://arxiv.org/abs/1810.04805</uri>, 13 pages.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_013">
<mixed-citation publication-type="chapter"> <string-name><surname>Elkan</surname> <given-names>C</given-names></string-name>, <string-name><surname>Noto</surname> <given-names>K</given-names></string-name> (<year>2008</year>). <chapter-title>Learning classifiers from only positive and unlabeled data</chapter-title>. In: <source><italic>Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</italic></source> (<string-name><given-names>Y</given-names> <surname>Li</surname></string-name>, <string-name><given-names>B</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>S</given-names> <surname>Sarawagi</surname></string-name>, eds.). <conf-loc>Las Vegas, Nevada, USA</conf-loc>. <conf-date>August 24–27, 2008</conf-date>, <fpage>213</fpage>–<lpage>220</lpage>. <publisher-name>ACM</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_014">
<mixed-citation publication-type="other"> <string-name><surname>ESSnet</surname></string-name> (2020). Web page provinding an overview of the experimental statistics developed in the context of essnet big data workpackage C on enterprise characteristics. <uri>https://ec.europa.eu/eurostat/cros/content/wpc-experimental-statistics_en</uri>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_015">
<mixed-citation publication-type="other"> <string-name><surname>Fasttext</surname></string-name> (2022). Webpage of fasttext language detect v1.0.3. <uri>https://pypi.org/project/fasttext-langdetect/</uri>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_016">
<mixed-citation publication-type="chapter"> <string-name><surname>Florescu</surname> <given-names>D</given-names></string-name>, <string-name><surname>Karlberg</surname> <given-names>M</given-names></string-name>, <string-name><surname>Reis</surname> <given-names>F</given-names></string-name>, <string-name><surname>Rey Del Castillo</surname> <given-names>P</given-names></string-name>, <string-name><surname>Skaliotis</surname> <given-names>M</given-names></string-name>, <string-name><surname>Wirthmann</surname> <given-names>A</given-names></string-name> (<year>2014</year>). <conf-name>Will ‘big data’ transform official statistics? Quality in Official Statistics Conference</conf-name>. Vienna, Austria. <comment>June 2-5, 2014</comment>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_017">
<mixed-citation publication-type="journal"> <string-name><surname>Gentzkow</surname> <given-names>M</given-names></string-name>, <string-name><surname>Kelly</surname> <given-names>B</given-names></string-name>, <string-name><surname>Taddy</surname> <given-names>M</given-names></string-name> (<year>2019</year>). <article-title>Text as data</article-title>. <source><italic>Journal of Economic Literature</italic></source>, <volume>57</volume>(<issue>3</issue>): <fpage>535</fpage>–<lpage>574</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_018">
<mixed-citation publication-type="other"> <string-name><surname>GitHub WIH Drones</surname></string-name> (2022). Web intelligence hub drone companies. <uri>https://github.com/eurostat/wih_drones_companies</uri>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_019">
<mixed-citation publication-type="journal"> <string-name><surname>Gökk</surname> <given-names>A</given-names></string-name>, <string-name><surname>Waterworth</surname> <given-names>A</given-names></string-name>, <string-name><surname>Shapira</surname> <given-names>P</given-names></string-name> (<year>2015</year>). <article-title>Use of web mining in studying innovation</article-title>. <source><italic>Scientometrics</italic></source>, <volume>102</volume>(<issue>1</issue>): <fpage>653</fpage>–<lpage>671</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_020">
<mixed-citation publication-type="other"> <string-name><surname>GOPA</surname></string-name> (2021a). Data Retrieval, Deliverable 2. Report 2 of the project Web Intelligence for Measuring Emerging Economic Trends: The Drone Industry. Eurostat, Luxembourg.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_021">
<mixed-citation publication-type="other"> <string-name><surname>GOPA</surname></string-name> (2021b). Deliverable 1. Report 1 of the project Web Intelligence for Measuring Emerging Economic Trends: The Drone Industry. Eurostat, Luxembourg.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_022">
<mixed-citation publication-type="journal"> <string-name><surname>Höchtl</surname> <given-names>J</given-names></string-name>, <string-name><surname>Parycek</surname> <given-names>P</given-names></string-name>, <string-name><surname>Schöllhammer</surname> <given-names>R</given-names></string-name> (<year>2015</year>). <article-title>Big data in the policy cycle: Policy decision making in the digital era</article-title>. <source><italic>J. Org. Comp. Elec. Com.</italic></source>, <volume>26</volume>(<issue>1–2</issue>): <fpage>147</fpage>–<lpage>169</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_023">
<mixed-citation publication-type="journal"> <string-name><surname>Kitchin</surname> <given-names>R</given-names></string-name> (<year>2015</year>). <article-title>The opportunities, challenges and risks of big data for official statistics</article-title>. <source><italic>Statistical Journal of the IAOS</italic></source>, <volume>31</volume>(<issue>3</issue>): <fpage>471</fpage>–<lpage>481</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_024">
<mixed-citation publication-type="journal"> <string-name><surname>Kowsari</surname> <given-names>K</given-names></string-name>, <string-name><surname>Jafari Meimandi</surname> <given-names>K</given-names></string-name>, <string-name><surname>Heidarysafa</surname> <given-names>M</given-names></string-name>, <string-name><surname>Mendu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Barnes</surname> <given-names>L</given-names></string-name>, <string-name><surname>Brown</surname> <given-names>D</given-names></string-name> (<year>2019</year>). <article-title>Text classification algorithms: A survey</article-title>. <source><italic>Information</italic></source>, <volume>10</volume>(<issue>4</issue>).</mixed-citation>
</ref>
<ref id="j_jds1087_ref_025">
<mixed-citation publication-type="journal"> <string-name><surname>Kühnemann</surname> <given-names>H</given-names></string-name>, <string-name><surname>van Delden</surname> <given-names>A</given-names></string-name>, <string-name><surname>Windmeijer</surname> <given-names>D</given-names></string-name> (<year>2020</year>). <article-title>Exploring a knowledge-based approach to predicting nace codes of enterprises based on web page texts</article-title>. <source><italic>Statistical Journal of the IAOS</italic></source>, <volume>36</volume>(<issue>3</issue>): <fpage>807</fpage>–<lpage>821</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_026">
<mixed-citation publication-type="book"> <string-name><surname>Larose</surname> <given-names>D</given-names></string-name>, <string-name><surname>Markov</surname> <given-names>Z</given-names></string-name> (<year>2007</year>). <source><italic>Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage</italic></source>. <publisher-name>Wiley-Interscience</publisher-name>, <publisher-loc>Hoboken, NJ</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_027">
<mixed-citation publication-type="journal"> <string-name><surname>Pedregosa</surname> <given-names>F</given-names></string-name>, <string-name><surname>Varoquaux</surname> <given-names>G</given-names></string-name>, <string-name><surname>Gramfort</surname> <given-names>A</given-names></string-name>, <string-name><surname>Michel</surname> <given-names>V</given-names></string-name>, <string-name><surname>Thirion</surname> <given-names>B</given-names></string-name>, <string-name><surname>Grisel</surname> <given-names>O</given-names></string-name>, <etal>et al.</etal> (<year>2011</year>). <article-title>Scikit-learn: Machine learning in python</article-title>. <source><italic>Journal of Machine Learning Research</italic></source>, <volume>12</volume>: <fpage>2825</fpage>–<lpage>2830</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_028">
<mixed-citation publication-type="chapter"> <string-name><surname>Pires</surname> <given-names>T</given-names></string-name>, <string-name><surname>Schlinger</surname> <given-names>E</given-names></string-name>, <string-name><surname>Garrette</surname> <given-names>D</given-names></string-name> (<year>2019</year>). <chapter-title>How multilingual is multilingual BERT?</chapter-title> In: <source><italic>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</italic></source>, <fpage>4996</fpage>–<lpage>5001</lpage>. <publisher-name>Association for Computational Linguistics</publisher-name>, <publisher-loc>Florence, Italy</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_029">
<mixed-citation publication-type="journal"> <string-name><surname>Powell</surname> <given-names>B</given-names></string-name>, <string-name><surname>Nason</surname> <given-names>G</given-names></string-name>, <string-name><surname>Elliott</surname> <given-names>D</given-names></string-name>, <string-name><surname>Mayhew</surname> <given-names>M</given-names></string-name>, <string-name><surname>Davies</surname> <given-names>J</given-names></string-name>, <string-name><surname>Winton</surname> <given-names>J</given-names></string-name> (<year>2018</year>). <article-title>Tracking and modelling prices using web-scraped price microdata: Towards automated daily consumer price index forecasting</article-title>. <source><italic>Journal of the Royal Statistical Society: Series A</italic></source>, <volume>181</volume>(<issue>3</issue>): <fpage>737</fpage>–<lpage>756</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_030">
<mixed-citation publication-type="other"> <string-name><surname>PUlearn</surname></string-name> (2021). Website of the pulearn python library v0.07. <uri>https://pypi.org/project/pulearn</uri>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_031">
<mixed-citation publication-type="book"> <string-name><surname>Rothaermel</surname> <given-names>F</given-names></string-name> (<year>2019</year>). <source><italic>Strategic Management</italic></source>. <publisher-name>McGraw-Hill Education</publisher-name>, <publisher-loc>New York</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_032">
<mixed-citation publication-type="book"> <string-name><surname>Song</surname> <given-names>M</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>YF</given-names></string-name> (<year>2008</year>). <source><italic>Handbook of Research on Text and Web Mining Technologies</italic></source>. <publisher-name>Information Science Reference</publisher-name>, <publisher-loc>Hershey, NY</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1087_ref_033">
<mixed-citation publication-type="other"> <string-name><surname>United Nations</surname></string-name> (2014). Fundamental Principles of Official Statistics. United Nations Statistic Division, New York.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
