Identifying Drone Web Sites in Multiple Countries and Languages with a Single Model
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 225–238
Pub. online: 26 January 2023
Type: Data Science In Action
Open Access
†
The views expressed in this paper are those of the authors and do not necessarily reflect the policies of Statistics Netherlands.
Received
25 July 2022
25 July 2022
Accepted
17 January 2023
17 January 2023
Published
26 January 2023
26 January 2023
Abstract
A text-based, bag-of-words, model was developed to identify drone company websites for multiple European countries in different languages. A collection of Spanish drone and non-drone websites was used for initial model development. Various classification methods were compared. Supervised logistic regression (L2-norm) performed best with an accuracy of 87% on the unseen test set. The accuracy of the later model improved to 88% when it was trained on texts in which all Spanish words were translated into English. Retraining the model on texts in which all typical Spanish words, such as names of cities and regions, and words indicative for specific periods in time, such as the months of the year and days of the week, were removed did not affect the overall performance of the model and made it more generally applicable. Applying the cleaned, completely English word-based, model to a collection of Irish and Italian drone and non-drone websites revealed, after manual inspection, that it was able to detect drone websites in those countries with an accuracy of 82 and 86%, respectively. The classification of Italian texts required the creation of a translation list in which all 1560 English word-based features in the model were translated to their Italian analogs. Because the model had a very high recall, 93, 100, and 97% on Spanish, Irish and Italian drone websites respectively, it was particularly well suited to select potential drone websites in large collections of websites.
References
Almeida F, Xexéo G (2019). Word embeddings: A survey. CoRR, arXiv preprint: https://arxiv.org/abs/1901.09069
Apertium (2021). Website of apertium, a free/open-source machine translation platform. http://www.apertium.org.
De Kunder M (2022). The size of the world wide web (the internet). https://www.worldwidewebsize.com/.
Devlin J, Chang MW, Lee K, Toutanova K (2018). Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint: https://arxiv.org/abs/1810.04805, 13 pages.
ESSnet (2020). Web page provinding an overview of the experimental statistics developed in the context of essnet big data workpackage C on enterprise characteristics. https://ec.europa.eu/eurostat/cros/content/wpc-experimental-statistics_en.
Fasttext (2022). Webpage of fasttext language detect v1.0.3. https://pypi.org/project/fasttext-langdetect/.
GitHub WIH Drones (2022). Web intelligence hub drone companies. https://github.com/eurostat/wih_drones_companies.
PUlearn (2021). Website of the pulearn python library v0.07. https://pypi.org/project/pulearn.