Improving the Science of Annotation for Natural Language Processing: The Use of the Single-Case Study for Piloting Annotation Projects

Anglin, Kylie; Boguslav, Arielle; Hall, Todd

doi:10.6339/22-JDS1054

Journal of Data Science

Improving the Science of Annotation for Natural Language Processing: The Use of the Single-Case Study for Piloting Annotation Projects

Volume 20, Issue 3 (2022): Special Issue: Data Science Meets Social Sciences, pp. 339–357

Kylie Anglin

Arielle Boguslav Todd Hall

https://doi.org/10.6339/22-JDS1054

Pub. online: 8 July 2022 Type: Data Science In Action

Open Access

Received
11 December 2021

Accepted
20 June 2022

Published
8 July 2022

Abstract

Researchers need guidance on how to obtain maximum efficiency and accuracy when annotating training data for text classification applications. Further, given wide variability in the kinds of annotations researchers need to obtain, they would benefit from the ability to conduct low-cost experiments during the design phase of annotation projects. To this end, our study proposes the single-case study design as a feasible and causally-valid experimental design for determining the best procedures for a given annotation task. The key strength of the design is its ability to generate causal evidence at the individual level, identifying the impact of competing annotation techniques and interfaces for the specific annotator(s) included in an annotation project. In this paper, we demonstrate the application of the single-case study in an applied experiment and argue that future researchers should incorporate the design into the pilot stage of annotation projects so that, over time, a causally-valid body of knowledge regarding the best annotation techniques is built.

Supplementary material

Supplementary Material

The Supplementary Material includes all of the scripts and data files necessary to reproduce the results of this paper. We also include the codebook used by our annotators.

References

Alyuz N, Aslan S, SK Nachman L D, Esme AA (2021). Annotating Student Engagement Across Grades 1–12: Associations with Demographics and Expressivity. In: Lecture Notes in Computer Science, 42–51. Springer International Publishing.

Auerbach C, Silverstein LB (2003). Qualitative Data: An Introduction to Coding and Analysis, volume 21. NYU press.

Bender EM, Friedman B (2018). Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. In: Transactions of the Association for Computational Linguistics, volume 6, 587–604. MIT Press.

Carbone VJ O’Brien L, Sweeney-Kerwin EJ, Albert KM (2013). Teaching Eye Contact to Children with Autism: A Conceptual Analysis and Single Case Study. Education and Treatment of Children, 36(2): 139–159.

Chi MT (1997). Quantifying Qualitative Analyses of Verbal Data: A Practical Guide. Journal of the Learning Sciences, 6(3): 271–315.

Cohen J, Wong V, Krishnamachari A, Berlin R (2020). Teacher Coaching in a Simulated Environment. Educational Evaluation and Policy Analysis, 42(2): 208–231.

Crittenden KS, Hill RJ (1971). Coding Reliability and Validity of Interview Data. American Sociological Review, 36(6): 1073.

D’Angelo ALD, Ruis AR, Collier W, Shaffer DW, Pugh CM (2020). Evaluating How Residents Talk and What it Means for Surgical Performance in the Simulation Lab. The American Journal of Surgery, 220(1): 37–43.

D’Mello S (2016). On the Influence of an Iterative Affect Annotation Approach on Inter-Observer and Self-Observer Reliability. IEEE Transactions on Affective Computing, 7(2): 136–149.

Donmez P, Carbonell J, Schneider J (2010). A Probabilistic Framework to Learn from Multiple Annotators with Time-Varying Accuracy. Society for Industrial and Applied Mathematics.

Explosion AI (2017). Prodigy: A New Tool for Radically Efficient Machine Teaching. https://explosion.ai/blog/prodigy-annotation-tool-active-learning.

Fort K (2016). Collaborative Annotation for Reliable Natural Language Processing: Technical and Sociological Aspects. John Wiley & Sons.

Gebru T, Morgenstern J, Vecchione B, Vaughan JW, Wallach H, Iii HD, et al. (2021). Datasheets for Datasets. In: Communications of the ACM, volume 64, number 12, 86–92. Association for Computing Machinery (ACM).

Geiger RS, Yu K, Yang Y, Dai M, Qiu J, Tang R, et al. (2020). Garbage in, Garbage out? Do Machine Learning Application Papers in Social Computing Report Where Human-Labeled Training Data Comes From?. ACM.

Hovy E, Lavid J (2010). Towards a ‘Science’ of Corpus Annotation: A New Methodological Challenge for Corpus Linguistics. International Journal of Translation, 22(1): 25.

Hunter JD (2007). Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3): 90–95.

Ide N, Pustejovsky J (2017). Handbook of Linguistic Annotation, volume 1. Springer.

Kratochwill TR, Hitchcock J, Horner RH, Levin JR, Odom SL, Rindskopf DM, et al. (2010). Single-Case Designs Technical Documentation. Technical report, What Works Clearinghouse.

Kratochwill TR, Hitchcock JH, Horner RH, Levin JR, Odom SL, Rindskopf DM, et al. (2013). Single-Case Intervention Research Design Standards. Remedial and Special Education, 34(1): 26–38.

Lingren T, Deleger L, Molnar K, Zhai H, Meinzen-Derr J, Kaiser M, et al. (2014). Evaluating the Impact of Pre-Annotation on Annotation Speed and Potential Bias: Natural Language Processing Gold Standard Development for Clinical Named Entity Recognition in Clinical Trial Announcements. Journal of the American Medical Informatics Association, 21(3): 406–413.

Liu J, Cohen J (2021). Measuring Teaching Practices at Scale: A Novel Application of Text-as-Data Methods. Educational Evaluation and Policy Analysis, 016237372110092.

Loksa D, Ko AJ (2016). The Role of Self-Regulation in Programming Problem Solving Process and Success. In: Proceedings of the 2016 ACM Conference on International Computing Education Research. ACM.

Manning CD, Schütze H (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge.

Neves M, Ševa J (2021). An Extensive Review of Tools for Manual Annotation of Documents. Briefings in Bioinformatics, 22(1): 146–163.

Perone M, Hursh DE (2013). Single-Case Experimental Designs. ISBN: 143381112X. American Psychological Association.

Pustejovsky J, Stubbs A (2012). Natural Language Annotation for Machine Learning: A Guide to Corpus-Building for Applications. O’Reilly Media, Inc.

Sabou M, Bontcheva K, Derczynski L, Scharl A (2014). Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), European Language Resources Association (ELRA), 859–866. 2014.

Samei B, Olney AM, Kelly S, Nystrand M, D’Mello S, Blanchard N, et al. (2014). Domain Independent Assessment of Dialogic Properties of Classroom Discourse. In: Proceedings of the 7th International Conference on Educational Data Mining, 4.

Seabold S, Perktold J (2010). Statsmodels: Econometric and Statistical Modeling with Python. In: 9th Python in Science Conference.

Shadish WR, Cook TD, Campbell DT (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton Mifflin.

Shaffer DW, Ruis AR (2021). How We Code. In: ICQE 2020 (S Lee, AR Ruis, eds.). Springer, Malibu, CA.

Skinner BF (1938). The Behavior of Organisms: An Experimental Analysis. BF Skinner Foundation.

Snow R, O’Connor B, Jurafsky D, Ng A (2008). Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, 254–263. Association for Computational Linguistics, Honolulu, Hawaii.

Van Rossum G, Drake FL (2009). Python 3 Reference Manual. CreateSpace, Scotts, Valley, CA.

Waskom ML (2021). Seaborn: Statistical data visualization. Journal of Open Source Software, 6(60): 3021.

Watson JB (1925). Experimental Studies on the Growth of the Emotions. The Pedagogical Seminary and Journal of Genetic Psychology, 32(2): 328–348.

McKinney W (2010). Data Structures for Statistical Computing in Python. In: Proceedings of the 9th Python in Science Conference (S van der Walt, J Millman, eds.), 56–61.

What Works Clearinghouse (2019). What Works Clearinghouse Standards Handbook: Version 4. U.S. Department of Education’s Institute of Education Sciences (IES), 1–17.

White AS, Reisinger D, Sakaguchi K, Vieira T, Zhang S, Rudinger R, et al. (2016). Universal Decompositional Semantics on Universal Dependencies. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 1713–1723.

White AS, Stengel-Eskin E, Vashishtha S, Govindarajan V, Reisinger DA, Vieira T, et al. (2019). The Universal Decompositional Semantics Dataset and Decomp Toolkit. arXiv preprint: https://arxiv.org/abs/1909.13851.

2022 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

annotation coding single-case study supervised machine learning text classification

Metrics

since February 2021

1157

Article info
views

623

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file