Quantifying Gender Disparity in Pre-Modern English Literature using Natural Language Processing
Volume 22, Issue 1 (2024), pp. 77–96
Pub. online: 2 May 2023
Type: Data Science In Action
Open Access
Received
26 May 2022
26 May 2022
Accepted
22 April 2023
22 April 2023
Published
2 May 2023
2 May 2023
Abstract
Research has continued to shed light on the extent and significance of gender disparity in social, cultural and economic spheres. More recently, computational tools from the data science and Natural Language Processing (NLP) communities have been proposed for measuring such disparity at scale using empirically rigorous methodologies. In this article, we contribute to this line of research by studying gender disparity in 2,443 copyright-expired literary texts published in the pre-modern period, defined in this work as the period ranging from the beginning of the nineteenth through the early twentieth century. Using a replicable data science methodology relying on publicly available and established NLP components, we extract three different gendered character prevalence measures within these texts. We use an extensive set of statistical tests to robustly demonstrate a significant disparity between the prevalence of female characters and male characters in pre-modern literature. We also show that the proportion of female characters in literary texts significantly increases in female-authored texts compared to the same proportion in male-authored texts. However, regression-based analysis shows that, over the 120 year period covered by the corpus, female character prevalence does not change significantly over time, and remains below the parity level of 50%, regardless of the gender of the author. Qualitative analyses further show that descriptions associated with female characters across the corpus are markedly different (and stereotypical) from the descriptions associated with male characters.
Supplementary material
Supplementary MaterialThe supplementary material contains details on: data preprocessing, character extraction and gender classification; additional quantitative details, including complete statistical significance results, for Hypotheses 1 and 2; quantitative linear regression results (including supporting statistics such as the analysis of variance); methodological details and results for the secondary analysis noted in Section 1.1 wherein we seek to use computational techniques from NLP to qualitatively assess the kinds of words associated with male and female character occurrences; and, a detailed description of some limitations of the study that were briefly discussed in the main text. Additionally, code, data and workbooks for replicating the analyses in this paper are also provided separately as supplementary material.
References
Agarwal A, Zheng J, Kamath S, Balasubramanian S, Dey SA (2015). Key female characters in film have more to talk about besides men: Automating the Bechdel test. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 830–840.
Cabrera D, Roy D, Chisolm MS (2018). Social media scholarship and alternative metrics for academic promotion and tenure. Journal of the American College of Radiology, 15(1): 135–141. https://doi.org/10.1016/j.jacr.2017.09.012
Devlin J, Chang MW, Lee K, Toutanova K (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint: https://arxiv.org/abs/1810.04805
Digital Humanities Lab, MIT (2022). The gender novels project. http://gendernovels.digitalhumanitiesmit.org/info/gender_novels_overview. Accessed: 2022-09-29.
Greider CW, Sheltzer JM, Cantalupo NC, Copeland WB, Dasgupta N, Hopkins N, et al. (2019). Increasing gender diversity in the stem research workforce. Science, 366(6466): 692–695. https://doi.org/10.1126/science.aaz0649
Han J, Wang H (2021). Transformer based network for open information extraction. Engineering Applications of Artificial Intelligence, 102: 104262. https://doi.org/10.1016/j.engappai.2021.104262
Hoekstra V (2010). Increasing the gender diversity of high courts: A comparative view. Politics & Gender, 6(3): 474–484. https://doi.org/10.1017/S1743923X10000243
Hu L, Kearney MW (2021). Gendered tweets: Computational text analysis of gender differences in political discussion on twitter. Journal of Language and Social Psychology, 40(4): 482–503. https://doi.org/10.1177/0261927X20969752
Katz E (1999). Theorizing diffusion: Tarde and sorokin revisited. The Annals of the American Academy of Political and Social Science, 566(1): 144–155. https://doi.org/10.1177/000271629956600112
Keuschnigg M, Lovsjö N, Hedström P (2018). Analytical sociology and computational social science. Journal of Computational Social Science, 1(1): 3–14. https://doi.org/10.1007/s42001-017-0006-5
Lebert M (2009). A short history of ebooks. http://www.gutenberg.org/files/29801/29801-0.txt. Accessed: 2023-03-14.
Legal Information Institute, Cornell Law School (2020). Gender Bias. https://www.law.cornell.edu/wex/gender_bias. Accessed: 2022-09-29.
Liu Y (2019). Fine-tune bert for extractive summarization. arXiv preprint: https://arxiv.org/abs/1903.10318
Miller DL (2016). Gender and the artist archetype: Understanding gender inequality in artistic careers. Sociology Compass, 10(2): 119–131. https://doi.org/10.1111/soc4.12350
Nadeau D, Sekine S (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1): 3–26. https://doi.org/10.1075/li.30.1.03nad
Nagaraj A, Kejriwal M (2022). Dataset for studying gender disparity in english literary texts. Data in Brief, 41: 107905. https://doi.org/10.1016/j.dib.2022.107905
Naseem U, Razzak I, Musial K, Imran M (2020). Transformer based deep intelligent contextual embedding for twitter sentiment analysis. Future Generation Computer Systems, 113: 58–69. https://doi.org/10.1016/j.future.2020.06.050
Nielsen MW, Bloch CW, Schiebinger L (2018). Making gender diversity work for scientific discovery and innovation. Nature Human Behaviour, 2(10): 726–734. https://doi.org/10.1038/s41562-018-0433-1
Oh D, Dotsch R, Porter J, Todorov A (2020). Gender biases in impressions from faces: Empirical studies and computational models. Journal of Experimental Psychology. General, 149(2): 323. https://doi.org/10.1037/xge0000638
Peters K, Chen Y, Kaplan AM, Ognibeni B, Pauwels K (2013). Social media metrics–a framework and guidelines for managing social media. Journal of Interactive Marketing, 27(4): 281–298. https://doi.org/10.1016/j.intmar.2013.09.007
Phillips JM, Malone B (2014). Increasing racial/ethnic diversity in nursing to reduce health disparities and achieve health equity. Public Health Reports, 129(1_suppl2): 45–50. https://doi.org/10.1177/00333549141291S209
Project Gutenberg (1971). Project gutenberg. https://www.gutenberg.org/. Accessed: 2022-09-29.
Reddy S, Chen D, Manning CD (2019). Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7: 249–266. https://doi.org/10.1162/tacl_a_00266
Richard OC (2000). Racial diversity, business strategy, and firm performance: A resource-based view. Academy of Management Journal, 43(2): 164–177. https://doi.org/10.2307/1556374
Rodriguez MY, Storer H (2020). A computational social science perspective on qualitative data exploration: Using topic models for the descriptive analysis of social media data. Journal of Technology in Human Services, 38(1): 54–86. https://doi.org/10.1080/15228835.2019.1616350
Rosenmann A (2016). Alignment with globalized western culture: Between inclusionary values and an exclusionary social identity. European Journal of Social Psychology, 46(1): 26–43. https://doi.org/10.1002/ejsp.2130
Setzler M (2019). Measuring bias against female political leadership. Politics & Gender, 15(4): 695–721. https://doi.org/10.1017/S1743923X18000430
Siblini W, Pasqual C, Lavielle A, Cauchois C (2019). Multilingual question answering from formatted text applied to conversational agents. arXiv preprint: https://arxiv.org/abs/1910.04659
Stathoulopoulos K, Mateos-Garcia JC (2019). Gender diversity in ai research. https://media.nesta.org.uk/documents/Gender_Diversity_in_AI_Research.pdf. Available at SSRN 3428240.
Tusan ME (2004). Performing work: Gender, class, and the printing trade in victorian britain. Journal of Women’s History, 16(1): 103–126. https://doi.org/10.1353/jowh.2004.0037
Yang L, Xu Z, Luo J (2020). Measuring female representation and impact in films over time. ACM Transactions on Data Science, 1(4): 1–14. https://doi.org/10.1145/3411213