Quantifying Gender Disparity in Pre-Modern English Literature using Natural Language Processing

Kejriwal, Mayank; Nagaraj, Akarsh

doi:10.6339/23-JDS1100

Journal of Data Science

Quantifying Gender Disparity in Pre-Modern English Literature using Natural Language Processing

Volume 22, Issue 1 (2024), pp. 77–96

Mayank Kejriwal

Akarsh Nagaraj

https://doi.org/10.6339/23-JDS1100

Pub. online: 2 May 2023 Type: Data Science In Action

Open Access

Received
26 May 2022

Accepted
22 April 2023

Published
2 May 2023

Abstract

Research has continued to shed light on the extent and significance of gender disparity in social, cultural and economic spheres. More recently, computational tools from the data science and Natural Language Processing (NLP) communities have been proposed for measuring such disparity at scale using empirically rigorous methodologies. In this article, we contribute to this line of research by studying gender disparity in 2,443 copyright-expired literary texts published in the pre-modern period, defined in this work as the period ranging from the beginning of the nineteenth through the early twentieth century. Using a replicable data science methodology relying on publicly available and established NLP components, we extract three different gendered character prevalence measures within these texts. We use an extensive set of statistical tests to robustly demonstrate a significant disparity between the prevalence of female characters and male characters in pre-modern literature. We also show that the proportion of female characters in literary texts significantly increases in female-authored texts compared to the same proportion in male-authored texts. However, regression-based analysis shows that, over the 120 year period covered by the corpus, female character prevalence does not change significantly over time, and remains below the parity level of 50%, regardless of the gender of the author. Qualitative analyses further show that descriptions associated with female characters across the corpus are markedly different (and stereotypical) from the descriptions associated with male characters.

Supplementary material

Supplementary Material

The supplementary material contains details on: data preprocessing, character extraction and gender classification; additional quantitative details, including complete statistical significance results, for Hypotheses 1 and 2; quantitative linear regression results (including supporting statistics such as the analysis of variance); methodological details and results for the secondary analysis noted in Section 1.1 wherein we seek to use computational techniques from NLP to qualitatively assess the kinds of words associated with male and female character occurrences; and, a detailed description of some limitations of the study that were briefly discussed in the main text. Additionally, code, data and workbooks for replicating the analyses in this paper are also provided separately as supplementary material.

References

Adams JE (2012). A History of Victorian Literature, volume 10. John Wiley & Sons.

Agarwal A, Zheng J, Kamath S, Balasubramanian S, Dey SA (2015). Key female characters in film have more to talk about besides men: Automating the Bechdel test. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 830–840.

Asghari F (2016). Methodological considerations in gender studies. Interdisciplinary Studies in the Humanities, 7(4): 105–127.

Belkhyr S (2013). Disney animation: Global diffusion and local appropriation of culture. Études Caribéennes, (22).

Bonferroni C (1936). Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze, 8: 3–62.

Budzise-Weaver T (2016). Developing a qualitative coding analysis of visual artwork for humanities research. DHQ: Digital Humanities Quarterly, 10(4): 33–45.

Burke RJ, Mattis MC (2013). Women on Corporate Boards of Directors: International Challenges and Opportunities, volume 14. Springer Science & Business Media.

Burley T, Humble L, Sleeper C, Sticha A, Chesler A, Regan P, et al. (2020). Nlp workflows for computational social science: Understanding triggers of state-led mass killings. In: Practice and Experience in Advanced Research Computing, 152–159.

Cabrera D, Roy D, Chisolm MS (2018). Social media scholarship and alternative metrics for academic promotion and tenure. Journal of the American College of Radiology, 15(1): 135–141. https://doi.org/10.1016/j.jacr.2017.09.012

Devlin J, Chang MW, Lee K, Toutanova K (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint: https://arxiv.org/abs/1810.04805

Digital Humanities Lab, MIT (2022). The gender novels project. http://gendernovels.digitalhumanitiesmit.org/info/gender_novels_overview. Accessed: 2022-09-29.

Fine L (1998). Gender conflicts and their “dark” projections in coming of age white female southern novels. Southern Quarterly, 36(4): 121.

Glavaš G, Nanni F, Ponzetto SP (2017). Cross-lingual classification of topics in political texts. Association for Computational Linguistics (ACL).

Greider CW, Sheltzer JM, Cantalupo NC, Copeland WB, Dasgupta N, Hopkins N, et al. (2019). Increasing gender diversity in the stem research workforce. Science, 366(6466): 692–695. https://doi.org/10.1126/science.aaz0649

Han J, Wang H (2021). Transformer based network for open information extraction. Engineering Applications of Artificial Intelligence, 102: 104262. https://doi.org/10.1016/j.engappai.2021.104262

Hoekstra V (2010). Increasing the gender diversity of high courts: A comparative view. Politics & Gender, 6(3): 474–484. https://doi.org/10.1017/S1743923X10000243

Homans M (1993). Dinah’s blush, maggie’s arm: Class, gender, and sexuality in george eliot’s early novels. Victorian Studies, 36(2): 155–178.

Hovy D, Volkova S, Bamman D, Jurgens D, O’Connor B, Tsur O, et al. (2017). Proceedings of the second workshop on nlp and computational social science. In: Proceedings of the Second Workshop on NLP and Computational Social Science.

Hu L, Kearney MW (2021). Gendered tweets: Computational text analysis of gender differences in political discussion on twitter. Journal of Language and Social Psychology, 40(4): 482–503. https://doi.org/10.1177/0261927X20969752

Hu M, Kejriwal M (2022). Measuring spatio-textual affinities in twitter between two urban metropolises. Journal of Computational Social Science, 5(1), 227–252.

Jarynowski A, Paradowski MB, Buda A (2019). Modelling communities and populations: An introduction to computational social science. Studia Metodologiczne, 39: 123–152.

Jockers ML (2013). Macroanalysis: Digital Methods and Literary History. University of Illinois Press.

John J (2016). The Oxford Handbook of Victorian Literary Culture. Oxford University Press.

Jordan CE, Clark SJ, Waldron MA (2007). Gender bias and compensation in the executive suite of the fortune 100. Journal of Organizational Culture, Communications and Conflict, 11(1): 19.

Katz E (1999). Theorizing diffusion: Tarde and sorokin revisited. The Annals of the American Academy of Political and Social Science, 566(1): 144–155. https://doi.org/10.1177/000271629956600112

Keuschnigg M, Lovsjö N, Hedström P (2018). Analytical sociology and computational social science. Journal of Computational Social Science, 1(1): 3–14. https://doi.org/10.1007/s42001-017-0006-5

Lebert M (2009). A short history of ebooks. http://www.gutenberg.org/files/29801/29801-0.txt. Accessed: 2023-03-14.

Legal Information Institute, Cornell Law School (2020). Gender Bias. https://www.law.cornell.edu/wex/gender_bias. Accessed: 2022-09-29.

Liu Y (2019). Fine-tune bert for extractive summarization. arXiv preprint: https://arxiv.org/abs/1903.10318

Mason W, Vaughan JW, Wallach H (2014). Computational social science and social computing.

May Alcott L (1868). Little Women. Project Gutenberg.

Miller DL (2016). Gender and the artist archetype: Understanding gender inequality in artistic careers. Sociology Compass, 10(2): 119–131. https://doi.org/10.1111/soc4.12350

Milli S, Bamman D (2016). Beyond canonical texts: A computational analysis of fanfiction. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2048–2053.

Montasseri Z, Khaghaninejad MS, Moloodi A (2020). Gender representation in american movies: A corpus-based analysis. The International Journal of Humanities, 27(4): 42–53.

Montjoye YAd, Quoidbach J, Robic F, Pentland AS (2013). Predicting personality using novel mobile phone-based metrics. In: International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction, 48–55. Springer.

Nadeau D, Sekine S (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1): 3–26. https://doi.org/10.1075/li.30.1.03nad

Nagaraj A, Kejriwal M (2022). Dataset for studying gender disparity in english literary texts. Data in Brief, 41: 107905. https://doi.org/10.1016/j.dib.2022.107905

Napierala MA (2012). What is the bonferroni correction? Aaos Now, 40–41.

Naseem U, Razzak I, Musial K, Imran M (2020). Transformer based deep intelligent contextual embedding for twitter sentiment analysis. Future Generation Computer Systems, 113: 58–69. https://doi.org/10.1016/j.future.2020.06.050

Nath R, Murthy N (2004). A study of the relationship between internet diffusion and culture. Journal of International Information Management, 13(2): 5.

Nielsen MW, Bloch CW, Schiebinger L (2018). Making gender diversity work for scientific discovery and innovation. Nature Human Behaviour, 2(10): 726–734. https://doi.org/10.1038/s41562-018-0433-1

Nixon L (1994). Gender bias in archaeology. In: Women in Ancient Societies, 1–23. Springer.

O’Connor SD (1996). History of the women’s suffrage movement. Vand. L. Rev., 49: 657.

Oh D, Dotsch R, Porter J, Todorov A (2020). Gender biases in impressions from faces: Empirical studies and computational models. Journal of Experimental Psychology. General, 149(2): 323. https://doi.org/10.1037/xge0000638

Peters K, Chen Y, Kaplan AM, Ognibeni B, Pauwels K (2013). Social media metrics–a framework and guidelines for managing social media. Journal of Interactive Marketing, 27(4): 281–298. https://doi.org/10.1016/j.intmar.2013.09.007

Phillips JM, Malone B (2014). Increasing racial/ethnic diversity in nursing to reduce health disparities and achieve health equity. Public Health Reports, 129(1_suppl2): 45–50. https://doi.org/10.1177/00333549141291S209

Pilcher J, Whelehan I (2016). Key Concepts in Gender Studies. Sage.

Prabhumoye S, Choudhary S, Spiliopoulou E, Bogart C, Rose C, Black AW (2017). Linguistic markers of influence in informal interactions. In: Proceedings of the Second Workshop on NLP and Computational Social Science, 53–62.

Project Gutenberg (1971). Project gutenberg. https://www.gutenberg.org/. Accessed: 2022-09-29.

Reagle J, Rhue L (2011). Gender bias in Wikipedia and Britannica. International Journal of Communication, 5: 21.

Reddy S, Chen D, Manning CD (2019). Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7: 249–266. https://doi.org/10.1162/tacl_a_00266

Richard OC (2000). Racial diversity, business strategy, and firm performance: A resource-based view. Academy of Management Journal, 43(2): 164–177. https://doi.org/10.2307/1556374

Rochon TR (2000). Culture Moves: Ideas, Activism, and Changing Values. Princeton University Press.

Rodriguez MY, Storer H (2020). A computational social science perspective on qualitative data exploration: Using topic models for the descriptive analysis of social media data. Journal of Technology in Human Services, 38(1): 54–86. https://doi.org/10.1080/15228835.2019.1616350

Rose A (2009). Gender and Victorian Reform. Cambridge Scholars Publishing.

Rosenmann A (2016). Alignment with globalized western culture: Between inclusionary values and an exclusionary social identity. European Journal of Social Psychology, 46(1): 26–43. https://doi.org/10.1002/ejsp.2130

Setzler M (2019). Measuring bias against female political leadership. Politics & Gender, 15(4): 695–721. https://doi.org/10.1017/S1743923X18000430

Siblini W, Pasqual C, Lavielle A, Cauchois C (2019). Multilingual question answering from formatted text applied to conversational agents. arXiv preprint: https://arxiv.org/abs/1910.04659

Stathoulopoulos K, Mateos-Garcia JC (2019). Gender diversity in ai research. https://media.nesta.org.uk/documents/Gender_Diversity_in_AI_Research.pdf. Available at SSRN 3428240.

Stevenson RL (1883). Treasure Island. Cassell & Co.

Tusan ME (2004). Performing work: Gender, class, and the printing trade in victorian britain. Journal of Women’s History, 16(1): 103–126. https://doi.org/10.1353/jowh.2004.0037

Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. (2020). Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45.

Wood-Doughty Z, Smith M, Broniatowski D, Dredze M (2017). How does twitter user behavior vary across demographic groups? In: Proceedings of the Second Workshop on NLP and Computational Social Science, 83–89.

Yang L, Xu Z, Luo J (2020). Measuring female representation and impact in films over time. ACM Transactions on Data Science, 1(4): 1–14. https://doi.org/10.1145/3411213

2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

digital humanities gender-specific character prevalence named entity recognition project Gutenberg word embedding

Metrics

since February 2021

722

Article info
views

460

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file