Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 22, Issue 1 (2024)
  4. Quantifying Gender Disparity in Pre-Mode ...

Journal of Data Science

Submit your article Information
  • Article info
  • More
    Article info

Quantifying Gender Disparity in Pre-Modern English Literature using Natural Language Processing
Volume 22, Issue 1 (2024), pp. 77–96
Mayank Kejriwal ORCID icon link to view author Mayank Kejriwal details   Akarsh Nagaraj  

Authors

 
Placeholder
https://doi.org/10.6339/23-JDS1100
Pub. online: 2 May 2023      Type: Data Science In Action      Open accessOpen Access

Received
26 May 2022
Accepted
22 April 2023
Published
2 May 2023

Abstract

Research has continued to shed light on the extent and significance of gender disparity in social, cultural and economic spheres. More recently, computational tools from the data science and Natural Language Processing (NLP) communities have been proposed for measuring such disparity at scale using empirically rigorous methodologies. In this article, we contribute to this line of research by studying gender disparity in 2,443 copyright-expired literary texts published in the pre-modern period, defined in this work as the period ranging from the beginning of the nineteenth through the early twentieth century. Using a replicable data science methodology relying on publicly available and established NLP components, we extract three different gendered character prevalence measures within these texts. We use an extensive set of statistical tests to robustly demonstrate a significant disparity between the prevalence of female characters and male characters in pre-modern literature. We also show that the proportion of female characters in literary texts significantly increases in female-authored texts compared to the same proportion in male-authored texts. However, regression-based analysis shows that, over the 120 year period covered by the corpus, female character prevalence does not change significantly over time, and remains below the parity level of 50%, regardless of the gender of the author. Qualitative analyses further show that descriptions associated with female characters across the corpus are markedly different (and stereotypical) from the descriptions associated with male characters.

Supplementary material

 Supplementary Material
The supplementary material contains details on: data preprocessing, character extraction and gender classification; additional quantitative details, including complete statistical significance results, for Hypotheses 1 and 2; quantitative linear regression results (including supporting statistics such as the analysis of variance); methodological details and results for the secondary analysis noted in Section 1.1 wherein we seek to use computational techniques from NLP to qualitatively assess the kinds of words associated with male and female character occurrences; and, a detailed description of some limitations of the study that were briefly discussed in the main text. Additionally, code, data and workbooks for replicating the analyses in this paper are also provided separately as supplementary material.

References

 
Adams JE (2012). A History of Victorian Literature, volume 10. John Wiley & Sons.
 
Agarwal A, Zheng J, Kamath S, Balasubramanian S, Dey SA (2015). Key female characters in film have more to talk about besides men: Automating the Bechdel test. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 830–840.
 
Asghari F (2016). Methodological considerations in gender studies. Interdisciplinary Studies in the Humanities, 7(4): 105–127.
 
Belkhyr S (2013). Disney animation: Global diffusion and local appropriation of culture. Études Caribéennes, (22).
 
Bonferroni C (1936). Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze, 8: 3–62.
 
Budzise-Weaver T (2016). Developing a qualitative coding analysis of visual artwork for humanities research. DHQ: Digital Humanities Quarterly, 10(4): 33–45.
 
Burke RJ, Mattis MC (2013). Women on Corporate Boards of Directors: International Challenges and Opportunities, volume 14. Springer Science & Business Media.
 
Burley T, Humble L, Sleeper C, Sticha A, Chesler A, Regan P, et al. (2020). Nlp workflows for computational social science: Understanding triggers of state-led mass killings. In: Practice and Experience in Advanced Research Computing, 152–159.
 
Cabrera D, Roy D, Chisolm MS (2018). Social media scholarship and alternative metrics for academic promotion and tenure. Journal of the American College of Radiology, 15(1): 135–141. https://doi.org/10.1016/j.jacr.2017.09.012
 
Devlin J, Chang MW, Lee K, Toutanova K (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint: https://arxiv.org/abs/1810.04805
 
Digital Humanities Lab, MIT (2022). The gender novels project. http://gendernovels.digitalhumanitiesmit.org/info/gender_novels_overview. Accessed: 2022-09-29.
 
Fine L (1998). Gender conflicts and their “dark” projections in coming of age white female southern novels. Southern Quarterly, 36(4): 121.
 
Glavaš G, Nanni F, Ponzetto SP (2017). Cross-lingual classification of topics in political texts. Association for Computational Linguistics (ACL).
 
Greider CW, Sheltzer JM, Cantalupo NC, Copeland WB, Dasgupta N, Hopkins N, et al. (2019). Increasing gender diversity in the stem research workforce. Science, 366(6466): 692–695. https://doi.org/10.1126/science.aaz0649
 
Han J, Wang H (2021). Transformer based network for open information extraction. Engineering Applications of Artificial Intelligence, 102: 104262. https://doi.org/10.1016/j.engappai.2021.104262
 
Hoekstra V (2010). Increasing the gender diversity of high courts: A comparative view. Politics & Gender, 6(3): 474–484. https://doi.org/10.1017/S1743923X10000243
 
Homans M (1993). Dinah’s blush, maggie’s arm: Class, gender, and sexuality in george eliot’s early novels. Victorian Studies, 36(2): 155–178.
 
Hovy D, Volkova S, Bamman D, Jurgens D, O’Connor B, Tsur O, et al. (2017). Proceedings of the second workshop on nlp and computational social science. In: Proceedings of the Second Workshop on NLP and Computational Social Science.
 
Hu L, Kearney MW (2021). Gendered tweets: Computational text analysis of gender differences in political discussion on twitter. Journal of Language and Social Psychology, 40(4): 482–503. https://doi.org/10.1177/0261927X20969752
 
Hu M, Kejriwal M (2022). Measuring spatio-textual affinities in twitter between two urban metropolises. Journal of Computational Social Science, 5(1), 227–252.
 
Jarynowski A, Paradowski MB, Buda A (2019). Modelling communities and populations: An introduction to computational social science. Studia Metodologiczne, 39: 123–152.
 
Jockers ML (2013). Macroanalysis: Digital Methods and Literary History. University of Illinois Press.
 
John J (2016). The Oxford Handbook of Victorian Literary Culture. Oxford University Press.
 
Jordan CE, Clark SJ, Waldron MA (2007). Gender bias and compensation in the executive suite of the fortune 100. Journal of Organizational Culture, Communications and Conflict, 11(1): 19.
 
Katz E (1999). Theorizing diffusion: Tarde and sorokin revisited. The Annals of the American Academy of Political and Social Science, 566(1): 144–155. https://doi.org/10.1177/000271629956600112
 
Keuschnigg M, Lovsjö N, Hedström P (2018). Analytical sociology and computational social science. Journal of Computational Social Science, 1(1): 3–14. https://doi.org/10.1007/s42001-017-0006-5
 
Lebert M (2009). A short history of ebooks. http://www.gutenberg.org/files/29801/29801-0.txt. Accessed: 2023-03-14.
 
Legal Information Institute, Cornell Law School (2020). Gender Bias. https://www.law.cornell.edu/wex/gender_bias. Accessed: 2022-09-29.
 
Liu Y (2019). Fine-tune bert for extractive summarization. arXiv preprint: https://arxiv.org/abs/1903.10318
 
Mason W, Vaughan JW, Wallach H (2014). Computational social science and social computing.
 
May Alcott L (1868). Little Women. Project Gutenberg.
 
Miller DL (2016). Gender and the artist archetype: Understanding gender inequality in artistic careers. Sociology Compass, 10(2): 119–131. https://doi.org/10.1111/soc4.12350
 
Milli S, Bamman D (2016). Beyond canonical texts: A computational analysis of fanfiction. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2048–2053.
 
Montasseri Z, Khaghaninejad MS, Moloodi A (2020). Gender representation in american movies: A corpus-based analysis. The International Journal of Humanities, 27(4): 42–53.
 
Montjoye YAd, Quoidbach J, Robic F, Pentland AS (2013). Predicting personality using novel mobile phone-based metrics. In: International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction, 48–55. Springer.
 
Nadeau D, Sekine S (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1): 3–26. https://doi.org/10.1075/li.30.1.03nad
 
Nagaraj A, Kejriwal M (2022). Dataset for studying gender disparity in english literary texts. Data in Brief, 41: 107905. https://doi.org/10.1016/j.dib.2022.107905
 
Napierala MA (2012). What is the bonferroni correction? Aaos Now, 40–41.
 
Naseem U, Razzak I, Musial K, Imran M (2020). Transformer based deep intelligent contextual embedding for twitter sentiment analysis. Future Generation Computer Systems, 113: 58–69. https://doi.org/10.1016/j.future.2020.06.050
 
Nath R, Murthy N (2004). A study of the relationship between internet diffusion and culture. Journal of International Information Management, 13(2): 5.
 
Nielsen MW, Bloch CW, Schiebinger L (2018). Making gender diversity work for scientific discovery and innovation. Nature Human Behaviour, 2(10): 726–734. https://doi.org/10.1038/s41562-018-0433-1
 
Nixon L (1994). Gender bias in archaeology. In: Women in Ancient Societies, 1–23. Springer.
 
O’Connor SD (1996). History of the women’s suffrage movement. Vand. L. Rev., 49: 657.
 
Oh D, Dotsch R, Porter J, Todorov A (2020). Gender biases in impressions from faces: Empirical studies and computational models. Journal of Experimental Psychology. General, 149(2): 323. https://doi.org/10.1037/xge0000638
 
Peters K, Chen Y, Kaplan AM, Ognibeni B, Pauwels K (2013). Social media metrics–a framework and guidelines for managing social media. Journal of Interactive Marketing, 27(4): 281–298. https://doi.org/10.1016/j.intmar.2013.09.007
 
Phillips JM, Malone B (2014). Increasing racial/ethnic diversity in nursing to reduce health disparities and achieve health equity. Public Health Reports, 129(1_suppl2): 45–50. https://doi.org/10.1177/00333549141291S209
 
Pilcher J, Whelehan I (2016). Key Concepts in Gender Studies. Sage.
 
Prabhumoye S, Choudhary S, Spiliopoulou E, Bogart C, Rose C, Black AW (2017). Linguistic markers of influence in informal interactions. In: Proceedings of the Second Workshop on NLP and Computational Social Science, 53–62.
 
Project Gutenberg (1971). Project gutenberg. https://www.gutenberg.org/. Accessed: 2022-09-29.
 
Reagle J, Rhue L (2011). Gender bias in Wikipedia and Britannica. International Journal of Communication, 5: 21.
 
Reddy S, Chen D, Manning CD (2019). Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7: 249–266. https://doi.org/10.1162/tacl_a_00266
 
Richard OC (2000). Racial diversity, business strategy, and firm performance: A resource-based view. Academy of Management Journal, 43(2): 164–177. https://doi.org/10.2307/1556374
 
Rochon TR (2000). Culture Moves: Ideas, Activism, and Changing Values. Princeton University Press.
 
Rodriguez MY, Storer H (2020). A computational social science perspective on qualitative data exploration: Using topic models for the descriptive analysis of social media data. Journal of Technology in Human Services, 38(1): 54–86. https://doi.org/10.1080/15228835.2019.1616350
 
Rose A (2009). Gender and Victorian Reform. Cambridge Scholars Publishing.
 
Rosenmann A (2016). Alignment with globalized western culture: Between inclusionary values and an exclusionary social identity. European Journal of Social Psychology, 46(1): 26–43. https://doi.org/10.1002/ejsp.2130
 
Setzler M (2019). Measuring bias against female political leadership. Politics & Gender, 15(4): 695–721. https://doi.org/10.1017/S1743923X18000430
 
Siblini W, Pasqual C, Lavielle A, Cauchois C (2019). Multilingual question answering from formatted text applied to conversational agents. arXiv preprint: https://arxiv.org/abs/1910.04659
 
Stathoulopoulos K, Mateos-Garcia JC (2019). Gender diversity in ai research. https://media.nesta.org.uk/documents/Gender_Diversity_in_AI_Research.pdf. Available at SSRN 3428240.
 
Stevenson RL (1883). Treasure Island. Cassell & Co.
 
Tusan ME (2004). Performing work: Gender, class, and the printing trade in victorian britain. Journal of Women’s History, 16(1): 103–126. https://doi.org/10.1353/jowh.2004.0037
 
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. (2020). Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45.
 
Wood-Doughty Z, Smith M, Broniatowski D, Dredze M (2017). How does twitter user behavior vary across demographic groups? In: Proceedings of the Second Workshop on NLP and Computational Social Science, 83–89.
 
Yang L, Xu Z, Luo J (2020). Measuring female representation and impact in films over time. ACM Transactions on Data Science, 1(4): 1–14. https://doi.org/10.1145/3411213

PDF XML
PDF XML

Copyright
2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
digital humanities gender-specific character prevalence named entity recognition project Gutenberg word embedding

Metrics
since February 2021
680

Article info
views

442

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy