Dataset for studying gender disparity in English literary texts

Abstract

Recent discourse has highlighted significant gender disparity in many aspects of economic, social and cultural life. With the advent of advanced tools in Artificial Intelligence (AI) and Natural Language Processing (NLP), there is an opportunity to use computational and digital tools to analyze corpora, such as copyright-expired literature in the pre-modern period (defined herein as books published approximately between 1800 and 1950) from the Project Gutenberg corpus. Nevertheless, there are challenges in using such tools, especially for maintaining high-enough quality to explore interesting hypotheses. We present a dataset and materials that illustrate how modern processes in NLP can be used on the raw text of more than 3,000 literary texts in Project Gutenberg to (i) extract characters and pronouns from the text with high quality, (ii) disambiguate characters so that they are not overcounted, (iii) detect the …

Date: April 1, 2022
Authors: Akarsh Nagaraj, Mayank Kejriwal
Journal: Data in Brief
Volume: 41
Pages: 107905
Publisher: Elsevier

View Paper

Information Sciences Institute

Publications

Dataset for studying gender disparity in English literary texts

Abstract