8 Corpus Data
The amount of data available for free on the Internet is astounding. Before you go through the trouble of running your own experiment or scraping data from sites, ask yourself: Has somebody else already done the work?
This chapter is a brief guide to corpus resources for language data, with links to all of our favorite sources.
8.1 Archived Experimental Data
Scientists who run large experiments often publish their data online for use in further research. For example, data from Sap et al. (2020) and Sap et al. (2022), summarized in Chapter 7, are available online as the Hippocorpus dataset—a dataset we will be exploring in depth in Unit 2. Another good example is the Empathic Reactions dataset (Buechel et al., 2018), used in Section 4.2.1 and Chapter 5, in which participants read news stories, rated their own empathy and distress after reading them, and then described their thoughts about them verbally.
Open experimental data are often linked in published papers, especially since the founding of the Center for Open Science in 2013. Many psychology-related datasets can be browsed freely on osf.io, the Harvard Dataverse, and other locations.
8.2 Linguistics Corpora
The field of linguistics has a long tradition of corpus data. Linguistics corpora provide extensive records of spoken and written speech in a wide range of contexts. These corpora are often very large and professionally curated, making them ideal for the techniques described in this book. On the other hand, they are generally curated with linguistics in mind, not psychology. This means that applying them to psychological questions requires some ingenuity.
One popular semi-experimental linguistics corpus is the HCRC Map Task Corpus (Anderson et al., 1991), in which pairs of participants collaborated in a communication game. In each pair, one partner could see a treasure map with a path through various landmarks, while the other partner had a similar map without a path. The first partner explained to the second how to draw the path. The partners’ communication accuracy can be measured as the distance between the drawn path and the original. Full dialogue transcriptions, as well as accuracy scores, are available online. The Map Task Corpus has been reproduced in many languages, including Hebrew, and is commonly used in psychology. For example, Dideriksen et al. (2023) used a Danish version of the Map Task Corpus, along with other dialogue corpora, to track the ways that speakers collaborate to achieve mutual understanding in different contexts.
- English-Corpora.org: A list of the most widely used corpora of naturalistic English speech and writing, with download links for each. Also includes preprocessed data, such as word frequency counts for nearly 100 genres, from the Corpus of Contemporary American English (Davies, 2009), used in Section 5.1.1.
- University of British Columbia Language Corpora List: Links to written and spoken language data in dozens of languages, including from bilingual and multilingual speakers.
- Wikipedia’s List of Text Corpora
- List of NLP Corpora: Links to useful corpora for NLP tasks like task-oriented dialogue, translation, and sentiment analysis.
- Convokit Datasets: Links to written and spoken dialogues from debates, news interviews, telephone conversations, video chats, legal trials, and more.
8.3 Data Gathered From the Internet
The Internet is full of text, and you are not the first one to want to use it for research. Many corpora of online text data are free to download.
Some sets of Internet data are professionally curated and well balanced. For example, the Blog Authorship Corpus (Schler et al., 2006) includes 681,288 blog posts annotated with age group (binned into ages 13-17, 23-27, and 33-47) and gender of author, with an equal number of male and female bloggers in each age group. Similarly, the 20 Newsgroups data set includes 20,000 newsgroup documents, partitioned evenly across 20 different newsgroups.
Some sets of Internet data are available only post-processing. For example, Eichstaedt et al. (2015) published Twitter n-gram (and LDA topic) frequencies by US county, along with corresponding measures of well-being (featured in Chapter 3).
Some sets of Internet data are very lightly curated. For example, the Reddit Top 2.5 Million dataset contains the top 1,000 all-time posts from the top 2,500 subreddits in August 2013, excluding NSFW subreddits.
Some sets of archived Internet data are not curated at all. These are sometimes referred to as data dumps. For example, Baumgartner et al. (2020) published all Reddit Submissions and Comments posted during April 2019. Even more extensive data dumps of Reddit, covering historical data back to Reddit’s inception, can be found in records of Pushshift Reddit. Similar archives exist for Twitter. Data dumps are usually in JSON format. A JSON file is like a list in R, but formatted slightly differently. For a tutorial on processing JSON data in R, see the relevant chaper in R for Data Science.
Most research topics in psychology do not require up-to-date data. As such, historical archives can be an invaluable resource. Biester et al. (2022) is a great example:
An example of social media archives in psychology research: Biester et al. (2022) used patterns curated by Cohan et al. (2018) to search Pushshift Reddit for users who publicly shared a depression diagnosis (e.g. “I have been diagnosed with depression”). They then used dictionary-based methods (Chapter 14) to measure various emotional qualities in users’ posts during the weeks leading to their declaration of the depression diagnosis, and in the weeks following. They found that anxiety, sadness, and cognitive processing increase in the weeks leading up to the declaration, and decrease afterwards.
Since Reddit and Twitter restricted their API access in 2023, the legal status of large archival data dumps from those platforms (such as Pushshift Reddit) has been unclear. We are not qualified to give legal advice, but as long as you are not using the data for profit, you are unlikely to get in trouble.
8.4 Other Public Data Sources
- Kaggle: An online hub for data science, including many text- and psychology-related datasets
- HathiTrust: A digital library of 18+ million digitized books, including many curated collections
- Forbes list of 30 Amazing (And Free) Public Data Sources