8  Corpus Data

The amount of data available for free on the Internet is astounding. Before you go through the trouble of running your own experiment or scraping data from sites, ask yourself: Has somebody else already done the work?

This chapter is a brief guide to corpus resources for language data, with links to all of our favorite sources.

8.1 Archived Experimental Data

Scientists who run large experiments often publish their data online for use in further research. For example, data from Sap et al. (2020) and Sap et al. (2022), summarized in Chapter 7, are available online as the Hippocorpus dataset—a dataset we will be exploring in depth in Unit 2. Another good example is the Empathic Reactions dataset (Buechel et al., 2018), used in Section 4.2.1 and Chapter 5, in which participants read news stories, rated their own empathy and distress after reading them, and then described their thoughts about them verbally.

Open experimental data are often linked in published papers, especially since the founding of the Center for Open Science in 2013. Many psychology-related datasets can be browsed freely on osf.io, the Harvard Dataverse, and other locations.

Advantages of Archived Experimental Data
  • Professional: Experiments conducted by trained academics are generally well designed.
  • Well-Documented: Datasets used in published papers have extensive documentation of the methods used to produce them.
Disadvantages of Archived Experimental Data
  • Sometimes Not Well-Documented: Datasets not used in published papers often have poor documentation.
  • Small Sample Size: Experiments often result in relatively small datasets, which can pose problems for certain NLP methods.

8.2 Linguistics Corpora

The field of linguistics has a long tradition of corpus data. Linguistics corpora provide extensive records of spoken and written speech in a wide range of contexts. These corpora are often very large and professionally curated, making them ideal for the techniques described in this book. On the other hand, they are generally curated with linguistics in mind, not psychology. This means that applying them to psychological questions requires some ingenuity.

One popular semi-experimental linguistics corpus is the HCRC Map Task Corpus (Anderson et al., 1991), in which pairs of participants collaborated in a communication game. In each pair, one partner could see a treasure map with a path through various landmarks, while the other partner had a similar map without a path. The first partner explained to the second how to draw the path. The partners’ communication accuracy can be measured as the distance between the drawn path and the original. Full dialogue transcriptions, as well as accuracy scores, are available online. The Map Task Corpus has been reproduced in many languages, including Hebrew, and is commonly used in psychology. For example, Dideriksen et al. (2023) used a Danish version of the Map Task Corpus, along with other dialogue corpora, to track the ways that speakers collaborate to achieve mutual understanding in different contexts.

Advantages of Linguistics Corpora
  • Professional: Linguistics corpora are generally well curated and well documented.
  • Ecological Validity: Corpora are often large and naturalistic—including for spoken dialogue, a domain that is otherwise out of reach for NLP.
Disadvantages of Linguistics Corpora
  • Domain-Specific: Linguistics corpora are generally created by linguists for linguists.

8.3 Data Gathered From the Internet

The Internet is full of text, and you are not the first one to want to use it for research. Many corpora of online text data are free to download.

Some sets of Internet data are professionally curated and well balanced. For example, the Blog Authorship Corpus (Schler et al., 2006) includes 681,288 blog posts annotated with age group (binned into ages 13-17, 23-27, and 33-47) and gender of author, with an equal number of male and female bloggers in each age group. Similarly, the 20 Newsgroups data set includes 20,000 newsgroup documents, partitioned evenly across 20 different newsgroups.

Some sets of Internet data are available only post-processing. For example, Eichstaedt et al. (2015) published Twitter n-gram (and LDA topic) frequencies by US county, along with corresponding measures of well-being (featured in Chapter 3).

Some sets of Internet data are very lightly curated. For example, the Reddit Top 2.5 Million dataset contains the top 1,000 all-time posts from the top 2,500 subreddits in August 2013, excluding NSFW subreddits.

Some sets of archived Internet data are not curated at all. These are sometimes referred to as data dumps. For example, Baumgartner et al. (2020) published all Reddit Submissions and Comments posted during April 2019. Even more extensive data dumps of Reddit, covering historical data back to Reddit’s inception, can be found in records of Pushshift Reddit. Similar archives exist for Twitter. Data dumps are usually in JSON format. A JSON file is like a list in R, but formatted slightly differently. For a tutorial on processing JSON data in R, see the relevant chaper in R for Data Science.

Most research topics in psychology do not require up-to-date data. As such, historical archives can be an invaluable resource. Biester et al. (2022) is a great example:

An example of social media archives in psychology research: Biester et al. (2022) used patterns curated by Cohan et al. (2018) to search Pushshift Reddit for users who publicly shared a depression diagnosis (e.g. “I have been diagnosed with depression”). They then used dictionary-based methods (Chapter 14) to measure various emotional qualities in users’ posts during the weeks leading to their declaration of the depression diagnosis, and in the weeks following. They found that anxiety, sadness, and cognitive processing increase in the weeks leading up to the declaration, and decrease afterwards.

Advantages of Archival Internet Data
  • Easy: Pre-gathered datasets are low-cost and low-effort, often for very large sample sizes.
  • Unintrusive: With pre-gathered datasets, you don’t have to worry about API usage limits or web scraping etiquette.
Disadvantages of Archival Internet Data
  • Old: Archival data do not reflect current events or recent trends.
A Disclaimer on Social Media Data Dumps

Since Reddit and Twitter restricted their API access in 2023, the legal status of large archival data dumps from those platforms (such as Pushshift Reddit) has been unclear. We are not qualified to give legal advice, but as long as you are not using the data for profit, you are unlikely to get in trouble.

8.4 Other Public Data Sources


Anderson, A. H., Bader, M., Bard, E. G., Boyle, E., Doherty, G., Garrod, S., Isard, S., Kowtko, J., McAllister, J., Miller, J., Sotillo, C., Thompson, H. S., & Weinert, R. (1991). The HCRC map task corpus. Language and Speech, 34(4), 351–366. https://doi.org/10.1177/002383099103400404
Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., & Blackburn, J. (2020). The pushshift reddit dataset. Zenodo. https://doi.org/10.5281/zenodo.3608135
Biester, L., Pennebaker, J., & Mihalcea, R. (2022). Emotional and cognitive changes surrounding online depression identity claims. PLOS ONE, 17(12), 1–20. https://doi.org/10.1371/journal.pone.0278179
Buechel, S., Buffone, A., Slaff, B., Ungar, L. H., & Sedoc, J. (2018). Modeling empathy and distress in reaction to news stories. CoRR, abs/1808.10399. http://arxiv.org/abs/1808.10399
Cohan, A., Desmet, B., Yates, A., Soldaini, L., MacAvaney, S., & Goharian, N. (2018). SMHD: A large-scale resource for exploring online language usage for multiple mental health conditions. In E. M. Bender, L. Derczynski, & P. Isabelle (Eds.), Proceedings of the 27th international conference on computational linguistics (pp. 1485–1497). Association for Computational Linguistics. https://aclanthology.org/C18-1126
Davies, M. (2009). The 385+ million word corpus of contemporary american english (1990―2008+): Design, architecture, and linguistic insights. International Journal of Corpus Linguistics, 14, 159–190. https://www.english-corpora.org//coca/
Dideriksen, C., Christiansen, M. H., Tylén, K., Dingemanse, M., & Fusaroli, R. (2023). Quantifying the interplay of conversational devices in building mutual understanding. Journal of Experimental Psychology: General, 152, 864–889. https://doi.org/10.1037/xge0001301
Eichstaedt, J. C., Schwartz, H. A., Kern, M. L., Park, G., Labarthe, D. R., Merchant, R. M., Jha, S., Agrawal, M., Dziurzynski, L. A., Sap, M., Weeg, C., Larson, E. E., Ungar, L. H., & Seligman, M. E. P. (2015). Psychological language on twitter predicts county-level heart disease mortality. Psychological Science, 26(2), 159–169. https://doi.org/10.1177/0956797614557867
Sap, M., Horvitz, E., Choi, Y., Smith, N. A., & Pennebaker, J. (2020). Recollection versus imagination: Exploring human memory and cognition via neural language models. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 1970–1978). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.178
Sap, M., Jafarpour, A., Choi, Y., Smith, N. A., Pennebaker, J. W., & Horvitz, E. (2022). Quantifying the narrative flow of imagined versus autobiographical stories. Proceedings of the National Academy of Sciences, 119(45), e2211715119. https://doi.org/10.1073/pnas.2211715119
Schler, J., Koppel, M., Argamon, S., & Pennebaker, J. (2006). Effects of age and gender on blogging. 199–205.