Quantifying Psychological Properties of Text

The average native English speaker has a functionally useful vocabulary size of over 20,000 words (Hellman, 2011). 20,000 words means 20,000 variables for psychologists to analyze. But that’s not all—words can be arranged in infinite orders, drastically changing their meanings in context. How do we capture all of this complexity in simple measurements that we can use for analysis?

In this unit, we introduce methods for quantifying psychological properties of text. We begin with fundamental skills for language processing in R, and move gradually to more advanced methods. In the final sections, we cover recent developments in the use of large language models for psychological research.

Throughout the unit, we will provide example analyses using the Hippocorpus dataset, which contains the data from Sap et al. (2020) and Sap et al. (2022). The dataset consists primarily of texts written by online participants, who were instructed to write either true stories that happened to them recently or fictional stories about a comparable topic. Participants then retold the true stories several months later. The dataset also includes a variety of psychological and demographic characteristics of the participants, including gender and openness to experience. The dataset itself is not included in the book’s files, but you can easily download it from the website and follow along.

Hellman, A. B. (2011). Vocabulary size and depth of word knowledge in adult-onset second language acquisition. International Journal of Applied Linguistics, 21(2), 162–182.
Sap, M., Horvitz, E., Choi, Y., Smith, N. A., & Pennebaker, J. (2020). Recollection versus imagination: Exploring human memory and cognition via neural language models. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 1970–1978). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.178
Sap, M., Jafarpour, A., Choi, Y., Smith, N. A., Pennebaker, J. W., & Horvitz, E. (2022). Quantifying the narrative flow of imagined versus autobiographical stories. Proceedings of the National Academy of Sciences, 119(45), e2211715119. https://doi.org/10.1073/pnas.2211715119