11  Look at Your Data

Before using any theory-driven methods, it is always worthwhile to get a sense of your data more generally. This is as true for text data as it is for any other format. The process of exploring your data without a particular theory or construct in mind is called Exploratory Data Analysis (EDA).

Many guides to EDA for text data suggest plotting histograms of text length, calculating standard metrics of valence, and generating word clouds. We love computational methods and we will explore many of them in the coming chapters, but there is no denying it—the quickest, most foolproof way to explore your data is by looking at it. Just open the raw data and spend a minute or two reading through it. This is especially true for language data, since you have been training your whole life to develop an efficient, nuanced understanding of natural language. If your data are short stories, pick two or three at random and read them. If your data are Reddit comments, pick a dozen at random and read them. It is very likely that you will notice interesting patterns, or identify quirks in the data that should be filtered out before further analysis.