In a recent project, I used the JSTOR Data for Research service to conduct a systematic review of the interdisciplinary policing literature. To start, I requested from DfR the population of peer-reviewed journals published from 1980 to 2018 that contain the terms ‘police’, policing’, or ‘security agent’ from 1980 to 2018. This corpus comprises 65,285 texts. After receiving the articles, I read the data into R. Since I was primarily interested in what political science articles had to say about policing, I used the metadata provided by DfR to exclude all articles from non-political-science journals. This left me with a corpus of 14,309 articles published across 95 journals. I then used R to count the number of times that ‘police’ and the many derivatives of this word, such as ‘sheriff ’ and ‘cop’, appear in each article. I then calculated the 90th percentile of this variable (5), and removed from the corpus any articles that had a lower count than this. The idea here is to isolate the articles that focus on policing from those that just mention policing. This process left me with 1,439 articles published in political science journals that make frequent mention of domestic security agents. This final corpus contains 10,689,081 words, with an average of 7,164 words per article (σ = 5,861).
Once I had the corpus of texts I wanted, I transformed it into a document-term matrix (DTM) for analysis. A DTM is a data frame where the rows are documents, the columns are words, and the cell entries contain counts of word occurrences. The guide for the R package 'quanteda' contains a good introduction to constructing DTMs and highlights their potential uses. Once this was done, I used the DTM to answer several questions about the literature: How often do political science journals publish articles on policing? Where has the discipline studied policing? And what are the central topics in the policing literature? I answered the first two questions in pretty rudimentary ways. For the first question, I used R to count the number of rows in my DTM by year and journal. For the second question, I used the 'grep' function in R to count the occurrence of country names in the raw text corpus. Answering the third question was slightly more difficult, though, and required that I estimate a structural topic model (STM) using a Latent Dirichlet allocation algorithm. In short, the STM helps me identify four distinct topics in the policing literature. Topic 1 seems to deal with the domestic security apparatus in communist countries,
such as the Soviet Union and the German Democratic Republic. Topic 2 seems to focus on policing in China and the Middle East. Topic 3 clusters articles on the political institutions that shape policing practices. We can perhaps think of this topic as having to do with ‘who polices the police’. Finally, Topic 4 seems to center on issues dealing with race and ethnicity and policing, particularly in the American context. You can read more about the results of my analysis here -http://comparativenewsletter.com/files/archived_newsletters/2018_spring.pdf.
I hope that this post points readers to some of the text-as-data tools available in R and helps illustrate some of the potential uses of the DfR data!
Please sign in to leave a comment.