Chris Bail
Duke University
www.chrisbail.net
This is the first in a series of tutorials I’ve created about collected data from web-based sources such as Facebook or Twitter and analyzing such data using a range of new techniques for automated text analysis. Before we proceed to the technical aspects of these techniques, I want to give you some sense of where the came from
The field of automated text analysis has been around for more than half a century, yet it has evolved very rapidly in recent years because of recent advances in the field of natural language processing. Perhaps the first person to propose the idea of quantifying patterns in text was Harold Laswell, who famously used this approach to study WWI propoganda. “We may classify references into categories,” wrote Laswell in 1938, " according to the understanding which prevails among those who are accustomed to the symbols. References used in interviews may be quantified by counting the number of references which fall into each category during a selected period of time (or per thousand words uttered)." Lasswell was ahead of his time. In 1935- and at the age of 21-Laswell was developing methods that tracked the association between word utterances and physiological reactions (e.g. pulse rate, electrical conductivity of the skin, and blood pressure).
Harold Laswell, Pioneer of Quantitative Text Analysis
But Laswell was but the first in a long line of pioneers in the field of quantitative text analysis that hailed from many different fields— from sociology to computer sience and linguistics. The table below provides an overview of some of the major milestones in the field.
Timeline of Quantitative Text Analysis
Time | Activity |
---|---|
1934 | Laswell Produces first Key-Word Count |
1934 | Vygotsky Produces first Quantitative Narrative Analaysis |
1950 | Gottschalk Uses Content Analysis to Track Freudian Themes |
1950 | Turing Applies AI to text |
1952 | Bereleson Publishes First Textbook on Content Analysis |
1954 | First Automatic Translation of Text (Georgetown Experiment) |
1963 | Msteller and Wallace analyze Federalist Papers |
1965 | Tomashevsky Further Formalizes Quantitative Narrative Analysis |
1966 | Stone and Bales use mainframe computer to measure psychometric properties of text at RAND |
1980 | Decline of Chomskyean Formalism/NLP is Born |
1980 | Machine Learning is Applied to NLP |
1981 | Weintraub counts parts of speech |
1985 | Schrodt Introduces Automated Event Coding |
1986 | Pennebaker develops LIWC |
1988 | First Latent Semantic Analysis Patent |
1989 | Franzosi brings Quantitative Narrative Analysis to Social Science |
1998 | Mohr’s Quantitative Analysis of Culture |
1999 | Bearman et al. apply Network Methods to Narratives |
2001 | Blei et al. develop LDA |
2002 | MALLET created |
2005 | Quinn et al. use analyze political speeches using topic models |
2009 | New Directions in Text Conference |
2010 | King/Hopkins Bring Supervised Learning into mainstream |
2010 | Tools for Text Workshop at Washington |
What I find particularly remarkable about this (probably incomplete) timeline is that it covers scholars in at least seven different fields. Though social scientists made some of the earliest contributions to the field, computer scientists and linguists have exerted considerable influence in recent decades. The intellectual diversification of the field also coincided with the tremendous outgrowth of text-based data via the internet and other sources.
For some other great introductions to the field of text as data, see:
Justin Grimmer & Brandon Stewart. Text as Data: The Promises and Pitfalls of Automated Content Analysis, Political Analysis.
James Evans & Pedro Aceves. Machine Translation: Mining Text for Social Theory. Annual Review of Sociology.