Chris Bail, PhD
Duke University
www.chrisbail.net
github.com/cbail
twitter.com/chris_bail
Among the most basic forms of quantitative text analysis are word-counting techniques and dictionary-based methods. This tutorial will cover both of these topics, as well as sentiment analysis, which is a form of dictionary-based text analysis. This tutorial assumes basic knowledge about R and other skills described in previous tutorials at the link above.
In the early days of quantitative text analysis, word-frequency counting in texts was a common mode of analysis. In this section, we’ll learn a few basic techniques for counting word frequencies and visualizing them. We’re going to work within the tidytext
framework, so if you need a refresher on that, see my previous tutorial entitled “Basic Text Analysis in R.”
Let’s begin by loading the Trump tweets we extracted in a previous tutorial and transform them into tidytext
format:
load(url("https://cbail.github.io/Trump_Tweets.Rdata"))
library(tidytext)
library(dplyr)
tidy_trump_tweets<- trumptweets %>%
select(created_at,text) %>%
unnest_tokens("word", text)
Next, let’s count the top words after removing stop words (frequent words such as “the”, and “and”) as well as other unmeaningful words (e.g. https):
data("stop_words")
top_words<-
tidy_trump_tweets %>%
anti_join(stop_words) %>%
filter(!(word=="https"|
word=="rt"|
word=="t.co"|
word=="amp")) %>%
count(word) %>%
arrange(desc(n))
Now let’s make a graph of the top 20 words
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
top_words %>%
slice(1:20) %>%
ggplot(aes(x=reorder(word, -n), y=n, fill=word))+
geom_bar(stat="identity")+
theme_minimal()+
theme(axis.text.x =
element_text(angle = 60, hjust = 1, size=13))+
theme(plot.title =
element_text(hjust = 0.5, size=18))+
ylab("Frequency")+
xlab("")+
ggtitle("Most Frequent Words in Trump Tweets")+
guides(fill=FALSE)
Though we have already removed very common “stop words” from our analysis, it is common practice in quantitative text analysis to identify unusual words that might set one document apart from the others (this will become particularly important when we get to more advanced forms of pattern recognition in text later on). As the figure below shows, the metric most commonly used to identify this type of words is “Term Frequency Inverse Document Frequency” (tf-idf).
We can calculate the tf-idf for the Trump tweets databased in tidytext
as follows:
tidy_trump_tfidf<- trumptweets %>%
select(created_at,text) %>%
unnest_tokens("word", text) %>%
anti_join(stop_words) %>%
count(word, created_at) %>%
bind_tf_idf(word, created_at, n)
Now let’s see what the most unusual words are:
top_tfidf<-tidy_trump_tfidf %>%
arrange(desc(tf_idf))
top_tfidf$word[1]
## [1] "standforouranthem"
The tfidf increases the more a term appears in a document but it is negatively weighted by the overall frequency of terms across all documents in the dataset or Corpus. In simpler terms, the tf-idf helps us capture which words are not only important within a given document but also distinctive vis-a-vis the broader corpus or tidytext dataset.
Though word frequency counts and tf-idf can be an informative way to examine text-based data, another very popular techniques involves counting the number of words that appear in each document that have been assigned a particular meaning or value to the researcher. There are numerous examples that we shall discuss below— some of which are more sophisticated than others.
Creating your own dictionary
To begin, let’s make our own dictionary of terms we want to examine from the Trump tweet dataset. Suppose we are doing a study of economic issues, and want to subset those tweets that contain words associated with the economy. To do this, we could first create a list or “dictionary” or terms that are associated with the economy.
economic_dictionary<-c("economy","unemployment","trade","tariffs")
Having created a very simple/primitive dictionary, we can now subset the parts of our tidytext dataframe that contain these words using the str_detect
function within Hadley Wickham’s stringr
package:
library(stringr)
## Warning: package 'stringr' was built under R version 3.5.2
economic_tweets<-trumptweets[str_detect(trumptweets$text, paste(economic_dictionary, collapse="|")),]
The example above was somewhat arbitrary and mostly designed to introduce you to the concept of dictionary-base text analysis. The list of economic terms that I came up with was very ad hoc—and though the tweets identified above each mention the economy, there are probably many more tweets in our dataset that reference economic issues that do not include the words I identified.
Dictionary-based approaches are often most useful when a high-quality dictionary is available that is of interest to the researcher or analyst. One popular type of dictionary is a sentiment dictionary which can be used to assess the valence of a given text by searching for words that describe affect or opinion. Some of these dictionaries are created by examining comparing text-based evaluations of products in online forums to ratings systems. Others are created via systematic observation of people writing who have been primed to write about different emotions.
Let’s begin by examining some of the sentiment dictionaries that are built into tidytext.
These include the afinn
which includes a list of sentiment-laden words that appeared in Twitter discussions of climate change; bing
which includes sentiemnt words identified on online forums; and nrc
which is a dictionary that was created by having workers on Amazon mechanical Turk code the emotional valence of a long list of terms. These algorithims can produce very different results since they were created using very different datasets (meaning they identify sentiment laden words using different corpora). Each of these dictionaries only describe sentiment-laden words in the English language. They also have different scales. We can browse the content of each dictionary as follows:
library(tidytext)
head(get_sentiments("bing"))
## # A tibble: 6 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
Let’s apply the bing
sentiment dictionary to our database of tweets by Trump:
trump_tweet_sentiment <- tidy_trump_tweets %>%
inner_join(get_sentiments("bing")) %>%
count(created_at, sentiment)
head(trump_tweet_sentiment)
## # A tibble: 6 x 3
## created_at sentiment n
## <dttm> <chr> <int>
## 1 2017-02-05 22:49:42 positive 2
## 2 2017-02-06 03:36:54 positive 4
## 3 2017-02-06 12:01:53 negative 3
## 4 2017-02-06 12:01:53 positive 1
## 5 2017-02-06 12:07:55 negative 2
## 6 2017-02-06 16:32:24 negative 3
Now let’s make a visual that compares the frequency of positive and negative tweets by day. To do this, we’ll need to work a bit with the created_at
variable—more specifically, we will need to transform it into a “date” object that we can use to pull out the day during which each tweet was made:
tidy_trump_tweets$date<-as.Date(tidy_trump_tweets$created_at,
format="%Y-%m-%d %x")
The format
argument here tells R how to read in the date character string, since dates can appear in a number of different formats, time zones, etc. For more information about how to format data with other dates, see ?as.Date()
Now let’s aggregate negative sentiment by day
trump_sentiment_plot <-
tidy_trump_tweets %>%
inner_join(get_sentiments("bing")) %>%
filter(sentiment=="negative") %>%
count(date, sentiment)
## Joining, by = "word"
Now, let’s plot it:
ggplot(trump_sentiment_plot, aes(x=date, y=n))+
geom_line(color="red", size=.5)+
theme_minimal()+
theme(axis.text.x =
element_text(angle = 60, hjust = 1, size=13))+
theme(plot.title =
element_text(hjust = 0.5, size=18))+
ylab("Number of Negative Words")+
xlab("")+
ggtitle("Negative Sentiment in Trump Tweets")+
theme(aspect.ratio=1/4)
There appears to be an upward trend. Is it possible that this increase is being shaped by Trump’s approval rating? Let’s take a look, downloading data from the survey polling group 538 for the same time period as our Twitter data above:
trump_approval<-read.csv("https://projects.fivethirtyeight.com/trump-approval-data/approval_topline.csv")
trump_approval$date<-as.Date(trump_approval$modeldate, format="%m/%d/%Y")
approval_plot<-
trump_approval %>%
filter(subgroup=="Adults") %>%
filter(date>min(trump_sentiment_plot$date)) %>%
group_by(date) %>%
summarise(approval=mean(approve_estimate))
#plot
ggplot(approval_plot, aes(x=date, y=approval))+
geom_line(group=1)+
theme_minimal()+
ylab("% of American Adults who Approve of Trump")+
xlab("Date")
Quite obviously we have some scaling issues we would need to address if we wanted to make a proper comparison, but for now, let’s move on.
There are many other types of sentiment analysis, which we do not have time to cover here. An important thing for you to know, however, is that different sentiment analysis tools work better for some corpuses than others. Here is a figure from a recent paper that applies a variety of different sentiment dictionaries to different corpora:
This paper also gives some comparative perspective on how different sentiment analysis tools perform on a database of tweets about different issues.
Finally, other papers attempt to rank different sentiment classifiers. These analyses are helpful, but I think the most appropriate thing for you to consider when you are trying to choose which type of sentiment analysis to use is how the tool was created, and for what purpose. The tools that perform best will probably be those that were created for purposes that are most similar to your own desired purpose.
Now you Try it
Let’s try to build together several of the skills you’ve learned in the course thus far: 1) Pick another politician’s Twitter account; 3) See if the politician’s approval rating tracks the sentiment of their tweets, or—better yet— create your own custom dictionary to track a variable or variables of interest over time.
Linguistic Inquiry Word Count (LIWC)
Before I wrap up, I just want to highlight another class of dictionary-based approaches: those that attempt to classify not only sentiment but a range of different types of psychometric properties and substantive properties of a text. A popular example is Linguistic Inquiry Word Count, which was developed by the social psychologist James Penebaker. As the figure below shows, LIWC is a large dictionary that classifies words into dozens of categories:
Unlike some of the other approaches above, LIWC was built in a very systematic fashion— through both observation of natural language use in a variety of settings as well as empirical observation of people who were primed to write about different subjetions. For these reasons, it has become one of the more popular dictionary-based approaches in the past decade. At the same time, it—like all dictionary approaches—is ultimately limited insofar as it assumes that each word has an intrinisic meaning. As we will soon see, a more useful assumption is often that words assume different meanings based upon their apperance alongside other words.
The quality of dictionary-based methods depends heavily upon the match between the learning-corpus and the one you want to code. Creating your own is often a good solution, but it is very time intensive. On the other hand, as we will see in future tutorials, dictionary-based approaches often perform better than more sophisticated techniques such as topic modeling, depending upon the task at hand.