Chris Bail, PhD
Duke University
Check out my Text as Data Course


Among the most basic forms of quantitative text analysis are word-counting techniques and dictionary-based methods. This tutorial will cover both of these topics, as well as sentiment analysis, which is a form of dictionary-based text analysis. This tutorial assumes basic knowledge about R and other skills described in previous tutorials at the link above.

Word Counting

In the early days of quantitative text analysis, word-frequency counting in texts was a common mode of analysis. In this section, we’ll learn a few basic techniques for counting word frequencies and visualizing them. We’re going to work within the tidytext framework, so if you need a refresher on that, see my previous tutorial entitled “Basic Text Analysis in R.”

Let’s begin by loading the Trump tweets we extracted in a previous tutorial and transform them into tidytext format:

tidy_trump_tweets<- trumptweets %>%
    select(created_at,text) %>%
    unnest_tokens("word", text)

Next, let’s count the top words after removing stop words (frequent words such as “the”, and “and”) as well as other unmeaningful words (e.g. https):


   tidy_trump_tweets %>%
      anti_join(stop_words) %>%
        count(word) %>%


Now let’s make a graph of the top 20 words

#select only top words

#create factor variable to sort by frequency
trump_tweet_top_words$word <- factor(trump_tweet_top_words$word, levels = trump_tweet_top_words$word[order(trump_tweet_top_words$n,decreasing=TRUE)])

ggplot(top_20, aes(x=word, y=n))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  ylab("Number of Times Word Appears in Trump's Tweets")+


Though we have already removed very common “stop words” from our analysis, it is common practice in quantitative text analysis to identify unusual words that might set one document apart from the others (this will become particularly important when we get to more advanced forms of pattern recognition in text later on). The metric most commonly used to identify these unusual words is “Term Frequency Inverse Document Frequency” (tf-idf). We can calculate the tf-idf for the Trump tweets databased in tidytext as follows:

tidy_trump_tfidf<- trumptweets %>%
    select(created_at,text) %>%
      unnest_tokens("word", text) %>%
        anti_join(stop_words) %>%
           count(word, created_at) %>%
              bind_tf_idf(word, created_at, n)

Now let’s see what the most unusual words are:

top_tfidf<-tidy_trump_tfidf %>%

## [1] "standforouranthem"

The tfidf increases the more a term appears in a document but it is negatively weighted by the overall frequency of terms across all documents in the dataset or Corpus. In simpler terms, the tf-idf helps us capture which words are not only important within a given document but also distinctive vis-a-vis the broader corpus or tidytext dataset.

Dictionary-Based Quantitative Text Analysis

Though word frequency counts and tf-idf can be an informative way to examine text-based data, another very popular techniques involves counting the number of words that appear in each document that have been assigned a particular meaning or value to the researcher. There are numerous examples that we shall discuss below— some of which are more sophisticated than others.

Creating your own dictionary

To begin, let’s make our own dictionary of terms we want to examine from the Trump tweet dataset. Suppose we are doing a study of economic issues, and want to subset those tweets that contain words associated with the economy. To do this, we could first create a list or “dictionary” or terms that are associated with the economy.


Having created a very simple/primitive dictionary, we can now subset the parts of our tidytext dataframe that contain these words using the str_detect function within Hadley Wickham’s stringr package:

## Warning: package 'stringr' was built under R version 3.5.2
economic_tweets<-trumptweets[str_detect(trumptweets$text, paste(economic_dictionary, collapse="|")),]

Sentiment Analysis

The example above was somewhat arbitrary and mostly designed to introduce you to the concept of dictionary-base text analysis. The list of economic terms that I came up with was very ad hoc—and though the tweets identified above each mention the economy, there are probably many more tweets in our dataset that reference economic issues that do not include the words I identified.

Dictionary-based approaches are often most useful when a high-quality dictionary is available that is of interest to the researcher or analyst. One popular type of dictionary is a sentiment dictionary which can be used to assess the valence of a given text by searching for words that describe affect or opinion. Some of these dictionaries are created by examining comparing text-based evaluations of products in online forums to ratings systems. Others are created via systematic observation of people writing who have been primed to write about different emotions.

Let’s begin by examining some of the sentiment dictionaries that are built into tidytext. These include the afinn which includes a list of sentiment-laden words that appeared in Twitter discussions of climate change; bing which includes sentiemnt words identified on online forums; and nrc which is a dictionary that was created by having workers on Amazon mechanical Turk code the emotional valence of a long list of terms. These algorithims often produce similar results, even though they are trained on different datasets (meaning they identify sentiment laden words using different corpora). Each of these dictionaries only describe sentiment-laden words in the English language. They also have different scales. We can browse the content of each dictionary as follows:

## # A tibble: 6 x 2
##   word       score
##   <chr>      <int>
## 1 abandon       -2
## 2 abandoned     -2
## 3 abandons      -2
## 4 abducted      -2
## 5 abduction     -2
## 6 abductions    -2

Let’s apply the bing sentiment dictionary to our database of tweets by Trump:

trump_tweet_sentiment <- tidy_trump_tweets %>%
  inner_join(get_sentiments("bing")) %>%
    count(created_at, sentiment) 

## # A tibble: 6 x 3
##   created_at          sentiment     n
##   <dttm>              <chr>     <int>
## 1 2017-02-05 22:49:42 positive      2
## 2 2017-02-06 03:36:54 positive      4
## 3 2017-02-06 12:01:53 negative      3
## 4 2017-02-06 12:01:53 positive      1
## 5 2017-02-06 12:07:55 negative      2
## 6 2017-02-06 16:32:24 negative      3

Now let’s make a visual that compares the frequency of positive and negative tweets by day. To do this, we’ll need to work a bit with the created_at variable—more specifically, we will need to transform it into a “date” object that we can use to pull out the day during which each tweet was made:

                                          format="%Y-%m-%d %x")

The format argument here tells R how to read in the date character string, since dates can appear in a number of different formats, time zones, etc. For more information about how to format data with other dates, see ?as.Date()

Now let’s aggregate negative sentiment by day

trump_sentiment_plot <-
  tidy_trump_tweets %>%
    inner_join(get_sentiments("bing")) %>% 
      filter(sentiment=="negative") %>%
          count(date, sentiment)
## Joining, by = "word"

ggplot(trump_sentiment_plot, aes(x=date, y=n))+
      ylab("Frequency of Negative Words in Trump's Tweets")+