SICSS curriculum for text analysis covered a broad range of techniques for analyzing unstructured text. Today, we are going to have a group exercise designed to compare the strengths and weaknesses of different text analysis techniques when they are applied to a single dataset.

  1. Everyone will be randomized into groups of ~4 people.

  2. Load the dataframe of tweets by President Trump that we analyzed during the discussion of Dictionary-Based text methods:

load(url("https://cbail.github.io/Trump_Tweets.Rdata"))

Note that this data frame also includes counts of the number of times each of his tweets were retweeted or “liked.” NOTE: IF YOU PREFER TO USE A DATASET YOUR GROUP COLLECTED DURING YESTERDAY’S EXERCISE, IT IS OK TO USE THIS AS WELL.

  1. Use at least two of the techniques discussed in the videos to pull out features from the text of Trump’s tweets (e.g. substantive themes, topics, sentiment), or the text-based dataset you collected yesterday.

  2. Work together to identify whether any features of Trump’s Twitter language predict the number of retweets or likes his messages receive. If you are working with Reddit data, you can predict upvotes or the number of comments.

  3. Load the dataframe of daily approval ratings for President Trump from the five-thirty-eight website’s github:

trump_approval<-read.csv("https://projects.fivethirtyeight.com/trump-approval-data/approval_topline.csv")
  1. Work together to determine whether there are any features of Trump’s Twitter language that have an association with his overall approval ratings.

  2. Produce a visualization that describes the findings of your analysis and share it in our #sicss-hse slack channel with a brief description of what you did.