SICSS-HSE Tutorial: Reddit as a source of data

Prepared by Elizaveta Sivak, SICSS-HSE

Reddit is a discussion website, or a network of communities based on people’s interests. People can use Reddit as a newsfeed - read posts in different communities, or topical subforums (called “subreddits”), or also post something and discuss with other users in comments. Reddit is very popular - mostly in the US, but also in other countries (the UK, Europe, Australia, Canada, etc.).

Reddit also presents an interesting source of data for research in social sciences. The main strength of Reddit as a data source seems to be that it gives access to clusters of hard-to-reach populations and people with not very common characteristics (interests, kinds of behavior, etc.) (people with depression; stock market enthusiasts; QAnon casualties, etc.). These people are conveniently clustered in the related subreddits which can be seen as sampling frames.

Reddit is often used for research. Some examples:

Maria Antoniak et al. (2019) who used r/BabyBumps (subreddit with birth stories) to analyse narratives abour births and positions of power of different agents
Munmun de Choudhury et al. analysed r/Depression

For methodological questions about using Reddit for research purposes see Adams, Artigiani and Wish (2019) who discusses comparative strengths of Twitter and Reddit for conducting social media drug research and Shatz (2016) who describes advantages and disadvantages of recruiting participants on Reddit.

In this tutorial we will give some introductory information about accessing Reddit data.

The first question is how to select subreddits for you research?

Use search http://www.reddit.com/reddits
Consult with people who have knowledge of the platform
Identify subreddits that are most active related to your keyword (code is below)

Ways to get Reddit data:

Use Pushshift repository. Thanks to Jason Baumgartner we actually have all Reddit data (posts and comments). Please, consider donations if you download large abounts of data. The only drawbacks are that data are posted with a time lag (you can’t access most recent posts; for instance, 2020 Reddit data were published on Pushshift in July 2021). In some situations, when you need to get recent data instantly (and also if you need only a specific subreddit or time period) it is impossible or not very rational to download datasets from Pushshift.
Use Reddit API by building an URL https://www.reddit.com/r/{subreddit}/{listing}.json?limit={count}&t={timeframe} timeframe = hour, day, week, month, year, all listing = controversial, best, hot, new, random, rising, top Example: https://www.reddit.com/r/depression/top.json?limit=100&t=year

You can access the page by constructing the link depending on which reddit and time frame you’d like and then parse the content of the page using traditional web-parsing instruments. However, there are already more convenient tools for getting data from Reddit using it’s API:

If you are a Python user, there is a PRAW package. Limitation: can’t extract submissions between specific dates. Given Reddit API’s limitations - 1000 posts per request - it’s not possible to get all the data from a subreddit (only 1000 most recent or 1000 most popular submissions)
For R users there is RedditExtractoR package (author Ivan Rivera). Although with this package you can easily get neat data frames the limitation is the same - is can’t extract submissions between specific dates
Good thing is, that there is a solution! Again, thanks to Jason (JSON) Baumgartner, one can use Pushshift API to access data between specific dates All the instructions about constructing a request using API are on this girhub page. For instance, if you’d like to search all subreddits for the term “parenting” and return posts (called ‘submissions’) made between 2 and 4 days ago, you construct the link like this: https://api.pushshift.io/reddit/search/comment/?q=parenting&after=4d&before=2d&sort=asc Another good thing about Pushshift API is that it returns data in a more human-readable format than standard Reddit API (open the link above and see for yourself)
One can also use Pushshift not directly, but via specific packages. We will use pushshiftR package (author Nathan Cunningham) to access Pushshift API in R

Let’s try tow ways to get Reddit data: via RedditExtractoR and pushshiftR

WAY 1 (RedditExtractoR)

Loading the package:

install.packages("RedditExtractoR", repos = "http://cran.us.r-project.org")
library(RedditExtractoR)

At first, let’s find out which subreddits are most active related to your keyword (in this case “parenting”). Page threshold controls the number of pages is going to be searched for a given search word (1 by default). You can either get most new submissions (sort_by = ‘new’) or submissions with the most comments (sort_by = ‘comments’)

links <- reddit_urls(
  search_terms   = "parenting",
  page_threshold = 10,
  sort_by = 'new'
)

Then we can access the content of the submission using reddit_content()

content <- reddit_content(links$URL[3])

Let’s visualize the result:

#install.packages('dplyr')
#install.packages('ggplot2')
library(dplyr)
library(ggplot2)
links2 <- links %>% group_by(subreddit) %>%  mutate(count = n()) %>% filter(count>2)
g <- ggplot(links2, aes(x = reorder(subreddit,-count))) +
  geom_bar(stat="count")+
  theme_classic()+
  theme(axis.text.x = element_text(angle=90, hjust=1))
g

The resulting plot shows the most relevant communities (subreddits), if you are interested in getting Reddit data on parenting (where relevance = mentioning the term “parenting”) one of these subreddits is r/Parenting. Let’s extract some content from this subreddit with get_reddit() function.

As before, we can get either most recent posts (sort_by = ‘new’) or posts with more comments (sort_by = ‘comments’)

The dataset includes posts, comments of each post, and also context information (date of each comment, number of upvotes, etc.). The interesting feature is that we also get the structure of comments. All with only one line of code (!)

prnt <- get_reddit(subreddit = "Parenting", page_threshold = 2, sort_by = 'new')

But, as was mentioned before, using this function we can’t get the whole subreddit (except very newsubreddits) How to get submissions between specific dates?

WAY 1 (pushshiftR)

Firstly, let’s install the package. Installing pushshuftR is a bit longer - at first you need to install devtools package

install.packages("devtools", repos = "http://cran.us.r-project.org")
library(devtools)
devtools::install_github("https://github.com/nathancunn/pushshiftR",force = TRUE)
library(pushshiftR)
# may be you'll be asked to install Rtools35 too. In this case install it via link https://cran.r-project.org/bin/windows/Rtools/history.html

Now we can extract data using pushshiftR. We can get either posts’ content (“submission”), or comments (“comment”). Using “after” parameter (+ using a loop) we can get older posts (not only the most recent)

prnt2 <- getPushshiftData(postType = "submission",
                     size = 10,
                     after = "1546300800",  # unix time https://www.unixtimestamp.com/index.php
                     subreddit = "Parenting",
                     nest_level = 1)