Application Programming Interfaces in R

Introduction
What Is an Application Programming Interface?
How Does an API Work?
API Credentials
Rate Limiting
An Example with Twitter’s API
Wrapping API Calls within a Loop
Working with Timestamps
Challenges of Working with APIs
A List of APIs of Interest

Chris Bail, PhD
Duke University
www.chrisbail.net
github.com/cbail
twitter.com/chris_bail

Introduction

Application Programming Interfaces, or APIs, have become one of the most important ways to access and transfer data online— and increasingly APIs can even analyze your data as well. Compared to screen-scraping data, which is often illegal, logistically difficult (or both), APIs are a useful tool to make custom requests for data in manner that is well structured and considerably easier to work with than the HTML or XML data described in my previous tutorials on screenscraping. This tutorial assumes basic knowledge about R and other skills described in previous tutorials at the link above.

What Is an Application Programming Interface?

APIs are tools for building apps or other forms of software that help people access certain parts of large databases. Software developers can combine these tools in various ways—or combine them with tools from other APIs—in order to generate even more useful tools. Most of us use such apps each day. For example, if you install the Spotify app within your Facebook page to share music with your friends, this app is extracting data from Spotify’s API and then posting it to your Facebook page by communicating with Facebook’s API. There are countless examples of this on the internet at present— thanks in large part to the advent of Web 2.0, or the historical moment where the internet websites became became much more intertwined and dependent.

The number of APIs that are publicly available has expanded dramatically over the past decade, as the figure below shows. At the time of this writing, the website Programmable Web lists more than 19,638 APIs from sites as diverse as Google, Amazon, YouTube, the New York Times, del.icio.us, LinkedIn, and many others. Though the core function of most APIs is to provide software developers with access to data, many APIs now analyze data as well. This might include facial recognition APIs, voice to text APIs, APIs that produce data visualizations, and so on.

How Does an API Work?

In order to illustrate how an API works, it will be useful to start with a very simple one. Suppose we want to use the Google Maps API to geo-code a named entity— or tag the name of a place with latitude and longitude coordinates. The way that we do this, is to write a URL address that a) names the API; and b) includes the text of the query we want to make. If we Googled “Google Maps API Geocode” we would eventually be pointed towards the documentation for that API and learn that the base-URL for the Google Maps API is https://maps.googleapis.com. We want to use the geocoding function of this API, so we need a URL that points to this more specific part of the API: https://maps.googleapis.com/maps/api/geocode/json?address=. We can then add a named entity to the end of the URL such as “Duke” using text that looks something like this: follows: https://maps.googleapis.com/maps/api/geocode/json?address=Duke. This link (with some additional text that I will describe below) produces this output in a web browser:

What we are seeing is something called JSON data. Though it may look somewhat messy at first glance— lots of brackets, colons, commas, and indendation patterns—it is in fact very highly structured, and capable of storing complex types of data. Here, we can see that our request to geocode “Duke” not only identified the city within which it is located (Durham), but also the County, Country, and—towards the end of the page—the latitude and longitude data we were looking for. We will learn how to extract that piece of information later. The goal of the current discussion is to give you an idea of what an API is and how they work.

If we wanted to search for another geographic location, we could take the link above and replace “Duke” with the name of another place– try it out to give yourself a very rudimentary sense of how an API works.

API Credentials

Though anyone can make a request to the Google Maps API, getting data from Facebook’s API (which Facebook calls the “Graph” API) is considerably more difficult. This is because—with good reason—Facebook does not want a software developer to collect data about people whom they do not have a connection with on Facebook. In order to prevent this, Facebook’s Graph API—and many other APIs—require you to obtain “credentials” or codes/passwords that identify you and determine which types of data you are allowed to access. To illustrate this further, let’s take a look at a tool Facebook built to help people learn about APIs. It’s called the Graph API explorer.

If you have a Facebook account— and if you were logged in— Facebook will generate credentials for you automatically in the form of something called an “Access Token.” In the screenshot above, this appears in a bar towards the lower bottom part of the screen. This code will give you temporary authorization to make requests from Facebook’s Graph API—but ONLY for data that you are allowed to access from your own Facebook page. If you click the blue “submit” button at the top right of the screen, you will see some output that contains your name and an ID that Facebook assigns to you. With some more effort, we could use this tool to make API calls to access our friend list, our likes, and so on, but for now, I’m simply trying to make the point that each person gets their own code that allows them to access some, but certainly not all, of the data on Facebook’s API. If I were to write in “cocacola” instead of “me” to get access to data posted by this business, I would get an error message suggesting that my current credentials do not give me access to that data.

Credentials may not only determine your access to people with whom you are connected on a social network, but also other privileges you may have vis-a-vis an API. For example, many APIs charge money for access to their data or services, and thus you will only receive your credentials after setting up an account. As we will see below, some sites also require you to have multiple types of credentials which can be described using a variety of verbiage such as “tokens”,“keys”, or “secrets.”

Now YOU Try It!!!

The Graph API explorer allows you to search for different fields. Take a moment and try to see whether you can make a call for other types of information about yourself. To do this, you’ll have to request different levels of “permission” using the dropdown menu on the right hand side of the explorer tool page. You can populate the syntax for different fields using the “search for a field” box on the upper right hand side of the page. Don’t worry if you can’t get the query language right for now—specifying the syntax of an API call requires mastering each API’s documentation, which can take a very long time. This tool tries to help you, but it does not work perfectly and constructing the right call may require spending quite a bit of time reading through the API documentation. More importantly– as you will soon see– there are a number of R packages that contain functions that make API calls for you that will save you the time and energy necessary to learn the syntax specific to each API.

Rate Limiting

Before we make any more calls to APIs, we need to become familiar with an important concept called “Rate Limiting.” The credentials in the previous section not only define what type of information we are allowed to access, but also how often we are allowed to make requests for such data. These are known as “rate limits” (see figure below). If we make too many requests for data within too short a period of time, an API will temporarily block us from collected data for a period of time that can range from 15 minutes to 24 hours or more, depending upon the API. Rate limiting is necessary so that APIs are not overwhelmed by too many requests that occur at the same time, which would slow down access to data for everyone. Rate limiting also enables large companies such as Google, Facebook, or Twitter, to prevent developers from collecting large amounts of data that could either compromise their user’s confidentiality or threaten their business model (since data has such immense value in today’s economy).

The exact timing of rate limiting is not always public, since knowing such time increments could enable developers to “game” the system and make rapid requests as soon as rate limiting has ended. Some APIs, however, allow you to make an API call or query in order to learn how many more requests you can make within a given time period before you are rate limited. Even if you do not violate rate limits, you can also be “throttled” for making too many requests overall, as the image below shows:

An Example with Twitter’s API

To illustrate the process of obtaining credentials and better understanding rate limiting, I will now present a worked example of how to obtain different types of data from the Twitter API. The first step in this process is to obtain credentials from Twitter that will allow you to make API calls.

Twitter, like many other websites, requires you to create an account in order to receive credentials. To do this, you will need to have a Twitter account, if you do not already. If you create a Twitter account for the purposes of trying out this exercise, note that your application to get credentials (or permission) to use the API may be rejected if your Twitter account is brand new and is never used (an indication that you could be some type of malicious actor starting lots of Twitter accounts to create a bot army, spread misinformation, etc.). You may be asked to confirm your email address or add a mobile phone number because two-factor authentication helps Twitter prevent people from obtaining a large number of different credentials using multiple accounts that could be use to collect large amounts of data without being rate limited.

Once you have a Twitter account, you must visit https://developer.twitter.com to create a developer account by clicking “Apply for a developer account.” This website is pictured below. Hover over your Twitter profile image on the top right hand side of the page and click “get started.”

Next, you will be asked a series of questions about how you want to use Twitter’s API. Unfortunately Twitter does not publish exact guidelines about who is allowed to use the API and why– as far as I know. That said, you can learn a lot by reading Twitter’s terms of service. Obvious red flags would include people who are hoping to build tools that somehow harasses Twitter users, hoards Twitter’s data (particularly for business purposes), ot other negative purposes. The most important thing is that you are as specific as possible about how you want to use the API. You could write something like “I am a student at X university trying to teach myself how to use Twitter’s API in order to study Y.” You should also give more detail about what type of data you want to collect, and what you plan to do with it.

Once you accept the terms, your app developer request will go under review by Twitter. At the time of this writing, this process is rather time intensive– and with good reason, since Twitter is most likely employing large numbers of people to vet everyone who is applying for credentials right now. At the time of this writing, I’ve seen people get credentials a) instantly; b) within 10 minutes, and c) within more than a week. I even unfortunately know of several cases where people made multiple applications without any of the red flags above (and using different wording) that were ultimately rejected. Hopefully, yours will be approved (and if not, you might try mentioning your problem in a tweet to the @TwitterAPI on Twitter– this seems to have worked for several people that I know)

Once your developer account is approved, you can log in once again and click the “Create New App” button at the top right of the screen. Our goal is not to create a fully fledged app at this point, but simply to obtain the credentials necessary to begin making some simple calls to the Twitter API. You can name your app whatever you want, describe it however you want, and put in the name of any website you like. One important thing you must do is to put the following text in the “Callback URL” text box: http://127.0.0.1:1410 This number describes the location where the API will return your data– in this case, it is your web browser (but it could be another site where you want to store the results of the data.).

If you followed the steps above, the name of your application should now appear. Click on it, and then click on the “Keys and Access Tokens” tab in order to get your credentials. Unfortunately, Twitter makes developers get two different types of credentials which are listed on that page. These are blurred out in the screenshot below because I do not want people who read this web page to have access to my credneitals, which they could then abuse in various ways:

The next step is to define your credentials as string variables in R, which we will then use to authenticate ourselves with the Twitter API. Make sure to select the entire string (by triple clicking), and make sure that you do not accidentally leave out the first or last digit (or add spaces):

# load rtweet

library(rtweet)

# create credentials as objects (these are FAKE CREDENTIALS)
# you need to replace them with your own.

api_key <- "aafghaioeriokjasfkljhsa"
api_secret_key <- "234897234kasdflkjashk"

Next, we are going to install an R package from Github called rtweet that helps us make calls to Twitter’s API. More specifically, it provides a long list of functions that both a) construct API URL queries for different types of information; and b) parses the resulting data into neat formats. In order to authenticate you may also need to install the httpuv package as well (if so, you will receive an error message about this package).

install.packages("rtweet")

Now, we are ready to authenticate ourselves vis-a-vis Twitter’s API. To do this, we are going to use rtweet’s create_token function, which makes an API call that passes the credentials we defined above, and then opens a web browser with an authentication dialogue that you must authorize by clicking the blue “authorize” button. You should then receive the following message Authentication complete. Please close this page and return to R. Remember that you will need to add the name of your app, instead of my_awesome_app listed in the code chunk below

token <- create_token(
  app = "my_awesome_app",
  consumer_key = api_key,
  consumer_secret = api_secret_key)

Now, we can take full advantage of all of the many useful functions within the rtweet function for collecting data from Twitter. Let’s begin by extracting 4,000 tweets that use that mention “coronavirus.”

library(rtweet)

## 
## Attaching package: 'rtweet'

## The following object is masked from 'package:purrr':
## 
##     flatten

covid_19_tweets<-search_tweets("coronavirus", n=4000)

This code creates a dataframe called covid_19_tweets which we may then browse. Let’s take a look at the names of the variables in this dataframe:

names(covid_19_tweets)

##  [1] "user_id"                 "status_id"              
##  [3] "created_at"              "screen_name"            
##  [5] "text"                    "source"                 
##  [7] "display_text_width"      "reply_to_status_id"     
##  [9] "reply_to_user_id"        "reply_to_screen_name"   
## [11] "is_quote"                "is_retweet"             
## [13] "favorite_count"          "retweet_count"          
## [15] "quote_count"             "reply_count"            
## [17] "hashtags"                "symbols"                
## [19] "urls_url"                "urls_t.co"              
## [21] "urls_expanded_url"       "media_url"              
## [23] "media_t.co"              "media_expanded_url"     
## [25] "media_type"              "ext_media_url"          
## [27] "ext_media_t.co"          "ext_media_expanded_url" 
## [29] "ext_media_type"          "mentions_user_id"       
## [31] "mentions_screen_name"    "lang"                   
## [33] "quoted_status_id"        "quoted_text"            
## [35] "quoted_created_at"       "quoted_source"          
## [37] "quoted_favorite_count"   "quoted_retweet_count"   
## [39] "quoted_user_id"          "quoted_screen_name"     
## [41] "quoted_name"             "quoted_followers_count" 
## [43] "quoted_friends_count"    "quoted_statuses_count"  
## [45] "quoted_location"         "quoted_description"     
## [47] "quoted_verified"         "retweet_status_id"      
## [49] "retweet_text"            "retweet_created_at"     
## [51] "retweet_source"          "retweet_favorite_count" 
## [53] "retweet_retweet_count"   "retweet_user_id"        
## [55] "retweet_screen_name"     "retweet_name"           
## [57] "retweet_followers_count" "retweet_friends_count"  
## [59] "retweet_statuses_count"  "retweet_location"       
## [61] "retweet_description"     "retweet_verified"       
## [63] "place_url"               "place_name"             
## [65] "place_full_name"         "place_type"             
## [67] "country"                 "country_code"           
## [69] "geo_coords"              "coords_coords"          
## [71] "bbox_coords"             "status_url"             
## [73] "name"                    "location"               
## [75] "description"             "url"                    
## [77] "protected"               "followers_count"        
## [79] "friends_count"           "listed_count"           
## [81] "statuses_count"          "favourites_count"       
## [83] "account_created_at"      "verified"               
## [85] "profile_url"             "profile_expanded_url"   
## [87] "account_lang"            "profile_banner_url"     
## [89] "profile_background_url"  "profile_image_url"

As you can see, our API call a lot of interesting variables, including the name and screen name of the user, the text of their tweet, the time of their post, and a variety of other metrics including links to media content and user profiles. A small number of users also enable geolocation of their tweets– and if that information is available it will appear in this dataset. Here is the full list of variables we collected via our API call above:

Next, let’s take a look at the text variable, which contains the contents of the tweets we collected:

head(covid_19_tweets$text)

## [1] "Funny, after nearly a month of not mentioning coronavirus, because of the riots, the Covid chicken littles are back chirping their heads off trying to fear monger again. \n\nWon’t work - people have had it with all the lies."                                                   
## [2] "Lo que  se observa es una catástrofe en la mala gestión de cadáveres... https://t.co/fFDlO1D3yM"                                                                                                                                                                                    
## [3] "Sin camas y desbordados los hospitales en Honduras por #coronavirus #covid19 ante el abandono de las autoridades de salud https://t.co/c432CLq94T"                                                                                                                                  
## [4] "Mnuchin told the Senate that he would not disclose who was getting $511 billion in coronavirus loans because that information is confidential. \n\nThat's interesting because the loan application says that information will be released automatically...\nhttps://t.co/CD3aUaAbw6"
## [5] "El titular de la cartera señaló que para poder optar a este beneficio, se debe pertenecer al Registro Social de Hogares. https://t.co/GkLOuJ6EUN"                                                                                                                                   
## [6] "@ArvindKejriwal @narendramodi Respected sir, we need an urgent intervention to control COVID19 situation. A short term lockdown will be immensely helpful to break the surge. Pls do consider, it's horrifying to see the current numbers. #coronavirus covid-19 #DelhiFightsCorona"

As a brief aside, the rtweet function also interfaces nicely with ggplot and other visualization libraries to produce nice plots of the results above. For instance, let’s make a plot of the frequency of tweets about Korea over the past few days:

ts_plot(covid_19_tweets, "secs") +
  ggplot2::theme_minimal() +
  ggplot2::theme(plot.title = ggplot2::element_text(face = "bold")) +
  ggplot2::labs(
    x = NULL, y = NULL,
    title = "Frequency of Tweets about Covid-19 Around 1pm, May 3, 2020",
    subtitle = "Tweet counts aggregated by second",
    caption = "\nSource: Data collected from Twitter's REST API via rtweet"
  )

The search_tweets function also has a number of useful options or arguments as well. For instance, we can restrict the geographic location of tweets to the United States and English-language tweets using the code below. The code also restricts the results to non-retweets and focuses upon the most recent tweets, rather than a mixture of popular and recent tweets, which is the defualt setting.

covid_geo_tweets <- search_tweets("coronavirus",
  "lang:en", geocode = lookup_coords("usa"), 
  n = 3000, type="recent", include_rts=FALSE
  )

rtweet also enables one to geocode tweets for users who allow Twitter to track their location:

geocoded <- lat_lng(covid_geo_tweets)

We can then plot these results as follows (you may need to install the maps package to do this):

library(maps)

## 
## Attaching package: 'maps'

## The following object is masked from 'package:purrr':
## 
##     map

par(mar = c(0, 0, 0, 0))
maps::map("world", lwd = .25)
with(geocoded, points(lng, lat, pch = 20, cex = .50, col = rgb(0, .3, .7, .75)))

We don’t see all of our tweets in this diagram for an important reason— we are only looking at people who allow Twitter to track their location (and this is roughly 1 in 100 people at the time of this writing).

Twitter’s API is also very useful for collecting data about a given user. Let’s take a look at Bernie Sanders’s Twitter page: http://www.twitter.com/SenSanders. There we can see Senator Sanders’s description and profile, the full text of his tweets, and—if we click several links—the names of the people he follows, those who follow him, and the tweets which he has “liked.”

First, let’s get his 5 most recent tweets:

sanders_tweets <- get_timelines(c("sensanders"), n = 5)
head(sanders_tweets$text)

## [1] "\"Less than lethal\"? What a joke. These weapons are fracturing people's skulls, destroying their eyeballs, and harming their lungs.\n\nEnough. We must ban the use of rubber bullets, tear gas and pepper spray on protesters across the country. https://t.co/glYh47tTnE"                                     
## [2] "Ban them. https://t.co/t4qhtEa5KT"                                                                                                                                                                                                                                                                              
## [3] "This is what oligarchy looks like. During the pandemic, 630 billionaires have seen their wealth go up by $565 billion while almost everyone else has seen their wealth go down by $6.5 trillion. We cannot push this grotesque inequality under the rug. https://t.co/CVMjUJPvTl"                               
## [4] "The Supreme Court shamefully chose to protect the profits of the fossil fuel industry today over the future of our planet.\n \nOur job: Rapidly transition away from fossil fuels like fracked natural gas and create millions of jobs in sustainable energy. https://t.co/k5EmCSgPDC"                          
## [5] "Fantastic news. No one in America should face discrimination for being who they are or for who they love.\n\nTogether, we are going to defeat the hate and bigotry of this administration and stand with our LGBTQ+ family. Congratulations to everyone who fought to make this happen. https://t.co/QYYmquszbc"

Note that you are limited to requesting the last 3,200 tweets, so obtaining a complete database of tweets for a person who tweets very often may not be feasible, or you may need to purchase the data from Twitter itself:

Next, let’s get some broader information about Sanders using the lookup_users function:

sanders_twitter_profile <- lookup_users("sensanders")

This creates a dataframe with a variety of additional variables. For example:

sanders_twitter_profile$description

## [1] "U.S. Senator Bernie Sanders of Vermont is the longest-serving independent in congressional history."

sanders_twitter_profile$location

## [1] "Vermont/DC"

sanders_twitter_profile$location

## [1] "Vermont/DC"

sanders_twitter_profile$followers_count

## [1] 9937261

We can also use the get_favorites() function to identify the Tweets Sanders has recently “liked.”

sanders_favorites<-get_favorites("sensanders", n=5)
sanders_favorites$text

## [1] "Law enforcement relies on qualified immunity to shield officers from accountability for instances of police brutality and excessive force.\n\nIn our culture of systemic racism, it is one of the foremost tools of oppression.\n\nQualified immunity must be immediately eliminated."                              
## [2] "To be clear, Georgetown Cupcakes in D.C. right now is delivery only.\n\nSo if that's your arbitrary standard Kellyanne, I think it's time that our country has national vote-by-mail. https://t.co/AyNttIxJTC"                                                                                                      
## [3] "Billions of dollars in defense spending increases won't solve this pandemic.\n\nIt's about time we see taxpayer dollars support the American public—not line the pockets of defense contractors.\n\nI led 29 Democrats who agree: we must decrease defense spending. https://t.co/2fvjBxMffh"                       
## [4] "Join me in a @LULAC &amp; @UnivisionNews virtual Town Hall to discuss the Impact of the #COVID19 pandemic in Latino and minority communities with @jorgeramosnews @senbooker @senkamalaharris and @sensanders.\n\U0001f4fa Tune in today at 1:00PM ET \U0001f449 https://t.co/R8pxysnD01"                           
## [5] "My legislation with @SenSanders, @MarkWarner, &amp; @SenDougJones is a guardrail at the edge of a precipice. Our plan gives workers the steady comfort of a consistent paycheck &amp; offers businesses the ability to hold onto workers, so they can start up again as easily as possible. https://t.co/LuVrEPdnna"

We can also get a list of the people who Sanders follows like this:

sanders_follows<-get_followers("sensanders")

This produces the user IDs of those followers, and we could get more information about them if we want using the lookup_users function. If we were interested in creating a larger social network analysis dataset centered around Sanders, we could scrape the followers of his followers within a loop.

Looping is an efficient way of collecting a large amount of data, but it will also trigger rate limiting. As I mentioned above, however, Twitter enables users to check their rate limits. The rate_limit() function in the rtweets package does this as follows:

rate_limits<-rate_limit()
head(rate_limits[,1:4])

## # A tibble: 6 x 4
##   query                  limit remaining reset        
##   <chr>                  <int>     <int> <drtn>       
## 1 lists/list                15        15 15.00606 mins
## 2 lists/memberships         75        75 15.00606 mins
## 3 lists/subscribers/show    15        15 15.00606 mins
## 4 lists/members            900       900 15.00606 mins
## 5 lists/subscriptions       15        15 15.00606 mins
## 6 lists/show                75        75 15.00606 mins

In the code above I created a dataframe that describes the total number of calls I can make within a given deadline (called reset). In this case, it is 15 minutes. In order to prevent rate limiting within a large loop, it is common practice to employ R’s Sys.sleep function, which tells R to sleep for a certain number of seconds before proceeding to the next iteration of a loop.

rtweet has a number of other useful functions which I will mention in case they might be useful to readers. get_trends() will identify the trending topics on Twitter in a particular area:

get_trends("New York")

## # A tibble: 46 x 9
##    trend url   promoted_content query tweet_volume place  woeid
##    <chr> <chr> <lgl>            <chr>        <int> <chr>  <int>
##  1 T-Mo… http… NA               T-Mo…       122966 New … 2.46e6
##  2 #Ins… http… NA               %23I…       254708 New … 2.46e6
##  3 Issa  http… NA               Issa        127157 New … 2.46e6
##  4 Molly http… NA               Molly        76972 New … 2.46e6
##  5 #90D… http… NA               %239…        15125 New … 2.46e6
##  6 #WWE… http… NA               %23W…        61419 New … 2.46e6
##  7 Shak… http… NA               %22S…           NA New … 2.46e6
##  8 #The… http… NA               %23T…           NA New … 2.46e6
##  9 Rob … http… NA               %22R…        42108 New … 2.46e6
## 10 Gundy http… NA               Gundy        62330 New … 2.46e6
## # … with 36 more rows, and 2 more variables: as_of <dttm>,
## #   created_at <dttm>

rtweet can even control your Twitter account. For example, you can post messages to your Twitter feed from R as follows:

post_tweet("I love APIs")

I have used this function in past work with bots. See for example, this paper.

Now YOU Try It!!!

To reinforce the skills you’ve learned in this section, try the following: 1) Collect the most recent 100 tweets from CNN; 2) determine how many people follow CNN on twitter; and, 3) determine if CNN is currently tweeting about any subjects that are trending in your area.

Wrapping API Calls within a Loop

Very often, one may wish to wrap API calls such as those we have made thus far into a loop to collect data about a long list of users. To illustrate this, let’s open a list of the Twitter handles of elected officials in the U.S. that I posted on my Github site:

#load list of twitter handles for elected officials
elected_officials<-read.csv("https://cbail.github.io/Senators_Twitter_Data.csv", stringsAsFactors = FALSE)

head(elected_officials)

##   bioguide_id party gender title birthdate   firstname middlename lastname
## 1     C001095     R      M   Sen   5/13/77         Tom              Cotton
## 2     G000562     R      M   Sen   8/22/74        Cory             Gardner
## 3     M001169     D      M   Sen    8/3/73 Christopher         S.   Murphy
## 4     S001194     D      M   Sen  10/20/72       Brian    Emanuel   Schatz
## 5     Y000064     R      M   Sen   8/24/72        Todd         C.    Young
## 6     S001197     R      M   Sen   2/22/72    Benjamin       Eric    Sasse
##   name_suffix state    district senate_class
## 1                AR Junior Seat           II
## 2                CO Junior Seat           II
## 3                CT Junior Seat            I
## 4                HI Senior Seat          III
## 5                IN Junior Seat          III
## 6                NE Junior Seat           II
##                               website    fec_id      twitter_id
## 1       https://www.cotton.senate.gov H2AR04083    SenTomCotton
## 2      https://www.gardner.senate.gov H0CO04122  SenCoryGardner
## 3       https://www.murphy.senate.gov H6CT05124 senmurphyoffice
## 4       https://www.schatz.senate.gov S4HI00136  SenBrianSchatz
## 5        https://www.young.senate.gov H0IN09070    SenToddYoung
## 6 https://www.sasse.senate.gov/public S4NE00090        SenSasse

As you can see, the second column of this .csv file includes the Twitter “screen names” or handles we need to make API requests about each elected official. Let’s grab each official’s most recent 100 tweets, and combine them into a single large dataset of recent tweets by elected officials in the U.S.

#create empty container to store tweets for each elected official
elected_official_tweets<-as.data.frame(NULL)

for(i in 1:nrow(elected_officials)){

  #pull tweets
  tweets<-get_timeline(elected_officials$twitter_id[i], n=100)
  
  #populate dataframe
  elected_official_tweets<-rbind(elected_official_tweets, tweets)
  
  #pause for five seconds to further prevent rate limiting
  Sys.sleep(5)
  
  #print number/iteration for debugging/monitoring progress
  print(i)
}

This code would take some time to run, of course, since we are collected 100 tweets from 500 different people. You may also get rate limited, depending upon your previous activity and your current rate limits. If so, modify the length of the pause in the Sys.sleep command above. You may also notice some error messages in your output– these could occur because Senators change their Twitter handle, or because they have an account but no tweets, or other such errors.

Working with Timestamps

There is one more skill that will be useful for you to have in order to work with Twitter data. Very often, we want to track trends over time or subset our data according to different time periods. If we browse the variable that describes the time each tweet was created, however, we see that it is not in a format that we can easily work with in r:

head(for_analysis$created_at)

To manage these types of string variables that describe dates, it is often very useful to convert them into a variable of class “date.” There are several ways to do this in R, but here is the way to do it using the as.Date function in base R.

for_analysis$date<-as.Date(for_analysis$created_at, format="%Y-%m-%d")
head(for_analysis$date)

Now, we can subset the data using conventional techniques. For example, if we wanted to only look at tweets for August, we could do this:

august_tweets<-for_analysis[for_analysis$date>"2018-07-31"&
                              for_analysis$date<"2018-09-01",]

Challenges of Working with APIs

By now it is hopefully clear that APIs are an invaluable resource for collecting data from the internet. At the same time, it may also be clear that the process of obtaining credentials, avoiding rate limiting, and understanding the unique jargon employed by those who create each API can mean a lot of hours sifting through the documentation of an API—particularly where there are not well functioning R packages for interfacing with the API in question. If you have to develop your own custom code to work with an API— or if you need information that is not obtainable using functions within an R package, you may find it useful to browse the source code of the R functions we have discussed above in order to see where they pass the API query language necessary to produce the results we worked with above.

A List of APIs of Interest

There are numerous databases that describe popular APIs on the web, including the aforementioned Programmable Web, but also a variety of crowd-source and user generated lists as well:

https://docs.google.com/spreadsheets/d/1ZEr3okdlb0zctmX0MZKo-gZKPsq5WGn1nJOxPV7al-Q/edit?usp=sharing

https://github.com/toddmotto/public-apis

https://apilist.fun/

The R OpenSci site also has a list of R packages that work with APIs:

https://ropensci.org/packages/

Happy coding!