Chris Bail, PhD
Duke University
www.chrisbail.net
github.com/cbail
twitter.com/chris_bail
Check out my Text as Data Course
Screenscraping refers to the process of automatically extracting data from web pages, and often a long list of websites that cannot be mined by hand. As the figure below illustrates, a typical screenscraping program a) loads the name of a web-page to be scraped from a list of webpages; b) downloads the website in a format such as HTML or XML; c) finds some piece of information desired by the author of the code; and d) places that information in a convenient format such as a “data frame” (which is R speak for a dataset). Screenscraping can also be used to download other types of content as well, however, such as audio-visual content. This tutorial assumes basic knowledge about R and other skills described in previous tutorials at the link above.
In the early years of the internet, screen-scraping was a very common practice because there were not yet widespread legal norms surrounding the protection of data on the internet. This has changed drastically in recent decades as the value of data on websites has become obvious, and bots or automated computer programs can easily wreak havoc by collecting data from websites and repurposing it for nefarious purposes. The very first thing you should consider before screen-scraping a website is whether you are allowed to do so. The easiest way to do this is to visit the “Terms of Service” (sometimes abbreviated as “Terms”) which often appears at the bottom of a web page. These days, most websites have a “robots.txt” policy that specifies rules about automated data collection on the site, and an increasing number of sites do not allow such practices (especially larger websites such as Facebook, the New York Times, or Instagram). You should consult professional legal advice to determine whether you have permission to scrape a website.
If you’ve identified a web page you’d like to scrape, the first step in writing a screen-scraping program is to download the source code into R. To do this we are going to install the rvest package, authored by Hadley Wickham, which provides a number of very useful functions for screen-scraping. (Note: You may also need to install the selectr package.)
install.packages("rvest")
install.packages("selectr")
We are going to begin by scraping this very simple web page from Wikipedia. I describe the page as “simple” because it does not have a lot of interactive features which require sophisticated types of web programming such as javascript, which—as we will see in a later example— can be particularly difficult to work with.
This is what the webpage linked above looks like to us when we visit it via a browser such as Explorer or chrome:
But this is not what the web page actually looks like to our browser. To view the “source code” of the web page, we can use Chrome’s dropdown menu called “developer” and then click “View Source.” We then see the page in its most elemental form, called an HTML file, which is a long file that contains both the text of the web page as well as a long list of instructions about how the text, images, and other components of the webpage should be rendered by the browser: