Chris Bail, PhD
Duke University
www.chrisbail.net
github.com/cbail
twitter.com/chris_bail
Check out my Text as Data Course

What is Screen-Scraping?

Screenscraping refers to the process of automatically extracting data from web pages, and often a long list of websites that cannot be mined by hand. As the figure below illustrates, a typical screenscraping program a) loads the name of a web-page to be scraped from a list of webpages; b) downloads the website in a format such as HTML or XML; c) finds some piece of information desired by the author of the code; and d) places that information in a convenient format such as a “data frame” (which is R speak for a dataset). Screenscraping can also be used to download other types of content as well, however, such as audio-visual content. This tutorial assumes basic knowledge about R and other skills described in previous tutorials at the link above.




Reading a Web-Page Into R

If you’ve identified a web page you’d like to scrape, the first step in writing a screen-scraping program is to download the source code into R. To do this we are going to install the rvest package, authored by Hadley Wickham, which provides a number of very useful functions for screen-scraping. (Note: You may also need to install the selectr package.)

install.packages("rvest")
install.packages("selectr")

We are going to begin by scraping this very simple web page from Wikipedia. I describe the page as “simple” because it does not have a lot of interactive features which require sophisticated types of web programming such as javascript, which—as we will see in a later example— can be particularly difficult to work with.

This is what the webpage linked above looks like to us when we visit it via a browser such as Explorer or chrome:





But this is not what the web page actually looks like to our browser. To view the “source code” of the web page, we can use Chrome’s dropdown menu called “developer” and then click “View Source.” We then see the page in its most elemental form, called an HTML file, which is a long file that contains both the text of the web page as well as a long list of instructions about how the text, images, and other components of the webpage should be rendered by the browser: