Screenscraping

Chris Bail
Duke University

Website: https://www.chrisbail.net
Twitter: https://www.twitter.com/chris_bail
Github: https://github.com/cbail

What is Screen-Scraping?

Is Screen-Scraping Legal?

Screen-Scraping is Frustrating

Reading a Web-Page into R

install.packages("rvest")
library(rvest)

Scraping a Wikipedia Page

We are going to begin by scraping this very simple web page from Wikipedia.

Scraping a Wikipedia Page

What a Web Page Looks like to a Computer

Downloading HTML

wikipedia_page<-

read_html("https://en.wikipedia.org/wiki/World_Health_Organization_ranking_of_health_systems_in_2000")

Downloading HTML

wikipedia_page

{html_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ...

Parsing HTML

Right Click "Inspect"

The XPath (Right Click, copy XPath)

Using the XPath

section_of_wikipedia<-
  html_node(wikipedia_page, 
            xpath='//*[@id="mw-content-text"]/div/table')

head(section_of_wikipedia)

$node
<pointer: 0x7fac43e9c750>

$doc
<pointer: 0x7fac43e65030>

Extracting the Table

health_rankings<-html_table(section_of_wikipedia)

head(health_rankings[,(1:2)])

              Country Attainment of goals / Health / Level (DALE)
1         Afghanistan                                         164
2             Albania                                         102
3             Algeria                                          44
4             Andorra                                          10
5              Angola                                         165
6 Antigua and Barbuda                                          48

When the XPath Fails...

A More Complicated Page

Selector Gadget

http://selectorgadget.com/

Parsing with the CSS Selector

duke_page<-
  read_html("https://www.duke.edu")
duke_events<-
  html_nodes(duke_page, css="li:nth-child(1) .epsilon")

html_text(duke_events)

[1] "Duke Experts: A Trusted Source for Policymakers\n\n\t\t\t\t\t\t\t"
[2] "Zoom: An Open Forum"                                              
[3] "Best Practices for Online Teaching"

Browser Automation

RSelenium

install.packages("ropensci/RSelenium")
library(Rselenium)

Note: you may need to install Java to get up and running see this tutorial. Also, you will need to install Docker.

Installing RSelenium from Docker

system('docker run -d -p 4445:4444 selenium/standalone-chrome')

RSelenium

library(RSelenium)
rD <- rsDriver()
remDr <- rD$client

Launch a Webpage

remDr$navigate("https://www.duke.edu")

Navigate to the Search Bar

search_box <- remDr$findElement(using = 'css selector', 'fieldset input')

Input a Search

search_box$sendKeysToElement(list("data science", "\uE007"))

Screenscraping

What is Screen-Scraping?

What is Screen-Scraping?

Is Screen-Scraping Legal?

Screen-Scraping is Frustrating

Reading a Web-Page into R

Reading a Web-Page into R

Scraping a Wikipedia Page

Scraping a Wikipedia Page

What a Web Page Looks like to a Computer

What a Web Page Looks like to a Computer

Downloading HTML

Downloading HTML

Parsing HTML

Parsing HTML

Parsing HTML

Right Click "Inspect"

The XPath (Right Click, copy XPath)

Using the XPath

Extracting the Table

When the XPath Fails...

A More Complicated Page

Selector Gadget

Parsing with the CSS Selector

Parsing with the CSS Selector

Browser Automation

RSelenium

Installing RSelenium from Docker

RSelenium

Launch a Webpage

Navigate to the Search Bar

Input a Search

Screenscraping Within a Loop

When Should I Use Screen-Scraping?