Basics of Text Analysis

Chris Bail
Duke University

WRANGLING TEXT

Character Encoding

Character Encoding

 

UTF-8

 

Character Encoding Over Time

 

GREP

GREP

Globally search a regular expression and print

Regular Expression

A pattern that describes a set of strings

It's GREP-tastic!

 

duke_web_scrape<- "Duke Experts: A Trusted Source for Policymakers\n\n\t\t\t\t\t\t\t" 

grepl

 

grepl("Experts", duke_web_scrape)
[1] TRUE

gsub

 

gsub("\t", "", duke_web_scrape)
[1] "Duke Experts: A Trusted Source for Policymakers\n\n"

gsub (2 patterns)

 

gsub("\t|\n", "", duke_web_scrape)
[1] "Duke Experts: A Trusted Source for Policymakers"

More GREP

 

some_text<-c("Friends","don't","let","friends","make","wordclouds")

some_text[grep("^[F]", some_text)]
[1] "Friends"

Regex Cheat Sheet