So far I've just been trying to scrape basic things to get a hang of it. I've tried scrapping the number of youtube views for a particular video, ratings from Rotten Tomatoes/Meta Critic and about any site that has a real time counter (e.g. http://www.nationaldebtclocks.org/debtclock/canada).
When you install rvest, be sure to to execute vignette("selectorgadget") . It will show you how to identify the exact elements you wish to scrape from a web page.
Anyway, here is a toy example I did with Rotten Tomatoes. I just wanted to scrape the data for the movies opening this week (name and scores). If you head over to www.rottentomatoes.com you can view them first on the top left.
When you install rvest, be sure to to execute vignette("selectorgadget") . It will show you how to identify the exact elements you wish to scrape from a web page.
Anyway, here is a toy example I did with Rotten Tomatoes. I just wanted to scrape the data for the movies opening this week (name and scores). If you head over to www.rottentomatoes.com you can view them first on the top left.
Code:
library(rvest)
library(ggplot2)
#Load site of interest
rt <- read_html("https://www.rottentomatoes.com/")
#Scrape the names of the movies opening this week
allnames <- rt %>%
html_nodes(".right a , .middle_col a") %>%
html_text() %>%
as.character()
#note the SelectorGadget helped identify the key ".right a , .middle_col a" was what I needed to scrape.
#The scrape actually extracts a lot of extras, must inspect the vector to identify the elements of interest
top5_names <- allnames[c(18,20,22,24,26)]
#Scrape the corresponding scores for the movies opening this week
allscores <- rt %>%
html_nodes(".tMeterScore") %>%
html_text() %>%
as.character()
#Again, the scrape has a lot of extra info. Identify and select just the scores for those movies opening
#remove "%" and make numeric so i can plot them
top5_scores <- as.numeric(gsub("%","",allscores[19:23]))
#store to a data frame for use in ggplot
top5_ratings <- data.frame(title=top5_names, ratings=top5_scores)
top5_ratings
#make a plot
p <- ggplot(top5_ratings, aes(weight=ratings, x=title, fill=title, label=ratings)) +
geom_bar() +
geom_text(aes(y=ratings/2)) +
guides(fill=FALSE) +
theme_bw()
p
Last edited: