Today I Learned: ____

So far I've just been trying to scrape basic things to get a hang of it. I've tried scrapping the number of youtube views for a particular video, ratings from Rotten Tomatoes/Meta Critic and about any site that has a real time counter (e.g. http://www.nationaldebtclocks.org/debtclock/canada).


When you install rvest, be sure to to execute vignette("selectorgadget") . It will show you how to identify the exact elements you wish to scrape from a web page.

Anyway, here is a toy example I did with Rotten Tomatoes. I just wanted to scrape the data for the movies opening this week (name and scores). If you head over to www.rottentomatoes.com you can view them first on the top left.


Code:
library(rvest)
library(ggplot2)


#Load site of interest
rt <- read_html("https://www.rottentomatoes.com/")


#Scrape the names of the movies opening this week
allnames <- rt %>%
          html_nodes(".right a , .middle_col a") %>%
		  html_text() %>%
		  as.character()


#note the SelectorGadget helped identify the key ".right a , .middle_col a" was what I needed to scrape.

#The scrape actually extracts a lot of extras, must inspect the vector to identify the elements of interest
top5_names <- allnames[c(18,20,22,24,26)]


#Scrape the corresponding scores for the movies opening this week
allscores <- rt %>%
          html_nodes(".tMeterScore") %>%
		  html_text() %>%
		  as.character()

		  
 
#Again, the scrape has a lot of extra info. Identify and select just the scores for those movies opening
#remove "%" and make numeric so i can plot them
top5_scores <- as.numeric(gsub("%","",allscores[19:23]))




#store to a data frame for use in ggplot
top5_ratings <- data.frame(title=top5_names, ratings=top5_scores)
top5_ratings


#make a plot
p <- ggplot(top5_ratings, aes(weight=ratings, x=title, fill=title, label=ratings)) + 
geom_bar() +
geom_text(aes(y=ratings/2)) +
guides(fill=FALSE) +
theme_bw()

p
 
Last edited:
Here is the extension of the above to a self contained shiny application.

Every 10 seconds (10000 milliseconds):
-read the data in, store it
-record the system time/date for the observation
-append it to an existing data frame
-plot it using GGPLOT and facet_grid by the time variable
-this example won't be interesting (i.e. doesn't change at all since yesterday) and maybe every 12 - 24 hours might be a better interval to refresh for this particular site.

It's a bit rough, but it does in concept what I wanted to do.

Code:
library(shiny)
library(magrittr)
library(ggplot2)
library(rvest)


ui <- shinyServer(fluidPage(
  plotOutput("plot")
))

server <- shinyServer(function(input, output, session){
  # Function to get new observations
  get_new_data <- function(){

  #Load site of interest and store the time you retrieved the data
  rt <- read_html("https://www.rottentomatoes.com/")
  systime <- Sys.time()

#Scrape the names of the movies opening this week
  allnames <- rt %>%
          html_nodes(".right a , .middle_col a") %>%
		  html_text() %>%
		  as.character()


#The scrape actually extracts a lot of extras, must inspect the vector to identify the elements of interest
   top5_names <- allnames[c(18,20,22,24,26)]


#Scrape the corresponding scores for the movies opening this week
   allscores <- rt %>%
          html_nodes(".tMeterScore") %>%
		  html_text() %>%
		  as.character()

		  
 
#Again, the scrape has a lot of extra info. Identify and select just the scores for those movies opening
#remove "%" and make numeric so i can plot them
top5_scores <- as.numeric(gsub("%","",allscores[19:23]))




#store to a data frame for use in ggplot
data <- data.frame(title=top5_names, ratings=top5_scores,time=systime)
  
  
    return(data)
  }

  # Initialize my_data
  my_data <<- get_new_data()

  # Function to update my_data
  update_data <- function(){
    my_data <<- rbind(get_new_data(), my_data)
  }

  # Plot the 30 most recent values
  output$plot <- renderPlot({
    invalidateLater(10000, session)
    update_data()
   
   p <- ggplot(my_data, aes(weight=ratings, x=title, fill=title, label=ratings)) + 
        geom_bar() +
		facet_grid(.~time) +
        geom_text(aes(y=ratings/2)) +
        guides(fill=FALSE) +
        theme_bw() +
		theme(axis.text.x = element_text(angle = 45, hjust = 1))

   print(p)
   
   
  })
})

shinyApp(ui=ui,server=server)
 
Last edited:

bryangoodrich

Probably A Mammal
So uh ... yeah TIL there is a toString function in R (seriously, when did this get added?!)

Code:
toString(letters)  # "a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z"
Basically it's syntactic sugar for paste(x, collapse = ", "), though it says it's a helper function for format. It does have a nice width parameter so it will give a shorter representation (kind of goofy, though)

Code:
toString(letters, width = 5)  # "a,...."
In any case, I had a use-case of efficiently unpacking a vector into components of a list for a SQL statement and this makes it a lot easier

Code:
x <- sample(letters, 3)
sql <- "SELECT * FROM foobar WHERE foo IN (%s)"
sprintf(sql, toString(x))
Only problem here is it won't work for characters because it'd produce something like "IN (a, b, c)" when you really need "IN ('a', 'b', 'c')". So there is still some work to be done!

Another approach is to do something for an expression like

Code:
x <- sample(LETTERS, 3)
sql <- "INSERT INTO foobar VALUES (%s, %s, %s)" 
sprintf(sql, x[1], x[2], x[3])
It would be nice to have some way to just say sprintf(sql, x) and "unpack" x into the remaining arguments, but I'll probably just have to create my own wrapper that uses paste and collapse and quotes appropriately (I already have something like this). I just wanted a base R way. Maybe stringr has something? Would also be nice if one of the database libraries has a way to just say "convert this data.frame into a series of sql INSERT statements for each row" because in the insert case, that's ultimately what I'm trying to have a wrapper for.
 

Dason

Ambassador to the humans
Code:
> sQuote(letters)
 [1] "‘a’" "‘b’" "‘c’" "‘d’" "‘e’" "‘f’" "‘g’" "‘h’" "‘i’" "‘j’" "‘k’" "‘l’"
[13] "‘m’" "‘n’" "‘o’" "‘p’" "‘q’" "‘r’" "‘s’" "‘t’" "‘u’" "‘v’" "‘w’" "‘x’"
[25] "‘y’" "‘z’"
> toString(sQuote(letters))
[1] "‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘j’, ‘k’, ‘l’, ‘m’, ‘n’, ‘o’, ‘p’, ‘q’, ‘r’, ‘s’, ‘t’, ‘u’, ‘v’, ‘w’, ‘x’, ‘y’, ‘z’"
> # Stupid 'fancy' quotes...
> options(useFancyQuotes=FALSE)
> toString(sQuote(letters))
[1] "'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'"
>
 

bryangoodrich

Probably A Mammal
Found some interesting behavior

Code:
insert_values <- function(x, table = "tbl", quote = TRUE)
{
    sql <- "INSERT INTO %s VALUES (%s)"
    sprintf(sql, table, ifelse(quote, toString(sQuote(x)), toString(x)))
}
Had to do that because this doesn't work (just quotes the first element of the vector ('a')

Code:
insert_values <- function(x, table = "tbl", quote = TRUE)
{
    sql <- "INSERT INTO %s VALUES (%s)"
    sprintf(sql, table, toString(ifelse(quote, sQuote(x), x)))
}
Code:
x <- c('a', 'b', 'c')
insert_values(x)  # "INSERT INTO tbl VALUES ('a', 'b', 'c')"
insert_values(x, "foobar", FALSE)  # "INSERT INTO foobar VALUES (a, b, c)"
This works for simple all or none use cases, but a more complex solution will need to provide quoted mappings for each record data type--e.g., quote strings, factors, and dates, unquote numerics and be able to supply a complex row tuple into the values list. Fun exercise nonetheless.
 

Dason

Ambassador to the humans
ifelse will always produce a result that has the same length as the first input. So if you put in a single logical value you'll get something of length 1 as the output. It's kind of annoying but it makes sense for it's particular use case. You could just use if/else directly to avoid that issue.
 

bryangoodrich

Probably A Mammal
Good point. True, I can just use if else, but it's like nails on a chalkboard to see that you can return a value from a control flow block! I think the approach I used is sufficient and falls within the paradigm of what ifelse is doing (one return for one input). I was just trying to localizing the conditional logic to the exact part that is being tested.
 
TIL: Reordering of a factor within Shiny using the 'shinyjqui' package. This was motivated by a common question (I myself had earlier this year too) of how to change the order of say the x-axis categories on a graph.

I have a self-contained Shiny App that demonstrates that here.

Code:
library(shiny)
library(shinyjqui)
library(ggplot2)			 

### CREATE A DATA FRAME OF SAMPLE DATA
mydat <- data.frame(group=c("Apples","Bananas","Oranges","Peaches"),
                     value=c(3,2,5,1)
				 )
				 
### CREATE SERVER.R	
		 
server <- function(input, output) {

  #Text Output
  output$txt <- renderPrint({input$foo_order})
  
    
  #Plot Output
  output$graph <- renderPlot({
  
    mydat$newp <- factor(mydat$group, levels=input$foo_order)
	
	g<-ggplot(mydat, aes(y=value,value,x=newp,color=newp)) + geom_point() + theme_bw() 
  
	print(g)
  
  
							})
}


###CREATE UI.R

ui <- fluidPage(
  orderInput(inputId = 'foo', label = 'change ordering by moving the squares', items =   levels((mydat$group))),
  verbatimTextOutput('txt'),
  plotOutput('graph')
)

shinyApp(ui, server)
 

trinker

ggplot2orBust
TIL `cat` has a `sep` argument. All these years of using `cat(paste(x, collapse = '\n'))` could have been `cat(x, sep = '\n')`:

Code:
x <- c('I like it', 'Me too')
cat(x, sep = '\n')