Today I Learned: ____

bryangoodrich

Probably A Mammal
TIL Closures

Though, to me this was YIL since I was reading Advanced R Programming at 2 am in between games of Call of Duty. Specifically, I was reading Hadley's articulation of functional programming in R: http://adv-r.had.co.nz/Functional-programming.html. To say the least, I am greatly impressed by this. I've only gotten my feet wet with regards functional programming, but there are things that R does behind the scenes, you don't really think about. Closures is one of them. The whole "names(x) <- c(...)" is another. You typically don't think about what is going on, because it all makes sense. However, programmatically, this is craziness!!

So what are closures? It's basically enclosing functions (puts OOP on its head--objects enclose data with methods; closures enclose methods with data). Look at Hadley's example

Code:
power <- function (e) {
    function (x) x^e
}
Very simple. It's a function that raises x to the specified exponent, right? Well, not exactly. It took me all morning while I was shopping--I spend a lot of time walking around on the weekends, it gives me time to think--but I finally framed this in functional terms instead of object terms. I couldn't quite understand how you could do something like

Code:
square <- power(2)
cube <- power(3)
square(5)  # 25
cube(2)  # 8
To frame it correctly, I thought about "foobar <- mean". This now renames (makes foobar refer to) the mean function. With this I can foobar(...) some vector to get its mean. This is essentially what the power function is doing. It is encapsulating an anonymous function with a specific attribute--viz., the specified exponent value. When you do an assignment like "square <- power(2)" you are assigning square that specific anonymous function (with e=2). Thus, when you do square(5) you are passing 5 to "function (x) x^2". The difficulty I was having was in thinking that somehow the "5 was being passed into power and into the anonymous function, which is just stupid. Instead, the return type of power is that of an anonymous function. So we might properly document such a think thusly,

Code:
#' Power Function Generator
#' 
#' Creates an anonymous function of the specified exponent
#'
#' @param e The power to which x will be raised.
#' @return An anonymous function that raises inputs to the power of e.
power <- function (e) {
  function (x) x^e
}
This is powerful stuff because we can encode similar functional calls into a closure that can generate the variation in functions. Think how you would have to do this manually

Code:
square <- function(x) x^2
cube   <- function(x) x^3
As soon as you had to create other varieties, you'd have to start manually coding a lot of details. This would be hard to maintain in the future and would be very inconvenient. The other nice thing is that since these functions can be generated by a function, you can program the automatic generation of these. That's powerful stuff!

Thanks to reading this, I'll be having this functional programming thought at the back of my head whenever I'm writing new functions. It'll be whispering "can you make a closure for this? Go ahead. Do it. You know you want to ..."
 

bryangoodrich

Probably A Mammal
TIL that Hadley has a lot to offer us. Today's lesson: subset assignment.


In particular, it turns out that if you have some data frame (x), you can operate on it efficiently as a list (lapply) and return the results back into data frame, as opposed to getting a list result you would expect and maybe conditioning back into a data frame afterwards.

Code:
x <- data.frame(A = rnorm(10e3), B = rnorm(10e3), C = runif(10e3))
x[] <- lapply(x, "*", 2)  # double the values; this is still not as fast as x*2 when x is a matrix
I ran benchmarks for this

Code:
library(rbenchmark)
mat <- matrix(0, ncol=10, nrow=10e3)
dat <- data.frame(mat)
dat2 <- dat
benchmark(dat[] <- lapply(dat, "*", 2), 
           mat <- apply(mat, 2, "*", 2),
           dat2 <- apply(dat2, 2, "*", 2))
The end result was that dat[] is significantly faster, upwards of 3 to 4 times. Though, at greater scale (10e5 by 10, say), this performance benefit starts to shrink regardless (to about 1.75x, I think it was; I'm not running it again, it took awhile). I also threw in an sapply form. This was about 120x slower at all scales and returned a vector! I think I'll never use sapply again. For one, there's always a chance it doesn't interpret the result how you want it, and then there's the fact vapply gives you better control over the resultant structure while being faster. With this new in-place approach of x[] <- lapply(...), I see no reason for sapply at all.

Jake had mentioned that if you have a different structure to your data frame in the result of lapply, you can't use this approach. So instead I made another comparison.

Code:
mat <- matrix(0, ncol=5, nrow=10e3)
dat <- data.frame(mat)
res <- data.frame(matrix(0, ncol=5, nrow=6))  # Store summary results
benchmark(res[] <- lapply(dat, summary),
           resmat <- apply(mat, 2, summary),
           resdat <- apply(dat, 2, summary))
Here the res[] method was still fastest, but at scale the apply methods were around 1.25x slower. There's also the downside that apply will provide the rowname corresponding to the summary types (min, max, etc.). If we include this sort of conditioning on the result, the performance benefits may disappear. Of course, this is just one alternative choice of function. I'm sure more exotic operations in lapply may have better performance, but like vapply, it will require that you do some work on defining the resultant structure. The key here is that res and dat must be data frames. I was using lapply on a matrix and it wasn't working until I realized this folly.

I'm sure I'll find more awesome stuff in Hadley's Advanced R Programming. I recommend you seasoned programmers to check it out yourself.
 

trinker

ggplot2orBust
I have often been intrigued by ggplot2's use of the + operator to do something besides adding. I decided to tear it apart today and minimally replicate what's going on. TIL (am starting to learn) how ggplot2 uses the binary + operator in a different way than adding.

Code:
 V <- function(x) { [COLOR="silver"]#Function to create a V class[/COLOR]
    class(x) <- c("V", class(x))
    x
}

"&.V" <- function(e1, e2) { [COLOR="silver"]#Function to take an object and paste a V class to it[/COLOR]
    paste(e1, e2)
}


"%&%" <- `&.V` [COLOR="silver"]#assign the function to a binary operator[/COLOR]

V(1:10) & V("A") & V(6) & V(rnorm(20))
I tried to extend this but get infinite recursion and am not sure why:

Code:
"&.Q" <- function(e1, e2) {
    c(e1 - e2, e1 + e2)
}

Q <- function(x) {
    class(x) <- c("Q", class(x))
    x
}

"%&%" <- `&.Q`

Q(4) & Q(10)
 

Dason

Ambassador to the humans
& is already a binary generic function so all you need to do is define the appropriate S3 method.

Code:
V <- function(x) { #Function to create a V class
    class(x) <- c("V", class(x))
    x
}

"&.V" <- function(e1, e2) { #Function to take an object and paste a V class to it
    paste(e1, e2)
}

V(1:10) & V("A") 

Q <- function(x) {
    class(x) <- c("Q", class(x))
    x
}

"&.Q" <- function(e1, e2) {
    c(e1 - e2, e1 + e2)
}

Q(4) & Q(10)
 

trinker

ggplot2orBust
@Dason why the error on the Q function though? It works for the V. Is it because I'm trying to spit a vector of length 2 out? THat doesn't make sense. Here is the error:

Code:
> Q <- function(x) {
+     class(x) <- c("Q", class(x))
+     x
+ }
> 
> "&.Q" <- function(e1, e2) {
+     c(e1 - e2, e1 + e2)
+ }
> 
> Q(4) & Q(10)
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
 

bryangoodrich

Probably A Mammal
TIL that you can create constants in R! Yeah, thank you Hadley.


It all comes down to binding, which is what happens when you assign values to variable name references. In this instance, you use lockBinding to indicate that this binding cannot be changed. This is especially important in package development when you want to create bindings that you don't want to let an end-user change. That can be very useful.

The use of active and delayed bindings was also pretty neat. The interested reader should check it out!
 

Dason

Ambassador to the humans
This is especially important in package development when you want to create bindings that you don't want to let an end-user change. That can be very useful.
I'm not sure it's really *that* important in package development. Anything you create in a package will be in it's own namespace so even if you have a function that depends on a "constant" you created even if the user creates a variable with that name your function will still grab the correct value. You don't need to use bindings to do that.
 

bryangoodrich

Probably A Mammal
The point is to control your package, not to avoid accidental bindings. I'd have to go out of my way to change the default print function in the base environment, but there's nothing stopping me from going

Code:
assign('print.default', whatever, envir = baseenv())
When you do that, you can see the developers took this to heart

Code:
[COLOR="red"]cannot change value of locked binding for 'print.default'[/COLOR]
 

Dason

Ambassador to the humans
The point is to control your package, not to avoid accidental bindings. I'd have to go out of my way to change the default print function in the base environment, but there's nothing stopping me from going

Code:
assign('print.default', whatever, envir = baseenv())
When you do that, you can see the developers took this to heart

Code:
[COLOR="red"]cannot change value of locked binding for 'print.default'[/COLOR]
I still don't see the point of doing this in a user built package. If I make a package and have a function "myfun" and the user wants to have their own function called "myfun" why should I stop them from doing that? If I have other functions in my package that use 'myfun' the fact that the user defined their own function called 'myfun' won't mess up my functions. They will still use the correct version. If somebody went out of the way to rebind mypackage::myfun then you know what - good for them but that is something that a user isn't going to do by accident and anybody doing that knows the risks involved.

Also note that things in a namespace are locked without you explicitly having to do anything

Code:
library(reshape2)
assign("melt.data.frame", mean, envir = asNamespace("reshape2"))
 
Last edited:

bryangoodrich

Probably A Mammal
TIL you can specify the local environment when sourcing a file

Code:
e <- new.env()
source(...some file, local = e)
Don't know when this would be of great use, but certainly it can allow you to execute a script without mucking up your current global environment. I made use of it because I've work to do before I have this stuff in a package, so I put all of it into one file. Then I made a load file that basically does

Code:
e <- new.env()
source(foobar, local=e)
attach(e)
rm(e)
This works as a sort of workaround from having a package for the time being. I rarely source files, though, so I've never looked into this stuff before.
 

trinker

ggplot2orBust
TIL: for loops don't have to iterate over indices.

Here is an example:

Code:
[COLOR="gray"]## I used to think we need to do:[/COLOR]
for (i in seq_along(LETTERS)) {
    print(LETTERS[i])
}

[COLOR="gray"]## But we can do:[/COLOR]
for (i in LETTERS) {
    print(i)
}
I'm not sure what this means for speed. I thought this was a unique lapply and family's advantage, that you didn't have to use indices.
 

bryangoodrich

Probably A Mammal
Well unlike other languages, R is simply looping over vectors, whether or not that is a vector of indices or the vector itself. You can also loop over the names of a vector (or data frame) and use each to select (or subset) what you want. This gives you a little more flexibility than lappy. In other languages, you usually do a loop with indices and test against the position. However, newer constructs use what are called iterators which are themselves by their design able to be iterated over. So you can do what are called "for each" loops. This is just an abstraction of a normal for loop.

R is really doing a for each loop because vectors are an iterable type. So even the loop on a vector of indices is a for each loop on a vector that logically is related to the vector you're interested in indexing.
 

Dason

Ambassador to the humans
TIL: sort has a 'partial' parameter which is useful if you only care about the 'top n highest' or 'top n smallest' values in a vector.

Code:
> library(microbenchmark)
> n <- 1000000
> x <- rnorm(n)
> 
> # Gives the same result
> all(sort(x)[1:5] == sort(x, partial = 1:5)[1:5])
[1] TRUE
> # But is faster
> microbenchmark(sort(x)[1:5], sort(x, partial = 1:5)[1:5])
Unit: milliseconds
                        expr       min        lq    median        uq       max neval
                sort(x)[1:5] 201.06586 202.77952 203.53633 204.60620 244.15758   100
 sort(x, partial = 1:5)[1:5]  33.35321  34.22701  35.03615  35.49524  74.88088   100
 

bryangoodrich

Probably A Mammal
TIL RStudio not only lets you skip around tabs from a drop down list (the double-arrows on the tabs when you've more loaded than can show). It also has within a document a drop down list to move to a function! It's next to the line:column number display. I was sitting here staring at my package script wondering "wtf is that?" clicked on it and it dawned on me "whoa, here's all the functions in this file!" More than that, however, I have a few "# ====== Section Title ======" and it has those in the list, too! Considering I'm throwing a bunch of stuff into these files sometimes, this is mightily useful for quick navigation. Maybe. We'll see. At least I know it is there now. It may have always been there. My ignorance has been obliterated, though!