Normalizing lists

merik

New Member
#1
I have a number of lists which look like this:

Code:
> list1
$var1
[1] "SomeValue"

$var2
[1] "OtherValue"

> list2
$var1
[1] "SomeOtherValue"

$var4
[1] "AnotherValue"

> list3
$var2
[1] "YetAnotherValue"

$var3
[1] "SomeDifferentValue"

$var4
[1] "YetOtherValue"

...

and so on. I also have a list of all possible variable names in a character vector:

Code:
> varnames
[1] "var1" "var2" "var3" ....
What I want to create in the end, is a data frame that has all var names as its columns, where each row represents one of those lists.

The first step, IMHO, is to normalize all those lists, as in, for list1 add variables var3, var4, etc and set their value to NA, do the similar thing for list2, list3, ... and so forth.

How can I do that?
 

Dason

Ambassador to the humans
#2
Screw that. Let's just do your entire goal.

Code:
# Make some fake data
list1 <- list(v1 = 1, v2 = 2)
list2 <- list(v1 = 1, v3 = 3)
list3 <- list(v5 = 5)
list4 <- list(v1 = 1, v2 = 2, v3 = 3)

# Stuff all the stuff we care about into a list
mylist <- lapply(ls(pattern = "list[0-9]"), get)
# If you don't feel comfortable with that 
# you could manually make the list.  But this is
# cumbersome if you have a lot of lists.
mylist <- list(list1, list2, list3, list4)

# We're going to use merge later so we want
# a unique id for each element of the list
addID <- function(i){
  x <- as.data.frame(mylist[[i]])
  x$ID <- i
  x
}

newlist <- lapply(seq(mylist), addID)

# Merge two data frames and don't drop any values
# This is typically too trivial for me to make into its
# own function but Reduce doesn't allow passing extra
# parameters to the function of interest
mymerge <- function(x, y){
  merge(x, y, all = T)
}

out <- Reduce(mymerge, newlist)
out

# We can remove the extra id if we want
out$ID <- NULL
out
 

merik

New Member
#3
Alright. It works! But I don't understand how. I'll try to explain things I do understand, and ask what I do not understand. Please help me out here.

First, you create a list of all my lists. Then, you create an object "newlist", which converts each element of mylist (which was a list) into a dataframe, and adds an ID column to its end. There is no future reference to this ID coumn, which makes me wonder why you do that.

Then, you create a function "mymerge" which need two parameters, and just merges them. You use Reduce to apply that function to the newlist. I am surprised that you can do that, because Reduce's "x" parameter should be a vector, but you are passing a list and it still works. But more importantly, I'm not following what parameters are being passed to the mymerge funcion exactly. It needs two parameters, right? I can imagine "newlist" to be sent as one of the parameters, but I can't follow what is being sent to it as "y".
 

Dason

Ambassador to the humans
#4
Alright. It works! But I don't understand how. I'll try to explain things I do understand, and ask what I do not understand. Please help me out here.

First, you create a list of all my lists. Then, you create an object "newlist", which converts each element of mylist (which was a list) into a dataframe, and adds an ID column to its end. There is no future reference to this ID coumn, which makes me wonder why you do that.
It's because I use merge to combine the data frames together. If somehow we have duplicate "rows" in our data frames then we don't get that duplication in the final result with merge. Here is an example of what I mean
Code:
> list1 <- list(v1 = 1, v2 = 2)
> list2 <- list(v1 = 1, v2 = 2)
> 
> mylist <- list(list1, list2)
> 
> addID <- function(i){
+     x <- as.data.frame(mylist[[i]])
+     #x$ID <- i
+     x
+ }
> 
> newlist <- lapply(seq(mylist), addID)
> 
> mymerge <- function(x, y){
+     merge(x, y, all = T)
+ }
> 
> out <- Reduce(mymerge, newlist)
> out
  v1 v2
1  1  2
Since the rows were identical in merge's eyes it only gives one row in the output when we would hope for 2. By creating an ID for each of the dataframes this guarantees that we don't drop any of these duplicates.

Then, you create a function "mymerge" which need two parameters, and just merges them. You use Reduce to apply that function to the newlist. I am surprised that you can do that, because Reduce's "x" parameter should be a vector, but you are passing a list and it still works.
Fun fact: lists are vectors
Code:
> list1
$v1
[1] 1 1

$v2
[1] 2 2

> is.vector(list1)
[1] TRUE
?is.vector and ?vector have more info about this.

But more importantly, I'm not following what parameters are being passed to the mymerge funcion exactly. It needs two parameters, right? I can imagine "newlist" to be sent as one of the parameters, but I can't follow what is being sent to it as "y".
This is actually where Reduce comes in. Essentially Reduce allows us to sequentially pass the elements of a list into a binary function (any function that takes two parameters). I think giving the motivation and an example for Reduce really clears it up though.

A lot of times we want to apply a function to the first two elements of a vector (or list) and then apply the same function to the result and the third item in the vector, and then apply the function to the result of that and the fourth item in the vector, and so on until the end of the vector.

An easy example is let's say we have a function that adds two numbers together
Code:
myadd <- function(x, y){
  x + y
}
but what we really want is to add all the numbers in a vector together. We could first add the first and second elements, then add the result and the third element, ... and so on.

Code:
> mydata <- c(1, 2, 3, 4)
> # Doing this by hand
> myadd(myadd(myadd(mydata[1], mydata[2]), mydata[3]), mydata[4])
[1] 10
But we can avoid that since this is exactly the statement that Reduce will build for us
Code:
> Reduce(myadd, mydata)
[1] 10
Clearly using 'sum' would be the better alternative here but I think it's easier to visualize. So really the following
Code:
out <- Reduce(mymerge, newlist)
is the same as
Code:
mymerge(mymerge(mymerge(newlist[1], newlist[2]), newlist[3]), newlist[4])