melt function chooses wrong id variable with large datasets in R

#1
Hello all,

I'm using a large dataset consisting of 2 groups of data, 2 columns in excel with a header (group name) and 15 000 rows of data. I would like like to compare this data, so I transform my dataset with the melt function to get 1 column of data and 1 column of ID variables, then I can apply different statistical tests. With small datasets this works great, the melt function automatically chooses the name in row 1 as ID variable and melts the data, thus giving me a matrix with all ID variables in column one and the data accordingly in column 2.
With this big dataset however it chooses the whole first column as ID variables in stead of the first row. Is there a reason why this happens and how can I make sure the first row is chosen as ID variabele and the lower rows as data?

If I specify that I want the first row to be the id variable I also get error.

melt(dataset,id.vars=dataset[1,], na.rm=TRUE)

Error: id variables not found in data: norm, jaar

Are there alternative ways to create a good reshaped dataset?

Kind Regards
Joachim
 

TheEcologist

Global Moderator
#2
This is one way to do this, in base R, with reshape.

Code:
dd=data.frame(g1=rnorm(5),g2=rnorm(5),g3=rnorm(5))

reshape(dd, idvar = "g", varying = names(dd),v.names = "stat", direction = "long")
 

trinker

ggplot2orBust
#3
Here's a dplyr + tidyr approach (faster than the reshape2 package). Not as compact code wise as TE's approach above but to me the dplyr + tidyr approach is easier to remember because each function does one action well.

Code:
if (!require("pacman")) install.packages("pacman")
p_load(dplyr, tidyr)

dd %>%
    mutate(time = 1:n()) %>%
    gather(g, stat, -time) %>%
    mutate(g = as.numeric(gsub("\\D", "", g))) %>%
    arrange(time)
 

TheEcologist

Global Moderator
#4
I personally find reshape easier to comprehend in this case, but note that dplyr + tidyr approach, though more cumbersome will be faster on large data-sets. They have been highly optimized - so it may be worthwhile to learn if you foresee yourself working with big datasets.

Note: at this point of development, you still achieve speeds way beyond the dplyr + tidyr approach, with custom base code on very large datasets, but it will be MUCH more cumbersome.