Create All Variable Value Combinations For Population

hlsmith

Less is more. Stay pure. Stay poor.
#1
There is a finite set of characteristics that are possible in population. I want to create a data set with every combination of the variable values. So a row for each of these unique combinations.

Here is a toy example:
Age: 0-110
Gender: Male / Female
Weight: 0-500
Height: 0-100
etc.

So its set would look like this:
0, M, 0, 0
...
110, F, 500, 100

What would be the best way to accomplish this, possibly using R, to create a data sample with an observation for every possible combination of these variable values? In actuality, I will have way more variables. I figured if I couldn't figure it out, which I want to, I could do a Monte Carlo simulation and remove duplicates. I get that this will be a huge dataset!
 

Dason

Ambassador to the humans
#2
Using base R just make vectors for each possible value of the variables and then use expand.grid so something like
Code:
age <- seq(0, 110, by=5)
gender <- c("male", "female")
out <- expand.grid(age, gender)
# Although the following is probably nicer to use if you want to give columns names and want a standard data.frame
out <- expand.grid(age = age, gender = gender, KEEP.OUT.ATTRS = FALSE)

# So really you could avoid storing the vectors individually and just do...
out <- expand.grid(age = seq(0, 110, by = 5),
                   gender = c("male", "female"),
                   KEEP.OUT.ATTRS = FALSE)
 
Last edited:

hlsmith

Less is more. Stay pure. Stay poor.
#3
This is phenomenally simple and great. Though I got an "Error: cannot allocate vector of size 16.8 Gb" for my first example.

I had to look up that you just times the number of categories per variable together to calculate the total number of combinations:

(111 * 2 * 13 * 201 * 51 * 76) = 2B
 

Dason

Ambassador to the humans
#4
Right. Which is why I used a by=5 parameter for the age. You have to ask if you really need *all* options. If you do there are ways to get around this but it's going to be tedious and you have to ask how you're going to actually use the data
 

hlsmith

Less is more. Stay pure. Stay poor.
#5
I removed from my initial explanation - as to not overload the reader, my purpose, which is that there is a prediction model (logistic) that is used for survival in medicine. This model kicks out coefficients that people can use for forecasting survival (probabilities). Good enough, but given the advancements in modeling - I thought a black box model could do better - but it wont kick out usable coefficients. But if I build multiple models and score every possibility of the sample space, especially for the blackboxes, I could have a look up table for survival probabilities - the user would only need to have their variable input values.

So, if I moved forward, I would need every value. What are the "ways to get around this?"
 

Dason

Ambassador to the humans
#6
Well it certainly sounds like you could do something with a loop where you generate a subspace of your options, score them, output them to a file, do garbage collection, and then move on. You'd probably need to spec it out to see how much you could get away with for each portion and how long it might take to do the entire thing. But a lookup table that large would probably be best stored in an actual database.

Another alternative... if you have access to a proper database with ability to interact with your model is to do some proper memoization where you store any values that have been computed and you can return those instantly but you just compute the rest that haven't been called for yet on demand. You could have a background process evaluating values that haven't been computed yet if you have the processing power, storage capacity, and demand for faster results.
 

hlsmith

Less is more. Stay pure. Stay poor.
#7
So is the error "cannot allocate vector of size 18.1 Gb" complaining about the outputted data set, which would only be 2.2B x 6, or the resources to do the calculations? I was thinking it just didn't like the size of the generated dataset? But then you got me thinking that it could be a processing storage issue.

I have access to Python on a server - but not sure about R.
 

Dason

Ambassador to the humans
#8
The error would be about allocating the data structures for the processing. So you could process parts at a time and then write them out to a file. Then you could clean up the session and process the next part. But I also don't know how you plan on delivering results to the user. Are you operating with a database of some sort? Or do you just need to deliver a file or something and let some devs deal with the technical stuff?
 

hlsmith

Less is more. Stay pure. Stay poor.
#9
This whole thing is just an idea I had in the shower this morning. That is where all of my best thinking comes. I actually will probably never do the later part of creating and scoring the 'super' dataset. I was just thinking about doing the model comparison portion, but then thought people would then say, "cool enough, how would I actually get predictions or use this?" So I was brain storming ways of making results accessible.

Though, I do see utility in knowing the above function for other processes in the future.
 

hlsmith

Less is more. Stay pure. Stay poor.
#10
How would the following be written, if you did not want every value in this range.

ISS = seq(0, 75, by = 1),

Say you wanted, 1-12, 15-36, 39-54, 56-75) or something choppy like that?
 

Dason

Ambassador to the humans
#11
You don't appear to have a regular pattern for the starting/ending values so I'm not sure if there is a super elegant way unless you explained your logic there a bit better.
 

hlsmith

Less is more. Stay pure. Stay poor.
#12
Variable ISS can only take the following values:

0 1 2 3 4 5 6 8 9 10 11 12 13 14 16 17 18 19 20 21 22
24 25 26 27 29 30 32 33 34 35 36 38 41 42 43 45 48 50 51 54 57
59 66 75

1658435159746.png

Background, it is a composite score that takes the three highest individual scores (on scale 0-5) and squares them individually and sums the three numbers: (1, 4 , 3) -> squared (1, 16, 9) then Summed (1, 16, 9) = 26.

So if I shave off these the file won't be as big. 2.2B / 22.
 

hlsmith

Less is more. Stay pure. Stay poor.
#13
I am too lazy at times, sorry for not giving that more effort before typing, this seems to work:

ISS = c(0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 22,
24, 25, 26, 27, 29, 30, 32, 33, 34, 35, 36, 38, 41, 42, 43, 45, 48, 50, 51, 54, 57,
59, 66, 75),
 

Buckeye

Active Member
#15
I admit that I have only read the first two posts of this thread. But, maybe you can concatenate variables together using this:
Code:
library(dplyr)

col1<-c("male","female","male","female","female")
col2<-c(22,23,15,10,19)

dat<-data.frame(col1,col2)

dat$concat<-paste(dat$col1,dat$col2,sep = ", ")

distinct_combos<-dat %>%
  select(concat) %>%
  distinct()
paste() works with "infinite" number of columns, but the resulting character string probably has some length limit.
 
Last edited:

Dason

Ambassador to the humans
#18
If all we were trying to do is remove duplicates then we could just use the "duplicated" function without going through the whole paste thing.