Hi all,
So my particular approach may be completely wrong, so let me provide a high level goal first:
The data I'm working with is visitor demographic features with a score. Currently I am using 14 demographic features, each with up to 5 levels. I want determine which partition of demographic features will yield the biggest difference in average score.
I don't want to get too granular, so I will likely omit partitions that consist of less than 5% of the data. My end result would ideally be to a partition of all visitors into 2-4 groups with different average scores.
***************
So! I wanted to start small and begin by calculating the average score based on every combination of 2 distinct features. I got stuck building the script to accomplish this. Here is a sample input/output:
Code for Input:
data=matrix(c("Blue","Blue","Brown","M","M","F","Dem","Dem","Rep",1,0,1), ncol=4)
dimnames(data)=list(c(1,2,3),c("Eyes","Gender","Politic","Score"))
Input:
Eyes Gender Politic Score
1 "Blue" "M" "Rep" "1"
2 "Blue" "M" "Dem" "0"
3 "Brown" "F" "Rep" "1"
Output:
Blue, M : .5
Brown, F:1
Blue, Rep : 1
Blue, Dem : 0
Brown,Rep :1
M, Rep : 1
M, Dem : 0
F, Rep : 1
So right now I am getting stuck at all the looping. To start I am just trying to build a function that creates a matrix of all distinct pairs of answers and questions. When it comes to looping through the questions, THEN each answer to the question, I get various errors.
analyze = function(test_data) {
x=matrix(ncol=2)
categories = lapply(test_data, unique) #created list of all distinct categoric values
category_names = names(categories)
for (feature in category_names) {
for (ans in categories$feature){
x=rbind(x,c(ans,feature))
}
}
return(x)
}
*I know this isnt representative of the whole problem described at the beginning, I just tried to simplify down to an easier problem whose answer will help the most.
So my particular approach may be completely wrong, so let me provide a high level goal first:
The data I'm working with is visitor demographic features with a score. Currently I am using 14 demographic features, each with up to 5 levels. I want determine which partition of demographic features will yield the biggest difference in average score.
I don't want to get too granular, so I will likely omit partitions that consist of less than 5% of the data. My end result would ideally be to a partition of all visitors into 2-4 groups with different average scores.
***************
So! I wanted to start small and begin by calculating the average score based on every combination of 2 distinct features. I got stuck building the script to accomplish this. Here is a sample input/output:
Code for Input:
data=matrix(c("Blue","Blue","Brown","M","M","F","Dem","Dem","Rep",1,0,1), ncol=4)
dimnames(data)=list(c(1,2,3),c("Eyes","Gender","Politic","Score"))
Input:
Eyes Gender Politic Score
1 "Blue" "M" "Rep" "1"
2 "Blue" "M" "Dem" "0"
3 "Brown" "F" "Rep" "1"
Output:
Blue, M : .5
Brown, F:1
Blue, Rep : 1
Blue, Dem : 0
Brown,Rep :1
M, Rep : 1
M, Dem : 0
F, Rep : 1
So right now I am getting stuck at all the looping. To start I am just trying to build a function that creates a matrix of all distinct pairs of answers and questions. When it comes to looping through the questions, THEN each answer to the question, I get various errors.
analyze = function(test_data) {
x=matrix(ncol=2)
categories = lapply(test_data, unique) #created list of all distinct categoric values
category_names = names(categories)
for (feature in category_names) {
for (ans in categories$feature){
x=rbind(x,c(ans,feature))
}
}
return(x)
}
*I know this isnt representative of the whole problem described at the beginning, I just tried to simplify down to an easier problem whose answer will help the most.