question regarding grouping several variables

#1
Hello!
In the context of regression, in what situations am I allowed to group separate variables into a single categorical predictor? I'll use R code as an example since it's what I'm most familiar with.
For example I could run a model like this:
Code:
dat <- data.frame(ind=c(1,2,3,4,5,6), y=c(40,63,23,66,74,45), day1=c(4,6,3,6,1,3), day2=c(6,4,7,9,8,9))
dat
  ind  y day1 day2
1   1 40    4    6
2   2 63    6    4
3   3 23    3    7
4   4 66    6    9
5   5 74    1    8
6   6 45    3    9

mod <- lm(y ~ day1 + day2 + day1:day2, data=dat)
summary(mod)
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -133.873    230.464  -0.581    0.620
day1          33.121     41.781   0.793    0.511
day2          23.062     29.095   0.793    0.511
day1:day2     -4.046      5.289  -0.765    0.524
but I could also reshape dat in this way
Code:
dat2 <- melt(dat, id.vars = c("ind","y"))
dat2
   ind  y variable value
1    1 40     day1     4
2    2 63     day1     6
3    3 23     day1     3
4    4 66     day1     6
5    5 74     day1     1
6    6 45     day1     3
7    1 40     day2     6
8    2 63     day2     4
9    3 23     day2     7
10   4 66     day2     9
11   5 74     day2     8
12   6 45     day2     9

mod2 <- lm(y ~ variable + value + variable:value, data=dat2)

summary(mod2)
Estimate Std. Error t value Pr(>|t|)
(Intercept)         47.7965    20.7469   2.304   0.0502 .
variableday2        -1.7345    41.7837  -0.042   0.9679
value                1.0531     4.9129   0.214   0.8356
variableday2:value  -0.2478     6.9479  -0.036   0.9724
The two approaches give very different results, and I'm not sure if one is considered more valid than the other.

Thanks!
 
Last edited:

obh

Active Member
#2
Can you give a simple non-R example? or paste small runnable R code inside block ended by [/CODE] and started with [ CODE]
and include the data
 

hlsmith

Not a robit
#3
Two questions:

1.) What is the purpose of the model. What model are you trying to run. Mod based on dat has only 3-way interactions.

2.) Look at what coefficients are generated from the model and that will help you distinguish what model is being ran. If they are generating different output than they are likely different models.
 

obh

Active Member
#5
Hi Stat20,

The second model is incorrect ...even if you ignore the interaction X1X2

I will take one row for example:

In the first model:

X1 X2 X1X2 Y
4 6 24 40

In the second model:

X1 X2 X1X2 Y
4 0 0 40
0 6 0 40


Why do you expect to get the same answer ...
 
#6
Thank you obh for your answer.
I do not expect the models to give the same answer, I was just wondering if the second model is valid. Why am I not allowed to have this structure?:
X1 X2 X1X2 Y
4 0 0 40
0 6 0 40

If y represents performance on math test and day1 and day2 are hours spent studying, maybe I would like to test if math scores can be predicted by a combination of time spent studying on day1 and day2. I realize that this is a repeated measures so to be accurate I should use a mixed model approach instead, but my question applies to the mixed model case as well.
 

obh

Active Member
#7
Hi S,

The following row is incorrect:
X1 X2 X1X2 Y
4 0 0 40

This is incorrect because when X1=4 and Y=40, X2 isn't equal 0 but it is equal 6.

You may decide for example to use only X1, or only X2
Y=a0+a1X1 okay.
Y=a0+a1X2 okay.
Y=a0+a1X1+a2X2 okay.

But you can't combine data of Y=a0+a1X1 and Y=a0+a1X2 in the same model.

Thank you obh for your answer.
I realize that this is a repeated measures.
This is not a repeated measure since you have only one DV (Y)
You have two predictors for one DV (the predictors may be dependent)
 
#8
Thanks obh, that was very helpul. Do you mind if I ask you another question?
If I go with
Y=a0+a1X1
Y=a0+a1X2

do I need to correct for multiple testing?

This is not a repeated measure since you have only one DV (Y)
You have two predictors for one DV (the predictors may be dependent)
Oh I see. I thought repeated measures meant that I have more than two IVs per individual. In my case each individual is measured on two occasions: day1 and day2.
I'm sure I'm wrong, I'm just trying to understand :)
 

obh

Active Member
#9
Hi S,

You should choose the one model, the best regression model: Y=a0+a1X1 or Y=a0+a1X2 or Y=a0+a1X1+a2X2 or Y=a0+a1X1+a2X2+a3X1X2.
So no need to correct (unless you need to choose predictor which is a different story)

Repeat measure says that you measure for example the same subject more than once.
For example blood pressure before taking the medicine and after. (paired t-test)

When you have more than one predictor, this is multiple regression.
Again there may be a dependency between Day1 and day2