question regarding grouping several variables

#1
Hello!
In the context of regression, in what situations am I allowed to group separate variables into a single categorical predictor? I'll use R code as an example since it's what I'm most familiar with.
For example I could run a model like this:
Code:
dat <- data.frame(ind=c(1,2,3,4,5,6), y=c(40,63,23,66,74,45), day1=c(4,6,3,6,1,3), day2=c(6,4,7,9,8,9))
dat
  ind  y day1 day2
1   1 40    4    6
2   2 63    6    4
3   3 23    3    7
4   4 66    6    9
5   5 74    1    8
6   6 45    3    9

mod <- lm(y ~ day1 + day2 + day1:day2, data=dat)
summary(mod)
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -133.873    230.464  -0.581    0.620
day1          33.121     41.781   0.793    0.511
day2          23.062     29.095   0.793    0.511
day1:day2     -4.046      5.289  -0.765    0.524
but I could also reshape dat in this way
Code:
dat2 <- melt(dat, id.vars = c("ind","y"))
dat2
   ind  y variable value
1    1 40     day1     4
2    2 63     day1     6
3    3 23     day1     3
4    4 66     day1     6
5    5 74     day1     1
6    6 45     day1     3
7    1 40     day2     6
8    2 63     day2     4
9    3 23     day2     7
10   4 66     day2     9
11   5 74     day2     8
12   6 45     day2     9

mod2 <- lm(y ~ variable + value + variable:value, data=dat2)

summary(mod2)
Estimate Std. Error t value Pr(>|t|)
(Intercept)         47.7965    20.7469   2.304   0.0502 .
variableday2        -1.7345    41.7837  -0.042   0.9679
value                1.0531     4.9129   0.214   0.8356
variableday2:value  -0.2478     6.9479  -0.036   0.9724
The two approaches give very different results, and I'm not sure if one is considered more valid than the other.

Thanks!
 
Last edited:

obh

Active Member
#2
Can you give a simple non-R example? or paste small runnable R code inside block ended by [/CODE] and started with [ CODE]
and include the data
 

hlsmith

Not a robit
#3
Two questions:

1.) What is the purpose of the model. What model are you trying to run. Mod based on dat has only 3-way interactions.

2.) Look at what coefficients are generated from the model and that will help you distinguish what model is being ran. If they are generating different output than they are likely different models.
 
#5
Hi Stat20,

The second model is incorrect ...even if you ignore the interaction X1X2

I will take one row for example:

In the first model:

X1 X2 X1X2 Y
4 6 24 40

In the second model:

X1 X2 X1X2 Y
4 0 0 40
0 6 0 40


Why do you expect to get the same answer ...