Multiple Linear Regression: to split or not the data

#1
Hi all,

I'm currently modelling running performance using multiple linear regression. The data has GENDER and AGE as inputs amongst others, the target is RACE_TIME.
I've partitioned the data into training and test for cross validation purposes. I've tried a couple of approaches 1) to generate one model using the entire data set and 2) split the data by gender (i.e. males on one side and females on the other) and creating two separate models. When comparing the SUM of SQUARED ERRORS (SSE) on the test data between the 1 model approach vs the 2 model approach, I'm observing a considerable improvement in the 2-model approach over the other.

I wondered what are your views in general on splitting the data into groups and modelling separately vs modelling it all in one model? Can you see any advantages or disadvantages? Are there any pitfalls that I should bear in mind?

Thanks in advance
Rob
 

Karabiner

TS Contributor
#3
I wondered what are your views in general on splitting the data into groups and modelling separately vs modelling it all in one model? Can you see any advantages or disadvantages? Are there any pitfalls that I should bear in mind?
I can see no clear reason for splitting the sample, if your purpose is to analyse subgroup effects.
Differences between models (such as SSE, or other) will have to be tested first, before you can
make valid inferences on group effects. And such a test is easier to perform within 1 model with
all the data (Dason already suggested what to do).

With kind regards

Karabiner
 

noetsi

Fortran must die
#4
Its valid to split the data into a training set where you estimate the parameters and one where you determine if the parameters estimate correctly in theory. I have rarely seen that done in part because many studies have limited data I suspect (and its more work than most want to do). How you chose which is which is something I have not seen addressed.