Effect between 2 variables

#1
I would like to measure (and test significance) the effect of a variable on another controlling for various factors. I would have 3 questions if possible:

1) As I understand, if there is a more or less linear relationship between the dependent and predictors a multiple regression would be the best and most powerful. Isn't it?

2) What other models can I use besides regression? Machine Learning? Which ones would be the bests?

Also, I need to run the model in different country, with different cultures obviously.
3) Should I use same model across all countries (not sure country comparison would be necessary) or a model country specific? What are the advantages vs drawbacks?

Many thanks in advance for your responses.
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
Please describe you data with more detail including sample size.

Sounds like multiple linear regression would be a good choice. You may be able to run a model for all countries if you use multi-level multiple linear regression.I
 
#3
Please describe you data with more detail including sample size.

Sounds like multiple linear regression would be a good choice. You may be able to run a model for all countries if you use multi-level multiple linear regression.I


Sure, thanks hlsmith.

Question is to assess drivers of pay among employees and see whether there is a gender gap. So we have demographic data such as age, education, ... and work data such as salary, tenure duration, ...

What would be the advantages of multi-level linear regression?
 

hlsmith

Less is more. Stay pure. Stay poor.
#4
MLMs allow you to control for groupings like country. so then you can control for within and between group variabilities and control variables. Of note, they can be tricky when you initially learn them.
 

noetsi

Fortran must die
#5
If you think countries vary on some dimension related to this you can use a dummy variable for that. That won't be easy to do. I suspect such analysis is run within a given country, but even regions of a country could be different.
 
#6
MLMs allow you to control for groupings like country. so then you can control for within and between group variabilities and control variables. Of note, they can be tricky when you initially learn them.
Many thanks,
But how would you suggest to manage a variable like salary in various countries, currencies and most of all different cost of living? Would you think these variabilities would be taken into account within the model?
 

noetsi

Fortran must die
#7
You could standardize for these. You might find that such standardizations already exist for the OECD. They likely have something like the consumer price index that permits such comparisons. If you are looking for a percentage differential between gender, however, I am not sure it really matters if these vary by country. Males and females would have the same cost of living for example - unless you think their choice of purchasing changes this. I am not if you are concerned with pay and not purchasing power that even matters.
 
#8
You could standardize for these. You might find that such standardizations already exist for the OECD. They likely have something like the consumer price index that permits such comparisons. If you are looking for a percentage differential between gender, however, I am not sure it really matters if these vary by country. Males and females would have the same cost of living for example - unless you think their choice of purchasing changes this. I am not if you are concerned with pay and not purchasing power that even matters.
Many thanks,
I would tend to agree since comparison would be conducted within the same context.
 
#9
Actually, I believe MLM is one of the advantage of statistical modelling versus machine learning. But would you have a view of statistical models such as regression versus machine learning models such as decision trees/random forest which are quite popular explaining and predicting dependent variables?
 

noetsi

Fortran must die
#10
I personally have only worked very briefly in machine learning and then only in a university course (I am a data analyst not a statistician and largely self taught at that so I am cautious about such advice). They are so different that I am not sure you can compare them in any case. These are such massive fields that I doubt many have the expertise to compare them - except that they tend to do very different things. The machine learning I have seen focuses on optimization while statistics focuses more on how variables relate. From personal experience, I have found statistics has a hard time at determining relative impact - its very painful to try to do this.
 

hlsmith

Less is more. Stay pure. Stay poor.
#11
Machine learning algorithms are great with prediction, but dont always provide confidence intervals. Your description doesnt seem like a prediction problem. Also, trees use recursive partitioning which translates into many interaction terms, this also do not seem like you want that. You shold read the Intro to statistical learning book if interested in a simple introduction into ML.
 
#13
Many thanks all for your response. I'm a data scientist and therefore use ML (Machine Learning) quite a lot.

Indeed, ML tend to be use more for predictions, however, significance and therefore impact level can be assessed and tested as with p-values.

Furthermore, model like decision trees can capture very easily non linear relationships which seem outperforming regressive models. I would also think residuals can be tested in the same way the regressions and use t-test if normal structure or Kolmogorov otherwise.

That would be my thoughts...
Thanks again all, That was awesome!!!