HR Salary analysis


I need help in making an analysis on a dataset of different individual basic salaries.
I will be using excel.

Basically I have a line per individual, their salary and a set of other variables such as age, gender nationality, so different types but all at the same point in time.

What method should I use to understand which/if any of these variables explain having a higher than average salary?

Later, i was also thinking that it could be interesting to try and understand if there was some correlation between some of the above variables and being promoted (binomial) and having a higher than average salary increase (continuous).

Many thanks for any help you can provide me.



TS Contributor
You could have a look at multiple linear regression for the prediction of salary,
and you could have a look at multiple logictic regression for the prediction of
promotion. How well these methods will perform depends on your sample size
and how well you thought about your predictive model beforehand (e.g. whether
you reasonably expect interactions between predictors, or whether there could
be non-linear relationships, for example whether the influence of age changes
with higher age). I do not know if any of these can be performed in Excel, or
whether you will need a genuine statistics software.

Depending on how extreme salaries are skewed, it could be useful to
take the logarithm of "salary" first and use it as dependent variable in linear

With kind regards

Thanks Karabiner.

If you do not mind, let's do this one at a time as I am not very good at this yet.

So, let's start with the multiple linear regression for the prediction of salary, although, in fact, I meant to say % salary increase.
My Y will be the average salary increase.
My X will be the rest of my variables, in this case:
- Gender (0,1)
- Age
- Contract type (0,1)
- Nationality (1,2,3,4,5,6,7,etc...)
- Level (1,2,3,4,5)
- Department (1,2,3,4,5,6,etc...)
- Bonus (1,0)

I have tried running it, with a just gender, age, contract type and bonus. To better understand how to do it.
All their p-values are very low? Is this normal? Does it mean that they are all significant?
Because, on the other hand, my R2 is around 12%. So only 12% being explained by my variables. This seems very, very low?

Can I also add my nominal variables here? Will it make sense?

Btw, the sample size is about 9000, I am using 3 different points in time.



TS Contributor
The p value is used to decide whether to reject the hypothesis that the coefficient is 0.000000 in the population from with your sample was drawn. With a huge sample size such as yours, this hypothesis will mostly be rejected very easily.

Whether 12% variance explained is a small value cannot be generally answered. It depends on what you expected and/or what you find useful. You could perform some diagnostics to check whether there is a non-linear relationship between what your model predicts and what was actually observed.

I don‘t know what you mean by 3 different points in time- do you actually have 3000 subjects, measured 3 times?
With kind regards

Last edited:


TS Contributor
You should also consider interactions between variables and other factors such as length of time with this organization. In many organizations, people that switch jobs between organizations and within the organization will earn more faster than people that stay in the same job and same organization for a long period of time. People that take on challenging assignments get larger increases. Other factors can include where they are in their pay bracket as well.
Thank you both.

@Karabiner, indeed let’s say 3000 individuals over 3 different years. Does that matter?
I will try to look into the non linear relationships.

@Miner, good points on the position in the salary brackets and seniority. I will try to add them too. For the changes I won’t have that info.

for the nominal variables, should I create binomial variables between them?

Thanks again!


TS Contributor
Of course it matters. There is also a difference between 1000 subjects measured once and 1 subject measured a 1000 times. You do not treat observations from the same subject as if they were independent from each other.

With kind regards