Verifying that this Regression Analysis Model is conceptually correct

#1
Hi,

I'm analyzing some data about computer proficiency (dependent variable, value between 100-400) having the following independent variables from a questionnaire:
- country
- age
- gender
- likes learning new things (score 1-5)
- parent education level (1-3)

Am wondering if I can just put this into a multiple linear regression model as per the example in R below:
Code:
lm(computerProficiency ~ AGE + GENDER + PARENTS_EDUCATION + COUNTRY + ENJOY_LEARNING, data = dataSet)
My doubt stems from struggling to interpret the following results:

Code:
Residuals:
    Min      1Q  Median      3Q     Max
-207.37  -24.30    1.21   25.25  177.07

Coefficients:
               Estimate Std. Error t value Pr(>|t|)  
(Intercept)    246.7276     1.7096  144.32  < 2e-16 ***
AGE             -0.5577     0.0182  -30.70  < 2e-16 ***
GENDER_Male      4.8323     0.4301   11.24  < 2e-16 ***
PARENTS_ED      18.4795     0.3721   49.67  < 2e-16 ***
ENJ_LEARNING     7.3347     0.2734   26.83  < 2e-16 ***
CNTRY_UK        -9.2826     0.7207  -12.88  < 2e-16 ***
CNTRY_JPN        3.9170     0.8987    4.36  1.3e-05 ***
CNTRY_POL      -56.4754     0.8575  -65.86  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 37 on 30115 degrees of freedom
  (15909 observations deleted due to missingness)
Multiple R-squared:  0.308,    Adjusted R-squared:  0.308
F-statistic: 1.34e+03 on 10 and 30115 DF,  p-value: <2e-16
Is the way this analysis is setup conceptually correct? Or is there something fundamentally wrong with it?
 

Miner

TS Contributor
#2
I cannot comment on the study design, as you did not provide details. However, I can comment on the implied results and ask relevant questions.

These results imply that computer proficiency:
  • declines with age (3rd strongest effect - ballpark estimate based on t value)
  • is moderately stronger with males
  • is stronger if the parents are educated (2nd strongest effect)
  • is stronger for ENJ learning (Engineering?)
  • is slightly stronger for Japan
  • is moderately weaker for the UK
  • is strongly weaker for Poland (strongest effect)
The question that you must answer is whether these implications are logical. For example, are there curriculum, infrastructure, access, etc. differences between countries that might explain those differences?

What was your hypothesis when setting up this study? What did you expect to see?
 
#3
Thank you very much this helps greatly. The hypothesis was pretty stereotypical, expecting young males from an educated family to have a high computer proficiency/literacy. Especially when compared to e.g. women from 3rd world countries (I did omit countries from this post to keep it shorter).
Based on these results I think the initial assumption is not refuted, therefore it makes sense to continue investigating the data especially in regards to the differences between countries to look for reasons which could explain this disparity.
What still confuses me after not finding a definitive answer to this (as there maybe is none?): Are the values for multiple R-squared and adjusted R-squared sufficient to even include this regression model as noteworthy in my analysis?
 

Miner

TS Contributor
#4
Are the values for multiple R-squared and adjusted R-squared sufficient to even include this regression model as noteworthy in my analysis?
This depends on how you intend to use the results. If you are developing a model to be used in making predictions, probably not. If you are interested solely in directionality and approximate strength of the effects, probably yes. You have an extremely large sample size, so these are probably real effects. The reason that the R^2 (adj) is so low is that you have a lot of variation across individuals. In other words, on average people in Japan are more proficient that people in Poland. However, on an individual basis, you may find many individuals in Poland that are more proficient than many individuals in Japan.

See illustration for concept.
1615404682473.png
 

noetsi

Fortran must die
#5
There is no agreement on this topic but some feel variables like this can generate nonsensical results (one of my professors pointed out to me when I raised this issue during my Master defense)

ENJOY_LEARNING is a likert scale predictor with less than 7 levels. It is not interval and you did not chose to make it a set of dummy variables.
 
#6
The reason that the R^2 (adj) is so low is that you have a lot of variation across individuals. In other words, on average people in Japan are more proficient that people in Poland. However, on an individual basis, you may find many individuals in Poland that are more proficient than many individuals in Japan.
@Miner
I see, that makes sense! Thanks for clarifying how this rather low R^2 materialized. Also this
However, on an individual basis, you may find many individuals in Poland that are more proficient than many individuals in Japan.
is true and can be seen in the data. Again thank you! Yes I won't use this for predictions, but rather to find some general tendencies which then can be investigated further.

@noetsi
Exactly the kind of input I was looking for, thank you! I will research this and either leave it as be and make an argument for it, or I will introduce dummy variables for ENJOY_LEARNING - leaning towards the latter.

I have one last question. In the results you can see CNTRY_UK , CNTRY_POL and CNTRY_JPN. Actually there's an additional country in the model (CNTRY_USA) which is not shown as a coefficient. This is totally expected right because this would be something like a "baseline"? So when interpreting the results in regards to the country, a coefficient of -9.3 for the UK has to be compared against a default setting where the CNTRY would have been the US? And thus, a comparison between UK and JPN can also be done?
 

noetsi

Fortran must die
#8
Note there is disagreement about whether you should use likert data as a predictor. Essentially the regression treats it as interval I believe when you do this and its questionable, particularly with less than 7 distinct levels, if likert is interval.
 

Miner

TS Contributor
#9
Note there is disagreement about whether you should use likert data as a predictor. Essentially the regression treats it as interval I believe when you do this and its questionable, particularly with less than 7 distinct levels, if likert is interval.
I would be more concerned if the model were used to make predictions, but since it is concerned with directionality and possible areas for further research, I think it is a non-issue.