# Multiple linear regression when unsure if independent variables are truly independent

#### MedStudentUK

##### New Member
Hello,

I am undertaking a project analysing insulin pump data in Type 1 diabetics. My dataset is from 2015-2019 and for each patient I calculate the mean or variance of many variables for the day. I then look at an average for the whole period 2015-2019 and am trying to determine behaviours associated with improved HbA1c, proportion of blood glucose readings "in range" (3.9-10) and proportion of readings which are hypoglycaemic (<3.0). The sample size is 556 patients. I am using R.

I have 20 input variables and when I run a single predictor linear model for each they tend to produce statistically significant results.

I am now looking to see which are independent predictors of success. I tried to run a multiple predictor linear model using all variables and it produced a model with an adjusted R squared of 0.239. However, I believe that some of the variables are definitely not independent. For example, I had total bolus insulin, total basal insulin and total insulin - all of which were included in the final minimal model using the step() function.

Code:
Call:
lm(formula = meanHb ~ m_BG_num + m_BR_num + v_BR_num + m_Bol_mean + m_Carbs_num + v_Carbs_num + m_BR_tot + m_Bol_tot + m_insulin_tot,  data = meanHb)

Residuals:
Min      1Q  Median      3Q     Max
-33.996  -7.427  -0.567   6.287  60.884

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   79.95296    2.18495  36.593  < 2e-16 ***
m_BG_num      -1.53930    0.27118  -5.676 2.24e-08 ***
m_BR_num      -0.50174    0.16609  -3.021 0.002638 **
v_BR_num      -0.09066    0.05753  -1.576 0.115667
m_Bol_mean     1.35472    0.33284   4.070 5.39e-05 ***
m_Carbs_num   -1.21391    0.37794  -3.212 0.001396 **
v_Carbs_num    0.90840    0.22757   3.992 7.46e-05 ***
m_BR_tot      -2.42331    0.61669  -3.930 9.60e-05 ***
m_Bol_tot     -2.49543    0.61519  -4.056 5.71e-05 ***
m_insulin_tot  2.29962    0.60692   3.789 0.000168 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.9 on 546 degrees of freedom
F-statistic: 20.37 on 9 and 546 DF,  p-value: < 2.2e-16
I then selected variables using my knowledge of insulin pumps that I believed would be least likely to be related. When I ran this model I got an adjusted R squared which was lower at 0.1937. See summary() below.

Code:
Call:
lm(formula = meanHb ~ m_BG_num + m_BR_num + m_BR_var + m_propBasal + m_Carbs_mean, data = meanHb)

Residuals:
Min      1Q  Median      3Q     Max
-32.680  -7.396  -1.086   6.391  69.187

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  67.68001    3.83933  17.628  < 2e-16 ***
m_BG_num     -1.93774    0.25871  -7.490 2.76e-13 ***
m_BR_num     -0.47213    0.16403  -2.878  0.00415 **
m_BR_var     -3.51347    2.28792  -1.536  0.12520
m_propBasal  13.75078    4.75705   2.891  0.00400 **
m_Carbs_mean  0.10588    0.03445   3.073  0.00222 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.25 on 550 degrees of freedom
F-statistic: 27.67 on 5 and 550 DF,  p-value: < 2.2e-16
My questions are:
1) Which model should I be using?
2) Are variables which are not found in the multiple linear model definitely not independently associated with improved outcome?
3) Should I reference both single and multiple predictor models when presenting the data?
4) How can I group the variables using a statistical method such that I am not just relying on my knowledge of insulin pumps? Can I look at the association between each independent variable and make (e.g.) 5 groups and then run it. So when researchers look at it they can see that m_Carbs_mean (the mean carbohydrate meal size) and the bolus size are related.