Hello,

I am undertaking a project analysing insulin pump data in Type 1 diabetics. My dataset is from 2015-2019 and for each patient I calculate the mean or variance of many variables for the day. I then look at an average for the whole period 2015-2019 and am trying to determine behaviours associated with improved HbA1c, proportion of blood glucose readings "in range" (3.9-10) and proportion of readings which are hypoglycaemic (<3.0). The sample size is 556 patients. I am using R.

I have 20 input variables and when I run a single predictor linear model for each they tend to produce statistically significant results.

I am now looking to see which are independent predictors of success. I tried to run a multiple predictor linear model using all variables and it produced a model with an adjusted R squared of 0.239. However, I believe that some of the variables are definitely not independent. For example, I had total bolus insulin, total basal insulin and total insulin - all of which were included in the final minimal model using the step() function.

I then selected variables using my knowledge of insulin pumps that I believed would be least likely to be related. When I ran this model I got an adjusted R squared which was lower at 0.1937. See summary() below.

My questions are:

1) Which model should I be using?

2) Are variables which are not found in the multiple linear model definitely not independently associated with improved outcome?

3) Should I reference both single and multiple predictor models when presenting the data?

4) How can I group the variables using a statistical method such that I am not just relying on my knowledge of insulin pumps? Can I look at the association between each independent variable and make (e.g.) 5 groups and then run it. So when researchers look at it they can see that m_Carbs_mean (the mean carbohydrate meal size) and the bolus size are related.

Any help would be appreciated - including links to further readings. Thank you in advance.

I am undertaking a project analysing insulin pump data in Type 1 diabetics. My dataset is from 2015-2019 and for each patient I calculate the mean or variance of many variables for the day. I then look at an average for the whole period 2015-2019 and am trying to determine behaviours associated with improved HbA1c, proportion of blood glucose readings "in range" (3.9-10) and proportion of readings which are hypoglycaemic (<3.0). The sample size is 556 patients. I am using R.

I have 20 input variables and when I run a single predictor linear model for each they tend to produce statistically significant results.

I am now looking to see which are independent predictors of success. I tried to run a multiple predictor linear model using all variables and it produced a model with an adjusted R squared of 0.239. However, I believe that some of the variables are definitely not independent. For example, I had total bolus insulin, total basal insulin and total insulin - all of which were included in the final minimal model using the step() function.

Code:

```
Call:
lm(formula = meanHb ~ m_BG_num + m_BR_num + v_BR_num + m_Bol_mean + m_Carbs_num + v_Carbs_num + m_BR_tot + m_Bol_tot + m_insulin_tot, data = meanHb)
Residuals:
Min 1Q Median 3Q Max
-33.996 -7.427 -0.567 6.287 60.884
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 79.95296 2.18495 36.593 < 2e-16 ***
m_BG_num -1.53930 0.27118 -5.676 2.24e-08 ***
m_BR_num -0.50174 0.16609 -3.021 0.002638 **
v_BR_num -0.09066 0.05753 -1.576 0.115667
m_Bol_mean 1.35472 0.33284 4.070 5.39e-05 ***
m_Carbs_num -1.21391 0.37794 -3.212 0.001396 **
v_Carbs_num 0.90840 0.22757 3.992 7.46e-05 ***
m_BR_tot -2.42331 0.61669 -3.930 9.60e-05 ***
m_Bol_tot -2.49543 0.61519 -4.056 5.71e-05 ***
m_insulin_tot 2.29962 0.60692 3.789 0.000168 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.9 on 546 degrees of freedom
Multiple R-squared: 0.2513,Adjusted R-squared: 0.239
F-statistic: 20.37 on 9 and 546 DF, p-value: < 2.2e-16
```

Code:

```
Call:
lm(formula = meanHb ~ m_BG_num + m_BR_num + m_BR_var + m_propBasal + m_Carbs_mean, data = meanHb)
Residuals:
Min 1Q Median 3Q Max
-32.680 -7.396 -1.086 6.391 69.187
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.68001 3.83933 17.628 < 2e-16 ***
m_BG_num -1.93774 0.25871 -7.490 2.76e-13 ***
m_BR_num -0.47213 0.16403 -2.878 0.00415 **
m_BR_var -3.51347 2.28792 -1.536 0.12520
m_propBasal 13.75078 4.75705 2.891 0.00400 **
m_Carbs_mean 0.10588 0.03445 3.073 0.00222 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.25 on 550 degrees of freedom
Multiple R-squared: 0.201,Adjusted R-squared: 0.1937
F-statistic: 27.67 on 5 and 550 DF, p-value: < 2.2e-16
```

1) Which model should I be using?

2) Are variables which are not found in the multiple linear model definitely not independently associated with improved outcome?

3) Should I reference both single and multiple predictor models when presenting the data?

4) How can I group the variables using a statistical method such that I am not just relying on my knowledge of insulin pumps? Can I look at the association between each independent variable and make (e.g.) 5 groups and then run it. So when researchers look at it they can see that m_Carbs_mean (the mean carbohydrate meal size) and the bolus size are related.

Any help would be appreciated - including links to further readings. Thank you in advance.

Last edited by a moderator: