Advice on OLS analysis

Hello Everyone,

I’m writing a paper on the relationship between weather on daily public bike share system usage. I’m looking for a bit of advice regarding the quality of my statistical analysis. I’m not sure if this is an acceptable post. If not, I deeply apologize. I’m a student of statistics and plan to use stats in the transportation planning field.

I’d like to know if my residual and QQ plots are accurate, and the OLS analysis is of good quality.

Thank you in advanced.

Ordinary Least Square Regression

Historical weather and daily bike hire counts were analysed using an ordinary least square regression (OLS) model. The statistical language and environment R was used to count daily bike hires, extract weather data from, and carry out the analysis. The OLS model was generated using the following function in R.

a <- lm(formula = Freq ~ Dew_PointC + Max_Gust_SpeedKm_h + Max_Humidity +
Max_TemperatureC + Max_Wind_SpeedKm_h + Mean_Humidity + Mean_TemperatureC + Mean_VisibilityKm + Mean_Wind_SpeedKm_h + MeanDew_PointC + Min_DewpointC +Min_Humidity + Min_TemperatureC + Min_VisibilitykM + Events, data = SSS1, x = TRUE)

An OLS linear regression model functions to minimize the sum of squares of the error between the observed data and the theoretical data. It is an extension of simple, single variant regression using the following function.

y = a + b1x1 + b2x2 + …. + bnxn
(where y is the dependent variable, a=intercept and b=slope of the line)

An OLS regression model assumes residuals are normally distributed (Field, Miles, and Field). A normal plot was used to test the validity of this assumption. Figure 4 suggests normally distributed residuals in the OLS analysis. Thee normal plot was generated using the following code in R.

a2= rstandard(a)
qqplot <- qqnorm(a2, ylab="Standardized Residuals", xlab="Normal Scores", main="Weather
and Bike Hire Counts") + grid()

Figure 4: Normal Plot of Residuals

In a perfect regression model, there is no error between the observed data and theoretical model. A no error model can be represented by plotting dependent variable against the standardized residuals from the mean. If all residuals fall along the mean, then the model is perfect and no difference between the theoretical and observed models. However, this is rarely the case. Figure 5 includes a scatter plot showing the residuals of the OLS model. Most residuals lie between 2 and -2 standard deviations from the mean, and some residuals between 2 and 4, and -2 and -4. The latter suggests that the model may not be able to predict the number of bike hires in some instances (Field, Miles, and Field, 2012). The following code was used to generate a scatter plot of the standard residuals of the OLS model.

a2= rstandard(a)
plot(a2, ylab="Standardized Residuals", xlab="Bike Hire Counts", main="Plot of Standardized
Residuals", col = "Dark Green")
abline(0, 0)

Figure 5. Scatter Plot of Residuals

The results of the post hoc tests suggest the model has strong, but not perfect predictive power. The following table highlights the findings from the multi variable regression model.

Table 10: OLS model weather and daily bike hire counts in Glasgow, UK (24 June 2014 – 3 August 2016)

The model suggests that if all independent variables are zero, then 928 hires should occur. This seems accurate as significant variables should act to bring this calculation closer to mean levels.

Mean humidity had a negative effect on the number of bike hires. This suggests that people use other modes of transportation when humidity is higher than a comfortable level (when it’s muggy), or when it’s raining or foggy. Three indicator variables of rain, fog, and rain with fog were included and each had a negative effect on daily bike hires.
Mean wind speed seems to have a negative (albeit small) effect on the number of bikes hires. The daily maximum temperature had a positive influence on daily bike hires. The relationship between bike hires and temperature may account for the seasonal variations in usage.

Dew point is negatively associated with bike hires, which may align well with the notion that cooler temperatures, rain, fog, and higher humidity negatively influence bike hires. Interestingly, bike hire counts were less when minimum visibility was greater. This is contrary to expectations of an inverse function. One would expect greater visibility to have a positive influence on bike hire usage.

Gebhart and Noland (2013) uncovered similar results on the relationship between bike hire counts and rain, fog, humidity, and temperature in Washington DC. However, the authors found contrary results regarding wind speed. Dew point and minimum visibility were not included into their model. This research suggests new findings on the relationship between weather and daily bike counts.