Curving qq plot: What does this mean?

noetsi

Fortran must die
#1
I have 22000 cases so I am not sure the residuals even matter, but these are strange looking residuals and raised my concern including if they are linear. The predictor is number of days usually less than 60 although they can be as many as 560. I am predicting income a interval measure.
 

Attachments

noetsi

Fortran must die
#2
I limited the data by excluding the top 1 percent. The residuals look normal to me but the qq plot does not. I have not found any examples of QQ plots that look like this so I don't know how to interpret them. As noted I am most concerned about non-linearity.
 

Attachments

Miner

TS Contributor
#3
If I were to see a plot such as this in a reliability analysis, I would conclude that I had a mixture of 4-5 different failure modes. Dog legs (straight segments with bends) indicate a mixture and each line segment a different failure mode.

1600719854969.png
 

noetsi

Fortran must die
#5
The data is duration in a specific status. 0 means you spent no days, effectively, in that status. In practice since this is a date diff generated statement it means you spent up to 24 hours in that status. I did not think it mattered since this is a predictor with roughly 178 distinct levels (values range from 0 to 178 days).

It is a population's duration miner. Can that have the equivalent of what you are talking about (it would not be failure mode, but maybe something similar). Some of the comments I read suggested this could occur when you have a multimodal result (and I assume this might occur).

Since I have something between 16-20 thousand data points I am not sure the normality matters. My real concern is that the data relationship might be non-linear. I am not sure if this suggests non-linearity (I posted the residuals as well which don't look non-linear to me).
 

Miner

TS Contributor
#6
Reliability analysis is just analyzing the time to an event such as a change in status, so it would be appropriate in your situation. The "failure mode" means that there are distinct subgroups where the distribution of time to a status change is different. In survival analysis, this would equate to having a mixture of people that die from different causes (e.g., pancreatic cancer, lung cancer, etc.). In your case, I think you may have 4-5 sub-populations that progress through the status change at different rates.
 

noetsi

Fortran must die
#7
I am certain we have many subpopulations. What they are, however, I have no idea. We have never analyzed or thought of this. My practical concern is if we have non-linearity - any suggestions on this based on the residuals.
 

Miner

TS Contributor
#8
The bottom line is more a matter of whether your model predicts well enough to be useful. If it does, minor violations don't really matter. If it does not, you should investigate the sub-populations and include them in your model.

To illustrate, we use a virtual catapult to teach design of experiments. The actual function is nonlinear, but response surface methods will yield a quadratic model that approximates the actual function. Depending on the students' design space, the model usually fits well enough to make good predictions on the distance. However, sometimes it will predict well for short/long distances and completely fall apart in the middle range. We have had students fit three linear spline models for short/medium/long distances that worked extremely well. The point being that the models were all theoretically wrong, but some were very effective at prediction within the requirements.

On the opposite extreme, I have seen models that had an R^2(predicted) of 0.999 that were not close enough for design purposes.
 

noetsi

Fortran must die
#9
I understand models being wrong but predicting correctly Miner - I do time series where I have had terrible models logically work better than models that were logical. :) My original concern was if a curve in a qq plot suggests non-linearity (I had never seen one curve in my data or the literature). Apparently it does not ever it just shows non-normality. Since I don't care about non-normality only nonlinearity that is important. I did a box tidwel test and look at the studentized residuals neither suggested non-linearity.

You raise a critical point, however. A model can violate the assumptions and be wrong. But how do you know how serious the violations are before you need to worry. I don't want to give the people I provide data the wrong answer so this always scares the heck out of me.