is regression analysis possible for this dataset?

GAL

New Member
#1
I am playing with a dataset (for SAS training purpose), and wanted to do some regression analysis, including the effects of several life style factors on the prevalence of a condition. After some playing, I now suspect that regression analysis is impossible, the way this dataset is built. But maybe I’m wrong?

The dataset itself is rather complex, but attached is an excel file screenshot with a simplified tables demonstrating what I think is the problem. Would be grateful for a confirmation or refutation. 2019-02-16.png
 
#2
I would not do regression. It seems you have almost as many parameters as observations. A good rule of thumb is 10 observations per predictor.
 
#4
I agree. I was speaking from a standpoint of being able to fit the model. If I have 10 observations and 10 predictors I don't have any degrees of freedom to estimate error.
 
#5
But if you have 12 observations and 11 predictors you can estimate each effect very good in a Plackett-Burman design. And that is far less than 10 times the number of predictors.

On the other hand, if you have a few very colinear variables and the effect size is very small, then you vwill need much more than 10 times the number of variables. It is just an internet rumor that is not true.
 

noetsi

Fortran must die
#6
There are many authors, Agresti comes to mind, who argue that you need a minimum sample size to conduct regression. I think this has less to do with identification, then it has to do with statistical power and detecting actual variation. In addition, very small data sets likely don't generalize well to the population you draw them from unless that population is very small. In addition, many of the regression assumptions are asymptotically correct so violations of them when you have small sample sizes are much more serious than when you have large data sets.

I would not run a regression with less than 30 cases (at which point the CLT kicks in) although I would prefer analysis with hundreds of cases to be reasonably sure of the results. This is true whether you have Multicolinearity or not.
 
#7
There are many authors, Agresti comes to mind, who argue that you need a minimum sample size to conduct regression. I think this has less to do with identification, then it has to do with statistical power and detecting actual variation. In addition, very small data sets likely don't generalize well to the population you draw them from unless that population is very small. In addition, many of the regression assumptions are asymptotically correct so violations of them when you have small sample sizes are much more serious than when you have large data sets.

I would not run a regression with less than 30 cases (at which point the CLT kicks in) although I would prefer analysis with hundreds of cases to be reasonably sure of the results. This is true whether you have Multicolinearity or not.
I disagree.
 

Dason

Ambassador to the humans
#8
That's all well and good but I think everybody is sidetracked on the (in my opinion) irrelevant discussion on sample size. They posted a sample/example of their data which wasn't intended to be the full dataset (at least for the data on the left) as far as I could tell.

@GAL Can you provide some clarification on what your data looks like since it's not clear to me what you actually have.
 

noetsi

Fortran must die
#11
Lol good one. It is distressing, as often occurs, when you read an article that says something that is taken for granted may in fact be totally wrong. For example I read a book once that argued virtually all linear methods were wrong (even with large samples) when you had extreme outliers. And more recently I have encountered authors who argue that nearly all regression in the presence of non-Stationarity (which is pretty common in real world data) may be spurious.

Time to go back to descriptive statistics :(
 

noetsi

Fortran must die
#13
An example of what I have trouble with Sir Box's views is that if you mis-specify a model the results will be biased. How can you use a biased model correctly. And yet all models are likely to be misspecified since none is going to include all pertinent variables (doing so would violate parsimony as well).
 

Dason

Ambassador to the humans
#14
Well how bad is the bias? What is your goal? If the goal is to get good predictions then a model being unbiased might actually hurt if it increases your variance enough.
 

noetsi

Fortran must die
#15
From what I have read if it is biased there is no way to really know in most cases how much. It depends on relationships you will rarely know.

When I predict (which is exclusively in time series) I use univariate models such as ESM or state space models. I think it is generally agreed they predict better at predicting time series than multivariate models and are robust to assumptions (they make almost no assumptions at all if you are making point estimates). They are of course much easier to do as well.

I use multivariate models for one thing only, to see the relationship of variables to each other.
 
#17
Well how horrendous is the inclination? What is your goal? If the goal is to get extraordinary conjectures, by then a model being unbiased may truly hurt in case it manufactures your change enough.
 
#18
Yes, 'all' models are wrong unless you are doing a data simulation or fixed experiment and know the true underlying data generating function. This is further affected given most things are probabilistic not deterministic, due a lack of precision in data collection, etc. Furthermore, if a realization is small, results may mislead the researcher given sampling variability even when the proper model is identified. There are many things to think about when attempting to make inference, including and going beyond model specification.

Side note, I don't see any issues with the meta-data presented by the OP. As @Dason noted, the person used ellipses to signify this is a partial presentation of data. The biggest issue for me was not showing us the percentages and sample size, so we could truly evaluate sparseness concerns. @GAL also did not tell us what the modeling issue was.