Backward logistic regression

#1
:wave: Hi ...This is my first posting. After reviewing the talkstats archives for info on regression analyses, I turn to talkstats subscribers for help. The last stats class I took was about 5 years ago. I am currently working on a research study using a secondary data set (non-random sample). I am looking for variables predictive of program outcomes. The method of analysis I am taking; (1) examine assoication between all variables, with a Pearsons Chi-Square and a t-test to compare sample means between two ethnic groups in the study. (2) clean up data - missing data and outliers among the IV's, all of which may impact the regression analysis, (3) enter all variables at one time in a backward logistic regression.
:confused: Questions - am I on the right track? Do I clean up data prior to comparing variables? I have been running the data set on SPSS; Pearson Chi-Square & t-tests. What criteria does a independent variable have to meet to be entered into the regression analysis (i.e. strength)?
Any help will be greatly appreciated,
Thanks,
dataquest
 
Last edited:
#2
First of all, I would like to know what are the program outcomes? You say you want to use a logistic regression, that is fine if you have binary variables, but if your outcome variables are continuous, then you cannot use a logistic regression. Not sure what you are doing with the missing data. If you are doing imputation, that is ok, but make sure it is done correctly, it can be tricky. I am not sure what else you can do with the missing data. Also be careful of the outliers. If you decide to remove the outliers, you must have a very good reason to suspect that the value is entered wrong rather than that its just an outlier. If you have continuous variables with outliers, you might want to do a transformation, such as taking a log of the values to normalize the data. Or you can put continuous variables into categories and break those down into dichotomous outcomes (yes/no) for each category. SPSS will skip the observations that have missing values, so you don't have to get rid of them from your sample.
You should examine the association between each outcome variable and all covariates as well as between the covariates. For example, education and salary might be highly correlated and then you should only use one of these. Also, when you run the chi-square statistic on each individual covariate versus the outcome variable, you will get some that are not statistically significant. You might consider not putting those variables into the model. On the other hand, there might be some variables that are not statistically significant, but you may think that its important for them to be in the model anyway. When doing backward elimination in a logistic regression, you let the computer decide what needs to stay and what needs to go. Sometimes human intuition does a better job, so you should try it both ways.

Jenny Kotlerman
www.statisticalconsultingnetwork.com
 
#3
Thanks for the response Jenny! VARIABLES - The program outcomes include participants who either (1) Graduated or (2) Terminated, from the program. MISSING DATA - The sample size is 260 and I have eleven variables with sets of scaled scores for 217 individuals (data is secondary so I have to work with what I have). It appears that during data entry a section of scores were not inputted either out of error or scores not available. Anyway, I am left with a hole in my data set. I have to keep the 260 count because I have other IV's (i.e. family participation, ethnicity..etc) that are included in my study.. and which I am grouping together to do another run. Do I still need to be concerned about the missing data?
OUTLIERS - I re-examined the outliers and found that they are not data entry mistakes. Each value comes from a different person and denotes individual differences (the variable was Age/entries ranging from 12yrs (1) to 18yrs(1) with the mode being 16). Do I have to be concerned in tranformation of this continuous variable if the only outliers are one- twelve year old and one -eighteen year old? I'm not sure if two extreme outliers would make a difference.
High Correlations - Correlations were first run on a group of scores taken from a clinical inventory. Almost all of the variables were significant at the .01 level, some even reaching .803 Sig. (2-tailed)(impulsive propensity was highly correlated to unruly), which may mean using only one? I have not examined the association between outcome variables (DV) and ALL IV's. I have examined the association between eleven variables (scaled scores) taken from one measure as mentioned above. I was hoping to remove some variables that are highly correlated before combining all variables (demographics, scaled scores, parent participation, etc). I want to make sure my data is clean before entering all variables into the backward regression.
Oh, I saw that you have entered an email address, should I be contacting you through this address? I entered this on the forum because I thought someone else may have a similar problem and they could benefit from the response (especially since talkstats has archieves, which is where I first went for help). Thanks again for your help, your input has been very beneficial. I hope to hear from you again,
dataquest
 
#4
So you only have one outcome variable: whether or not a person graduated from the program. In this case it is probably practical to use a logistic regression.
For the missing data, what are these sets of scaled scores that you are talking about? Do they somehow define an outcome, or are they a set of scores that define your predictor variable? In either case, if you cannot get these scores from anywhere, then whatever analyses you will be running, will be for the 217 people, not 260. There is nothing that you can do about it, but you will have to note it in your report.
As far as the outliers are concerned, I am not sure what you mean that you have outliers for one twelve year old and one 18 year old. As long as the histogram has a normal distribution, you are ok to put it in the model as is. It does not matter that there is only one value for 12 and one value for 18. Perhaps you have to clarify your problem to me with this variable. Also you should make sure that all of your scaled scores are normally distributed. What are the values for the scales? If its a likert score of 1-4 or 1-5, it is considered an ordinal variable, not a continuous variable, although many people use these as a pseudo-continuous one.
For correlations, without knowing more about these clinical variables, I cannot give you anymore advise on how to properly put them into the model. Sometimes a set of scores are combined, if they are all highly correlated, into one score to use in the final model. Other times, they are left as separate scores. Remember, the more variables you put into your model, the lower is your power. As I mentioned before, start by performing simple t-test or a chi-square test for each individual variable in order to decide which one should go into the model. For example, if ethnicity is not significantly different between those who graduated and those who did not, you might decide to exclude it from the full model, or you might decide that it is important enough variable for you to keep it in.
As far as the website which I listed in the posting, it is the website for my private consulting business, where you can go to get more in depth answers as well as some more complicated analyses done for you for a reasonable price. I cannot perform your analysis for you here on talkstats, nor check your data for some other inconsistencies which you might have not noticed. My personal website has nothing to do with talkstats.

Jenny Kotlerman
www.statisticalconsultingnetwork.com
 
#5
Backward logistic regression, continued dialogue

So you only have one outcome variable: whether or not a person graduated from the program. In this case it is probably practical to use a logistic regression.
For the missing data, what are these sets of scaled scores that you are talking about? Do they somehow define an outcome, or are they a set of scores that define your predictor variable? In either case, if you cannot get these scores from anywhere, then whatever analyses you will be running, will be for the 217 people, not 260. There is nothing that you can do about it, but you will have to note it in your report.
As far as the outliers are concerned, I am not sure what you mean that you have outliers for one twelve year old and one 18 year old. As long as the histogram has a normal distribution, you are ok to put it in the model as is. It does not matter that there is only one value for 12 and one value for 18. Perhaps you have to clarify your problem to me with this variable. Also you should make sure that all of your scaled scores are normally distributed. What are the values for the scales? If its a likert score of 1-4 or 1-5, it is considered an ordinal variable, not a continuous variable, although many people use these as a pseudo-continuous one.
For correlations, without knowing more about these clinical variables, I cannot give you anymore advise on how to properly put them into the model. Sometimes a set of scores are combined, if they are all highly correlated, into one score to use in the final model. Other times, they are left as separate scores. Remember, the more variables you put into your model, the lower is your power. As I mentioned before, start by performing simple t-test or a chi-square test for each individual variable in order to decide which one should go into the model. For example, if ethnicity is not significantly different between those who graduated and those who did not, you might decide to exclude it from the full model, or you might decide that it is important enough variable for you to keep it in.
As far as the website which I listed in the posting, it is the website for my private consulting business, where you can go to get more in depth answers as well as some more complicated analyses done for you for a reasonable price. I cannot perform your analysis for you here on talkstats, nor check your data for some other inconsistencies which you might have not noticed. My personal website has nothing to do with talkstats.

Jenny Kotlerman
www.statisticalconsultingnetwork.com
Hi Jenny,
:tup: Thanks for replying and confirming my use of a logisitc regression analysis in my study. Missing Data: the scaled scores I am using are continuous,they range from 0-115, and they define the predictor variable. After researching how to deal with missing data, I had to drop the 43 cases with missing data:shakehead , which was over the suggested rule of thumb 15% missing values in a particular case or variable. Outliers: I will have to look at the normal distribution. Age is the variable ranging from 12-18yrs. Correlations: I will have to read a little more on the topic. If all are significant I will be entering them into the regression model, if not I will have to make a decision if they are important predictor variable. Thanks for the input, it always helps to discuss process. I have been working with the SPSS program for a while, it is just remembering all the extra procedures to include in any given analysis. I will post my process once I have figured it out completely.
:wave: Bye for now!