# cell size for multifactorial ANOVA

#### Mary_Anne

##### New Member
Hello, I wonder if I could get help with this.

I am analyzing data on scores from a psychosocial development assessment on young children, trying to see if there are any differences by their age, sex, location, and length of participation in a program. I’m using multifactorial ANOVA (using general linear model in SPSS). I enter all main effects and 2 expected interactions (of theoretical interest).

I am using all variables as categorical. I decided to go with this approach rather than using some variables (age, length of time in program) as continuous covariates or using multiple regression because of the audience I’m targeting (I thought it’ll be easier for them to see the results by means by each age group, by each year in program etc. rather than seeing regression coefficients, etc.).

We have N of 330 for this analysis, with age (4 groups, n=30, 94, 122, 84), sex (2 groups, n=177, 152), location (3 groups, n=214, 81, 35), time in program (3 groups, n=139, 115, 73).

Now the question is about sample size in each of the cells when all these factors are crossed. I have 12 cells with n of 1, 10 cells with n of 2, 7 cells with n of 3, 5 cells with n of 4, 2 cells with n of 5, etc. Maximum is n of 37.

I also have another analysis with 3 categorical predictors (all above excluding time in program) with which I test all 2-way interactions but not 3-way.

I’m seeing interesting and expected statistically significant results. So if no one gives me trouble with cell level sample size, I’m fine with the results, but I’m thinking of submitting this to a journal and getting into trouble because of cell n.

I tried googling on this but did not find much information but what I found gives me mixed message (don’t worry if there’s enough power, can even have zero as the cell size, must have at least n of 2, etc.) so I thought I’d ask this forum what your feedback will be.

#### spunky

##### Doesn't actually exist
I’m seeing interesting and expected statistically significant results. So if no one gives me trouble with cell level sample size, I’m fine with the results, but I’m thinking of submitting this to a journal and getting into trouble because of cell n.
i think you're almost guaranteed to get into trouble with such extreme disparity in cell sizes... i mean, 12 cells with only 1 participant there? although i would have to look at your data/research design a little bit closer to make more precise comments, i would probably challenge the significance of all of your results just based on that (and well, if we factor in that you have cells with only 2 participants, 3 participants, etc, versus one with... 37? well, then the thing just keeps on getting more and more complicated).

are you using type II or type III sums of squares? i guess type III because it's the default in SPSS although you'd need to make the case as for why you're using that one... it doesn't always fix everything as most people would like to believe (i did research on types of sums of squares about a year ago or so so i can point you in a few directions, especially for interactions in unequal sample size designs on which type II sums of squares makes more sense....)

all in all, you're kind of stuck between a rock and a hard place here. unequal sample sizes are not as bad as most people think, but they do require a little bit more care in the analysis of your data. for instance... god, i dont remember how it goes but i think if your largest cell also has the largest variance your type I error rate goes up... i think that's in bradley's robustness article from the british journal of mathematical and statistical psychology (from the 70's? i dunno. i can find it for you though).

can you get more people in your sample? for a school project of sorts you may be able to get away with this but if you submit this to a journal as a manuscript, i do get a strong feeling that someone's gonna red-flag these extreme sample disparities...

by the way, where did you find that thing of cells with a count of 0? i definitely want to have a look at that...

#### Mary_Anne

##### New Member
Hi, Spunky, thank you so much for your comment! It's good to know that these cell sizes are problematic. I guess I knew it would be but wanted to ignore the issue. Now I got an objective opinion that I should not ignore it I have to work to address that. Unfortunately the data collection for the project wrapped up a few years ago so there is no chance to add more to the sample.

Yes, I'm using Type III sum of squares, and did not think much of it but should study up on it further. Please let me know if you think I should use Type II.

The place I read a comment that said n of zero maybe OK was in below page, the second commenter named admin.
http://www.statistical-consulting.info/minimum-cell-size-anova

What I am thinking of now is maybe I'll keep and present the table of means using categorical age and time in program just as descriptives, but for statistical testing, use ANCOVA, keeping sex and location as categorical variables and use age and time in program as continuous covariate. With sex x location, the cell sizes will be

location
sex 1 117 43 17
sex 2 98 37 18

Do you think this will still be a problem? I understand that it's not just the sample size but unequal-ness of sample size that's the problem. So with minimum being 17 and maximum 117, it may still be a problem. WHat do you think?

Also, not trying to invent a way to ignore the original problem, but this is simply a question. I wonder, if the cell size crossing all 4 factors would matter in the case presented in the original question because I don't test 3-way or 4-way interactions?

I test 4 main effects, with n's:
age (4 groups, n=30, 94, 122, 84), sex (2 groups, n=177, 152), location (3 groups, n=214, 81, 35), time in program (3 groups, n=139, 115, 73),

and

two 2-way interactions which will be
sex by time in program (cell sizes below)
78 61
59 56
40 36

location by time in program (cell sizes below)
81 41 17
76 32 7
58 7 11

No n of 1s or 2s. Do you think this is not as bad? Is it a correct way to think about it (=don't have to think about the cell sizes crossing all factors if not testing higher order interactions)? Curious - please let me know.

Thank you! Even if you're not Spunky, please chime in as well!

#### spunky

##### Doesn't actually exist
ok, a few comments here and there to clear up some things.

i once read what i think is a very good, persuasive article as for why type II sums of squares is better than type III sums of squares, especially if you're interesed in exploring interactions. it is:

ØYVIND LANGSRUD. (2003). ANOVA for unbalanced data: use type II instead of type III sums of squares in Statistics and Computing, Vol. 13, 163-167

so i guess at the very least you should try both type 2 and type 3 sums of squares and see what comes out. most people in the literature ignore this issue, however. type-3 sums of squares won the battle of sums of squres (there're actually 6 types of sums of squares) when SPSS was programmed because (a) type-3 is their default method and (more importantly) (b) it is the default method because it's the most likely type to give you statistically significant results. and we both know how people just luuuuuuv those p-values below .05 rite?

i went to that webpage and reviewed what they said. please, be very careful with the way in which they use the language. when they say that "technically" ANOVA needs at least 2 measurements per cell, the point they're trying to make is that you need at least two points to get a mean and deviations from that mean. it is kind of hard, as you can imagine, to deviate from yourself when you're the only score there. however, they were just trying to make a technical point with that. such small cell sizes should never be advisable.

with regards to the 0-cell count that also took me by surprise. they are right though, but i think the author of that post was a little bit careless with regards on not citing anyone (i mean, 0-count cells? that's a HUGE claim to make). i found a good article that addresses the issue in:

Searle, Speed, and Henderson (1981). Some Computational and Model Equivalences in Analyses of Variance of Unequal Sub-class Numbers Data in The American Statistician, vol. 35, 16-33.

the mathematics in this article may be well outside of your are of interest, but the conclusion is a huge IT DEPENDS. for instance with your particular case, if you had 0-cells in the boy level of the gender factor but none on the girl-level and wanted to make claims about your location factor (crossed with gender), you can do it. you cannot claim much about any gender effec because of that 0-cell there. it just gets too complicated.

now, on to your case... with regards to using ANCOVA i guess the only recommendation would be to make sure that you're not missing too much data in certain sections of the covariate. for instance say you have quite a few people on the group of ages from, i dunno, 15-20 years old, then not so many on the 20-30 group and then once again more on the 30 and up, ANCOVA is going to overfit and you're very likely to get significance as a statistical artifact rather than because of a real effect.

missing data is going to be a concern of yours regardless of the interactions or main effects being tested. if it's missing, it's missing and it's going to screw up something somewhere. now, i like those new numbers i see in your tableslittle bit better because they are not as unbalanced as those n's of 1 and 2 that you mentioned on your previous post... HOWEVER, just as you said, the issue is where the inequalities lie. if you're missing data where the groups are the most different, your ANCOVA is not going to catch it... or if you have a lot of data in one group and not a lot on another and those two groups are really not all that different from each other, you're almost guaranteed to find a statistically significant difference where there is none. it all comes down to paying a lot of attention to your data:

if you were to look at their variances, how different are they among groups?
what about running a missing value analysis? perhaps you dont have too much of a problem because you have mcar (missing completely at random) versus mar (missing at random) situation...(just for the record, we like mcar situations and we dont like mar situations)

anyways, as i keep on writting this stuff, i guess you can maybe just do your best with an ANCOVA and write up on your discussion section that a potential draw-back is unequal sample sizes. as a quantitative data analyst, i have a fascination with modeling data and doing the best (and usually most complicated, when required) analysis i can do so that the people who ask for my help in these situations are absolutely sure, beyond reasonable doubt, that their conclusions are substantiated by the analysis. i forget sometimes that not everyone is like that and, sometimes, reviewers dont know too much about stats beyond what they learnt at thei master's level. people i've worked with have had their manuscripts rejected on the basis that the analysis was "too complicated"(<--WTF!?)

anyways, my 2 cents for your post rite there.

#### Karabiner

##### TS Contributor
What I am thinking of now is maybe I'll keep and present the table of means using categorical age and time in program just as descriptives, but for statistical testing, use ANCOVA
Since the categorization of interval data is a loss of information, I would have
tried the same strategy. But why ANCOVA, if multiple regression could do the
job?

Regards

K.

#### Mary_Anne

##### New Member
Spunky, Thank you so much for your response. I tried running the analyses using ANCOVA, and although some of the results changed, two main effects that I wanted to see are still statistically significant. Yeah!

You asked about the distribution of variables, and I know that age is relatively normally distributed. Time in program may be a bit skewed in the negative direction. The thing is, I’m dealing with a very young population here (2-5 year olds) and the length of time in the program is correlated with age (r= .5’s) (i.e., if you’re a 2 year old, there’s so many number of months/years you could be in the program so far). Would this be a problem?

Also, now I have a new question (may start a new thread but let me ask first here).

As I described in my earlier posts, I have two, very similar analysis. In the first, I analyze the scores by age, sex, and location, with all 2-way interactions.

The second analyzes the same scores by age, sex, location, and time in program, with sex by time in program and location by time in program interactions only.

The first analysis was sort of a supplement to the general descriptive table showing the scores in each category. The second purpose is to show “validity” of the scores (the assessment checklist was developed for the purpose of this study) that it shows the expected increase by age. Other variables are in there just in case these are confounders.

The second analysis tests the hypothesis that the time in the program helped children increase their scores. So time in the program is the focal variable here. The results will be attached to tables separate from 1st analysis showing the results by length of time in the program.

The question is, do you think it would be better to do one multivariate test including all variables and interactions? I’m afraid it’ll get messy with too many interactions, but it should be possible. Also, when I analyzed data, one analysis showed sig result by age, where as the other didn’t although age is in both analysis and results should be the same (I understand that the results are affected by the other variables in the analysis and that’s why this happened, but this doesn’t look real good).

What do you think?

Also, Karabiner, I guess I’m not sure why ANCOVA would be better than multiple regression. I originally selected ANOVA because I thought that was better suited to the audience. However, decided to use the age and time in program as continuous (because of cell size issue) – so I guess I naturally selected ANVOCA because it’s closer to ANCOVA. I’m not sure what I would have done if I hadn’t started with ANOVA initially. One advantage would be that I’ll get adjusted means with ANCOVA but I don’t plan to present adjusted means for my audience.

Would I get the same results anyways either way?

Thanks!

#### spunky

##### Doesn't actually exist
You asked about the distribution of variables, and I know that age is relatively normally distributed. Time in program may be a bit skewed in the negative direction.

The thing is, I’m dealing with a very young population here (2-5 year olds) and the length of time in the program is correlated with age (r= .5’s) (i.e., if you’re a 2 year old, there’s so many number of months/years you could be in the program so far). Would this be a problem?
depends. if you have a lot of kids entering the program very early and a lot entering very late but with not many in between, then you're in trouble because if you use that as a covariate i can guarantee you it is going to overfit. if you've got a few kids at the beginning, a few in the middle and a few at the end, then it shouldn't matter that much.

now, if you have a kids entering the same program, at different points in time and through different periods (i dunno, like "when little Johnny was 2 he entered the program for 6 months, then when he was 3 he entered the program again for 2 weeks, then when he turned 4 he just didnt participate in the program...") then prepare to do some pretty hardcore stats if you plan to untangle the effects of that...

As I described in my earlier posts, I have two, very similar analysis. In the first, I analyze the scores by age, sex, and location, with all 2-way interactions.

The second analyzes the same scores by age, sex, location, and time in program, with sex by time in program and location by time in program interactions only.

The first analysis was sort of a supplement to the general descriptive table showing the scores in each category. The second purpose is to show “validity” of the scores (the assessment checklist was developed for the purpose of this study) that it shows the expected increase by age. Other variables are in there just in case these are confounders.

The second analysis tests the hypothesis that the time in the program helped children increase their scores. So time in the program is the focal variable here. The results will be attached to tables separate from 1st analysis showing the results by length of time in the program.

The question is, do you think it would be better to do one multivariate test including all variables and interactions?
i'm not sure you can do that. you said you analyzed the same scores twice, right? so unless you're willing to do profile analysis, you really only have 1 dependent variable and, hence, you can only use univariate methods. i'm starting to get the feeling that you're not gonna be able to use ANOVA/ANCOVA to answer what you're looking for... for instance, if you run the same analysis on the same sample twice, you'll need to adjust for type-1 error. if i'm understanding you correctly, you're doing two ANCOVAs and you're adding more variables on the 2nd analysis... but it is the same questionnaire scores, right? so... what happens when we do the same analysis on the same data more than once?

Also, when I analyzed data, one analysis showed sig result by age, where as the other didn’t although age is in both analysis and results should be the same (I understand that the results are affected by the other variables in the analysis and that’s why this happened, but this doesn’t look real good).
well, you have the sample size issue there as well rite? but, in general, i guess you have a pretty good grasp of what happened here. perhaps age's variance is redundant when paired with other variables and, hence, you missed that effect.
What do you think?

Also, Karabiner, I guess I’m not sure why ANCOVA would be better than multiple regression. I originally selected ANOVA because I thought that was better suited to the audience. However, decided to use the age and time in program as continuous (because of cell size issue) – so I guess I naturally selected ANVOCA because it’s closer to ANCOVA. I’m not sure what I would have done if I hadn’t started with ANOVA initially. One advantage would be that I’ll get adjusted means with ANCOVA but I don’t plan to present adjusted means for my audience.

Would I get the same results anyways either way?
you will and, if you ask me, you'd even get better results. multiple regression is much more flexible when it comes to handling unequal samples and interactions. i didnt suggest it for you because you specifically said that you did not want to use it because of your audience. however, i can see that you're asking some pretty complex questions which, of course, will require more complex analysis. i dont think you'll be able to answer the questions you want only through ANOVA/ANCOVA (besides, let's face it... using ANCOVA just to get out the small cell sizes is not a very good reason, lolz.)

i would strongly suggest to you to either follow Karabiner's advice and frame this as a multiple regression problem or simplify your research question so that ANOVA can handle it better.... besides, ANOVA is regression (a restricted type with categorical instead of continuous predictors) but they do exactly the same thing, with the advantage that regression is far more flexible. i dunno, i think you could use this as an opportunity to educate your intended audience in the benefits of regression analysis...

#### Mary_Anne

##### New Member
i'm not sure you can do that. you said you analyzed the same scores twice, right? so unless you're willing to do profile analysis, you really only have 1 dependent variable and, hence, you can only use univariate methods. i'm starting to get the feeling that you're not gonna be able to use ANOVA/ANCOVA to answer what you're looking for... for instance, if you run the same analysis on the same sample twice, you'll need to adjust for type-1 error. if i'm understanding you correctly, you're doing two ANCOVAs and you're adding more variables on the 2nd analysis... but it is the same questionnaire scores, right? so... what happens when we do the same analysis on the same data more than once?
Hi, Spunky, thank you again for your informative response!

One part was confusing, probably because I wrote my original comment in a confusing way – in my use of words univariate and multivate. I was trying to write fast and wasn’t thinking. I didn’t mean ANOVA vs MANOVA etc. Yes, I do have only one dependent variable (actually I have two, but don’t intend to used it in a MANOVA kind of way). I do understand ANOVA is kind of the same as multiple regression and I guess I was using the word “multivariate” (incorrectly) to mean “analysis with multiple predictors like in multiple regression” --it just meant putting all predictors at the same time.

So my question was: it is OK to do two analysis (would it raise a red flag):
1) Dep v predicted by sex, age (continuous), location, all 2 way interactions
2) Dep v predicted by sex, age (continuous), location, time in program (continuous), time in program x sex, time in program x location
Or should I combine them into 1 analysis:
Dep v predicted by sex, age (continuous), location, time in program (continuous), 2 way interactions included in 1) above, time in program x sex, time in program x location

I haven’t decided whether to use ANCOVA or multiple regression, but the N will be the following.
location
sex1 117 43 17
sex2 98 37 18

I think though, from your quote above that you don’t think it’s a good idea to run analyses on one dep variable twice – due to type I error, etc. So that’s why I thought of combining it into one analysis…however, you mention…

i'm starting to get the feeling that you're not gonna be able to use ANOVA/ANCOVA to answer what you're looking for...
Could you tell me why you think so?

Thank you so much for your help!

#### spunky

##### Doesn't actually exist
well you are running the same analysis on the same data twice, with the only exception that one has one more variable and different interaction terms... but still, same DV and same IV's (plus a few others)... hence my concern about type-1 error and why, in such case, you'd need to correct for experimentwise error rate.

now, why are you running analysis twice? i know you said:

"The first analysis was sort of a supplement to the general descriptive table showing the scores in each category. The second purpose is to show “validity” of the scores (the assessment checklist was developed for the purpose of this study) that it shows the expected increase by age. Other variables are in there just in case these are confounders"

what's this "supplement to the general descriptive table"? now, if i get what you're doing, you kind of want to say that there are differences predicted by sex, age and location and then that there are still differences if you add time in the program after all that? how are you going to tell that from an ANCOVA? if i remember correctly, somewhere there on your previous posts (i believe the 2nd one) you said:

"What I am thinking of now is maybe I'll keep and present the table of means using categorical age and time in program just as descriptives, but for statistical testing, use ANCOVA, keeping sex and location as categorical variables and use age and time in program as continuous covariate"

covariates are not predictors. you'll only be able to make claims about the sex and location variables... however, on your 3rd post you say:

"The second analysis tests the hypothesis that the time in the program helped children increase their scores. So time in the program is the focal variable here. The results will be attached to tables separate from 1st analysis showing the results by length of time in the program."

and from what i'm reading on your last post, so time in the program is both covariate and predictor of changes in children scores? i'm very confused now.

if you were willing to use multiple regression as karabiner suggested, however, you could always just do a hierarchical linear regression on SPSS, adjust for age and location, and see whether time in the program still predicts some of the variance for scores (and its resepective interactions).

#### Mary_Anne

##### New Member
Hi, Spunky

Thank you for the info! I have things that I haven't been sure of in stats and I see that these things are relevant here and are putting me in trouble.

I thought covariate was just a way of calling continuous (vs. categorical) predictors (I knew that it had the meaning of variable that you want to control for, but I guess I thought it had more general meaning).

Since my focal variable (or variables that I want to see significant results in) are age and time in program, then I'm doomed using ANCOVA!

However, another question - in this thread it was mentioned that ANOVA was pretty much the same as multiple regression (using dummy variables etc.). Then ANCOVA is not equivalent to multiple regression (using dummy and continous predictors)? With multiple regression, you would be able to draw conclusions about continous predictors, correct? Or is ANCOVA not equivalent to multiple regression in the same sense ANOVA is?

Now I have to think of my whole approach to this study. I really don't think the audience and the journal that I have in mind is up for hiarchical multiple regression... This was supposed to be a very simple, descriptive study. It's just that I have too many things that I feel I should examine the contribution of (sex, location - although they haven't shown up sig in meaningful way in any of the analyses...).

Again, thank you for your input, I'll have to do some more thinking...

#### spunky

##### Doesn't actually exist
Hi, Spunky

Thank you for the info! I have things that I haven't been sure of in stats and I see that these things are relevant here and are putting me in trouble.

I thought covariate was just a way of calling continuous (vs. categorical) predictors (I knew that it had the meaning of variable that you want to control for, but I guess I thought it had more general meaning).

Since my focal variable (or variables that I want to see significant results in) are age and time in program, then I'm doomed using ANCOVA!
people sometimes call predictors covariates in multiple regression... but in ANCOVA it as a very specific meaning, and it is what it's used to adjust for the effect of the predictors your're interested in

However, another question - in this thread it was mentioned that ANOVA was pretty much the same as multiple regression (using dummy variables etc.). Then ANCOVA is not equivalent to multiple regression (using dummy and continous predictors)? With multiple regression, you would be able to draw conclusions about continous predictors, correct? Or is ANCOVA not equivalent to multiple regression in the same sense ANOVA is?
let's say that if ANOVA is one step from multiple regression, ANCOVA is already knocking at its door. why do you think that one of the assumptions in ANCOVA is that covariates are continuous and the homogeneity of regression slopes? ANCOVA uses multiple regression to give you the adjusted marginal means. it's basically doing regression on the covariates so it can get the means that will later be used for analysis in ANOVA[/QUOTE]

Now I have to think of my whole approach to this study. I really don't think the audience and the journal that I have in mind is up for hiarchical multiple regression... This was supposed to be a very simple, descriptive study. It's just that I have too many things that I feel I should examine the contribution of (sex, location - although they haven't shown up sig in meaningful way in any of the analyses...).

if that's too much for your intended audience then i dunno what else you can do to both do the analysis right and still please your audience...

#### Mary_Anne

##### New Member
Thank you Spunky, I appreciate all the learning that I'm getting from you!

I was thinking of hierchical regression, not the hierchical linear model....unfortunately, I think hierchical multiple regression is too complex for the audience, but I'll try running it.

Since location has 3 levels, I will create 2 dummy variables. For interaction of time in program (continuous variable) and location, do I center the time in program (subtract the mean) and multiply with the two dummies (creating 2 variables representing the interaction)?

If I have interaction variables using centered variables, do I entered the cetered versions of the variables for main effects as well?

Thank you so much for your mentoring.

Also, I have a question. The second (last) step of the hierachical regression seems to give me the same coefficients/sig levels etc for each predictor as if I didn't use hierchical regression but just entered all variables at the same time.

Does this mean the main benefit of using hierachical regression is that it gives me R squared change? If I'm not interested in R squared change, it'll be the same if I use just simple multiple regression?

Thank you!

#### spunky

##### Doesn't actually exist
Since location has 3 levels, I will create 2 dummy variables. For interaction of time in program (continuous variable) and location, do I center the time in program (subtract the mean) and multiply with the two dummies (creating 2 variables representing the interaction)?
uhmm... i dont think you can do that. the centering-and-cross-multiplying trick only works when you're doing continuous X continuous interactions. you have continuous X categorical, in which case (if i'm correct) what you need to do is contrast coding and not dummy coding. unfortunately, i'm gonna have to tip-toe around this question because it's been years since the last time i did something like this. nonetheless, you can always look into the bible of all bibles for multiple regression in the social/behavioural/health sciences: Cohen et. al.'s "Aplpied Multiple Regression/Correlation Analysis for the Behavioural Sciences" for guidance. Chapter 7 and Chapter 8 of the new edition, to be more specific.

If I have interaction variables using centered variables, do I entered the cetered versions of the variables for main effects as well?
yep, everything gets centered except for the dependent variable. although see my previous answer with regards to centering and continuous X categorical interactions because things are going to change.

Also, I have a question. The second (last) step of the hierachical regression seems to give me the same coefficients/sig levels etc for each predictor as if I didn't use hierchical regression but just entered all variables at the same time.
Does this mean the main benefit of using hierachical regression is that it gives me R squared change? If I'm not interested in R squared change, it'll be the same if I use just simple multiple regression?
couple of things here. first, if you're using centered variables and doing the interactions the way you mentioned, the regression coefficients might be wrong which could explain why you're not getting any change in them because it is very strange that they dont change at all... so i do think this has to do with the fact that you're using dummy coding when you should use contrast coding.
with regards to R^2-change, you're really not so much after the regression coefficients as you're after that R^2 change. you mentioned previously that sex and location were being treated as potential confounders and what you were really after was an effect for time in the program, right? well, if after you adjust for sex and location, the increase in R^2 is not statistically significant when you add the time in the program variable then you can't calim that time in the program had any effect "above and beyond" so to speak, from sex and location. meaning that you'd be just as good explaining change in scores just with sex and location as you would be if you added time in the program (i.e. there really is no effect of time in the program).

now, one other thing that i'm very curious about is whether the same children entered the program at different times throughout i dunno, the months or years that it lasted. i just want to make sure you dont have nesting factors, in which case you'd be in trouble. my example would be: "when little Johnny was 2 years old he entered the program for 3 months, then when the following year he was just on the program for 2 weeks, then the following year he was one week, then got out, and then in for two months".

second and this refers to your previous post where you said this was "supposed to be a very simple, descriptive study". if you're feeling a little bit overwhelmed by all the analysis and all the issues you have to address, for the sake of simplicity and efficency of effort, just do the best you can and mention somewhere about the potential drawbacks in your analysis (like the unequal sample size situation). if you wanna work this into a publishable manuscript, then maybe you'll have to dwell a little bit more in the stats, but i dont want the number crunching to take the main stage as opposed to the conclusions you derive from such analysis. yeah, i know i'm promoting bad science here but hey, we all need to start learning somewhere, right?

#### Mary_Anne

##### New Member
Hi, Spunky

Thank you again for your response!

Good thing that I asked about how to handle the interactions - I was about to use the wrong coding! I will look into this a bit, although upon thinking and reading your comment at the end, I'm leaning towards going back to an original, simpler (although more flawed) approach.

I also would like to make sure that I was clear that when using hierarchical multiple regression, coefficients for the independent variables that were in step 1 did change in step 2 - it's just that all coefficients in step 2 did not seem to differ from when I entered all independent variables at the same time (not using hierarchical regression). However, I do understand the importance of R squared - although in my model, the 2nd step had multiple variables (time in program, age) and interactions (multiple including time in program), even if R squared changed significantly, it's not clear which independent variable contributed - so felt that i needed to go back to the coefficients.

FYI - it seems like there were no participants who had multiple episodes in this program -- at least ones included in the data collection, and none that staff recorded on my form.

Like you say, the best way to analyze may be a bit over the head for my audience and for myself - even though flawed, I may revert back to my initial (pre-this forum) approach and simplify my question and the model (ignore some of the predictors I thought I should try to control, but not crucial). I'll still try to submit it to be published, but I won't be so shocked if it's rejected or requires significant revision.

Again, Spunky, thank you so much for your guidance!

#### spunky

##### Doesn't actually exist
Like you say, the best way to analyze may be a bit over the head for my audience and for myself - even though flawed, I may revert back to my initial (pre-this forum) approach and simplify my question and the model (ignore some of the predictors I thought I should try to control, but not crucial). I'll still try to submit it to be published, but I won't be so shocked if it's rejected or requires significant revision.
yah, i think that's the way to go. just do remember to mention a few of the drawbacks we explored here (part of being a good researcher is always accepting one's experimental design flaws) and, heck, good thing about sending stuff for pubslihing is that they always tell you what you did wrong so you can go back and try again, right? my advisor sits in quite a few editorial boards as the research methods/statistics expert so if he hands me in a paper that looks remotely familiar to yours, i'll work extra hard on the recommendations, lol.

good luck!