I started off writing a long lecture on why the analysis you have in mind--which I readily acknowledge is certainly the traditional and straightforward way to handle this type of data--is demonstrably flawed and why the general strategy needs to finally die a quiet death... but the post was really starting to go far outside the scope of this thread, so I'll try to be a little more succinct and to the point. (If the post still seems a little long, well, you should have seen the first draft.

)

By calculating proportion correct for each participant and then testing for group differences in these proportions, you ignore question-to-question variability in difficulty (e.g., there are a large number of possible ways for a participant to get 0.60 correct) and thereby implicitly treat questions as a fixed factor nested under participants. Not only are questions

*not *nested under participants here--every participant responds to the same set of 40 questions so clearly questions are

*crossed *with participants--but depending on the context you're working in, you probably wouldn't be too happy about assuming questions are fixed either. Can the questions in your study reasonably be considered a sample from a larger (theoretical) population of potential questions? And might the data look a little different if you had used some other sample of questions? If yes then you need to explicitly take question-to-question variability into account in your analysis if you are to avoid systematic bias in your results.

The appropriate analysis here is a logit or probit mixed model where correct responses are predicted from the online vs. face-to-face factor, with crossed random effects for participants and test questions. This is very similar to an item response theory type of analysis (the two may in fact be equivalent here but unfortunately I haven't studied enough item response theory to know if that is actually true). Further reading and some instruction on how to go about conducting such an analysis can be found in the papers cited below.

I have to stress that this is not an esoteric piece of statistical pedantry--the amount of positive bias you may be introducing in your analysis by ignoring random effects can be

*quite substantial* depending on various details of your study. In various simulations I have conducted (albeit looking at continuous responses, not categorical), it has not been at all uncommon to see empirical type 1 error rates that exceed the nominal .05 error rate by more than an order of magnitude (!!). So doing this right matters.

Baayen, Davidson, & Bates. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390-412.

Jaeger. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59, 434-446.