[TSBC] The R-Book, chapter 11, Analysis of Variance.

Hi Fellow TS'rs,

Welcome to our second book club session from the first book; Michael Crawley's The R Book. We continue our statistical journey with chapter 11, which will bring us one step closer to everyone favorite chapter: GLMS.

The rules:

Discussion will start within this tread now. You may post remarks, improvements on Crawley, indicate problems you are having, ask deeper philosophical questions, but please keep on topic and don't let your posts trail off. The discussion will continue until Tuesday 4th of August ending on the international date line, after which the discussion thread will be closed (officially also for mods). Advice to lazybums (bryangoodrich's words ;) ), like myself, is to start early with the chapter as the last day does not work entirely.

Happy reading everyone!

note: if you want to discuss things like e.g the bookclub rules or suggest other books. Use this thread. The thread below is for the chapter only.

EDIT: Deadline adjusted because of complaints: It fell within the summer holiday


Probably A Mammal
I was wondering if the stuff in Box 11.1 (page 452-3) can be put into terms of matrix operations. I hate looking at summand notation; I have a feeling it would be much simpler and more intuitive if it were in terms of matrix operations, but I don't have the familiarity see that connection if it exists.


Probably A Mammal
I was also wondering about the Effect sizes section. I never knew about plot.design or how to interpret the summary output on an aov object. Nevertheless, I understand the general idea of an effect size, especially with the more simple interpretations of things like, say, the \(R^2\), but what does the plot.design graphic show? I don't understand how to interpret this as an effect size. It shows that the means of soils 1 and 3 are far more distant from the overall mean than the mean of soil 2 is. So .... ? Maybe using the more complicated examples on page 178+ would be more fruitful?


Probably A Mammal
Okay, I have a confession to make: I'm a total n00b when it comes to ANOVA. I mean, I picked up the basics, and am currently working on learning more about it (at least from an applied perspective--i.e., the 2nd half of applied linear statistical models). So, I have some questions for you guys about what was said or not said in this section.

(1) From what I understood on page 468-9, the reason agrimore and supersupp were combined (why as "best"?) was because supersupp was not significantly different. Likewise, supergain and control were grouped (why as "worst"?) because they were significantly different from agrimore (and supersupp), but not from each other. Thus, he generates the second model. Is this correct? On my first reading he didn't state that explicitly. Crawley merely stated the information and then stated his adjustments for a new model.

(2) What is the anova(model, model2) presenting? I'm familiar with the F-test in ANOVA/regression comparing a model with the model lacking the explanatory terms and how we can compare a model with nested models (e.g., anova(lm(y ~ a + b + c), lm(y ~ a + b)) ). However, model and model2 are not nested. So, what exactly is the test and conclusion?


Super Moderator
Hmm. I've really struggled to find time for this and have ended up having to return the Crawley book to the library (someone else requested it). I might have to bail on this club for the moment, but maybe I'll rejoin sometime in the future :)

Best wishes and happy R-ing to you all...


Probably A Mammal
Sad that nobody participated in this. I'm going to revitalize some of these book clubs in another format I've mentioned in the chatbox before (at-will participation). I'm posting here because, hey, I was going over The R Book in my free time today to cover various ANOVA topics.

(a) Still wondering if anyone has matrix versions to shorthand the stuff in box 11.1 (p 452). Any slides or books I have on it use summation notation, which I understand and recognize better now than I did when I asked the question, but I'm weak on my matrix stuff. Any resources people? I'll have to scour some ANOVA class notes somewhere on the interwebs then!

(b) Still wondering how the plot.design shows effect sizes. I guess I'm just used to numerical interpretations. How would you interpret the graphic linguistically ("in words")? Clay (2) has a mean close to the grand mean, but negative. Do we say it has little effect? But loam (3) is much more different and positive, so it has a large effect? What about sand (1)?

(c) I understand about the simplification of the model and the choice of 'worst' and 'best' factor aggregation of the 4 prior factors. I still don't understand the use of anova. I thought the results were meaningless if the two models weren't nested, but it also applies to this simplification? So is simplification a pseudo-nesting, then? Or is the norm I've read repeatedly about only using nested models in anova wrong? You can simply do an F-test for lack of fit between any two models in search for one that performs 'better'?

(d) I've added now that I find the MANOVA section very lacking (read it today). I've never covered it before, and while it is very easy to execute, I have no idea how to interpret the results or if there is more to it. At least in ANOVA the author goes over how the coefficient on a factor is its mean displacement from the reference factor mean (p 459-460). That was important insight into interpreting summary.aov. There is no such detail (if it exists) about MANOVA. Do I just review the relevant tests the summary can utilize? Anybody have further resources on this topic? I'm more interested in interpretation than the MANOVA model, per se (read about when it has advantages and why).

Anybody having trouble obtaining/keeping the book to participate?


Ambassador to the humans
(a) Still wondering if anyone has matrix versions to shorthand the stuff in box 11.1 (p 452). Any slides or books I have on it use summation notation, which I understand and recognize better now than I did when I asked the question, but I'm weak on my matrix stuff. Any resources people? I'll have to scour some ANOVA class notes somewhere on the interwebs then!
Are you still interested in the matrix forms? I can whip those up pretty quickly if you want.


Ambassador to the humans
Too lazy to write it up in nice math notation so you're just getting some R code.

# We'll just use a simple two sample t-test to illustrate this
# size of group 1
n1 <- 10
# size of group 2
n2 <- 12
# mean of group 1
m1 <- 0
# mean of group 2
m2 <- 3
# shared standard deviation
s <- 2
# generate data for both groups
y1 <- rnorm(n1, m1, s)
y2 <- rnorm(n2, m2, s)
# combine it
y <- c(y1, y2)

# Getting the quantities 'by hand'
SSA <- n1*(mean(y1) - mean(y))^2  + n2*(mean(y2) - mean(y))^2
SSE <- sum((y1 - mean(y1))^2) + sum((y2 - mean(y2))^2)
SSY <- sum((y - mean(y))^2)

# Let's make some matrices to help us out
# I will be an identity matrix
I <- diag(n1+n2)
# X0 will be a column of 1s (intercept only model)
X0 <- matrix(rep(1, n1+n2), ncol = 1)
# X1 will have a seperate column for both groups
X1 <- matrix(c(rep(1, n1), rep(0, n2), rep(0, n1), rep(1, n2)), ncol = 2)

# Create the projection matrices for X0 and X1
P0 <- X0 %*% solve(t(X0) %*% X0) %*% t(X0)
P1 <- X1 %*% solve(t(X1) %*% X1) %*% t(X1)

# SSA is the difference in the sums of squares between
# intercept only and our treatments
SSA_matrix <- t(y) %*% (P1 - P0) %*% y
# SSE is the difference between the response and our
# actual predictions (given by P1%*%y)
SSE_matrix <- t(y) %*% (I - P1) %*% y
# SSY is more often call SS_total and is the 'total sums of squares'
SSY_matrix <- t(y) %*% (I - P0) %*% y

# make an object to see that we get the same things either way
out <- data.frame(byhand = c(SSA, SSE, SSY), 
                 matrix = c(SSA_matrix, SSE_matrix, SSY_matrix))
rownames(out) <- c("SSA", "SSE", "SSY")