Normality assumption for PCA?

#1
I know that the classical Pearson correlation coefficient is only valid when
data are normally distributed. For this, I generally use the Shapiro–Wilk
normality test.

I was recently wondering if the data also need to have a normal distribution to use a PCA. I didn't find a clear answer to this in the litterature but I read
that PCA assumes a multivariate normality of the data. I was wondering (1) if
you agree with this, (2) what this actually means, and (3) if there is a test
to check this.

Thank you very much!
 
#2
That is not a very strict requirement. If you have multivariate normality, then great, but if you don't, results can still be interpreted. PCA is not a p-value driven technique.

Checking that assumption is difficult. I would just check normality for each variable separately along with skewness and kurtosis stats.
 
#3
I know that the classical Pearson correlation coefficient is only valid when data are normally distributed.
That is not quite true. You are free to compute Pearson's correlation for data with any distribution. Maybe not always a smart thing to do, but there is no law against it. It's when you start to make p values that things become more strict.

Similar thing with PCA. You are free to PCA any data you wish, but it may work better for multivariate normal data.
 

bugman

Super Moderator
#4
Like ohammer said, but just an additional note:

If you are using PCA for modelling purposes (either subsequent gradient analyses or regression) - then normality would be ideal. If its for data reduction or exploratory prurposes, then normality (as previous posters have mentioned) is not a strcit requirement.
 
#5
Thank you for your answers. This clarifies my concerns.

I asked the same question to several statisticians in parallel and I got quite different answers. In the end, I guess all depends what I want to do with the data (as bugman says if its for data reduction or exploratory purposes, then normality is not a strcit requirement)

Here are the other answers that I got:

(1) PCA is a purely geometrical technique - there is no need for a statistical hypothesis

(2) Multivariate normality is an assumption of PCA, but not a critical assumption. You can test for multivariate normality with a version of Shapiro-Wilk for multivariate normality.

(3) For PCA, there are assumptions about the data - that is is continuous and normally distributed - but this can be overlooked if the purpose of the test is to generate further hypotheses

Thanks!

Sebastien
 
#8
I know that the classical Pearson correlation coefficient is only valid when
data are normally distributed. For this, I generally use the Shapiro–Wilk
normality test.

I was recently wondering if the data also need to have a normal distribution to use a PCA. I didn't find a clear answer to this in the litterature but I read
that PCA assumes a multivariate normality of the data. I was wondering (1) if
you agree with this, (2) what this actually means, and (3) if there is a test
to check this.

Thank you very much!
this books gives perfect answer to your question.
Principal Component Analysis- 2nd edition-2002

http://www.amazon.com/Principal-Component-Analysis-I-T-Jolliffe/dp/0387954422