Correspondence Analysis and car environmental friendness

gianmarco

TS Contributor
#1
Some days ago, Trinker pointed out to me a blog's post dealing with Correspondence Analysis.

I am taking the opportunity to elaborate very very briefly on the dataset used in that blog, just to provide an idea of how the CA scatterplot used in that blog can be 'reworked' to help in assessing the cars' relative 'environmental friendness' as perceived by customers.

As you can see from the attached pict, it suffices to draw a segment (BLACK) passing trough the 'environmentally friendly' category point and the origo of the scatterplot. Then, one has to draw a segment (RED) from each 'car' category point, perpendicular to the preceding segment.

The more a red segment intersects the black one beyond the origo, the less than expected the 'environmentally friendly' category has an 'impact' on a given car category, and viceversa.

So, listing from more to less 'environmental friendly', Opel Corsa, Fiat 500, Toyota Prius, Citroen Picasso, Ford Focus, Volksw. Golf, Mini Cooper and Renault Espace are those perceived as 'environment friendly', in that specific order (as revealed by the relative position of the intersections).

By the same token, the other cars are perceived as not 'environmental friendly', being the BMW X5 the one that scores the lowest position.


These are my 2 cents on the topic :)


Side notes:
-there is no model building, hypothesis testing, residuals, and the like. Just exploratory approach. Sorry Guys (i.e., Dason) :)
-it is unfortunate that no R package to date allows to get that type of extra (visual) info out of CA
 
#2
-there is no model building, hypothesis testing, residuals, and the like. Just exploratory approach. Sorry Guys
Well, OK. (Maybe the guys are sorry. I am not.) But there is no contradiction between exploratory analysis and a more hypothesis testing/confirmatory approach. They are rather on a continuum scale from one side to an other.

But hey, wait a minute! Isn't there any "residuals"? No "model building"? You have 20 to 25 attributes (like "environmental" etc.) that are projected down to a two-dimensional space (in a way that I personally is not familiar with). Isn't that procedure using an implicit model? Isn't the perpendicular lines you are drawing some kind of "residuals"?

If you excludes a few attributs, will that not change the plot? So there is an implicit model! The question is, does this implicit model describes the reality in a good way?

-it is unfortunate that no R package to date allows to get that type of extra (visual) info out of CA
But the point you are making is a good one. The interpretation can be greatly improved just by drawing a line.
 

gianmarco

TS Contributor
#3
Ahhhhh....no one gives a **** to this thread....I knew.

In order to gain attention, I should have titled it:
'R is good to nothing' (to attract mainly Dason and Trinker)
'Normality of variables IS an assumption on Linear Regression' (Dason, CowBoyBear, Terzi, Greta)
'SAS is good to nothing' (noetsi)
'Bugs, snakes and spiders suck' (Bugman, TheEcologist)
'I have a severe toothache' (Victor)
'I got free cakes for you' (Jake)
'Canada is a lesser version of US' (Spunky)
 

gianmarco

TS Contributor
#4
Greta:
1) my second (joking) post was written while you was replying. So, sorry for including you in my list....
2) Indeed, as always, you raise many sensible points!!!!! I was just kidding...

Thanks for your feedback
Gm
 
#5
Ahhhhh....no one gives a **** to this thread....I knew.
I must thank you for the above useful post!

'R is good to nothing' (to attract mainly Dason and Trinker)
Since you have said:
-it is unfortunate that no R package to date allows to get that type of extra (visual) info out of CA
...I guess that you are right about that one!

'Normality of variables IS an assumption on Linear Regression' (Dason, CowBoyBear, Terzi, Greta)
No, it isn't! In a generalized linear model the assumption is just that the distribution is in the exponential family, of which normality is just a special case.


'SAS is good to nothing' (noetsi)
For this I tank you for this useful post! Besides, SAS is just an aeroplane company anyway!

'Bugs, snakes and spiders suck' (Bugman, TheEcologist)
Although, they might be interesting, I must agree about this one.

'I have a severe toothache' (Victor)
Has he said so?

'I got free cakes for you' (Jake)
Fine, send them to the lounge on Wednesday! They have been missing a long time.

'Canada is a lesser version of US' (Spunky)
Isn't it actually much larger?

@Gianmarco, as you put it: thread deterioration is ON!

:)
 
#7
Greta:
2) Indeed, as always, you raise many sensible points!!!!! I was just kidding...
I realized that you were kidding. But I, "as always", took it deadly serious! :)

Isn't one issue here, when is a correspondence analysis (CA) good and when is it not? If the data really can be described as a "two-dimensional pattern" then I guess that CA is good. But if the pattern is three dimensional or higher, then can CA lead us wrong? Suppose that one attribute is an outlier in a three dimensional space but it is shown in a two-dimensional space?

How can we check that a CA model is a good one? There is a model, isn't it?
 

gianmarco

TS Contributor
#8
Greta:
I was not kidding when I said the you often raise interesting point. I was kidding in saying those things about no model, no testing and so on....

Now, like many dimension-reducing techniques, CA tries to reduce the dimensionality of the data to enhance data interpretability. By using CA one wishes to find a balance between reduction of dimensions (which implies some kind of distortion) and easiness of data structure interpretation.

In essence, CA tries to graphically chart the deviations from independence in a contingency table.

Contingency tables can be high dimensional. For instance, if you were to plot a table with 20 rows and 40 columns, you would need to project the data points in a 19-dimensional space.

Instead, CA decompose the inertia (i.e., the data variability, i.e. deviations from independence) into dimensions, each capturing a decreasing amount of variability (i.e., inertia).

Sometime, the first 2 dimension manage to capture a great part of the inertia (say, 90%). In this way, (and keeping with the above example) by sacrificing 17 dimensions, you manage to retain a great deal of data variability.

There would be a lot more to say. May be you could jump to my site to have further info.

Cheers
Gm
 
#11
Hi Gianmarco,

Can you clarify on the following sentence?

The more a red segment intersects the black one beyond the origo, the less than expected the 'environmentally friendly' category has an 'impact' on a given car category, and viceversa.
It doesn't make sense to me. By "segment", I assume you mean the line, and by "origo" you mean the origin. But what do you mean by "the less than expected"?
 

gianmarco

TS Contributor
#12
I was wondering what the theoretical basis for this method is? I have never seen anything like it....
Noetsi: do you mean the basis of CA in general, or of that 'geometrical' interpretation?
As for CA in general, I elaborate quite some more here.


@Injektilo:
Yes, I was meaning "line" and "origin".
As for my terminology ('more than expected'), it derives from the fact that there is a close underlying relation between CA and chi-square. Bear in mind that the underlying data on which CA is put to work is a contingency table, and that (in a very short summary) CA charts the deviation from independence in a contingency table. So, the cars I was referring to (e.g., the ones whose red line intersects the black line beyond the origin) have less then expected counts relative to the category in question (i.e., 'environmental friendness').

gm