Analysis of elections data?


New Member
So, Happy Birthday it is then!

Np, take the time you need. This is a rather long time work of mine and feel no rush. However, if it would turn out that CA is suitable for this kind of analysis then it would also solve some other matters related to my dissertation.

Yes, you are correct. That was also my point, the analysis would be very interesting if possible to see what kind of variable that may be grouped around a specific party.

I have tried PAST and gotten some results out of it, but unsure on how solid my results are. According to what I know of the empirics it seems to be ok.

Looking forward to see what results you may get later when you have the time.



TS Contributor
Good news and bad news

sorry for the delay but I had to think a bit about the dataset.

Do not be scared by the title of this reply!!

Bad news first.
As I wrote in my previous post, the dataset you posted is made up of more than one variable. That is, Correspondence Analysis may analyse two-way contingency table where two categorical variables are taken into account.
So, as far as we want to understand the nature of the relationship between Parties and, e.g., nationality, we can use CA. We can use it also in the case we have three variables. There are some "tricks that allow us to have more than two variables in a contingency table" (quoting the post of Terzi).

In order to achieve this, the three (or more variables) have to be recoded and the table rebuilt in some different way (that are cleverly summarized by Greenacre (reference quoted in my previous post)).
In essence, it has to be resumed the relation between, e.g., nationality and gender or age_class.

Good news.
I tried to analyse the table as far as nationality is concerned. See attached PDF (I attach the "worked" Excel file as well).

As you can see (page 1; output from Minitab), the CA provides a good representation of the "variation" (inertia) present in your data.
The first 2 axes account for about the 92% of the total inertia, with the 1 axis accounting for the great part of it (about 73%). This means that the majority of the inertia is explained by the first (orizontal) axis.

On page 2 you can see the data-set, followed by two tables providing the rows and columns profiles. The hightlighted cells contain the values (percentage) that are greater than the corresponding row/column average.
For example, as far as row profiles are concerned, the first Party (recorded as P_1) as a proportion of Gagauzian (15%) that turns up to be greater than the average (3,4%).
The same applies to the column profiles' table.

These tables inform you about what profile (row and/or column) is above or below (or same as) the average.

Page 3 contains the scatterplots of row and column profiles. The two graphs are separately displayed. These are from PAST.

As for the second graph, it is clear that CA is essentially opposing two broad groups in relation to the first axis. One to the right; one to the left.
It is also clear that the groups to the left can be further divided into two sub-groups: the one laying to the top, while the other is to the bottom of the graph.

As you can see, the Moldova profile-point is closer to the centre than all the other points. This is due to the fact that the Moldova profile is the closest to the average (the centre of the graphs indicates, in fact, the average rows/columns profile). You can easily see this from the Tables with the profiles values.
In other words, the Parties are little different as far as the Moldovan votes are concerned; by the same token, the Moldovan vote are nearly "evenly" (grossly speaking) distributed across the parties.

As for the relation between parties and nationality as stemming from the CA, the more a party lays to the right, the more it has a higher relative proportion of Moldovan and Romanian voters.
The more a party lays to the left, the more it has vote from the nationality laying in the same space. But remember that, as far as this second group of parties is concerned, the more a party will lay in the higher area of the plot, the more votes it will have from Gagauzian and Bulgarian. The more down, the more votes from Ukrainan, Russian and other Nationality.

Page 4 contains a "worked" table: the original table has been sorted on the basis of the CA's scores and two broad groups have been devised (A-B). Group A has been further divided in two sub-groups (A1, A2).
In essence, the table try to reflect the groupings devised by CA.

The little table at the bottom of the same page summarizes the data of the major table.
As you can see from the bottom table, the Nationality of different colour are those featuring the different groups of parties.

For ease of reading, page 5 provides two tree diagram (after the cluster analysis on the CA's scores) indicating in a more rigid way the groupings.

All this is meant in an Exploratory perspective. You may add the type of graphical representation you want for ease of representation.

As far as the testing of the statistical validity of the groupings is concerned, this is a question on which I would like to have comments from other members of the forum.

I am going to read a chapter of Greenacre book dealing with this topic. I will let you know if any idea comes up.

I hope this helps.

If you want and if you manage to recode the data in the way I said before, we could go further with this analysis.

Let me know.



New Member
Np, your title did not scare me :) Thanks again for the time you are putting down, especially the explanations are very helpful.

The results you got match very well my understanding of Moldovan politics. Your results on the Moldovan group is also more interesting since it nuances the picture you get by simply running percentages. But grouping the nationalities as CA can do makes it also easier to grasp and understand.

Greenacre's book seems very interesting. At the moment I'm residing in Bucharest and while I did bring some stat books with me, none of them deals with CA (or any other of the names is goes by). I was able to find Greenacre on Google books but could just read parts of it. I'll try to see whether its possible to pick it up somewhere around the city.

The recoding you are proposing would not that just be a table that organises the variables according to below scheme?

urban rural
men/women men/women
ethnic group 1..x ethnic group 1...x

Something like that?

To make this exercise useful I need indeed to go further and also check the other variables. I could of course run new CA:s but that would only provide answers how the variables are related and not how strongly attached they are. Any ideas how such operation could be carried out?



TS Contributor


I am happy to know that my tentative analysis turned up to be interesting to you :).

I have not read the Greenacre's chapter yet. I will inform you later if any idea comes up (about, i.e., testing the significance of clusterings in CA's results).

As for the recording in case of three or more variables, you could read in Googlebook the pages of Greenacre's chapter on "stacked tables".

In any case, I attach 2 pictures from Google.

Pict.1 shows an example of recoding, when three variables are taken into account. I think it is rather self-explaining.

Pict.2 shows an example of recoding when one is dealing with more than three variables. This is an example of stacked table.

Try to figure out how to adapt your data in the light of these examples.

It has to be noted, however, that as long as the number of variables increases, the interpretation of CA's results becomes a little trickier. Nonetheless, Greeneacre provides several guidelines to this kind of situation and to the interpretation of its results.

Hope this helps.



New Member

Yes, that is pretty much as I pictured it. The Gender-Age table is clear. It combines variables and group them.

Regarding the stacked table it would seem that variables are run separately, but are they are combined in the CA or just analysed as separate?

It would of course be possible for me to simply run the variables separately but then I would also have to be able to tell something about the strength of the relationships.



TS Contributor

As for the coded age/nationality, the coded table is analised by means of CA but the interpretation process has to be done in a slightly different perspective.

As for the stacked table, it is analised by CA as well, taking into account all the variables at the same time. In this case as well, the interpretation differs a little.

Let me know if you want an help in running CA and intepreting results, or if you need general help for some reason.


I will inform about the inferential factes of the CA (I am studying the issue at the moment)


New Member
Hi again,

But is the stacked table then any different from what I provided before and you worked with? How does the stacked table take into account that variablesalso may be related?


TS Contributor

the table you provided me (and on which I worked) does not inform us, e.g., how many person aged 18_29 were Moldovan, how may Russian, etc.

By the same token, we do not know how many Moldovan were male and how many were female, etc.

So we can:

1) use the one table for each group of three variables: let's say, one exploring the relation between Parties, Age and Nationality; one exploring the relation between Parties, Gender and Nationality; an so forth.
In this instance we only need one recoded table (of the type of Pict.1 of my preceding post) for each type of analysis.

2) use stacked table (of the type of pict.2 of my preceding post).
Your are right in your doubt. The analysis with the type of staked table attached here will reveal the interaction between Nationality and Parties, and Age classes, and Gender, BUT NOT between Parties, Age classes and Gender. Or this table could be reorganized in order to analyse the relation between Parties (putting them in columns) and Nationality (switching them in rows), and Age classes, and Gender, etc.

I think these are the better options for your type of analysis.

Let me know. I would be happy to help you.



New Member
Happy New Year!


I think the second alternative will also be the best one, i.e. a stacked table where each variable is run against the parties. I'm of course grateful for all future help you can provide but I would also need to run the analysis myself so that I can repeat it later if I would need to :)

I was also thinking of changing rows and columns so that I have transposed (both in Excel and Past).

I picture the process in the following way:

1) Running CA on the stacked table in order to see if there are other variables that also correspond to parties. I would presume locality and age to have some effect, gender less so.

3) Are there any possibilities to see whether a relation is stronger or weaker? I have understood it as a no, but on the same time ending up at the origo, as the Moldovan group tended in your previous run, indicates a result close to the average. May such a result also be the cause of a larger sample or the size of the group does not matter?

4) Tables and digrams to illustrate the results but lets come back to that later.

Question is, then, in what end to begin? Would Past be enough? On my computer I have SPSS, Excel, now also Past. If you just "push" me in the right direction here, I will know where to start my next investigations.

I do have rather good hopes that the analysis will show up pretty interesting and that it should be possible to publish it somewhere. Be sure that I will recognise your invaluable help somewhere in the text :)

Happy New Year!



TS Contributor
nice to hear you again and to know that you are managing to find the right path to your analysis.

I hope that you will manage to get interesting results. I am happy to have helped a bit :)

As for your points (I repeat your numeration):

1) I think it is good. I understand that you have read the warnings about the interpretation of CA on stacked tables. But, from what you write, it seems so.

3) I wrote in one preceding post of mine that the issue of assessing the statistical significance of CA clusterings (or, generally speaking, CA results) is a complex one.
I have read Greenacre's Chapter 15, where he pointed out some hints: someones are easy to perform, others are difficult and require specific software. I have write to Prof. Greenacre himself to ask for some advices, and I am waiting for his reply. I wrote to the Past user Forum as well as to this forum, but I did not received any reply until now.

As far as the statistical significance of the division, e.g., of the Parties into two broad groups, I would act in the following way (but please note that I am not so sure about): I would perform a non-parametric test (e.g., Mann-Whitney) to test the significance of the median difference in vote between the two broad groups. The same (I guess) could be performed in relation to the Nationality (to keep with the dataset I worked on).

More "orthodox" ways (I found them in Greenacre book) are:

A) to perform chi-square test on the contingency table, to see if there is a significant association between rows and columns.
This can be easily performed with the CA's results and Excel.
-take CA's results and get the total inertia (it is present in the output analysis or you can just sum up the inertia of the various axes as provided by Past output window)
-multiply this total inertia by the sample size (in our case, the table's grand total)
-so you get the chi-square value for your table
-go in Excel and use the function DISTRIB.CHI() and inside the parentheses put the chi-square value, then ";" and then the degree of freedom of your table. The latter is equal to the (number of row-1)*(number of columns-1).
-you get the probability associated to your chi-square values.
NOTE: Instead of using Excel, may be you can use any statpack by analysing the table.

B) to perform a similar analysis applied to the relevant axis:
-Take the inertia explained by the first axis
-multiply this by the table's grand total in order to get the chi-square contribution of this axis
-test this values referring to the table here attached
-if your values is greater than the corresponding value in the table, then that dimension is significant (assuming that your data are statistical valid [e.g., from random sampling]), that is there is less the 5% of possibility that it has arisen by chance.

To be sincere, I am unsure about the effect of different sample size on CA results. What I can say is that Moldovan profile is near the average, that is its "distribution" does not differ a lot between profiles.

4) When you will have to present your analysis, I think that you could start from the original table, and then perform the CA providing the scatterplot. Then you could wish to sort the table(s) according to CA results and provide some descriptive graphs (histograms ?) of the groupings you devise.
It could be nice to facilitate (along the scatterplot) the eyeballing of groupings by means of dendrograms of the cluster analysis on rows/columns scores on the relevant axes.
On this latter topics, I attach an interesting article found on the web. I also attach a PDF that explain the cluster analysis (it is from Minitab Guide, but I think it can be useful anyway).

As for program, I use various program, since each one has its own strong points (SPSS, but mainly PAST, MiniTab, SigmaPlot11).

So, I think it all.
I hope this can help and that this quite long reply does not confuse you.

I look forward to know about your results, and I hope that you will manage to do all by yourself. In any case, if you have any problem do not hesitate to contact me (here or privately [you can find my mail in my website]).

Good luck and happy new year,
Kind regards