Are students prepared for college? PCA analysis?


New Member
I have an assignment for my online sociology course that requires that I complete a quantitative analysis on a social issue.
For my assignment, I have decided to investigate the level of preparedness of my fellow classmates for university study.

The motivation for this research arose when I recently encountered media reports describing what is close to a mental health crisis occurring on campuses across North America and elsewhere. I contacted my university's counseling department about this and apparently they are gearing up for a large scale roll-out of enhanced mental well-being services to address the demand. Being a student can be a very stressful experience. Recently, it would seem that it has become even more stressful.

However, I have known people who were both very high achieving and very unstressed while advancing through their academic and professional lives. The secret to their success seems to be that they knew from a very early age their life trajectory and they had the resources necessary to achieve their goals. In short, they were well-prepared.

Today many students are not well-prepared for university. Students now often attend university in a family context lacking in academic role models. They also typically completely lack a family environment orientated to learning (such as books, high level verbal skills etc.). It is. thus, hardly surprising that these students might experience extreme stress, academic disappointment and life setbacks. In short, they were ill-prepared. My research could benefit what appears to be nearly an entire generation of students that have no conception of what university life entails by highlighting the connection between preparation and a stress-free successful life.

In order to research this question, I would like to ask my classmates (might receive 10-20 responses) roughly 20-30 questions. Perhaps 20-25 questions would ask about preparedness for university study. For example, possibly 15 questions using a 1-5 scale (very prepared/engaged ..., very unprepared/unengaged ...), such as:

How prepared were you for this course?
Have you actively "participated" in the sociological topics covered in the course through political activism, life circumstances (family conflict, divorce, gender/race/ etc.).

I would also like to ask roughly 10 questions that are not on a scale, though would have yes/no answers, other numerical responses etc.
For example: Do you know anyone who has taken this course? Yes/No
How many years did you know that you wanted to take this course before you enrolled in it?

Furthermore, I would also like to ask a few questions about stress and success. For example, I could ask on a 5 point scale for students to rate their stress levels and also what they expect their course grade will be and how they have done on other uni courses.

I have watched videos on Principal Component Analysis and it looks like it could provide useful insights into my survey.
I could run a PCA analysis on the 20 preparedness questions and see if all of these questions load onto a single component.
I would tend to think that they would, though it is possible that different aspects of preparedness might form their own factors (such as preparedness for this course, general university preparedness etc.). I would also like to construct an overall preparedness composite score. I could simply add up all the scale values, yet perhaps some feature of PCA might allow be to construct an "optimal" score. I am not sure how this might be done.

By summing up all the items I might gain statistical strength for the preparedness dimension just as Spearman's g could separate out the noise from the signal by increasing the number of items/tests.

Lastly, I would like to do an exploratory look at how preparedness might relate to stress and success. I could create composite scores for stress and success and plot the values against preparedness. I am not entirely sure whether one simply puts all the responses from the students into one big matrix and then see what happens with a PCA or whether one should add in each group separately (for example, preparation, stress and then success). It would not be entirely surprising if highly prepared students had less stress and more success ( This is what I actually want the numbers to tell me), though it would equally not be that surprising if more stress and less success in some way caused
students to be less prepared. Given this, the one big matrix PCA approach might be best.

The course textbook tried though did not seem to entirely succeed at defining what was meant by mean (The definition the text used was averaging the range which I do not believe is accurate). So, there are no great expectations that my assignment include a rigorous statistical analysis. However, including a PCA or other statistical analysis with my assignment would greatly impress my tutor, in fact it might be grade levels beyond the assignment quality that my classmates likely would submit. No statistical assistance is provided with the course and I have not conducted a PCA analysis before, thus any assistance that might be provided would be very much appreciated. Any suggestions on how I might go about accessing the needed software and perhaps any pointers to existing code that I could simply liftover would be of great help. From what I can see, running a PCA analysis might be done quite painlessly without having to delve deeply into the theoretical foundations and still producing highly relevant insights into my study question. The videos that I have seen online show that software programs such as SPSS can run various analysis within seconds. I do not have access to SPSS software, so perhaps I could just upload my dataset to the thread and someone with a software might just crank it through the software.

Thank you in advance to those who might offer me guidance.
Last edited:
From my experience, I had to go through human subjects training before I was able to work with human subjects data. As far as PCA, you can put each set of questions that pertain to the same latent variable through PCA. This can show you if the items load onto the same component. I would also consider calculating Cronbach's alpha for the related items.


New Member
Buckeye, thank you very much for your reply!

I do not have a firm grasp of principal component analysis. Ordinarily, this would be a gamestopper for me. However, the sociology course that I am currently enrolled in is not intended to be a deep statistical theory course. The assignment is merely intended as a basic introduction to quantitative/qualitative methods in sociology. When I came across the PCA literature, though, it seemed to me to be a great fit for the research that I want to do. I will use the PCA software largely as a tool without full comprehension of the background theory and derivation of formulae. The videos that I have seen online that interpret the results of running a PCA analysis, scree plot etc. appear to be very straight forward. I will simply use what I have seen as a template for my own analysis.

I am very excited about moving forward with my research because I am almost certain that the report that I will submit will be considered, in comparison to that of my classmates, a work of pure genius. No one else likely will have done this. At most other students probably would simply stop with means and standard deviations of a simple sum of scores composite scale. My assignment will be assessed as pure brilliance!

I would, nonetheless, like to have a better understanding of some of the background. In particular, the question of finding an optimal preparedness predictor. I will have roughly 20 answers from each respondent concerning their course/university preparedness. I could simply add up all the scale scores (scale from 1-5). This would give me a score with values from 5-100. This is likely the approach that many students would take under similar circumstances. However, given the theory behind PCA analysis, I would think that I would be able to do better than this. I could use the factor loadings from the PCA as weights in constructing my composite score. I would have to think that this would be an improvement over an unweighted approach. Could anyone enlighten me as to whether this would in fact be the optimal such score? Might there be some other function that would yield an even more accurate score? After I construct such optimal scores for preparedness, stress and success and have transformed values on each of these dimensions for each of the students, I would then want to see how the plots relate to each other.

Another question that I am somewhat uncertain about is how the noise will be replaced by signal with the various questions that I will ask.
Specifically, I am thinking about Spearman's insight that with g, the more tests/questions that one had on hand, the more accurately
one would be able to measure intelligence ( or more in this instance preparedness). Basically, more data less noise more signal after the analysis.
Is a good way to detect this effect in a PCA analysis is by looking at the % variance explained by PC1, PC2 etc.? For example, I could run the analysis with only 5, 10, 15 or all 20 questions (not sure if one would have a result if one only used 1 question in a PCA). By including more questions one expects that the principal components will explain more and more of the variance. this should mean that more and more of the noise is being
removed from the data. Comments that could help crystallize this point are welcome.

Could you tell me more about Cronbach's alpha and how this would be helpful for my analysis? What is Cronbach's alpha intended to describe?
Might other statistical measures also be relevant in my analysis?

Any additional comments about my basic setup or on theory that would help me to have a deeper understanding of the procedure that I intend to use would be greatly appreciated.
Last edited:


New Member
Very awesome news!
While looking around on some online videos on PCA, I came across one that went through an example in R.
I have downloaded R and I am following along with the code!

This is great great!
Perhaps I can simple lift over the template from the video and I will be all done.
I am starting to wonder whether Principal component regression might make sense with my data set.

Only slight snag that I am up against now is the video installs a package to make the analysis work.
Problem is that they run a different version of R. I am trying to figure out how to install the package into
my version. When I tried to use one of the mirror sites, they truncated the list of packages and did not
include the one I need.


New Member
Very exciting!
I have been able to get over the hurdle about the installing of the packages and I am now able to see the
analysis of the data! Excellent!

If I can follow the template in my assignment, then I will have no worries!


New Member
Very startling how much can be learned on the internet in a few days!
Web learning is way way better than a classroom.

I have watched about 20 videos on PCA and I am developing a feel for it. Great having 20 or so people explaining the same thing to you. Each slightly different perspective helps build up one's mastery.

I have been able to download a great number of R packages for data analysis. I had a little trouble at first with installing the packages (the UK mirror did not appear able to download them, though when I added the repo command and another mirror, no more problems); also adding devtools helped as well as copying and pasting the packages from the temp file to C--> Program file --> R --> library path solved a glitch.

I have developed a sense for the mechanics of PCA, though I want to crystallize my understanding of some of the background theory involving symmetric properties related to eigenvalues, eigenvectors (e.g., when AT = A then eigenvalues are real etc.).

After that I will start to think some more about the overall model that I should use. I don't suppose that obtaining data and then massaging the data with a bunch of different models until one hits the jackpot is highly regarded. The problem that I think that I could face, though, is that clear PCs might not emerge from the data. The questions that I have made for "preparedness" really might not be all that correlated. For example, some of my questions are:

Did you read the course textbook before the start of the course? Yes/No,
Rate how experiences in your life have helped prepare you for the course content (e.g., issues related to gender, deviance, racial identity, social inequality, social/political activism, etc.)? [Scale 1-5, 1- A great deal, 5- almost none]

It has occurred to me that students could be highly prepared for the course, yet have went through widely different pathways to such preparation.
The latent variable Preparedness thus might have the same value, and yet the answers for the different questions could be highly different. The different survey questions might act as naturally orthogonal dimensions. Above, one dimension might be labeled academic preparation and the other life preparation.

I will post my survey soon. After I receive the results I could attach the data to the forum and then discuss the analysis.
Last edited:


New Member
GretaGarbo, Thank you for suggesting RStudio to me!

I downloaded the software and have found that it does add a layer of polish to the R interface.
If RStudio can do everything in the background for me related to packages, this alone would be a great help to me.
Without it I needed to keep copying and pasting the packages from one folder to another.

I am not sure why the packages in R simply do not go directly to C-->Program Files--> R--Library folder.
I suppose I could set the path, though I am not entirely sure how this is done.
RStudio has a nice GUI, so if I did have a problem it would likely be easy to work around it.

I had been receiving a whole bunch of the errors below.
When I added the repo = another mirror the problem seemed to disappear.

Warning: unable to access index for repository
cannot open URL ''
Warning message:
package ‘vioplot’ is not available (for R version 3.6.0)

I continue to be pleasantly surprised how helpful virtual environments can be for learning.
I have been able to rapidly make progress over the last 2 or 3 days in developing competency in PCA analysis.
It is very impressive actually. There are so many videos and web sites that provide very clear explanations.
Hopefully, others who come across this thread will draw inspiration from the general discovery that
statistics is learnable as a solo learner with online resources.

I also have located a range of very relevant software tools that would reduce the frustration of having to do
a great amount of hand calculations (for example: matrix multiplications, finding eigenvectors, eigenvalues etc).
{If I chose to do these outside of R.} Such resources would allow me to concentrate on the higher level understanding
of concepts. This is such a fantastic new era of rapid learning. Without all of these resources I probably would have given
up in frustration well before now.

One of the last lingering issues I have with the PCA workflow is finding the transformed data points and then the residuals.
I do not believe that I have seen this analysis performed on the iris dataset. One video talked about using the matrix multiplication

Transformed data = VT X (where V is the reduced eigenvector matrix (i.e., the matrix formed from only the eigenvectors from the scree plot found to be significant). Apparently if the transformed data is then multiplied again by V, the original non-transformed data (though centered) should be recovered. That is V VT X = I X, because V VT

So, this should allow us to go back from the transformed data (compressed data) to the original data (recover the initial dataset X)

There are a great many other interesting data informatic techniques that I could investigate, though it is probably best that I move now to getting to work on my actual data collection/analysis with PCA.
Last edited:
Congratulaions! Now you are more interested in statistics than in sociology! Well done! Maybe you will end up as a sociometrician.

But please note that real data sets are often more difficult than the demo data sets that you can see on YouTube. They often have outliers, some missing values, different variances, definitely not normally distributed and sometimes the values are censored under a detection point. And often there is just not any clear signal. Nothing is significant. Just be mentally prepared that it will not be so clear cut.


New Member
Thank you again for your reply, GretaGarbo. Going through the process of writing down one's ideas helps to crystallize them.

I had to take a vacation from the course for a while because I found it too disturbing. Sociology has to be the craziest subject that I have ever taken-- by far. In the course materials, students are required to watch a video in which there is a highly graphic depiction of a brutal murder. I found this to be psychologically challenging to cope with. On the course forum, some of the students swore at the course professor; the course professor swore right back at them. The course texts suggest that a viable solution to current global economic inequalities is for a violent socialist uprising. It truly is startling! By comparison, PCA is a walk in the park.

I do not see sociology and statistics as being mutually exclusive. In fact, many subjects taught at the high school level such as social studies, economics etc. transition into essentially applied math courses at university. Earlier in the course I took another excursion and read through Asimov's Foundation and Robot series. His idea of a psychohistorian (sociohistorian) was brilliant. Applying mathematical/statistical concepts to understanding and predicting society's evolution is a tremendously powerful bridge to help those from a science/math perspective to port over to the social sciences. PCA appears to be one statistical approach that could actually offer an approximate glimpse at what psychohistory might be like. It is such a great feature of online courses that I can go on these detours that I find interesting. This is what a real education is all about! In a classroom setting, when I have felt disengaged from course content, I would just begin drifting which does not make for a good educational experience or create a sense of well being. Online learning is the future, and the future is now.

I am continuing to develop deeper levels of understanding of what is involved in PCA. My current epiphany that is PCA often should not be thought of as providing THE answer: it is not like a mathematical answer. Instead PCA gives you the chance of taking verbal information --> quantifying it in a survey and then seeing how the numbers relate to the words. With my survey on university preparation, I might be able to see how certain types of preparation correlate with other types of preparation, how certain preparation features might predict certain outcomes etc. . I need to move to a state of consciousness in which I can accept that no right or wrong answer will be reported in the PCA analysis. PCA will provide me insights about the nature of the dataset, and I should just be happy with these insights. PCA is such a powerful and readily understandable technique that every person who aspires to be considered educated should learn it. With all the resources available, I think that it is reasonable that in the proper learning context that this could be accomplished within 1 hour. For the payoffs involved, investing an hour to learn PCA has to make sense.

Finally I have reengaged with my sociology course and have applied myself to the survey. It required more effort to create the survey than I had expected. I have attached my rough draft to this post. It is still somewhat sketchy; comments/criticisms etc. would be welcome. One problem that I am having is trying to phrase the statements so that I can have a scale (My answers are often: No, Slightly, Moderately, Strongly. Problem here is that this does not represent a linear scale. Many would probably interpret Slightly as 2; Moderately as a 7 and Strongly as a 10. You probalby want to try more for a linear scale such as 0, 2.5, 5, 7.5, 10. Any substitution suggestions would also be welcome. I think I was able to capture most of the relevant forms of what is meant by being prepared for college, though others might have other ideas. Possibly one important addition would be to consider being financially prepared for college. Many students are not financially prepared, and then basically are never actually college students but instead are part-time or even full time workers on the side.


Last edited:
Wow! I am not sure if Stack Exchange -- Cross Validated(CV) is somehow connected to this forum, though I have found CV to be an incredibly useful resource to deepen my understanding of PCA. It is brilliant! Let people ask questions and then let the best answers rise to the top. My impression is that the statistical experts on that forum are close to the leading authorities on subjects such as PCA. The explanations along with accompanying graphics are immensely helpful in extending your intellectual grasp. I am beginning to wonder now what there is about PCA that still eludes me?
Perhaps there is an advanced description somewhere that puts everything on the table; I would be interested in reading such a document.

I am glad that I finally posted my survey. It is not that great, though at least I have something that can be critiqued and improved. My updated version is switching the questions to statements. Some of the questions in my rough drft were worded awkwardly. With statements, all I have to do is add a 5 point scale [Strongly Disagree --> Neutral --> Strongly Agree] and I am done. One demographic question that I am considering asking relates to previous high school grades or university grades. This would be one of the more obvious factors in preparing a student for university study. However, I do not want to pry too much nor scare off potential respondents.

It surprised me that I had neglected a very important aspect of my online university: It is Open Access! There are ZERO entry requirements. You can live anywhere in the world; be from virtually any economic, social or racial background and enrol in my university! As long as you paid your academic fees, you're good to go. I am not sure how this slight quirk escaped my attention. Open online universities are, in fact, a dramatic, and radical departure from university cultural practices probably spanning upwards of thousands of years. I love my online learning experience so much(!), though it does introduce certain issues with respect to student preparation for university.

In a typical bricks and mortar school, there will be an overwhelming force homogenizing the student body. The students will be clones! At a top ivy league school, perhaps all of my returned preparation surveys would be nearly identical: 18-24 year olds in lock step with their year of birth cohort, almost identical course completion counts, both parents are university graduates (check), pre-read course textbooks (check), highly restricted measured intelligence distribution (check) -- check, check, check, ... . The expression, restriction of range would seem appropriate. What could I possibly learn from surveys that were so homogenous? Except that they were so homogenous. How would one PCA it?

Yet, with an Open online university, there might possibly be almost no correlation at all for the different data points. More to the point, a sizable subset of those surveyed might have completed absolutely none of the suggested preparatory steps. {Including Other as a choice for the Demographic Category: sex is gold. In sociology, it is not easy to avoid heteronormative this and patriarchical construct that. Realizing that Other can break out of the binary knot has to be worth bonus marks.} I will have no restriction of range issue with my dataset! In fact, my tutors have from time to time confided in me that my virtual fellow classmates can be excessively unprepared for university study. Memorably, one tutor noted that they had received assignments sent by snail mail in crayon. I am not sure if that is entirely believable, though a very wide variation in level of preparation in the responses would greatly simplify the analysis. Perhaps I will be able to easily be able to break the data into a few clearly disjoint clusters. The kindergarteners, the high schoolers, the return to school demographic etc. .

I started reading up on the literature on university preparedness. I was quite surprised that a highly comprehensive meta-analysis that included over 7,000 studies conducted over the last 30 years did not seem to mention preparation as one of the factors related to university success. I suppose that it could be disguised as another variability (e.g., procrastination), though it did seem odd to me that it wasn't mentioned by name. Having a specific plan and preparation strategy would seem almost essential to achieve success.
Last edited:
So awesome!
This is THE age of learning.
There are opportunities to grow and learn in ways that simply were not reasonably available to people in the past.

Clearly, I am just not engaging with my online sociology course.
The course textbook's 2 page description of the author's struggle to choose earring on the left or earring on the right rightfully could be described as oversharing. While I can certainly understand that such decisions are of monumental personal significance, one might reasonably ask how any such sociological dimension (including age, gender, race, class, and yes earring on the left or earring on the right and many others) has any particular relevance in the era of the virtual community. I have never met any of my virtual classmates, nor do I have any conception of their sociological reality. At a profound level, much of the undeniable truths of pre-internet sociology, no longer matter.

In a typical physical bricks and mortar environment I would have politely sat out the full semester, pretended that I truly cared about the social significance of facial hair, etc. etc. -- and would have totally wasted my time. With virtual learning, I can change my learning itinerary on a dime as I see fit.

Ergo, I have went online and searched out Regression MOOCs!
There are a ton of them starting within the next few days.
It's amazing!

I can study those subjects that I am passionate about and can easily bail if the courses are not what I want.
The MOOCs are typically only 4-7 weeks long, though they would offer a great starting point to go deep with a full course.

Nevertheless, I am still greatly glad that I did stretch out my intellectual horizons into sociology. Sociology does offer profoundly important perspectives into human society and specifically into problems that I have faced in my life and possible ways of solving such problems. Everyone probably would benefit from taking a sociology course.

I had been considering enrolling in a non-statistical psychometric course offered by my online university.
I am somewhat stuck with this psychometrics for those who don't like math approach, though with the background in stats perhaps I can use these skills in the course project.

Considering the truly colossal amount of government financial resources that are devoted to preparing college students to study at the college level once they arrive on campus, perhaps some of my insights could be used in order to save a vast amount of money and to energize students to be passionate about what they are studying and align their academic work to their future work.
Last edited: