Association between non-normal continuous variable and dichotomous variable

#1
Hello,

I'm trying to examine the association between a non-normally distributed continuous variable and a dichotomous variable: yes/no. At first, I thought I should conduct point-biserial correlations but then I realised that in point-biserial correlations, the continuous variable needs to be normally distributed. For this reason, I conducted the Mann-Whitney U test, because it is the non-parametric alternative to the independent t-test, since my continuous variable is not normally distributed. Was that the correct choice?
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
It works. Is the continuous variable your dependent variable? There are multiple different options in Wilcoxon test (e.g., exact, medians, etc.) Depending on skewness, data transformations are always an option for normalizing if actually needed or quantile regression. What is your sample size?

What is the purpose of comparing the two groups?
 
#3
Yes, the continuous variable is my dependent variable. My sample size is n=52 but I excluded one participant because he was an extreme outlier. So, for my analysis I use n=51. Even with the exclusion of the outlier, the dataset remains positively skewed.

I am looking into students' participation in different activities: YES, i.e. the person participated in the activity and NO, i.e. they didn't participate in the activity. And I want to see the relationship between taking part in an activity and my dependent variable.

You mentioned the Wilcoxon test. In what ways is it different from the Mann Whitney U?
 

Karabiner

TS Contributor
#4
alternative to the independent t-test, since my continuous variable is not normally distributed.
An independent sample t-test does NOT require that the dependent variable
is normally distributed (in the population, by the way). Instead, it is assumed
that each group is sampled from a normally distributed population. And even
this can be ignored if total sample size is large enough. n > 50 should be more than
necessary (see central limit theorem).

With kind regards

Karabiner
 
#5
each group is sampled from a normally distributed population
I'm not sure this applies to my data because it is one group of students. It's just that some of them said YES to having participated in certain activities and others said NO. So, the two groups that emerge (YES students and NO students) don't represent any specific population... Also, for some activities, I've had, for example, 3 participants having said YES to an activity and the remainder 48 having said NO. Would either an independent sample t-test or a Mann Whitney U test still be the correct choice to measure the association between participation to an activity and non-normally distributed dependent variable?

Thank you very much for your help.
 

Karabiner

TS Contributor
#6
(YES students and NO students) don't represent any specific population...
They are a sample from a population anyway.
3 participants having said YES to an activity and the remainder 48 having said NO. Would either an independent sample t-test or a Mann Whitney U test still be the correct choice to measure the association between participation to an activity and non-normally distributed dependent variable?
n(total)=51 would make a t-test possible. If groups have unequal size, then it is important to correct for unequal variances (Welch-correction of the t-test; it is implemented in most software packages, I suppose, and often carried out automatically). Personally, I would perhaps not much be interested in a statistical inference where one of 2 groups just comprises of only 3 subjects.

The advantage of the U-test, though, would be that you would not have to bother about variances, normality, outliers etc. But it does not compare means, if it's that what you are interested in.

With kind regards

Karabiner
 
Last edited:
#7
But it does not compare means, if it's that what you are interested in.
Thank you very much for all the advice.

Basically, what I'm trying to do is see if there is association between variables in order to prepare my data for regression. Although I find that this is straightforward with my continuous variables, as I have collected all statistically significant and strong Spearman correlation coefficients, I find it hard to understand how to do that with my nominal dichotomous variables. That's why I've used Mann Whitney U tests. Ideally, I then would like to create a model with all strong correlations between variables, add them in the model and find the predictor. I'm not sure if that's the right way to do it though...
 
#10
the core point is that whether your data are random data or not. if yes, the t-test or a simple regression model can work, or a non-parameter test if the distribution is not normal. If your data is not random, you need to add control variables to make your treatment looks like a random experiment than you can get the real relationship between the continuous variable and dichotomous variable, or a causal relationship.
 
#11
Thank you. What do you mean when you say "random"? My sample is a convenience sample, unfortunately.
Also, what do you mean by "control variables"? How can I add those?
 
#12
according to the statistics language, you need to collect a random sample to study the relationship, a convenience sample would get a biased outcome. for example, if a random sample from a simple random trial that means the sample with the same probability selected in your sample. a convenience sample is not a good choice, sometimes it can be a reference but not a real relationship. "control variables" is working when you have other variables with correlation relationship among dependent variable and independent variable.