Calculating t-test and correlation/importance of feature with aggregated dataset

cercig

New Member
Hi,

Important note: I will do the processes below with ready functions/packages in tools, and I am not a professional statistician, so I would really appreciate if you can help with the question below with simplified explanation.

I have a dataset with a couple of features as "Product Risk Level" and "Customer Risk Score" and an outcome as "Customer Default" as below.
I want to calculate the significance or correlation of "Product Risk Level" and "Customer Risk Score" for the outcome "Customer Default".

The sample size is totally 1300 aggregated rows (600 rows for the "Yes"-outcome and 700 rows for the "No"-outcome). I just shared an example below with made up numbers.

I am planning to do first t-test to check the independency between the features and the outcome. Then I will calculate either the correlation between each feature and the outcome. Or I will calculate the importance of each feature by using a "classification model".

However, unfortunately my dataset is not "per-event". My dataset has aggregated values which consists of the population/number of samples in each features-outcome-peer.

I believe that I need to take the "Number of samples" into consideration when I do the t-test and correlation and/or "importance of feature". The question is how?

Last edited:

Karabiner

TS Contributor
For a t-test, you need the variability of the scores (standard deviation in each pair-group).
As you do not have this, you'll have to analyse the data on the aggregate level.
But since your sample size for such an analysis is small (n=7 groups), a t-test for the
risk score, or a U-test for the risk level (which clearly is ordinal scaled and would not
permit a t-test) does not seem very useful. They would have extremely low statistical power
to detect any effect.

Of course, you can do descriptive statistics, for example calculate the weighted mean
(or median, respectively) of "outcome: yes" versus "outcome: no".

Just my 2pence

Karabiner

cercig

New Member
For a t-test, you need the variability of the scores (standard deviation in each pair-group).
As you do not have this, you'll have to analyse the data on the aggregate level.
But since your sample size for such an analysis is small (n=7 groups), a t-test for the
risk score, or a U-test for the risk level (which clearly is ordinal scaled and would not
permit a t-test) does not seem very useful. They would have extremely low statistical power
to detect any effect.

Of course, you can do descriptive statistics, for example calculate the weighted mean
(or median, respectively) of "outcome: yes" versus "outcome: no".

Just my 2pence

Karabiner
Sorry @Karabiner , I forgot to mention that my sample size is actually total 1300 aggregated rows (600 with Yes and 700 with No), I just shared here a made-up sample. In that case, is there a way of calculating the "importance of feature" and doing a test like t-test?

Last edited:

Karabiner

TS Contributor
So, as far as I can see, you can perform a t-test on the group level data with risk score, and U test or a Chi² test with risk level.
I do not know whether it would be useful and possible to take group size into account, maybe someone else does.

It would be possible to use both measurements to jointly predict the outcome, but that would be a bit more complicated
(binary logistic regression).

With kind regards

Karabiner