Hi,

Important note: I will do the processes below with ready functions/packages in tools, and I am not a professional statistician, so I would really appreciate if you can help with the question below with simplified explanation.

I have a dataset with a couple of features as "Product Risk Level" and "Customer Risk Score" and an outcome as "Customer Default" as below.

I want to calculate the significance or correlation of "Product Risk Level" and "Customer Risk Score" for the outcome "Customer Default".

The sample size is totally 1300 aggregated rows (600 rows for the "Yes"-outcome and 700 rows for the "No"-outcome). I just shared an example below with made up numbers.

I am planning to do first t-test to check the independency between the features and the outcome. Then I will calculate either the correlation between each feature and the outcome. Or I will calculate the importance of each feature by using a "classification model".

However, unfortunately my dataset is not "per-event". My dataset has aggregated values which consists of the population/number of samples in each features-outcome-peer.

I believe that I need to take the "Number of samples" into consideration when I do the t-test and correlation and/or "importance of feature". The question is how?

Important note: I will do the processes below with ready functions/packages in tools, and I am not a professional statistician, so I would really appreciate if you can help with the question below with simplified explanation.

I have a dataset with a couple of features as "Product Risk Level" and "Customer Risk Score" and an outcome as "Customer Default" as below.

I want to calculate the significance or correlation of "Product Risk Level" and "Customer Risk Score" for the outcome "Customer Default".

The sample size is totally 1300 aggregated rows (600 rows for the "Yes"-outcome and 700 rows for the "No"-outcome). I just shared an example below with made up numbers.

I am planning to do first t-test to check the independency between the features and the outcome. Then I will calculate either the correlation between each feature and the outcome. Or I will calculate the importance of each feature by using a "classification model".

However, unfortunately my dataset is not "per-event". My dataset has aggregated values which consists of the population/number of samples in each features-outcome-peer.

I believe that I need to take the "Number of samples" into consideration when I do the t-test and correlation and/or "importance of feature". The question is how?

Last edited: