How to determine the abnormality of a specific variable by taking into account all the other variables in the data?

AdrienC

New Member
#1
Hello,

I have an issue of machine learning/anomaly detection. Indeed, I have a variable Y and several other variables X. The purpose is to quantify the degree of abnormality of the data on Y but I have to take into account the values on the other variables (the relationship between Y and X).

Normally, an anomaly detection algorithm would find anomalies but on the whole data (Y + X), but in my case I want to zoom in on Y because it is a very important variable. If I wanted to quantity the abnormality on all my variables (Y + X), Y would be lost in the middle of all the variables.

It is not something strange because when you apply a linear regression Y ~ X, you can calculate the Cook distance which is a kind of "abnormality score" and it took into account the relationship between Y and X.

I hope it is clear !

Thank you

Have a nice day
Adrien
 

Miner

TS Contributor
#3
Can you provide additional context? What type of data are you dealing with? Is it time series? Data on individuals or transactions, etc.? Process data?
 

AdrienC

New Member
#4
Hello, thank you for your answers. Indeed it is a data with n = 100 000 individuals (rows) and p = 50 columns (where the first one is Y and the other 49 variables are X). All the variables are quantitatives and they are not times series.

I can't go into details but the variables on X are just measures that we did on our 100 000 patients and Y is a very important measure correlated with X.

We would like to know if there is a way (a paper but I haven't found one yet), to know the degree of abnormality of Y (univariate data) but to take into account X.

If I had to write with mathematics it would be something like : D( Y | X), where D is the function which measures the degree of anormality (like a conditional probability).

I am well aware of the papers about novelty detection but they all try to find anomaly on just Y (univariate anomaly detection) or just X (multivariate detection) but without any conditions.

Thank you so much
 

Miner

TS Contributor
#5
Do you have a multiple regression model with which you can work? If you do, I would suggest using Cook's distance for detecting influential observations and the standardized residuals for anomalous observations.

If you do not, you might research into some of the multivariate approaches mentioned here. I cannot help you with any of these, my specialty lies more in the process and time series realm.
 

AdrienC

New Member
#7
Thank you for your answers. Indeed there is a big field about anomaly detection. I am doing my phd on this : I work on Isolation Forest, Local Outlier Factor,.... but all thoses methods like Mahabolis distance only measure an anormality score on a dataframe X with p columns and n rows.

My problem is a little different, I want to measure the anormality only on the variable Y (univariate) but taking into account the other set of variables X.

Basically, it is to look for anomaly on the response variable which is related to X.

Thank you !

And have a nice day :)
 
#8
Can you analyze the residuals (I am not sure that is taking account X in the sense you mean it). I am no expert in this, and if you are doing a phd in it you likely know more than I can dream of. :p