Hello everyone !
I have a data with n = 100 000 rows and p = 2 variables X and Y.
There is a trend between these two variables however it is very blurry and we don't see anything (too many points).
My strategy is to use a clustering algorithm (K-Means for example) on the 100 000 rows and to classify them into 1000 clusters (the purpose is to catch the dispersion of the whole data). As you know, I can calculate the "center" of each cluster.
After, I only plot the 1000 centers on a graph (with the dimensions X and Y). The link looks really linear and I apply a linear model on these 1000 points (the centers). The 1000 centers represent the 100 000 points.
In the end, this model will be useful for biologists (and a future publications).
Is it correct to do that ? In a way, it is to reduce the noise in my data.
I did this, because I know we can do the same with Self-Organization Map to reduce the data in 1000 neurons and work only with the neurons after.
Thank you from France
I have a data with n = 100 000 rows and p = 2 variables X and Y.
There is a trend between these two variables however it is very blurry and we don't see anything (too many points).
My strategy is to use a clustering algorithm (K-Means for example) on the 100 000 rows and to classify them into 1000 clusters (the purpose is to catch the dispersion of the whole data). As you know, I can calculate the "center" of each cluster.
After, I only plot the 1000 centers on a graph (with the dimensions X and Y). The link looks really linear and I apply a linear model on these 1000 points (the centers). The 1000 centers represent the 100 000 points.
In the end, this model will be useful for biologists (and a future publications).
Is it correct to do that ? In a way, it is to reduce the noise in my data.
I did this, because I know we can do the same with Self-Organization Map to reduce the data in 1000 neurons and work only with the neurons after.
Thank you from France