I have a data with n = 100 000 rows and p = 2 variables X and Y.

There is a trend between these two variables however it is very blurry and we don't see anything (too many points).

My strategy is to use a clustering algorithm (K-Means for example) on the 100 000 rows and to classify them into 1000 clusters (the purpose is to catch the dispersion of the whole data). As you know, I can calculate the "center" of each cluster.

After, I only plot the 1000 centers on a graph (with the dimensions X and Y). The link looks really linear and I apply a linear model on these 1000 points (the centers). The 1000 centers represent the 100 000 points.

In the end, this model will be useful for biologists (and a future publications).

Is it correct to do that ? In a way, it is to reduce the noise in my data.

I did this, because I know we can do the same with Self-Organization Map to reduce the data in 1000 neurons and work only with the neurons after.

Thank you from France