How to normalize LC-MS data with minimum bias (proteomics)

Hello all,

I'm working with quantification of some modified peptides enriched from serum of sick and healthy people. In the crude data (integrated LC-MS peak areas) this modification appears to be significantly more abundant in sick people's serum, which is the hypothesis, and also consistent with literature.

Now, I'm not sure how to perform normalization without losing the apparent differences. If I normalize each peak list (derived from each LC-MS run) using average area or median area, the differences are lost. This happens because the "sick" set does have larger areas in almost all peaks, while the "healthy" set has smaller peaks overall. Which they also should have, if the hypothesis is true, since the enrichment catches the modified peptides, not other peptides.

Using square-rooted average area as the normalization factor has a less radical effect on values: it reduces the standard deviation between technical replicates but conserves most differences between sick and healthy. But I'm not sure if that's an appropriate normalization method.

Any suggestions?
Why do you want to “normalize” the data?

Is it because you want it to look more like a normal distribution? And that you want to do statistical significance test?

Can you please explain what is LC-MS? It is obvious to you, but not for me (I have clue though). Every day I see people using abbreviations, obvious for them – or for americans – but not for me or the rest or the world. So they are left unanswered simply because they are not understood. How about: “use LS or ML or preferably REML”. Maybe not obvious for the reader.

What do you mean by normalizing? Is it: “withdraw the mean and divide by standard deviation”? That is also called standardisation.

Do you also have values under the “detection limit”?
LC-MS stands for Liquid chromatography–mass spectrometry :–mass_spectrometry . Not very relevant, just the method I used to collect the data. It should be an obvious abbreviation for anyone involved in bioanalytics, but of course not for many others, my apologies.

I want to normalize to minimize the random errors arising from sample treatment and analysis. Errors arising from the LC-MS analysis could be, for example, related to injection volume and ion suppression. I would like the different samples to be comparable. And yes, I also want to do statistical significance test.

What do you mean by normalizing? Is it: “withdraw the mean and divide by standard deviation”? That is also called standardisation.
Yes that’s also one normalization method I’ve tried (taking the Z score), but that’s definitively introducing bias to the data. Common normalization methods in proteomics are dividing each data set by mean or median. So I calculate the mean peak area of all peaks in one LC-MS run and divide all the peaks with this normalization factor. This is in most cases an appropriate method, assuming that most peaks in all runs are the same size, and only a few a different. If this is true, the few differences are preserved after normalization. But in my case, most peaks are different size between the two groups, and the differences are faded out by this type of normalization.

No, I don’t have values under the detection limit. I’ve only chosen intensive peaks for my data.
It seems like your question is not about the must usual song here: “my-data-is-not-normally-distributed-what-am-I-going-to-do? So it seems, if I understand you correctly, it is not primarily a significance testing question.

But rather to get precise estimates, with low bias and good discriminatory power in comparing healthy versus sick persons.

The square root thing was transformations to try to achieve constant variance and maybe normality.

Since the area of the peak can not be negative it cant really be normal-distributed (but maybe approximately). Other distributions can be the gamma distribution of the log-normal distribution. Now days there are standard software that can estimate things like that.

But maybe your concerns is more directed towards the estimation.

I think I stop here for the day.
In trying to read, write and rewrite a reply to hkontro, I saw this blog from Statblogs above, the blog “simplystatistics”:

“Mindlessly normalizing genomics data is bad – but ignoring unwanted variability can be worse”

It is about genomics, that is different but in statistical terms maybe similar to LC-MS. There are several links there that can be useful. Although all of it is not that easy to read.
Thanks, there are some interesting links in that blog, like the article about "subset quantile normalization". I'll dig into it I see what it does to my data.

I believe same principles apply to normalization of genetic data and proteomic data, although LC-MS might be a more precise method than microarrays. Still, even if the analysis is relatively steady, the sample preparation will inevitably cause some error.

Do you have technical replicates ? Meaning, the exact same sample that you could run over several times (usually 3) and then calculate a mean of each peak. It helps reduce the "noise variation" of peaks.

I you don't have that possibility, can you use a relativization ? I used this technique for T-RFLP data (terminal restriction fragment lenght polymorphism, a genomic profiling method) with good results. The Hellinger transformation (= (X/(sum of the sample))*(0.5)) reduces the impact of multiple empty (= 0) cells in your matrix, and also reduces the impact of having different total values for replicate samples (which seems to be the case in your experiment).

As mentionned previously, maybe your different in total values is a biological effect that you need to consider.

Lots of possibilities here... Good luck ;)
Yes, I do have 3 technical replicates for each sample. The standard deviations are quite small, 1-6 % for most peaks, although there are some with huge SDs (this is mainly caused by 1 of the replicates differing, so I guess I can just eliminate those values if the other 2 replicates have a similar value).

I tried the Hellinger transformation, and it seems to have a less radical effect on the observed differences. On the other hand, it greatly decreases the amount of significant (P < 0.05) changes... But I will consider that, thanks.

Yes you can ignore one peak if the other 2 are very similar. Same thing if one out of the 3 peaks is missing.

Hellinger transformation indeed reduces the amount of significant changes observed. But consider that the significant changes you see before transformation might be false positive.

Each step has to be considered very carefully ;)