Applying Statistical Methods to Biophysical Chemistry Research, IDP Ensembles

katherynpenrod

New Member
Hello! I am a second-year chemistry graduate student studying intrinsically disordered proteins (IDPs). IDPs do not adopt a stable folded state at equilibrium, and instead sample an ensemble of conformations (structures). We have recently made use of a software program to generate conformational ensembles (sets of structures) of IDPs using experimental data as constraints. The output of this program is a .pdb file with 100 structures which correspond to those structures that are representative of the experimentally-constrained ensemble. We have six ensembles in total, where the order in which the experimental data was considered was varied.

We wish to choose structural metrics and compare them between ensembles and between structures in order to quantify the similarity between ensembles and between structures within an ensemble. One such structural metric is radius of gyration, which is the metric that we have been focusing on. We eventually wish to extend our statistical analysis to other structural metrics, as well (i.e. persistence length, solvent-accessible surface area, end-to-end distance, etc.).

I am having trouble in knowing where to begin with choosing statistical analyses/metrics to apply to these data. I would like to begin by comparing all ensembles with all other ensembles, or in other words, generating a 6 x 6 matrix which compares each data set with every other data set, and itself (for confirmation that the metrics have been implemented correctly). As a first pass, I have computed the Pearson Correlation Coefficient, Pearson Chi Squared (not sure if the terminology is correct here, as it's been taken from Wikipedia - can provide equation for clarification), and Kullback-Leibler Divergence for distributions of radius of gyration for each ensemble. I am assuming that Pearson Correlation Coefficient and Pearson Chi Squared are not good metrics to use for this type of data, but the KLD was quite intriguing. Even more intriguing was the Jensen-Shannon Divergence, which is a true metric. Mutual information is also potentially of interest.

My question for you experts out there is as follows: Knowing what you know about the nature of the data that we are working with, what statistical metrics/analyses would you recommend? Why (aside from my own, entirely unfounded hunch) would Pearson Correlation Coefficient and Pearson Chi Squared be bad for these data? Why might KLD or JSD or MI be good? What are the assumptions that go into these types of tests? I am very much a beginner with all things statistics, having never taken a single statistics course, so any and all basic insights that you may be able to offer would be greatly appreciated!