Bayesian: Defining priors from *same* data set you're modeling

#1
I believe that it is inappropriate to define prior distributions based on the same data you will be analyzing with those priors. In other words, the priors should not come from the **same** data set (i.e., from descriptive stats, etc) that you are about to estimate a model for.

Does anyone know of any sources that discuss this type of data-driven prior elicitation (from the **same** data set you are analyzing) as being an inappropriate method of defining priors?
 

Dason

Ambassador to the humans
#2
Well from a purely Bayesian point of view it's not appropriate. The prior should reflect your prior information. However not everybody holds a pure Bayesian viewpoint and are willing to take an Empirical Bayes approach. One example of where this is actually quite helpful is in microarray testing. Typically there aren't a lot of experimental units in a microarray experiment but there are A LOT of genes that are tested. If you're doing something like a t-test then you're estimating a variance for each gene. You might think that there is some sort of distribution for these variances and in this case taking an Empirical Bayes approach can help you. Essentially you do what you're opposed to doing (using the data to choose the parameters for your prior) but it helps you by giving you a better estimate of the variance for each t-test. In this way where you have some genes that appear to have really small variances (maybe due to chance) you can pull the estimate of the variance up a little bit and for those genes that have a big estimate of the variance you might pull it down a little bit but you also get bigger degrees of freedom which when you don't have too many dfs in the first place can help out.

Smyth (2004 Statistical Applications in Genetics and Molecular Biology 3, No. 1, Article 3.) describes this approach and it's pretty widely used. It's built into the Limma package in R (available through bioconductor). The theory checks out and if you want more information you can consult the paper. Philosophically I agree that data shouldn't be used in the construction of the prior - but sometimes it can help especially when we don't have much data to go on in the first place.