# Regression of probability distributions

#### JeffTheGreen

##### New Member
I have an [x, y] dataset where the y values are multiple points representing a probability distribution rather than a single discrete point. I'm trying to figure out how to run a regression on this to determine whether there's a correlation between x and y.

Ideally I'd like to use SPSS, but I'm also familiar enough with R that I could use it if I need to. I'd really appreciate it if anyone could help me.

#### ichbin

##### New Member
Let me make sure I understand this. You have some parameterized probability distribution over y. The form of the distribution depends on x, plus some other parameters. You have, for various values of x, collected samples of y values. You now want to do a regression to find the best-fit values of the other parameters that determine how the distribution changes as x is varied.

Is that right? If it is, I can tell you how to solve your problem.

#### JeffTheGreen

##### New Member
Yes, essentially. The probability distribution of y for any given value of x is approximately normal, so the shape of the probability distribution doesn't change much, but the mean certainly does. (I'd just use the mean, but the variance/error matters.) Essentially, I want to treat each group of 100 data points as a single point for the purpose of calculating p-values, etc.

#### Dason

How exactly is y representing a pdf? Is it a collection of parameters? Or is it just a random sample and you're considering the empirical distribution to be the pdf?

#### ichbin

##### New Member
Consider each x,y as an individual data point. Not each x,{y}, which was your original idea, but each x,y. So if you have 10 different x-values, each with 100 y-values, you will have 1000 data points, not 10.

Suppose your parameterized distribution is $$p(x,\theta; y)$$, where x is the x-value, y is the y-value, and $$\theta$$ represents the unknown regression parameter(s). The the probability (density) of each data point is $$p(x_i, \theta; y_i)$$. The log-likelihood function for your whole data set is then

$$\log L = \sum_{i} \ln p(x_i, \theta; y_i)$$

Regression consists of finding $$\theta$$ to maximize this function.

Here is a concrete example. Suppose your model is that your data is normally distributed, with the variance proportional to the mean. For various values of the mean (x), you have taken different samples (y), and you want to do a regression to determine the proprotionality constant (a).

$$p(x,a;y) = \frac{1}{\sqrt{2\pi a x}} \exp \left\{ -\frac{1}{2} \left( \frac{y - x}{\sqrt{a x}} \right)^2 \right\}$$

What I am suggesting you do is: (i) for a given assumed value of a, compute p for each x,y data point. (ii) construct a log-likelyhood function by summing ln(p) over all data points. (iii) adjust a to maximize the function.

#### JeffTheGreen

##### New Member
How exactly is y representing a pdf? Is it a collection of parameters? Or is it just a random sample and you're considering the empirical distribution to be the pdf?
It's a random sample, generated by an MCMC simulation.

ichbin, you'll have to excuse my ignorance, but I'm not sure I understand what you're saying. Most of my experience with statistics is just with data analysis and experiment design, not with the mathematics.

It seems like what you're suggesting I could do a least-squares regression--as I might in Excel or SPSS or R--but calculate a log likelihood score instead of using the p-value the program calculates. Is that correct? How would I then turn that into a p-value? (Or is that even possible?) What about if the model is y = mx + b, rather than just y = mx?