Normal distribution probability density function returning values > 1

#1
i am trying to draw a normal curve over a histogram of some data of mine and i just can't get it to come out right!

Mean =1.002327404

Standard Deviation = 0.020919449

so, obviously the problem is that as x approaches mu, the e^(-.5*((x-μ)/σ)^2) approaches 1, and i get probabilities approaching 1/(σ*√(2π)), which is approximately equal to 19. naturally probabilities of 1900% are not possible.

what am i doing wrong? please help!
 
Last edited:

Dragan

Super Moderator
#2
i am trying to draw a normal curve over a histogram of some data of mine and i just can't get it to come out right!

Mean =1.002327404

Standard Deviation = 0.020919449

so, obviously the problem is that as x approaches mu, the e^(-.5*((x-μ)/σ)^2) approaches 1, and i get probabilities approaching 1/(σ*√(2π)), which is approximately equal to 19. naturally probabilities of 1900%

what am i doing wrong? please help!
Those are not probabilities. Rather, those values are the height (ordinate) of the normal distribution.
 
#3
hi dragan, thanks for your reply, but i don't understand what you're getting at. perhaps i abused the term 'probability' instead of saying 'probability density'.

regardless, should not the maximum possible value of the normal distribution be 1? what am i doing wrong? do i need to integrate from point to point to obtain the 'normal curve'?
 

Dragan

Super Moderator
#4
hi dragan, thanks for your reply, but i don't understand what you're getting at. perhaps i abused the term 'probability' instead of saying 'probability density'.

regardless, should not the maximum possible value of the normal distribution be 1? what am i doing wrong? do i need to integrate from point to point to obtain the 'normal curve'?
No, No....This has nothing to do with computing probabilities. You're trying to superimpose the normal curve on a histogram (of some empirical data set). Right? If so, then there is no need for any integration.

Look, what I do is first take the data and create a histogram using the min and max data points as the stop and starting points.

What you can do next is compute the mean (mu) and standard deviation (s) of the data and then scale z as s*z + mu.

Plot it parametrically as:

On the x axis you would have

x=s*z + mu.

And, on the y axis you would have

y=f(z)/s

where f(z) is the unit normal distribution.

In the Mathematic code I use it would be:

ParametricPlot[{x,y},{z,-3,3}]

The full Mathematica source code is:


Show[Histogram[data, HistogramCategories -> IntegerPart[Sqrt[Length[data]]],Frame -> False, BarEdges -> False, HistogramScale -> 1, HistogramRange -> {Min[data], Max[data]}, DisplayFunction -> Identity], ParametricPlot[{x,y}, {z, -3, 3}, DisplayFunction -> Identity], DisplayFunction -> \
$DisplayFunction]

That will superimpose the normal curve on the data. I just tried it and it works.
 
#5
whew, ok i'm not as lost as i thought! thanks, i think that gives me the info i need.

if i'm reading you correctly, my problem is i'm expressing the domain in raw data rather than in z statistics?

z = (x-mu)/(sqrt(s^2/N)), yes?
 
Last edited:
#6
ugh. i'm doing this all wrong. i did well at this in college but apparently that was too long ago. i should go get a textbook.


can you elaborate on what you mean by "scale z as s*z+mu" ?

i'm reading it as:

calculate z as: (x-mu)/(sqrt(s^2/N))*s+mu, then compute y as f(z)/s
 

Dragan

Super Moderator
#7
ugh. i'm doing this all wrong. i did well at this in college but apparently that was too long ago. i should go get a textbook.


can you elaborate on what you mean by "scale z as s*z+mu" ?

i'm reading it as:

calculate z as: (x-mu)/(sqrt(s^2/N))*s+mu, then compute y as f(z)/s

Basically, what you want is to have the standard normal curve f(z) "drive" everything. Thus, x(z) and y(z) are both functions of z.

Now, on the x axis you want x = s*z + mu where mu and s the mean and standard deviaton of the data, respectively.

For example, if z=0 (the mean of the unit normal curve) then you're at the mean of the data (mu)...and so on for other values of z.

On the y axis you want the height of the normal curve as a function of the change in x with respect to z.

Technically speaking this is:

y = f(z) / D[x,z]
y = f(z) /s

where D[x/z] is the derivative of x with respect to z

i.e. dx/dz = d(s*z + mu)/dz = s.

Thus, for any particular value of z (from the unit normal curve) it's going to send one value to the x-axis (x=s*z+mu) and one value to the y-axis (f(z)/s). Hence, this is the reason for a parametric plot (2 space).

For example, suppose I sample data with mean mu=100 and standard deviation s=16. Then if z=0 then on the x-axis you have x=100. And, the height of the normal pdf would be

(1/16)*1/Sqrt[2*Pi] *Exp[-z^2/2]=0.02493...

If z=1 then x=116 and y=

(1/16)*1/Sqrt[2*Pi] *Exp[-z^2/2]=0.015123...
 
#8
this puts me back where i started, getting f(z) ~= 19

dividing by N seems to give me what i'm after, but that seems at odds with my understanding of what the NDF is supposed to mean.
 

Dragan

Super Moderator
#9
this puts me back where i started, getting f(z) ~= 19

dividing by N seems to give me what i'm after, but that seems at odds with my understanding of what the NDF is supposed to mean.

Yes, it works perfectly. I just ran an example using your mean and standard deviation of:


Mean =1.002327404

Standard Deviation = 0.020919449

with an experimental set of data.

And, yes f(z=0) the height of the normal pdf will be approx. 19 as you say - this is correct.

Thus, I don't understand where you're going wrong.

What program are you using for graphing??
I have a feeling the problem might be with how your graphing you pdf over the data.
 
#10
i see. well my problem was that i thought the normal pdf should only return values less than 1.

my problem was that i'm doing a Cpk and the y axis is supposed to be expressed in frequency, and not raw counts. if i were expressing the historgram in counts, the scaling would have come out correctly in the first place, but i was expecting the scaling of the pdf to come out on the same scale as the frequency and not the counts.

i went back and forth thinking whether or not it made sense that the pdf would return values > 1. now i see that it absolutely does.

oh, and the software i'm using is my own :) that's why i'm the one trying to draw the curve! i do control and software engineering for semiconductor manufacturing tools, and this is part of a log file analysis package.
 
#12
yes i see that now, it makes complete sense.

for some reason i was thinking that in the limit as sigma approaches zero, npdf = [1 | x=mu, 0 | x!= mu]

instead of, of course:

lim[sigma->0] npdf = [infinity | x=mu, 0 | x!= mu]

which it must be to satisfy the cdf = 1 when integrated from -inf -> inf