Empirical distribution function

#1
Hello everyone, cheers for browsing by.

We're handling various data in our statistics courses at the moment and I have gotten back to using R....

In any way, as the title reads I've ran into a bit of an understanding issue in the exam-preparation for the data description part of the exam.

From what I've gotten, the ECDF shows the distribution of relative frequency up to X.
Meaning if I have a dataset:

1 1 3 3 4 4 5 5 6 6 total:12
2/12 4/12 6/12 ...1 Relative frequency up to =1


So, in the csv we use, the initial commands executed to read it in:
>poll.data<-read.csv2("sample.csv") # German csv(Dec/Sep)
>attach(poll.data) #I've already read that attaching is ....not well received and people dodge it with with() and so on, but that yet exceeds my wisdom.


Anyway, for that exercise I did go through the following steps as my attempt to solve it, but it diverges midway, what I simply dont seem to grasp,as I thought I've completed the required task with my ecdf:

q4.cut<-cut(q4.length, breaks=seq(0,30,5))
q4.cut.table<-table(q4.cut)
q4.cut.table #to "draw the frequency table"

Now I went ahead and did the ECDF of just "that" and not the steps shown in the solution:

plot(ecdf(q4.cut), do.points=TRUE, verticals=FALSE)

Expected solution->
the professor is suggesting the following:
> Fn.q4 <- ecdf(q4.length)
> plot(Fn.q4)
> Fn.q4.cut <- ecdf(q4.cut)
> points(sort(Fn.q4.cut(q4.cut)))


I understand that this does two ecdf's (correct me If I'm wrong) and uses the "points" command to draw them into the already drawn plot?
It's just not in the units and here's the first time I come across "points".
Also I seem to miss why the points has the "sort" there, as well as how it affects the plot. Is that just points that always draws on the displayed plot?
In that case, sort is used to present the "cut edcf" and "q4.cut" points on the plot of "ecdf(q4.length")"?

Im sorry if it's partially hard to follow, tried my best to work as far as I've gotten, anyone can give me that push to understand the ecdf steps here, will be greatly appreciated.

Regards
Eo

LINK to CSV: http://www.sharecsv.com/s/e3b1d804d7b452839945f9c4c41aed0d/sample.csv (read.csv2, due to germany intensifying)
 

Attachments

Last edited:
#2
I went ahead and skimmed through textbooks and got a lot further, as to understand why points command is executed.
What I havent still understood is why points(sort(fn.q4.cut) isn't enough, but we have to include the classified q4.cut into the brackets at the end.

Another thing I came across was quantile calculation:

to find the median the following formula is applied (which I inherently understand)
>quantile(DATASET, prob=0.5, Type=1)

So in the case of the presented csv, I had to calculate the median of the parameter "q4.length", then add that to an ECDF plot->

>plot(ecdf(q4.length)

then we're supposed to draw the line of the median on that plot for which I tried->

>Median.q4.length<-quantile(q4.length, prob=0.5, Type=1)
then trying to draw the median in there I attempted two things:
>lines(Median.q4.length, 0, col="red"), which successfully passes, but doesn't present a line whatsoever and
>points(Median.q4.length, col="red) which also passes but doesnt show me a visible point anywhere.


I assume I just added a invisible dot under the already existing graph and not a line as intended.

The example solution, bear in mind that rep() is nowhere explained, looks like this:
> lines(c(0, Median.q4.length), rep(0.5, 2), lty = 2)
> lines(rep(Median.q4.length, 2), c(0, 0.5), lty = 2)

So the way I understand this, rep() repeats a vector, so in this case the Median, but I don't understand the combined formula working here.
The formula does result in the desired lines, but plainly learning it by heart without understanding the subject here, feels bad....


Why is rep (0.5/2) ? Half a vector gets repeated twice? Why?
The second line now turns it around, I do get why one side in lines(0/Y) and lines(X/0) is 0 for a flat line pointing to the median, but what does the rep essentially do?

I typed it in without the rep and it basically marks away at the quantiles of X vertically, so again, if anyone can shed some light on the last part of the ecdf problem and this "rep()" function in this context, I'd appreciate.


Regards
Eo


EDIT: I Found a work around that seems to work for this case, but it doesn't make me understand the above.

abline(v=median.q4.length)
abline(h=0.5)
worked but any explaination for the above?
 
Last edited:

Dason

Ambassador to the humans
#3
Do you know how to access the help pages?

Code:
?rep
will open up the help file for rep which will explain what it does, the parameters it takes, the output...

You can also just try a few things out to explore and see what they do. For example if you want to know what rep is doing...

Code:
> rep(0.5, 2)
[1] 0.5 0.5
In this case it's repeating the value 0.5 twice.

Playing around with it should help you understand what it does
Code:
> rep(0.5, 10)
 [1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
> rep(2, 10)
 [1] 2 2 2 2 2 2 2 2 2 2
> rep(c(1,2,3), 5)
 [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
 
#4
Do you know how to access the help pages?

Code:
?rep
will open up the help file for rep which will explain what it does, the parameters it takes, the output...

You can also just try a few things out to explore and see what they do. For example if you want to know what rep is doing...

[

Playing around with it should help you understand what it does
Code:
> rep(0.5, 10)
[1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
> rep(2, 10)
[1] 2 2 2 2 2 2 2 2 2 2
> rep(c(1,2,3), 5)
[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
First of all, thanks for the reply, but the things that are unclear, aren't the essential functions but rather the combination of two and their mechanics.
I'd really like to understand this, to be able to combine formulas better myself.


I do in fact use the help pages, but find them at utmost, cryptic. Im barely studying this as a "statistics 101" course and the requirements are rather high.

I try my best to gain as much knowledge as possible, but as im not directly studying statistics, ive been kinda tossed in the cold water here.
As mentioned in my:
"So the way I understand this, rep() repeats a vector, so in this case the Median, but I don't understand the combined formula working here"

I really do understand what it does, but I simply can't see how they work combined. It might be a simple misunderstanding of the "programming language".

rep (WHAT-TO-REPEAT, HOW OFTEN)
or rep(c(FirstToRep,SecondToRep), How often)

Is all clear, but why does my > lines(c(0, Median.q4.length)not just draw the desired line allong the y, or respectively
> lines(rep(Median.q4.length, 2) along the x?
Is the line command not actually drawing any lines by itself, unlike abline, hence it needs the rep ~~

Does rep add the 'line' bit to the otherwise 'median' point? Meaning did those commands just drop a dot into my plot and with rep(0.5,2) I repeat it along the "0.5" bar on y?
 
Last edited:
#5
I think I just understood it, but would like someone to give this a confirm. Apparently it was in fact a misunderstanding of the language at work.

So the line function seems to work somewhat as described and the one I was writing before was missing parametres.

Unlike the abline, that just drops a line, this seems to require more "settings".

>lines(Median.q4.length, 0) was in that manner faulty, as it didnt specify two x-es to draw the line with.
Line is meant to work as in lines((X:X, Y:Y), lty/col/whatnot="Whatever")


The
> lines(c(0, Median.q4.length), rep(0.5, 2), lty = 2)
Command does the following (again, please confirm/deny):

line from x1=0 to x=median.q4.length, repeat for both points at "y height" y0=0.5 y1=0.5, and make it a fancy dotted line ->respectively

> lines(rep(Median.q4.length, 2), c(0, 0.5), lty = 2)
makes x1: median.q4.length and x2:median.q4.length, while c(0, 0.5) draws said line at y=0 to y=0.5 on the chart


Correct?

That was a coin the size of a 16wheeler dropping