Kaplan-Meier Estimator Formula

#1
I am trying to figure out which approach I should use to plot the KM Survival estimator. I am referring to different resources. One from the Survival Analysis Course provided through Coursera and another through a tutorial provided via KDD. https://www.researchgate.net/publication/319151424_Machine_Learning_for_Survival_Analysis_A_Survey

I have attached the file capture with the data I am working on and I would like to understand whether I am doing it correctly. I am trying to plot the survival probability for the population and it is similar to the approach used in the Coursera module on Life Tables. But I noticed that the formula used to calculate the proportions of patients surviving past time t seems to give the same results as the S(t) formula given in the KDD tutorial. So I am now not very sure if this S(t) formula is to calculate the proportion of subjects surviving past time t or for the probability of survival because the formula used to calculate the latter seems to be different if I use the approach taught in the Survival Analysis course. My data consist of subjects that have left the study in between and also at two time points, new subjects entered the study.

I also wish to know if it is really necessary to display the KM estimator chart using the steps or is it acceptable to plot it as how I have done it in the attached file.

Thank you and look forward to get some advice on this.
 
Last edited:
#6
I tried to use the link you shared. Although it could help me plot the KM chart but it is not considering subjects that enters in the middle of the study so it is assuming that all subjects began the study at the same time.
 

obh

Active Member
#7
The timeline isn't necessarily common but from the specific subject's start.

If enter at period 2 and still alive at the end let's say period 10, it is exactly as started in period 0 and censured at period 8.
 
#8
Okay in that sense, i will have to change my raw data to show that all subjects started from the same day because when I am doing it manually in excel. my definition of t is the time elapsed since the start of the study and at which point a follow up is done to determine how many subjects are still alive, censored or new added. But I guess if the purpose of the KM is to simply look at the survival distribution then I shouldn't be too critical of placing the subjects according to when it entered the study in the raw data. Correct?

But when I tried to do that just now, where I placed to raw data to begin at the same date as every other subject, the number of survived days for this subject changed.
 
Last edited:

obh

Active Member
#9
Think of an engine in a car, you want to know how many days specific model of engine survives.
you don't really care on what date from the beginning of the research the engine broke down but from the time the engine started working.

You didn't put a comma in the text file so it isn't so clear.
I think that in the calculator you should enter one row per every subject (like row per every engine)

example, 4 engines:
1 broke after 122 months
1 broke after 166 months
1 still working after 300 months
1 still working after 222 months

Months, Event/censured
122, 1
166, 1
222, 0
300, 0
 
Last edited:
#10
Thanks. The text file needs to be opened in excel and it will give an option to separate the columns delimited.
I think I am looking at 2 ways of plotting this curve. One way is by having each machine and how long it lasted in the study until it failed, dropped out or still working at the end of the study (censored)

The other approach is what is being referred to as the Life Table where we have the follow up point(time elapsed from when the study started) where we check the number of machines still in working order at time t (n), the number of machines that died at time t (di), number of machines that were censored (ci), the formula used for S(t) seems to be giving the proportion of subjects that have survived past time t but to get the probability of surviving I have to multiple the proportion of survival at time t and probability of surviving at time (t-1)
 

obh

Active Member
#11
okay, I succeed paste in excel.
In your data I don't know if the "died/censured is from the "old" or from the "new in the study" ...
 
#12
Great. I am not very sure if I have done it correctly. So for example, at time t= 100, 1 machine dies so the total number of machines still alive at time t=100 becomes 109

I should probably removes the rows when nothing happens during that particular follow up time.
 

obh

Active Member
#13
Hi Red,

I attached a zip file of excel with a very simple example (not your data)

Generally, in each step, you calculate the one step survival probability based on the events only (d) for this step only: 1-d/n
But for the next step, you calculate the remaining (n) based on the previous subjects that were removed from the experiment
by event(d) or that were censored (l).

S(t) is the multiplication of all the "one step" probabilities in each step until "now"

n - remained
d - event
l - censored

Code:
n(i)=n(i-1)-d(i-1)-l(i-1)
St(i)=St(i-1)*( 1-d(i)/n(i) )
 

Attachments

obh

Active Member
#15
Yes, it is the same formula. the only difference is that what I wrote S(t) is based on S(t-1), but S(t-1) is the is based on S(t-2) ....so it is actually the same
the pie symbol π is multiplication.