Help with decision tree when continuous data

#1
[Solved Myself, Worthless Community] Help with decision tree when continuous data

Hello,

I have a dataset (y = true/false, x's = combination of binary and continuous data columns, n = 5000) that I am attempting to create a decision tree for. I used an entropy calculation to determine my 10-most information-gaining columns but it turns out they are all continuous data.

The root of the tree (based on max information gain)contains about 3000 different values (out of a total of 5000 rows) so I am unsure of how to create the branches. There does not appear to be a basic rule (such as a cutoff value) for identifying how the column values are related to the true/false column. Logistic regression failed miserably. Any relatively easy clustering techniques (or any stat methods) for how I can go from this root node to the next nodes in this situation? If it helps, I'm doing this in R and Excel.

Thanks!
 
Last edited:
#2
Thanks for being worthless.

I normalized the data to force it between 0 and 1, compared the distributions of the true values in each column to the false values, then was able to identify appropriate cutoff points for each column (and layer in the game tree).
 

hlsmith

Not a robit
#3
Yeah a decision tree on whether we were worthless would have been much easier with the dichotomous classifications (replied to you/ did not reply to you). Glad to see you found an approach you could use!
 
#6
I like how no one actually replied until I brought up how worthless you all are, and then the responses are just acknowledging your true worthlessness by posting worthless comments. Do you guys even understand statistics?

You do know "Talk Stats" isn't about your World of ******** stats, right?
 

Dason

Ambassador to the humans
#11
I like how no one actually replied until I brought up how worthless you all are, and then the responses are just acknowledging your true worthlessness by posting worthless comments. Do you guys even understand statistics?
Yes. But there aren't an unlimited number of contributors and those of us that are regulars don't have an unlimited amount of time. So not every thread is going to get a reply - and when you insult us you're are almost definitely not going to make matters any better for yourself.

If you would have posted what you did and just not insulted us in your second post I probably would have spent a little time explaining my take on your problem. But you decided to be an ass so I was an ass back - imagine that.
 

trinker

ggplot2orBust
#12
Dason said:
If you would have posted what you did and just not insulted us in your second post I probably would have spent a little time explaining my take on your problem. But you decided to be an ass so I was an ass back - imagine that.
In JRadcliffe's defense you are usually an ass though.
 
#13
Yes. But there aren't an unlimited number of contributors and those of us that are regulars don't have an unlimited amount of time. So not every thread is going to get a reply
Noticed that when I joined. Over half the threads never got a response. Apparently you're veryyyyy busy people.
 

Dason

Ambassador to the humans
#14
Noticed that when I joined. Over half the threads never got a response. Apparently you're veryyyyy busy people.
Yes. Most of us are. We don't get paid for this and most threads aren't interesting and/or the OP didn't show enough work to warrant helping them out. I skipped over your thread the first time around because there were a lot of new threads and it didn't seem to jump out at me as particularly exciting and I didn't have a lot of time.

You're not really helping out your case. You're not making yourself seem like somebody that we want to help. You're not being understanding or appreciative. I can understand why you aren't appreciative because we haven't help with the problem you've posted but remember we aren't paid and you've been nothing but an ass since your second post. We've been poking fun at you since your second post because of this which understandably has made you more agitated but seriously do you think you're doing anything to make us want to help you more by acting this way?
 

trinker

ggplot2orBust
#16
I think we can all have a do over on this. Let's put it behind us because I see JRadcliffe has answered at least 2 questions for other people today. I think he's part of the solution. JRadcliffe can we move forward. The more people that share in the laod of answering people's questions the more questions get answered.
 
#17
“The soap of Mr Radcliffe”

This has happened:

Post #1
Mr Radcliffe wants to create a decision tree, but has not told about estimator, algorithm or software (or package or function). A not very interesting post. Use of...


doing this in R and Excel.
...seems contradictory.

Post #2

Mr Radcliffe writes:

Thanks for being worthless.
I believed (when I was reading first time) that he writes ironically and feels bad and that he is worthless, because he has not received any replies. That turns out to be a misunderstanding from me.

However, in addition Mr Radcliffe give some extra information, presumably to persuade someone to reply! Somewhat surprising!

Post #3

Hlsmith gives a great suggestion for a simpler data set to start with.

Post #4

For once Dason gives his appreciation for one of the many thanks he has received.

Post #5

The Ecologist give an example of a haiku poem.
Very informative! I have always wondered what a haiku poem is.

Post # 6
Mr Radcliffe informs

I like how no one actually replied
He likes that! Problem solved!

(but the soap continues....)


Post #7

Trinker gives a statement, that goes for me too, and informs us that the cakes are on TUESDAY'S at 10.00, and not on Wednesday's as I have been told (by someone!!). Very informative post!

Post #8

The Ecologist tells us (that someone is playing with pac-man) and that there will be Kool-Aid. I am curious!


Post # 9

Trinker informs us why he is on the PhD program.

Post #10

Victor shows us that he can swear in English (without being hit by **************.)

Post #11

Dason informs us that he is actually researching on statistical genetics and only has a limited amount for writing 10,451 posts.

(But I believe that he secretly reveals that the solution is: 001100010010011110100001101101110011 )

Post # 12

Trinker restarts the raptor-bot's alliance's usual raptor-versus-bot war. Everything is back to the normal. (You might believe that this is the end...Cliff hanger... But...

Post #13 (that's not a lucky number.)

Mr Radcliffe returns surprisingly! and emphasizes everybody's duty to answer every post. That surprises me because I cant remember that he has answered any thread created by me!

Post # 14

Dason replies that there is actually work that are paid and that there are threads that are interesting. I conclude that we are all (in a mysterious way), paid by Mr Radcliffe since we have responded.

Post # 15

Ted00 replies. This gives more power!

Post # 16
You think that this is The End?

“No, this is not The End. But it is, perhaps, the end of the beginning!”
 

Dason

Ambassador to the humans
#18
Are you trying to build your decision tree by hand? Are you opposed to having one of the many R packages that build trees for you fit the model? Have you checked out the package 'rpart'?
 
#20
Are you trying to build your decision tree by hand? Are you opposed to having one of the many R packages that build trees for you fit the model? Have you checked out the package 'rpart'?
Yes, trying to build it by hand. I'm doing this as an independent study course and the instructor - who knows nothing about the topic - doesn't want me to use pre-made packages.

I identified the top 5 variables based on entropy / information gain, normalized the variables, then split each variable based into TRUE and FALSE groups. I compared the distributions of the TRUEs / FALSEs to identify the best cutoff points.

On my training set, this resulted in 126 predictions for TRUE: 34 correct and 92 incorrect, for a net score of -58 (+1 point for correct, -1 for incorrect). The actual number of TRUEs in the training set consists of 104 TRUE, 4896 FALSE (5000 total).

I then used an Excel plugin called Risk Solver (a far superior version of the Solver add-in) to optimize the cutoff points to maximize the net score. It was able to get a net score of 5, and it achieved this after the third variable. Pretty poor accuracy.