Modeling Guidance

#1
I am working on a side project/hobby where I have found a way to extract a large data-set centered around a binary outcome of the same situational occurrence over many many years. I would like to apply some level of predictive modeling or trend analysis that would help me understand whether or not this situation has a high probability of producing the same results in the future (assuming all externalities that could influence this situation remained untouched).

To be frank, there are just so many approaches to this out there that it becomes a bit overwhelming. I am looking for some help to point me in the right direction. Allowing me to focus my research on the right modeling/analysis techniques so I don't waste time looking in the wrong places.

Appreciate everyone's expertise.
 
#2
Lets talk about the data first. How many binary observations do you have per year? How many years do you have? How many other variables do you have?

For simplicity lets say your binary observation is 'does it rain or not' in a given city on a given day. If you have a a continuous predictor to go along with it, lets say humidity, then you can perform a binary-logistic regression. What this would be doing is trying to fit a probability of it raining based on the humidity. Based on the model diagnostics you can determine if its a good predictor or not. Also, you can use many factors simultaneously, like humidity, temperature, barometric pressure, etc..

If you have some categorical predictors than you can include them as well into a logistic regression as well, very much like you can with linear regression. Depending on your software you might have to code the data differently to make it work. But examples would be using the month, or day of week. Again, you need to be able to interpret the results of the model to understand if the factors/predictors are significant or not.

I only proposed this one because I've used it before. I am sure some others with more experience can offer a few more ideas.
 
#3
Thanks Arch, this is a great start - I'll dive in!

The short answer to your initial questions is that they will vary. I am reviewing and developing many (100s) of situations, that date back to 2000. Each situation presents different array of samples throughout the years. Each situation will have a different amount of variables associated to the situation.

To put a range on the sample sizes, some have as low as 60 over 18 years while others have around 500. I assume some of these models or analysis will take into consideration the low sample sizes of some situations and will expose the risk in it's predictability going forward? Maybe that's a different analysis altogether.

As far as software goes, I have only been operating out of excel. If you think I will need something more robust, I am open to suggestions although I wouldn't be interested in spending more than $150 on the tool-set.

Appreciate the help!
 
#4
Thanks Arch, this is a great start - I'll dive in!
I am open to suggestions although I wouldn't be interested in spending more than $150 on the tool-set.
Appreciate the help!
R is a free coding language designed for statistical work. If you are good at programming and have the time you could probably learn to do what you want in a matter of 4-6 hours. I am self-taught in R, and I used youtube and Google to figure things out. It helps if you have a programming background, I learned SAS in school so R was an easy transition.