Script Help Or Excel Add-in

#1
Note: please see my next post to get a better understanding of what I am looking for. Thanks -Ben

Hello,

I have been using R/S+ for years and love the software. Still a novice but I can do what i need to.

Anyway, I am working on a forecasting project and would like to create a script that automates the following steps:
order data by X
Create groupings 1...N
Separate test data set
run regression on each group
Determine result. (A mathematical formula is run on the forecasted output)
Repeat for test set
Review results

I have no idea how to do this script in S+. Working in excel would make this much easier so my question is twofold:

1) I am running windows 7, MS Office 2010 and i dont have an excel add-in for S+. Does one exist? Does one exist for R? An installation link would be very helpful.

2) How can i create the above script in R/S+ to automate my research.


Any help would be greatly appreciated!

Thank you,

Ben
 
Last edited:

trinker

ggplot2orBust
#2
I have no idea how to do this script in S+. Working in excel would make this much easier
I actually doubt this. Learning to do it in S+ would make it easy. R does have an add on for excel but I don't use it and don't have the link. Use http://www.rseek.org/ and do a search as this is an R specific search engine.

Your questions are conflicting so let's focus on the second one.

I started making a fake data set to do your problem and started trying to figure out what you're after but then I stopped as you don't appear to have done much leg work on your own on this question and have been very vague in your question. I'm familiar with R and in your list:


order data by X
Create groupings 1...N
Separate test data set
run regression on each group
Determine result. (A mathematical formula is run on the forecasted output)
Repeat for test set
Review results
I have a difficult time surmising what you want and in some places can't even attempt to guess what you're after. With this and any R help site two things should occur and possibly a 3rd if you think you've been vague in what you want:

1. Provide a minimal example to work with.
2. Provide the code of what you've done so far. (at least show me that you can read the data frame into S+)
3. [if you've been vague] show a final result of what you want to happen (for you that would be showing what you want at every stage of the steps you've given).

First, I personally will not respond back until those things occur. Secondly, I'd want to see at least a little bit of work on this on your own (see our Posting Guidlineson this). Many of the things you're looking for are very elementary and could be understood with a brief perusing of a guide like: http://cran.r-project.org/doc/contrib/Short-refcard.pdf

Your thread has some very important questions about work flow but you've asked how to do an entire project basically without very explicit outcomes. it shows very little work. Please clean your original thread up a bit and I (and I'm sure other members) will be more than willing to provide assistance to you,a s I really like R and try hard to help others understand it and get them excited about what it can do (though you use S+ which use similar but not identical).
 
#3
Dear trinker,

You having a bad day? Your condescending remarks don't foster a welcoming forum. Use that feedback however you wish but I'm compelled to let you know that.

Now to answering your questions:
Attached is a data set which should help. It's composed of 195 rows (excluding the header) and 17 columns.
The data is sorted by column 2, called "grouping", which has values 1,2,3,4,5.
There is also column 3, called test_in, which represents data to be used for out of sample testing.

My analysis approach today is:
step 1 - remove the testing data which leaves 115 records.

2 - run a regression on group 1. Save the model
note: my regression is a linear model with dependent variable: column4, GM_Total. Then cols 6-17 are the independent variables.

3 - run a regression on group 2. Save the model
4 - run a regression on group 3. Save the model
5 - run a regression on group 4. Save the model
6 - run a regression on group 5. Save the model
7 - Determine the result. So I am forecasting game_total. I now compare this versus column7, O.U. If the forecasted value is greater than OU AND the actual GM_Total is higher than O.U., record is given a 'success' rating. Conversely if both are below OU, success, otherwise failed marking.

8 - using the saved models, run the respective test cases.
9 - analyze the results
Note: i have separated the models because the regression did not differentiate the pre-defined cluster/grouping/portioning i did. Ie, this is column 2, grouping. This is an absolute critical piece, otherwise the data is meshed together and the regression cannot identify true trends.
Basically I get different results when adding this grouping variable and running all the data together.

So here's where my question comes into play. I have all of these steps, most of which are automated at this point, but the remaining piece is automating the grouping values. Say for example instead of the values 1-5, I use 1-10. I would love to simply enter this changed parameter and have the script run.
This little change has profound effects. Now model 1 no longer uses rows 1-40, it must use rows 1-25. Furthermore, the applicable test records are now rows 26-31 vs previous case. I hope this part is clear. This small differentiation in grouping values causes large impacts downstream.

So I need to create code that can determine what the groupings are and select the proper input records. And I have no idea how to do this. Any ideas?

Appreciate the help and please let me know nicely if you need any clarification. More than happy to provide.

Ben
 

trinker

ggplot2orBust
#4
Didn't mean to come off as a jerk I just want to make sure that you understand expectations. It makes it a lot more fruitful for everyone if we all agree to some parameters around questions. My intent was to be forward but not to discourage. Alright when I say I doubt that by the way I mean that generally R will be so much faster and easier if you know how to do something.

Ok the first thing I'd do is save your data as a .csv file which I did. You can read in excel with read.xsl from the gdata package but it can be a pain.

Now I start by reading in the data and then subset to eliminate the cases you wanted to exclude. Next I created a regression function to regress on the columns you specified. I chose an additive model but you can may interaction or what ever you choose. I told it to spit out the anova table and the summary (beta table). You can tell it to do more things but I am not sure exactly what you're after as I'm not familiar with your field.

Next I split the data frame into a bunch of data frames by the Grouping variable. (this guy is a nice function)

Now the beautiful lapply function to take that regression function I made (really just a wrapper for lm) and told lapply to run that regression on every group and spit out the results as a list. I applied the names of the groups so you know what's what.

Now you have a list of results that you can access using indexing which I demonstrate how to do at the end of the code.
Code:
##############################################################
# step 1 - remove the testing data which leaves 115 records. #
##############################################################
dat2 <- subset(dat, Test_IN==0)#subset the data
nrow(dat2) #115 observations as specified


#Create a function to analyze thecolumns you're interested in and 
#output anything you're interested in
modeling <- function(x) {
    mod <- lm(GM.Total~Line + O.U. + HM.off.Points + HM.def.Points + 
        AW.off.Points + AW.def.Points + HM.off.QB.rtng + HM.def.QB.rtng + AW.off.QB.rtng + 
        AW.def.QB.rtng + HM_OFF.QB.DIFF + HM_DEF.QB.DIFF, data=x)
    list("model"=mod, "anova"=anova(mod), "beta_table"=summary(mod))
}


RESULTS <- lapply(GROUPED.DAT, modeling)
names(RESULTS) <- paste("GROUP", names(GROUPED.DAT))
RESULTS


RESULTS[[1]] #access results for group one
RESULTS[[1]] [[1]]#access model for group one
RESULTS[[1]] [[2]]#access anova for group one
Again, I am sorry for making you feel as if though I was being condescending, but we get a number of people here wanting members here to do HW for them or analyze their data. We are here because we love stats and want to help others learn. Your initial post indicated you were interested in free consulting which we do not do (it was very broad and not explicit). Your second post was much more explicit (though some parts were field dependent and I couldn't follow exactly). If you have questions about how to tweak the results let us know.

Tyler
 
#5
Trinker,

Thank you again for the help. Your comments/suggestions have allowed me to quickly perform many operations that would have previously taken days to accomplish..
I have now built a script that cycles through several partitions which is awesome!

My next challenge is the Linear Regression Model. It currently includes every column of data and I would like to pear this down. My questions:

1a) Is there a function/automated tool that runs the LM, finds the least significant variable, removes it, runs the LM again and re-assesses? This way the "optimal" model is found.

1b) If this functionality exists, what model summary data does it look at to stop? Does it try to maximize the R-squared or AIC or other? Can this be customized?

1c) If this does not exist, is there any software I can use to manually create independent variable permutations? For example, we have variables y ~ a + b + c. scenarios are: a + b + c, a + b, a + c, b + c, etc.
Then if I have these output to a file, i could run each LM and identify performed best.


2) Is there a function that transforms the independent variables. For example say your LM model is y ~ a + b. Does any function exist such that it tries out all permutations of a^2 + b, a^2 + b^2, a + b^2, a*b + a + b, ln(a +b), etc..


Appreciate the help!

Ben