# Share your functions & code

#### trinker

##### ggplot2orBust
I decided to try to improve upon my search function to make it more versatile for searching through large data sets (similar to the search button in Microsoft Excel). I changed grep to agrep and added an ignore.case argument so the function is no longer case sensitive and takes approximate matches. I added a variation argument (agrep's max.distance) set at .02. Adjust this to 0 to narrow the results or higher to broaden.

Currently the function works on specific columns of a data frame. I wanted to make it work on a data frame and return all rows that contain any columns with the search term. I thought about using the apply function and then unique to eliminate duplicates. Unfortunately, I can't seem to get this to work. Any ideas?

Search Function Code
Code:
Search<-function(term,dataframe,column.name,variation=.02){
te<-substitute(term)
te<-as.character(te)
cn<-substitute(column.name)
cn<-as.character(cn)
HUNT<-agrep(te,dataframe[,cn],ignore.case =TRUE,max.distance=variation)
dataframe[c(HUNT),]
}
EXAMPLE
Code:
#CREATING A FAKE DATA SET
SampDF<-data.frame("islands"=names(islands),mtcars[1:48,])

#EXAMPLES
Search(cuba,SampDF,islands)
Search(New,SampDF,islands)
Search(ho,SampDF,islands)#Too much variation
Search(ho,SampDF,islands,var=0)
Search("Axel Hbeierg",SampDF,islands)#not enough variation
Search("Axel Hbeierg",SampDF,islands,var=2)
Search(19,SampDF,mpg,0)

Last edited:

##### Ninja say what!?!
Dason I'm proficient (just a smidge away from what a real statistician would describe as dangerous) with [R]. I tried to figure it out for some time; to no avail. I failed and peeked Don't judge.
LMAO. I was just plain lazy and skipped right to the spoiler. Sorry! Very neat though!

##### Ninja say what!?!
Thanks a lot trinker for posting this! I had been wanting to do something like this for a while...but was too busy (read lazy) to do it. Since it's started though...here's one of my favorites. I should probably edit it though to work for words with spaces as well (it currently doesn't work if there are spaces in the word).

Code:
#This function allows you to type in names without the need for " or '. Great when you have lots of text to type.
words <- function(...) paste(substitute(list(...)))[-1]

#CREATING A FAKE DATA SET
SampDF<-data.frame("islands"=names(islands),mtcars[1:48,])

#EXAMPLES
words(test, test2, test3)
subset(SampDF, select=words(am, gear, carb))
SampDF$islands %in% words(Cuba, Newfoundland, Africa) #### Dason ##### Ambassador to the humans I should probably edit it though to work for words with spaces as well (it currently doesn't work if there are spaces in the word). I don't think you could keep it the way you have it and modify it so that you could have spaces in your "words". But if you do have words with spaces in them you can just put quotes around that and it will work. Code: nullfun <- function(...){} words <- function(...) paste(substitute(list(...)))[-1] # works words(hey, you, guys) # Doesn't work words(hey, you, guys, specifically link) # Doesn't work either so I don't think you could #Just modify your function to make it work. nullfun(hey, you, guys, specifically link) #But this works. So you just have to do a little more work # for words with spaces. words(hey, you, guys, "specifically link") #### Link ##### Ninja say what!?! lol. Yeah...I guess that's one solution. That's actually what I currently do when I have words with spaces. #### Dason ##### Ambassador to the humans I think that's about the best you'll be able to do too without wrapping the entire input with quotes like so: Code: words <- function(input){ strsplit(input, ",")[[1]] } #example words("Hey,you,guys") Note that I was lazy and if you put spaces after the commas then the spaces are in the words. You could modify it to trim the beginning and ending spaces easily though. #### trinker ##### ggplot2orBust Search Function (Danger) In my previous post for the search() function I named an object within the function "t". This is dangerous because [R] has already assigned "t" to something else. This is what lead to my issues with the snow library not loading. Inside of the function I think you're ok, but if you run individual lines outside of the function it's risky. I updated the code in my previous post to represent this change. my apologies. Now any ideas on how to apply search to all columns of a data frame. My idea was to try apply and then unique but I can't get the apply to work properly. #### Dason ##### Ambassador to the humans Search Function (Danger) In my previous post for the search() function I named an object within the function "t". This is dangerous because [R] has already assigned "t" to something else. Imagine if you would have overwrote T. One of the meanest things you could do to somebody who leaves an R script open is put this at the top of the file and run it so they don't notice Code: T <- FALSE F <- TRUE which is partially why I try to always use TRUE or FALSE instead of T or F. Imagine trying to debug that. #### bryangoodrich ##### Probably A Mammal Since I want to contribute something here, I thought I would include an example I coded myself as part of my ALSM project. This comes from chapter 16 (not included on website, yet). The example considers taking all the permutations of the given (base) 9-sequence as assigned to three treatments: (1.1, 0.5, -2.1), (4.2, 3.7, 0.8), and (3.2, 2.8, 6.3). The example concluded by drawing a histogram of the Randomization (Permutation) distribution and overlaying the appropriate F-distribution that approximates it. Code: ################################################################################ ## TABLE 16.5 (p 715) # ## Randomization Samples and Test Statistics--Quality Control Example # ## FIGURE 16.8 # ## Randomization Distribution of F* and Corresponding # ## F Distribution--Quality Control Example # ## # ## Since there is no algorithm to compute this example we had to devise one. # ## It should come as rather straight-forward. The Xi's are as in the above # ## examples. The 'y' will hold the 1,680 cases of 9-sequences consisting of # ## the response variables. The 'ti' implies the treatment group. In this case # ## t1 is the first group (3-sequence) and t12 is the composite of t1 and t2. # ## The 'remainder' function is a wrapper for grabing a subset of 'set' based # ## on those values not in 'x'. The 'seq6' is the 6-sequence remainder after t1 # ## is defined. The whole process took less than 10 seconds on a 2.4 GHz # ## processor. As for the output, the columns are arbitrarily labeled 1-9. # ## Clearly they represent the three treatment groups based on groups of three. # ## The function 'f' uses the matrix algebra discussed in Ch. 5. It is possible # ## to get away with merely fitting an 'lm' object, and then extract the # ## f-statistic in a single call. However, this requires a lot of additional # ## work for each of the 1680 rows. It took somewhere between 30-60 seconds to # ## produce the same result. # ################################################################################ remainder <- function(x, set) set[!set %in% x] f <- function(Y, X) { Y <- matrix(Y) ## Turn row-vector into column p <- ncol(X); n <- nrow(X) J <- matrix(1, n, n) ## (5.18) H <- X %*% solve(t(X) %*% X) %*% t(X) ## (5.73a) SSE <- t(Y) %*% (diag(n) - H) %*% Y ## (5.89b) SSR <- t(Y) %*% (H - (1/n)*J) %*% Y ## (5.89c) fstar <- (SSR / (p - 1)) / (SSE / (n - p)) ## (6.39b) } base <- c(1.1, 0.5, -2.1, 4.2, 3.7, 0.8, 3.2, 2.8, 6.3) t2 <- t12 <- t123 <- list() y <- NULL X <- cbind( X1 = c(1, 1, 1, 0, 0, 0, 0, 0, 0), X2 = c(0, 0, 0, 1, 1, 1, 0, 0, 0), X3 = c(0, 0, 0, 0, 0, 0, 1, 1, 1) ); t1 <- t(combn(base, 3)) seq6 <- t(combn(base, 3, remainder, set = base)) for (i in 1:84) t2[[i]] <- t(combn(seq6[i, ], 3)) for (i in 1:84) t12[[i]] <- cbind(t1[i, 1], t1[i, 2], t1[i, 3], t2[[i]]) for (i in 1:84) t123[[i]] <- cbind(t12[[i]], t(apply(t12[[i]], 1, remainder, set = base))) for (i in 1:84) y <- rbind(y, t123[[i]]) fstar <- apply(y, 1, function(Y) f(Y, X)) cbind(y, data.frame(f = fstar)) hist(fstar, freq = FALSE, ylim = c(0, 1), col = "gray90", main = "") curve(df(x, 2, 6), add = TRUE, lwd = 2) rm(base, fstar, i, remainder, seq6, t1, t2, t12, t123, f, X, y) I originally coded this using the 'lm' summary function and appropriate indexing to grab the F-values. It took nearly a minute. I decided later to recode this algorithm by doing the calculations manually and the whole thing reduced to mere seconds. If anyone has any suggestions on better organization or more efficient coding, let me know. I like my code to be concise, especially in examples like these. I don't want to burden the reader with having to decipher a whole bunch of code to appreciate the example. And if anyone is wondering how we figure out there are 1680 possible permutations, consider the fact that we first have to choose the first treatment (3-sequence). There are 84 (= 9 choose 3) ways to do this. For each of those 84, there remains a 6-sequence for which we need to choose another 3-sequence for the second treatment. Upon that selection the last (remainder) treatment (3-sequence) is automatically determined. Thus, the number of permutations is $$\binom{9}{3} \times \binom{6}{3} \times \binom{3}{3} = 84 \times 20 \times 1 = 1680$$ Last edited: #### trinker ##### ggplot2orBust Random card, die and coin outcome generator I feel it is my duty to revive this thread every so often (mostly for selfish reasons so I can see everyone else's good ideas) Game Probabilities Function I tutor middle school students and am often in the need of a card, die, or coin generator (the web is rank with these but I didn't always get internet so [R] to the rescue). So… Description This sfunction is outcome generators for: die, coin & cards. It's very basic but I've used it many times tutoring (one of the first functions I wrote). Function note: the game argument does not need to be in quotes Code: game.prob(game=c(coin,die,cards),n,replace=T) Function Code Code: [FONT="Courier New"][SIZE="1"]game.prob<-function(game,n=10,replace=T){ g<-substitute(game) g<-as.character(g) switch(g, coin=paste(sample(c("heads","tails"), n, replace),sep=" "), die=paste(sample(c("one","two","three","four","five","six"), n, replace),sep=" "), cards=paste(sample(c("Ace of Hearts","Ace of Diamonds","Ace of Spades","Ace of Clubs", "2 of Hearts","2 of Diamonds","2 of Spades","2 of Clubs", "3 of Hearts","3 of Diamonds","3 of Spades","3 of Clubs","4 of Hearts","4 of Diamonds","4 of Spades","4 of Clubs", "5 of Hearts","5 of Diamonds","5 of Spades","5 of Clubs","6 of Hearts","6 of Diamonds","6 of Spades","6 of Clubs", "7 of Hearts","7 of Diamonds","7 of Spades","7 of Clubs","8 of Hearts","8 of Diamonds","8 of Spades","8 of Clubs", "9 of Hearts","9 of Diamonds","9 of Spades","9 of Clubs","10 of Hearts","10 of Diamonds","10 of Spades","10 of Clubs", "Jack of Hearts","Jack of Diamonds","Jack of Spades","Jack of Clubs","Queen of Hearts","Queen of Diamonds","Queen of Spades","Queen of Clubs", "King of Hearts","King of Diamonds","King of Spades","King of Clubs"), n, replace),sep=" ")) }[/SIZE][/FONT] Examples Code: game.prob(die,6) game.prob(die,40) game.prob(cards,40) game.prob(cards,40,replace=F) game.prob(coin,100) paste(game.prob(die,40), game.prob(die,40),sep=" & ") # 2 die paste(game.prob(coin,10), game.prob(die,10),sep=" & ") # a coin and a die #TABLING THE OCCURANCES table(game.prob(die,4000)) table(game.prob(coin,100)) table(game.prob(cards,40)) table(paste(game.prob(coin,1000), game.prob(die,1000),sep=" & "))# a coin and a die Last edited: #### Dason ##### Ambassador to the humans That doBy is a pretty big dependency just to recode the variables. #### trinker ##### ggplot2orBust Point taken Dason I eliminated that dependency and eliminated the need for quotations around the game argument. Still somewhat inefficient (could have pasted the number and suits of the cards together but I'm lazy right now). #### trinker ##### ggplot2orBust Get Package and Load it with One Function When I find a library I want I usually get it and load it right away. I use the install.packages() & library() functions. I've decided I'm too lazy for that. So I made one function to do both actions. Thanks to Dason for helping me work this one out. You do not need to put quotes around the library name when you use the code. Function Code: get.lib<-function(package){ pack1<-substitute(package) pack<-as.character(pack1) install.packages(pack) library(pack, character.only = TRUE) } Example Code: get.lib(car) #### trinker ##### ggplot2orBust MODE OF Vectors and Data frames Description [R] has a mean and a median function and mode means something other than you'd expect. At times I wish there was a mode function (may be one in a package). Anyway I decided to make a function for mode. It takes a vector or data frame argument and returns 1) the mode, or 2) a warning if there is no mode (frequency of everything is 1). I made it work on data frames but I feel like it's not at its fullest yet. I have addressed this concern with the change from apply to lapply noted at the bottom Please critique so we can improve this. Function MODE(x) Arguments x is a vector or dataframe Function Code: MODE <- function(dataframe){ DF <- as.data.frame(dataframe) MODE2 <- function(x){ if (is.numeric(x) == FALSE){ df <- as.data.frame(table(x)) df <- df[order(df$Freq), ]
m <- max(df$Freq) MODE1 <- as.vector(as.character(subset(df, Freq == m)[, 1])) if (sum(df$Freq)/length(df$Freq)==1){ warning("No Mode: Frequency of all values is 1", call. = FALSE) }else{ return(MODE1) } }else{ df <- as.data.frame(table(x)) df <- df[order(df$Freq), ]
m <- max(df$Freq) MODE1 <- as.vector(as.numeric(as.character(subset(df, Freq == m)[, 1]))) if (sum(df$Freq)/length(dfFreq)==1){ warning("No Mode: Frequency of all values is 1", call. = FALSE) }else{ return(MODE1) } } } return(as.vector(lapply(df, MODE2))) } Your indentation was really weird to me. I'm not sure if that was a result of mixing tabs/spaces (it's best to stick to one or the other - or use tabs to indent to the current indentation level and then if you want to align code use spaces). I fixed that up. You also had some parenthesis and curly braces that I couldn't figure out what the point of them was. I removed those and it didn't seem to break anything. I added a few spaces between operators. I'm not quite as big a fan of throwing in extra spaces as some of my professors are but I do think they are useful (I changed something like j==3 to j == 3) On a personal note I found it somewhat confusing to create a variable called df inside of MODE2 since there is already a df defined in the parent scope. On a side note - is there a reason you only accept numeric input? I typically only think of using the mode for non-numeric input so I found it interesting that you only find it for numeric input. #### trinker ##### ggplot2orBust Dason, Thanks for the time and feedback. That helps immensely. For me to have someone give critique on something you've made is much more meaningful than looking up critiques and tearing apart code. Hopefully people will see the progression of the code as well and will find the style useful in their own learning. Your indentation was really weird to me. I'm not sure if that was a result of mixing tabs/spaces (it's best to stick to one or the other - or use tabs to indent to the current indentation level and then if you want to align code use spaces). yes this was the problem. I was also using str to look at other functions and saw mixed methods. Your example provides very clear guidance. You also had some parenthesis and curly braces that I couldn't figure out what the point of them was. I removed those and it didn't seem to break anything. I thought I was supposed to do the curly braces to section off (not the right term) a function within a function. I'm not sure where I got this idea from. (I changed something like j==3 to j == 3) I wasn't sure what to do here. The style guides seemed to not be definitive with = signs but I think that was a single =. On a personal note I found it somewhat confusing to create a variable called df inside of MODE2 since there is already a df defined in the parent scope. Didn't know this was bad. I'll make changes to reflect that. Thank you for that awareness on this point. On a side note - is there a reason you only accept numeric input? I typically only think of using the mode for non-numeric input so I found it interesting that you only find it for numeric input. I gave this some thought and I think it goes with thinking of central tendency measures as numeric. Mean and median are applied to numeric data. But on investigating this I found it is fine and actual useful for non-numeric data as well. I made the change to the code to reflect that as well. Updated Code Code: MODE <- function(dataframe){ DF <- as.data.frame(dataframe) MODE2 <- function(x){ if (is.numeric(x) == FALSE){ df <- as.data.frame(table(x)) df <- df[order(dfFreq), ]
m <- max(df$Freq) MODE1 <- as.vector(as.character(subset(df, Freq == m)[, 1])) if (sum(df$Freq)/length(df$Freq)==1){ warning("No Mode: Frequency of all values is 1", call. = FALSE) }else{ return(MODE1) } }else{ df <- as.data.frame(table(x)) df <- df[order(df$Freq), ]
m <- max(df$Freq) MODE1 <- as.vector(as.numeric(as.character(subset(df, Freq == m)[, 1]))) if (sum(df$Freq)/length(df\$Freq)==1){
warning("No Mode: Frequency of all values is 1", call. = FALSE)
}else{
return(MODE1)
}
}
}

return(as.vector(lapply(DF, MODE2)))
}