Share your functions & code

trinker

ggplot2orBust
#21
I decided to try to improve upon my search function to make it more versatile for searching through large data sets (similar to the search button in Microsoft Excel). I changed grep to agrep and added an ignore.case argument so the function is no longer case sensitive and takes approximate matches. I added a variation argument (agrep's max.distance) set at .02. Adjust this to 0 to narrow the results or higher to broaden.

Currently the function works on specific columns of a data frame. I wanted to make it work on a data frame and return all rows that contain any columns with the search term. I thought about using the apply function and then unique to eliminate duplicates. Unfortunately, I can't seem to get this to work. Any ideas?

Search Function Code
Code:
Search<-function(term,dataframe,column.name,variation=.02){
    te<-substitute(term)
       te<-as.character(te)
   cn<-substitute(column.name)
      cn<-as.character(cn)
          HUNT<-agrep(te,dataframe[,cn],ignore.case =TRUE,max.distance=variation)
   dataframe[c(HUNT),]
}
EXAMPLE
Code:
#CREATING A FAKE DATA SET
SampDF<-data.frame("islands"=names(islands),mtcars[1:48,])

#EXAMPLES
Search(cuba,SampDF,islands)
Search(New,SampDF,islands)
Search(ho,SampDF,islands)#Too much variation
Search(ho,SampDF,islands,var=0)
Search("Axel Hbeierg",SampDF,islands)#not enough variation
Search("Axel Hbeierg",SampDF,islands,var=2)
Search(19,SampDF,mpg,0)
 
Last edited:

Link

Ninja say what!?!
#22
Dason I'm proficient (just a smidge away from what a real statistician would describe as dangerous) with [R]. I tried to figure it out for some time; to no avail. I failed and peeked :( Don't judge.
LMAO. I was just plain lazy and skipped right to the spoiler. Sorry! Very neat though!
 

Link

Ninja say what!?!
#23
Thanks a lot trinker for posting this! I had been wanting to do something like this for a while...but was too busy (read lazy) to do it. Since it's started though...here's one of my favorites. I should probably edit it though to work for words with spaces as well (it currently doesn't work if there are spaces in the word).

Code:
#This function allows you to type in names without the need for " or '. Great when you have lots of text to type.
words <- function(...) paste(substitute(list(...)))[-1]

#CREATING A FAKE DATA SET
SampDF<-data.frame("islands"=names(islands),mtcars[1:48,])

#EXAMPLES
words(test, test2, test3)
subset(SampDF, select=words(am, gear, carb))
SampDF$islands %in% words(Cuba, Newfoundland, Africa)
 

Dason

Ambassador to the humans
#24
I should probably edit it though to work for words with spaces as well (it currently doesn't work if there are spaces in the word).
I don't think you could keep it the way you have it and modify it so that you could have spaces in your "words". But if you do have words with spaces in them you can just put quotes around that and it will work.
Code:
nullfun <- function(...){}
words <- function(...) paste(substitute(list(...)))[-1]

# works
words(hey, you, guys) 
 # Doesn't work
words(hey, you, guys, specifically link)
# Doesn't work either so I don't think you could
#Just modify your function to make it work.
nullfun(hey, you, guys, specifically link) 

#But this works. So you just have to do a little more work
# for words with spaces.
words(hey, you, guys, "specifically link")
 

Dason

Ambassador to the humans
#26
I think that's about the best you'll be able to do too without wrapping the entire input with quotes like so:
Code:
words <- function(input){
  strsplit(input, ",")[[1]]
} 

#example
words("Hey,you,guys")
Note that I was lazy and if you put spaces after the commas then the spaces are in the words. You could modify it to trim the beginning and ending spaces easily though.
 

trinker

ggplot2orBust
#27
Search Function (Danger)

In my previous post for the search() function I named an object within the function "t". This is dangerous because [R] has already assigned "t" to something else. This is what lead to my issues with the snow library not loading. Inside of the function I think you're ok, but if you run individual lines outside of the function it's risky. I updated the code in my previous post to represent this change. my apologies.

Now any ideas on how to apply search to all columns of a data frame. My idea was to try apply and then unique but I can't get the apply to work properly.
 

Dason

Ambassador to the humans
#28
Search Function (Danger)

In my previous post for the search() function I named an object within the function "t". This is dangerous because [R] has already assigned "t" to something else.
Imagine if you would have overwrote T.

One of the meanest things you could do to somebody who leaves an R script open is put this at the top of the file and run it so they don't notice
Code:
T <- FALSE
F <- TRUE
which is partially why I try to always use TRUE or FALSE instead of T or F. Imagine trying to debug that.
 

bryangoodrich

Probably A Mammal
#29
Since I want to contribute something here, I thought I would include an example I coded myself as part of my ALSM project. This comes from chapter 16 (not included on website, yet). The example considers taking all the permutations of the given (base) 9-sequence as assigned to three treatments: (1.1, 0.5, -2.1), (4.2, 3.7, 0.8), and (3.2, 2.8, 6.3). The example concluded by drawing a histogram of the Randomization (Permutation) distribution and overlaying the appropriate F-distribution that approximates it.

Code:
################################################################################
## TABLE 16.5                                                          (p 715) #
## Randomization Samples and Test Statistics--Quality Control Example          #
## FIGURE 16.8                                                                 #
## Randomization Distribution of F* and Corresponding                          #
## F Distribution--Quality Control Example                                     #
##                                                                             #
## Since there is no algorithm to compute this example we had to devise one.   #
## It should come as rather straight-forward. The Xi's are as in the above     #
## examples. The 'y' will hold the 1,680 cases of 9-sequences consisting of    #
## the response variables. The 'ti' implies the treatment group. In this case  #
## t1 is the first group (3-sequence) and t12 is the composite of t1 and t2.   #
## The 'remainder' function is a wrapper for grabing a subset of 'set' based   #
## on those values not in 'x'. The 'seq6' is the 6-sequence remainder after t1 #
## is defined. The whole process took less than 10 seconds on a 2.4 GHz        #
## processor. As for the output, the columns are arbitrarily labeled 1-9.      #
## Clearly they represent the three treatment groups based on groups of three. #
## The function 'f' uses the matrix algebra discussed in Ch. 5. It is possible #
## to get away with merely fitting an 'lm' object, and then extract the        #
## f-statistic in a single call. However, this requires a lot of additional    #
## work for each of the 1680 rows. It took somewhere between 30-60 seconds to  #
## produce the same result.                                                    #
################################################################################
remainder <- function(x, set) set[!set %in% x]
f <- function(Y, X) {
  Y <- matrix(Y)                                ## Turn row-vector into column
  p <- ncol(X);     n <- nrow(X)
  J <- matrix(1, n, n)                          ## (5.18)
  H <- X %*% solve(t(X) %*% X) %*% t(X)         ## (5.73a)
  SSE <- t(Y) %*% (diag(n) - H) %*% Y           ## (5.89b)
  SSR <-  t(Y) %*% (H - (1/n)*J) %*% Y          ## (5.89c)
  fstar <- (SSR / (p - 1)) / (SSE / (n - p))    ## (6.39b)
}

base <- c(1.1, 0.5, -2.1, 4.2, 3.7, 0.8, 3.2, 2.8, 6.3)
t2   <- t12 <- t123 <- list()
y    <- NULL
X    <- cbind(
  X1 = c(1, 1, 1, 0, 0, 0, 0, 0, 0), 
  X2 = c(0, 0, 0, 1, 1, 1, 0, 0, 0),
  X3 = c(0, 0, 0, 0, 0, 0, 1, 1, 1)
);

t1   <- t(combn(base, 3)) 
seq6 <- t(combn(base, 3, remainder, set = base))

for (i in 1:84)  t2[[i]] <- t(combn(seq6[i, ], 3))
for (i in 1:84) t12[[i]] <- cbind(t1[i, 1], t1[i, 2], t1[i, 3], t2[[i]])
for (i in 1:84) 
  t123[[i]] <- cbind(t12[[i]], t(apply(t12[[i]], 1, remainder, set = base)))
for (i in 1:84) y <- rbind(y, t123[[i]])

fstar <- apply(y, 1, function(Y) f(Y, X))

cbind(y, data.frame(f = fstar)) 
hist(fstar, freq = FALSE, ylim = c(0, 1), col = "gray90", main = "")
curve(df(x, 2, 6), add = TRUE, lwd = 2)

rm(base, fstar, i, remainder, seq6, t1, t2, t12, t123, f, X, y)
I originally coded this using the 'lm' summary function and appropriate indexing to grab the F-values. It took nearly a minute. I decided later to recode this algorithm by doing the calculations manually and the whole thing reduced to mere seconds. If anyone has any suggestions on better organization or more efficient coding, let me know. I like my code to be concise, especially in examples like these. I don't want to burden the reader with having to decipher a whole bunch of code to appreciate the example.

And if anyone is wondering how we figure out there are 1680 possible permutations, consider the fact that we first have to choose the first treatment (3-sequence). There are 84 (= 9 choose 3) ways to do this. For each of those 84, there remains a 6-sequence for which we need to choose another 3-sequence for the second treatment. Upon that selection the last (remainder) treatment (3-sequence) is automatically determined. Thus, the number of permutations is

\(\binom{9}{3} \times \binom{6}{3} \times \binom{3}{3} = 84 \times 20 \times 1 = 1680\)
 
Last edited:

trinker

ggplot2orBust
#30
Random card, die and coin outcome generator
I feel it is my duty to revive this thread every so often (mostly for selfish reasons so I can see everyone else's good ideas) :D

Game Probabilities Function
I tutor middle school students and am often in the need of a card, die, or coin generator (the web is rank with these but I didn't always get internet so [R] to the rescue). So…

Description
This sfunction is outcome generators for: die, coin & cards. It's very basic but I've used it many times tutoring (one of the first functions I wrote).

Function note: the game argument does not need to be in quotes
Code:
game.prob(game=c(coin,die,cards),n,replace=T)
Function Code
Code:
[FONT="Courier New"][SIZE="1"]game.prob<-function(game,n=10,replace=T){
g<-substitute(game)
g<-as.character(g)
switch(g,
coin=paste(sample(c("heads","tails"), n, replace),sep="    "),
die=paste(sample(c("one","two","three","four","five","six"), n, replace),sep="    "),
cards=paste(sample(c("Ace of Hearts","Ace of Diamonds","Ace of Spades","Ace of Clubs",
"2 of Hearts","2 of Diamonds","2 of Spades","2 of Clubs",
"3 of Hearts","3 of Diamonds","3 of Spades","3 of Clubs","4 of Hearts","4 of Diamonds","4 of Spades","4 of Clubs",
"5 of Hearts","5 of Diamonds","5 of Spades","5 of Clubs","6 of Hearts","6 of Diamonds","6 of Spades","6 of Clubs",
"7 of Hearts","7 of Diamonds","7 of Spades","7 of Clubs","8 of Hearts","8 of Diamonds","8 of Spades","8 of Clubs",
"9 of Hearts","9 of Diamonds","9 of Spades","9 of Clubs","10 of Hearts","10 of Diamonds","10 of Spades","10 of Clubs",
"Jack of Hearts","Jack of Diamonds","Jack of Spades","Jack of Clubs","Queen of Hearts","Queen of Diamonds","Queen of Spades","Queen of Clubs",
"King of Hearts","King of Diamonds","King of Spades","King of Clubs"), n, replace),sep="    "))
}[/SIZE][/FONT]
Examples
Code:
game.prob(die,6)
game.prob(die,40)
game.prob(cards,40)
game.prob(cards,40,replace=F)
game.prob(coin,100)
paste(game.prob(die,40), game.prob(die,40),sep=" & ")             # 2 die
paste(game.prob(coin,10), game.prob(die,10),sep=" & ")           # a coin and a die

#TABLING THE OCCURANCES
table(game.prob(die,4000))
table(game.prob(coin,100))
table(game.prob(cards,40))
table(paste(game.prob(coin,1000), game.prob(die,1000),sep=" & "))# a coin and a die
 
Last edited:

trinker

ggplot2orBust
#32
Point taken Dason :)

I eliminated that dependency and eliminated the need for quotations around the game argument. Still somewhat inefficient (could have pasted the number and suits of the cards together but I'm lazy right now).
 

trinker

ggplot2orBust
#33
Get Package and Load it with One Function
When I find a library I want I usually get it and load it right away. I use the install.packages() & library() functions. I've decided I'm too lazy for that. So I made one function to do both actions. Thanks to Dason for helping me work this one out.

You do not need to put quotes around the library name when you use the code.

Function
Code:
get.lib<-function(package){
    pack1<-substitute(package)
       pack<-as.character(pack1)
install.packages(pack)
library(pack, character.only = TRUE)
}
Example
Code:
get.lib(car)
 

trinker

ggplot2orBust
#34
MODE OF Vectors and Data frames

Description
[R] has a mean and a median function and mode means something other than you'd expect. At times I wish there was a mode function (may be one in a package). Anyway I decided to make a function for mode. It takes a vector or data frame argument and returns 1) the mode, or 2) a warning if there is no mode (frequency of everything is 1). I made it work on data frames but I feel like it's not at its fullest yet. I have addressed this concern with the change from apply to lapply noted at the bottom Please critique so we can improve this.

Function
MODE(x)
Arguments
x is a vector or dataframe

Function
Code:
MODE <- function(dataframe){
    DF <- as.data.frame(dataframe)

    MODE2 <- function(x){      
        if (is.numeric(x) == FALSE){
            df <- as.data.frame(table(x))  
            df <- df[order(df$Freq), ]         
            m <- max(df$Freq)        
            MODE1 <- as.vector(as.character(subset(df, Freq == m)[, 1]))

            if (sum(df$Freq)/length(df$Freq)==1){
                warning("No Mode: Frequency of all values is 1", call. = FALSE)
            }else{
                return(MODE1)
            }

        }else{ 
            df <- as.data.frame(table(x))  
            df <- df[order(df$Freq), ]         
            m <- max(df$Freq)        
            MODE1 <- as.vector(as.numeric(as.character(subset(df, Freq == m)[, 1])))

            if (sum(df$Freq)/length(df$Freq)==1){
                warning("No Mode: Frequency of all values is 1", call. = FALSE)
            }else{
                return(MODE1)
            }
        }
    }

    return(as.vector(lapply(DF, MODE2)))
}
Example
Code:
x1<-c(1:10)
x2<-c(2,3,4,4,5,5,6,6,8,10)
x3<-c(2,3,4,4,4,200)
x4<-factor(c("yes","no","yes"))

MODE(x1)
MODE(x2)
MODE(x3)
MODE(x4)
MODE(mtcars)
MODE(CO2)
I made it work on data frames but I feel like it's not at its fullest yet.
I figured out what I was not doing correctly. I was using apply rather than lapply. This caused the data frame to be converted to a matrix. Meaning in a dataframe like CO2 (in base package) the numerics were converted to character. The function returned non-numeric warnings for all columns. The switch to lapply (what should be used for dataframes) corrected this error.
 
Last edited:

Dason

Ambassador to the humans
#35
I've wondered this for a while - if you post code without indentation does that mean that you didn't indent in your original source code or just that the indentation got stripped when you posted it? Because I can't read unindented code. Well I can but I don't enjoy it.
 

trinker

ggplot2orBust
#36
Dason,

Thanks for your question.
I've wondered this for a while - if you post code without indentation does that mean that you didn't indent in your original source code or just that the indentation got stripped when you posted it?
I don't know the answer to the first question (does the indenting get stripped away; I doubt it) but the answer to the second is yes I formatted it unindented. I'm not a programmer and [R] is my first language. What I've looked up this far about grammar/structure has not been helpful to a nonprogrammer. I'd appreciate if you have a resource to guide a nonprogrammer as to how to style/format code and functions? I don't even have the proper vocabulary around programming to know what terms to search for to get a style guide. Once I have that figured out I'll go back through and reformat the code I put in to be more readable. I want to become a stronger programmer and knowledge around this would help immensely.

EDIT: Added links below
Google Style Guide

Hadley Wickam’s guide
 
Last edited:

Dason

Ambassador to the humans
#37
Looks like I got beat to it. I like those style guides. I was just going to provide this link and mention that I prefer something similar to K&R - except I never have an opening brace on a separate line.
 

trinker

ggplot2orBust
#38
I've read through the style guides and have attempted to follow them (this is my first attempt at writing in a more readable style. I really want to open this up for critique. I won't get better and people can't learn from this if we're worried about hurting someone's feelings. I want this thread to be open to critique in the name of improvement. So please give me the feedback I need to improve.

I have attached the code here but edited in the original post as well (so future searchers will have quality code accessible right away). I actually debated whether or not to leave the original as a learning thread but decided in favor of editing my original post. I will not edit this post though, so others may learn from your feedback.

Code:
MODE <- function(df){

df <- as.data.frame(df)

{MODE2 <- function(x){      
  if (is.numeric(x)==FALSE)
    {warning("MODE not meaningful for non-numeric vectors", call. = FALSE)
      }
  else { 
    df <- as.data.frame(table(x))  
       df <- df[order(df$Freq), ]         
          m <- max(df$Freq)        
(MODE1 <- as.vector(as.numeric(as.character(subset(df,Freq==m)[, 1]))))

  if (sum(df$Freq)/length(df$Freq)==1){
    warning("No Mode: Frequency of all values is 1", call. = FALSE)
      } 
  else {
    MODE1
      }
    }
  }
}

as.vector(lapply(df, MODE2))

}
 

Dason

Ambassador to the humans
#39
Code:
MODE <- function(df){
    df <- as.data.frame(df)

    MODE2 <- function(x){      
        if (is.numeric(x) == FALSE){
            warning("MODE not meaningful for non-numeric vectors", call. = FALSE)
        }else{ 
            df <- as.data.frame(table(x))  
            df <- df[order(df$Freq), ]         
            m <- max(df$Freq)        
            MODE1 <- as.vector(as.numeric(as.character(subset(df, Freq == m)[, 1])))

            if (sum(df$Freq)/length(df$Freq)==1){
                warning("No Mode: Frequency of all values is 1", call. = FALSE)
            }else{
                return(MODE1)
            }
        }
    }

    return(as.vector(lapply(df, MODE2)))
}
Your indentation was really weird to me. I'm not sure if that was a result of mixing tabs/spaces (it's best to stick to one or the other - or use tabs to indent to the current indentation level and then if you want to align code use spaces). I fixed that up. You also had some parenthesis and curly braces that I couldn't figure out what the point of them was. I removed those and it didn't seem to break anything.

I added a few spaces between operators. I'm not quite as big a fan of throwing in extra spaces as some of my professors are but I do think they are useful (I changed something like j==3 to j == 3)

On a personal note I found it somewhat confusing to create a variable called df inside of MODE2 since there is already a df defined in the parent scope.

On a side note - is there a reason you only accept numeric input? I typically only think of using the mode for non-numeric input so I found it interesting that you only find it for numeric input.
 

trinker

ggplot2orBust
#40
Dason,

Thanks for the time and feedback. That helps immensely. For me to have someone give critique on something you've made is much more meaningful than looking up critiques and tearing apart code. Hopefully people will see the progression of the code as well and will find the style useful in their own learning.

Your indentation was really weird to me. I'm not sure if that was a result of mixing tabs/spaces (it's best to stick to one or the other - or use tabs to indent to the current indentation level and then if you want to align code use spaces).
yes this was the problem. I was also using str to look at other functions and saw mixed methods. Your example provides very clear guidance.

You also had some parenthesis and curly braces that I couldn't figure out what the point of them was. I removed those and it didn't seem to break anything.
I thought I was supposed to do the curly braces to section off (not the right term) a function within a function. I'm not sure where I got this idea from.

(I changed something like j==3 to j == 3)
I wasn't sure what to do here. The style guides seemed to not be definitive with = signs but I think that was a single =.

On a personal note I found it somewhat confusing to create a variable called df inside of MODE2 since there is already a df defined in the parent scope.
Didn't know this was bad. I'll make changes to reflect that. Thank you for that awareness on this point.

On a side note - is there a reason you only accept numeric input? I typically only think of using the mode for non-numeric input so I found it interesting that you only find it for numeric input.
I gave this some thought and I think it goes with thinking of central tendency measures as numeric. Mean and median are applied to numeric data. But on investigating this I found it is fine and actual useful for non-numeric data as well. I made the change to the code to reflect that as well.

Updated Code
Code:
MODE <- function(dataframe){
    DF <- as.data.frame(dataframe)

    MODE2 <- function(x){      
        if (is.numeric(x) == FALSE){
            df <- as.data.frame(table(x))  
            df <- df[order(df$Freq), ]         
            m <- max(df$Freq)        
            MODE1 <- as.vector(as.character(subset(df, Freq == m)[, 1]))

            if (sum(df$Freq)/length(df$Freq)==1){
                warning("No Mode: Frequency of all values is 1", call. = FALSE)
            }else{
                return(MODE1)
            }

        }else{ 
            df <- as.data.frame(table(x))  
            df <- df[order(df$Freq), ]         
            m <- max(df$Freq)        
            MODE1 <- as.vector(as.numeric(as.character(subset(df, Freq == m)[, 1])))

            if (sum(df$Freq)/length(df$Freq)==1){
                warning("No Mode: Frequency of all values is 1", call. = FALSE)
            }else{
                return(MODE1)
            }
        }
    }

    return(as.vector(lapply(DF, MODE2)))
}