Share your functions & code

Re: R Tips and Tricks and Graphics

Here's an example of how to label outliers in boxplots:
Code:
set.seed (123)
x <- matrix (runif (50), 10)
colnames (x) <- LETTERS [1:5]
b <- boxplot (x, names=LETTERS[1:5], col=cm.colors (5))
r <- round (b$out, 2)
text (b$group+.3, r, r)
 

Dason

Ambassador to the humans
Re: R Tips and Tricks and Graphics

Win I don't want to rain on your parade but almost all of this would be just as appropriate in one of the threads that trinker linked to. I might just merge this thread into the "Share your functions and code" thread.
 
Re: R Tips and Tricks and Graphics

I didn't mean to be "parading". Just sharing tips and tricks which helped me as a learner. How do I merge this thread into the "Share your functions and code"? Please advise.
 

trinker

ggplot2orBust
Re: R Tips and Tricks and Graphics

@Win a mod has to do that. Luckily Dason is a mod. But by keeping the code in one location it makes it easier for future searchers to find related material.
 

Dason

Ambassador to the humans
Re: R Tips and Tricks and Graphics

I didn't mean to be "parading". Just sharing tips and tricks which helped me as a learner. How do I merge this thread into the "Share your functions and code"? Please advise.
You weren't parading - it's just an expression. Basically I don't want to ruin your fun or slow you down or anything like that - just wanted to point out that there are other threads that would be more appropriate since they were originally designed pretty much just for this type of thing. You can't do the merge - but if you want I'll do it for you.
 
Re: R Tips and Tricks and Graphics

Trinker and Dason, you two know your stuff. It's good I stumbled on this forum. Now I know where to go when I have R questions. I'm an educator, so I like to pass on whatever little I know to others. Let me know if I step on anybody's toes.
 

trinker

ggplot2orBust
I liked the pryr package's partial function https://github.com/hadley/pryr/blob/master/R/partial.r

But it's limited for sure: http://stackoverflow.com/q/25355310/1000343

I used the formals function to whip up a function that hijacks a function and allows you to change its default parameters. Here I demo it on data.frame that has 2 defaults I wish I could change (particularly the strings to factor one; as the second is actually pretty sensible).
Code:
hijack <- function (FUN, ...) {
    .FUN <- FUN
    args <- list(...)
    invisible(lapply(seq_along(args), function(i) {
        formals(.FUN)[[names(args)[i]]] <<- args[[i]]
    }))
    .FUN
}

.data.frame <- hijack(data.frame, check.names = FALSE, stringsAsFactors = FALSE) 

(dat <- data.frame(`bad name` = 1:3, x2 = c("a", "b", "c"))); str(dat)

[COLOR="silver"]##   bad.name x2
## 1        1  a
## 2        2  b
## 3        3  c

## 'data.frame':   3 obs. of  2 variables:
##  $ bad.name: int  1 2 3
##  $ x2      : Factor w/ 3 levels "a","b","c": 1 2 3[/COLOR]

(dat2 <- .data.frame(`bad name` = 1:3, x2 = c("a", "b", "c"))); str(dat2)

[COLOR="silver"]##   bad name x2
## 1        1  a
## 2        2  b
## 3        3  c

## 'data.frame':   3 obs. of  2 variables:
##  $ bad name: int  1 2 3
##  $ x2      : chr  "a" "b" "c"[/COLOR]
 

trinker

ggplot2orBust
Make HTML ransom notes. I was bored...

Code:
ransom <- function(text, open=TRUE) {
    root <- "http://contactsheet.org/cgi-bin/ransom.pl?thedata="

    text <- gsub("\\s+", "+", text)
    theurl <- paste0(root, text)
    
    require(XML)
    out <- capture.output(htmlTreeParse(theurl, useInternalNodes = TRUE) )
    
    temp <- tempdir()
    outfile <- file.path(temp, "out.html")
    cat(out, file=outfile)
    if (open) browseURL(outfile)
    return(invisible(outfile))
}

x <- "Give us your hard drives if you ever want to see your children again.  Signed, The Robots."
ransom(x)
 

trinker

ggplot2orBust
I have the need to send paths (usually partial paths) to colleagues to go grab documents. When I paste them into emails they aren't very readable and these folks typically are the type that like to use the gui to find the directory rather than paste the path in directly. I wrote a function to make paths more readable:

It is mostly Windows friendly but other people could use and/or modify to their own purposes:

Code:
path_ <- 
function(x=NULL, copy2clip = TRUE){
    if (is.null(x)) x <- reports::WP()
    m <- strsplit(x, "(?<=[/|\\\\])", perl=TRUE)[[1]]
    o <- paste0(lapply(2*0:c(length(m)-1), function(i) paste(rep(" ", i), collapse="")), "-> ", m)
    if (copy2clip) try(cat(paste(o, collapse="\n"), file="clipboard"))
    invisible(paste(o, collapse="\n"))
}

So if the following was on your clipboard:

Code:
Copy\CLaRI_Engineering_Literacy_Project_2014-15\META_PROJECT\IRB\Documents
After using path_(), this would replace the info on your clipboard:

Code:
-> Copy/
  -> CLaRI_Engineering_Literacy_Project_2014-15/
    -> META_PROJECT/
      -> IRB/
        -> Documents
EDIT: Just realized you'd need the reports package installed as I make use of WP()
 

trinker

ggplot2orBust
Kmeans from scratch.

I wanted to learn how kmeans works (and believe I have). I created a crude 2 k model with visual plotter to understand the process. Basically, you start with 2 random spots and calculate the distance of each point to the center spots. Which ever spot is closer the point get's classified into. You then recalculate 2 new centers based on the last assignment and repeat until there is no change in assignment. Critiques, rebukes, and improvements welcomed. It is designed more for me to learn so each iteration requires a key press.

I kept everything in base install R. Also...Not optimized or generalizable to k clusters (these improvements are welcomed). Also relies on global variables.

Code:
#Fake data
x1_p <-rnorm(100)
x2_p <-rnorm(100)

x1 <- c(x1_p, x1_p + rnorm(100, 3, .2))
x2 <- c(x2_p, x2_p + rnorm(100, 3, .1))

# plot data
plot(x1, x2)


k <- 2

# random center finder
random_center <- function(){
    start_x1 <- sample(seq(min(x1), max(x1), by=.02), k)
    start_x2 <- sample(seq(min(x2), max(x2), by=.02), k)
    split(data.frame(cbind(start_x1, start_x2)), 1:2)
}

# k-colorizing plotting function
replotter <- function(){
    plot(x1, x2)
    points(centers[[1]][1], centers[[1]][2], pch=19, cex=2, col="red")
    points(centers[[2]][1], centers[[2]][2], pch=19, cex=2, col="blue")
    cents <- do.call(rbind, centers)
    text(cents[, 1], cents[, 2]+.4, labels=c("1", "2"))
}



converged <- FALSE
centers <- random_center()
assignment2 <- NULL
replotter()

# kmeans loop
while(!isTRUE(converged)){

     replotter()
     assignment <- unlist(Map(function(a, b) {
         p<-which.min(c(
             dist(cbind(c(centers[[1]][1], a), c(centers[[1]][2], b))),
             dist(cbind(c(centers[[2]][1], a), c(centers[[2]][2], b)))
         ))
         points(a, b, col = ifelse(p==1, "red", "blue"), pch=19)
         p
     }, x1, x2))

     invisible(readline(prompt="Press [enter] to continue"))

      if (length(rle(assignment)$lengths) == 1){
          centers <- random_center()
      } else {

          centers <- lapply(split(data.frame(cbind(x1, x2)), assignment), colMeans)
          if (!is.null(assignment2)) converged <- all.equal(assignment2, assignment)
          assignment2 <- assignment
      }
}

assignment
 

bryangoodrich

Probably A Mammal
Nice!

One approach I would recommend is to think of this like Expectation-Maximization. Thus, break it into 2 functions E and M. The E function will take the random centers (initially) and do the assignments based on your distance metric (who is closest to each center). The M function will then recalculate the centers based on the new assignments. Thus, the body of k-means becomes a simple repeat and E and M until you get convergence.

Code:
centers <- random_centers(k)  # initialize or select random points or other start methods
while (test_convergence(...))
{
    assignments <- E(centers, d)  # d is the distance function to use
    visualize(assignments, centers)
    centers <- M(assignments, d)  # optimize the centers again
    visualize(assignments, centers)
}
Something like that. Then you can start moving toward the actual EM algorithm by giving soft assignments based on some probability distributions
 

bryangoodrich

Probably A Mammal
Ha, wish I had known I posted this code before because I had to reinvent it (not that hard) for putting together a primer on k-means with the Iris data set for a presentation tomorrow.

Anyway, I don't know if there's a function that will create binary dummy variables for you from a vector--and some QA would need to handle NA values--but here is my dummy function. The only things I found online were to use model.matrix, so this basically wraps that technique.

Code:
dummy <- function(x)
{
    nm <- sort(unique(x))
    x <- data.frame(f = factor(x, nm, nm))
    x <- model.matrix(~ 0 + f, x)
    colnames(x) <- nm
    x
}
I should probably add some other things from my bgmisc package.
 

Dason

Ambassador to the humans
Using model.matrix is the best way that I can think of to do that particular task. I'm sure there is a function in some package somewhere that has this already but it's not bad to do manually either as you see.
 

Jake

Cookie Scientist
I think the model.matrix approach is just fine, but for kicks (and procrastination) I came up with another way of doing it, which is shorter, but doesn't label the columns (although this could be easily added):
Code:
diag(length(unique(x)))[rep.int(unique(x),table(x)),]
Edit: Actually this isn't right, it works in the case where x is numeric, but not in general, here's a revision (which might be better anyway):
Code:
diag(length(unique(x)))[match(x,unique(x)),]