Today I Learned: ____

trinker

ggplot2orBust
#61
Functions are starting to make more sense to me of when to indent but in just lines of code like this I don't know what and when to indent. I realize if it takes two lines for a statement indent that next line but other than that?
 

bryangoodrich

Probably A Mammal
#62
Just follow the Google's R Style Guide. Though, I disagree with some of them, e.g., variable names as "variable.name". It violates the class.variable or class.method() convention of object-oriented languages. Just because R allows it doesn't mean it is good practice, especially since R does make use of that! E.g., print.default versus print.lm. Usually in Python you'll see variable_name, which isn't too bad, but I just stick with variableName, not to be confused with functionName(). I think some R functions have even been changed to match this. I discuss this in detail below.

Frankly, I use 2-space indents, but 4-space isn't too bad. It just depends on how much clarity you like to give. For instance, some people like to include a lot of double-space line; I don't. I usually keep things 'tight' and use quad-space lines when I'm separating major blocks, double-space lines when separating processing blocks. You always indent blocks under a process

Code:
f <- function(x) {
  some code
  goes here
}

for (i in seq(10)) {
  indent
  stuff here
}
You can pretty much look at any Python code for examples since it doesn't use curly brackets to segment these logical blocks; it uses indentation to identify them (any amount of indent, as long as its uniform).

The real part that people disagree on is multi-line statements, and it really is to taste. For instance,

Code:
df <- read.delim(infile, header = FALSE, colClasses = c("factor", rep("numeric", 3), "character", "factor"),
  col.names = c("blah", "boo", "foo", "bar", "joe", "bob"))

df <- read.delim(infile, header = FALSE, colClasses = c("factor", rep("numeric", 3), "character", "factor"),
  col.names = c("blah", "boo", "foo", "bar", "joe", "bob")
)

df <- read.delim(infile, header = FALSE, colClasses = c("factor", rep("numeric", 3), "character", "factor"),
                 col.names = c("blah", "boo", "foo", "bar", "joe", "bob"))
Take your pick. I usually do one of the first two. In other cases, it makes sense to group them together

Code:
x <- function(x, y)  (x - y) /
                     (x + y)
The problem with long multi-line statements, and why I just use normal indentation with a closing bracket, is that you can run out of space.

Code:
someVeryLongNameThatNeedsToGo <- read.delim("ThisFileIsDisgustinglyLongDude.txt", header = FALSE, na = " - ", 
                                            colClasses = c("factor", "character", "numeric", "factor", "numeric", "NULL", "numeric"),
                                            col.names = c("bob", "joe", "chris", "frank", "bar", "foo", NA, "job")
);
While it makes it logically grouped together by parameter list, some of those parameter statements are long, too! You're pushing everything up against the right-hand side for no reason, and I don't find it adds any more clarity than

Code:
someVeryLongNameThatNeedsToGo <- read.delim("ThisFileIsDisgustinglyLongDude.txt", header = FALSE, na = " - ", 
    colClasses = c("factor", "character", "numeric", "factor", "numeric", "NULL", "numeric"),
    col.names  = c("bob", "joe", "chris", "frank", "bar", "foo", NA, "job")
);
These limitations also make more sense when you consider that your programming window should be at most 80 characters long (some go 60!). We're going way passed that in these examples! In fact, 80 characters wouldn't even fit the first parameter, requiring you, to keep to that limitation, to drop your starting line down

Code:
someVeryLongNameThatNeedsToGo <-
read.delim("ThisFileIsDisgustinglyLongDude.txt", header = FALSE, na = " - ", 
    colClasses = c("factor", "character", "numeric", "factor", "numeric", "NULL", "numeric"),
    col.names  = c("bob", "joe", "chris", "frank", "bar", "foo", NA, "job")
);
 

Jake

Cookie Scientist
#63
One thing that I like to do for the sake of easy interpretation, but that I haven't seen much in others' code, is when working with long multiply nested function calls to group the arguments for each function that uses multiple arguments onto the same line. That's probably not a very intuitive description but here's a quick example of what I mean (copied/pasted from some random code I had in my Rstudio editor):
Code:
picFrame <- cbind.data.frame(subject = rep(subjectFrame$subject,
                                           each=36),
                             pic = unlist(lapply(strsplit(picVec,
                                                          split='[()]'),
                                                 "[",
                                                 2)))
Personally, this helps me to quickly break down the hierarchy of function calls that is happening here and figure out which argument is going with which function.

When I'm working with a long function call that isn't nested in this way, but is simply full of long variable names and such, I go with the more conventional two-space indentation.
 

bryangoodrich

Probably A Mammal
#64
Ugh, I'd probably try and find a way to put some of those calls into variables beforehand, or do something like I showed above

Code:
picFrame <- cbind.data.frame(
  subject = rep(subjectFrame$subject, each=36),
  pic     = unlist(lapply(strsplit(picVec, split='[()]'), "[",  2))
);
The longest line is like 68 characters. We know that each line is a parameter, and they're logically grouped (equal signs line up). One could still easily break up that long unlist statement, though, and it wouldn't be too bad--e.g., move the contents of lapply to its own line with the double brackets on its own line, lined up with unlist to signify the end of that block at that location. I use brackets like I use curly ones to designate other type of named blocks.
 

Jake

Cookie Scientist
#65
A matter of taste I guess; I personally don't care too much for the piecemeal approach of doing everything in small steps in separate lines, and try to avoid it when it's feasible. Granted, doing that can save the confusion of later having to unravel a potentially long hierarchy of function calls (my indentation scheme is an attempt to alleviate this), but the trade-off for me is that as I read through each small step, I get a slight feeling of "where are you going with this?" It's harder for me to maintain the more abstract notion of what the code is trying to do at any given time and how it relates to the overall goal.
 

Dason

Ambassador to the humans
#66
If you don't know where it's going then the programmer probably didn't do a good enough job writing/commenting their code to tell you what they're trying to do. This happens a lot though when people don't actually expect anybody to look at their code.

I understand how you feel and it can be a little tedious to break everything up if you immediately think "I'll just use a do.call on this and the lapply I'll do the other thing using this function I'll just create and BAM I'm done. But that's also a lot harder to debug later. If you use good variable names and write comments as they're needed then debugging can become a lot easier.

I'm guilty of writing code like you - one big line and sometimes it's hard to read and not clear to an outsider what is exactly being done (reference: a lot of my answers to ledzep's questions). But I also know if I come back to that kind of code even a week later sometimes I have to rerun it line by line to remember what I was doing with it.
 

bryangoodrich

Probably A Mammal
#67
That's one reason why I try to leave a bit of space to the right of my code. I try to fit code within 60 characters, leaving another 18 or so for comments. I should start doing more multi-line comments. It doesn't hurt the code, and gives you more room to explain what's going on. I'd write an example, but I'm too tired now.

EDIT: Here's an example
Code:
picFrame <- cbind.data.frame(                                        # This is a comment about this function
                                                                     # It does neat stuff. Like,  you know.
                                                                     # Like, stuff. 
  subject = rep(subjectFrame$subject, each=36),
  pic     = unlist(lapply(strsplit(picVec, split='[()]'), "[",  2))
);
 

bryangoodrich

Probably A Mammal
#69
Usually you have your working source code and your development source code. Your development source code is what is used to make your program and is usually stripped of commentary. For instance, I believe if you use Roxygen, it'll help you convert your properly formatted comments to the help documentation and strip your source of comments. That way, you get your packaged development source code and your properly made help documentation. This is the way it works in Java, I believe (and sort of what Roxygen mimics?)
 

Dason

Ambassador to the humans
#70
But if you're submitting to CRAN you don't need to worry about that. You keep your sources files as they are and when you do the 'R CMD BUILD whatever' it takes care of those details for you.
 

bugman

Super Moderator
#71
I found a great R book for biometrics through jstor. Maybe common knowledge in the forum but I thought Id share it:

Biostatistcial Design and Analysis Using R. By Murray Logan.

For me, this is a perfect walk through. Its 500+ pages and at page 52 i have already picked up useful functions that I never knew existed sucg as:


the fix() function which allows simple editing of data frames
library(foriegn) allows the import of files creasted by: SPSS, SYSTAT etc...
and
random coorrdinate generation for random, stratified and systematic sampling designs and:

grep() for pattern searching.

Like I said, its probably basic, but Im loving it.
 

bryangoodrich

Probably A Mammal
#72
TIL about GML. You all are probably aware of KML, a variation on GML apparently. KML is like XML for storing geographic information catering to Google Earth (or Maps). GML is the more general cousin that encompasses storing geographic data in an XML like format that includes more than KML is designed for. As the wikipedia page details,

KML instances may be transformed losslessly to GML, however roughly 90% of GML's structures (such as, to name a few, metadata, coordinate reference systems, horizontal and vertical datums, etc.) cannot be transformed to KML.
Why I am posting this in the R forum? Because R users may run across spatial data. As far as I was aware, as I'm still new to the GIS field, KML was the only transport format for geographic data. In my mind, if I wanted to process some stuff in R and export it into an easily viewable format, I would have looked at KML. For most purposes, that may still be the case. Nevertheless, with GML you can take data from your GIS into R and from R back to a GIS directly--assuming R has faculties for parsing GML and writing GML. If it does not, then just like there are ways to write XML and KML from R (see Omegahat), we might want to investigate how to write GML, too. Note, there's also ways I was looking into--if you've paid attention to my JSON thread--that you can store spatial information in JSON. In fact, it is probably a better standard to use because the complexity of a JSON document, to me, looks a lot more hectic in XML. JSON is the future, my friends! Nonetheless, being aware of these things is important, and I thought I'd share what I learned today.
 

bryangoodrich

Probably A Mammal
#73
While not entirely new or amazing, today I learned (or realized) how to index by name. No, not in the sense that "wow, did you know you can do df['somevariable'] ?!" I knew that. What occurred to me was that I had no good way of accessing a variable by number without making my code very convoluted.

To illustrate, usually I'll do a looping structure on the numeric index of an object

Code:
for (i in seq(aVector)) { ... aVector[i] ...}
Then whenever I require an element of this vector, I just index the object as shown above. However, there are times when you don't want to have to index the object. You'd rather iterate through the vector itself

Code:
for (point in aVector) { ... point ... }
However, it turns out I might need to know the position of this point to grab an associated element from another vector (e.g., a color for that element being plotted). This isn't hard to do, but it becomes cumbersome on the code. Instead, I did two things. First, take that associated vector and give it names. I actually didn't realize this for quite a long time, but even common vectors can be given names and accessed as such

Code:
x <- c(10, 13, 22)
names(x) <- LETTERS[1:3]
x['B']
#  B
# 13
In this way, I can accomplish my task

Code:
associated_vector <- somePoints
names(associated_vector) <- levels(aVector)
for (pt in aVector) { ... associated_vector[pt] ... }
This simple adjustment changed my perspective on how I approach my loops and indexing and associating vectors that I'm using. One would be keen to keep these facets of R in mind!
 

Dason

Ambassador to the humans
#74
TI(remembered): The 'rle' function. I don't use it too often so I somehow always manage to forget about it when it's useful. I really need to remember that it exists.
 

Dason

Ambassador to the humans
#76
Well I threw together a quick simulation for the thread where the user was asking about 7 consecutive doubles in parcheesi. So I used rbinom to generate a sequence of 0s and 1s to represent not-double and doubles (with probability 1/6 for doubles). If you generate a sequence of length N a couple hundred thousand times looking at the max run length of 1s then you get an idea about whether observing 7 doubles in a row is actually unusual if you happened to roll N times over the night. I essentially rolled my own function to get the run lengths when it would have been much easier to just use rle.

Ultimately I ended up doing exactly what rle does but with less error checking and it was more specialized to only return what I cared about.
 

Lazar

Phineas Packard
#77
TIL: The bibtex package is potentially useful for elements of meta-analysis. I used it to pull all the keywords from articles I am looking at for a potential meta-analysis and draw a wordcloud. This gave me a quick visual of some of the major subject areas within the relevent domain.
 

bryangoodrich

Probably A Mammal
#78
What can the bibtex package do in this regard, Lazar? I assume the bibtex you have is going to be the metadata regarding some article, for instance, and not its content. So you're just looking at the keywords attached to the article? Maybe a wordcloud on its abstract might reveal some specific common themes? The problem with keywords is that sometimes they are too liberal (the article is scarcely related), sometimes not liberal enough, or the keywords are just too vague (more abstract words, not specific enough). Nevertheless, I think it could be interesting, and I might want to play around with this against my Zotero library!
 

Lazar

Phineas Packard
#79
Hi Bryan,

Yes I agree that it is not ideal. But it is intertesting. Here is the code I am using:
Code:
require(bibtex)
require(tm)
require(wordcloud)
require(RColorBrewer)
require(Cairo)
test<-read.bib("biblibrary")
dataPrep<- function(text){
            x<- c(do.call(cbind, text))
            x<- gsub('[&,;:()]', '', x)#I suppose I could use removePunctuation here
            x<- tolower(x)
            x<- Corpus(DataframeSource(data.frame(x)))
            x<- TermDocumentMatrix(x)
            m <- as.matrix(x)
            v <- sort(rowSums(m),decreasing=TRUE)
            d <- data.frame(word = names(v),freq=v)
            return(list(wordMatrix=v, plotData=d))
}

parts<- dataPrep(text=test$keywords)
pal2 <- brewer.pal(8,"Dark2")

png("fileName.png", width=1280,height=800)
wordcloud(parts$plotData$word,parts$plotData$freq, scale=c(8,.2),min.freq=1,
max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal2)
dev.off()
code based on http://onertipaday.blogspot.com/2011/07/word-cloud-in-r.html

Here is a trial I ran on my boss's last 50 publications. Seems to work ok:
http://imageshack.us/photo/my-images/593/herbtest2.png
 
Last edited:

bryangoodrich

Probably A Mammal
#80
TIL: You can give lists dimensions and dimname attributes to give them different forms. I haven't explored what all I can do that is creative with this, but it is new to me! The example is from my JSON/XML examples.

Code:
library(XML)
xmltree <- xmlTreeParse("http://www.bryangoodrich.com/api/get.php?format=xml", isURL = TRUE)
xmlToList(xmltree)
The results?
Code:
            person         person             person         
name        "Trinker"      "Dason"            "bryangoodrich"
awesomeness "lacking"      "Does Not Compute" "mind blowing!"
profession  "velociraptor" "Robot"            "Data Master"  
status      "taken"        "D-Bot"            "Single" 

> str(xmlToList(xmltree))
List of 12
 $ : chr "Trinker"
 $ : chr "lacking"
 $ : chr "velociraptor"
 $ : chr "taken"
 $ : chr "Dason"
 $ : chr "Does Not Compute"
 $ : chr "Robot"
 $ : chr "D-Bot"
 $ : chr "bryangoodrich"
 $ : chr "mind blowing!"
 $ : chr "Data Master"
 $ : chr "Single"
 - attr(*, "dim")= int [1:2] 4 3
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:4] "name" "awesomeness" "profession" "status"
  ..$ : chr [1:3] "person" "person" "person"