Reading csv file. How to escape character string delimiter inside a field

#1
Hi,

I'm trying to read a csv file with 7 fields separated with tab char. String fields are delimited by double quote char '"'. The problem is that those fields can hold double quote chars into. I try escape double quotes with '\' character, but hasn't worked.

Here an example:

field1<tab>field2<tab>"Example of string with \"double quotes\" inside"<tab>field4<tab>field5<tab>field6<tab>field7<EOL>

Here the command I'm using:

mydata <- read.table("file.csv", quote="\"", dec=".", header=FALSE, allowEscapes=TRUE, sep='\t')

(allowEscapes=FALSE don't work)

Any ideas?

Thx in advance!
 

bryangoodrich

Probably A Mammal
#2
I had some issues with escaping characters before. You might have to get crafty and start doing things more manually. Check out readLines and grep. You can read everything in as a line of text, split it yourself (strsplit) and then deal with each field clean up yourself.

Code:
x <- readLines(somedata)
x <- strsplit(x, "\t")  # Returns a list for each row but they *should* all be equal length vectors
x <- do.call('rbind', x)  # put into a tabular form. You might also try data.frame instead of rbind
From here you'd probably want to have a function for each column that cleans it up how you want it cleaned, even if it's just "as.numeric(...)". Then you can do something like lapply the table or lapply a vector of functions that correspond to each column or lapply a numeric vector of indices for each column. In any case, the point is for each i have something like

Code:
x[, i] <- column_cleaners[i](x[, i])
The end result will be to have tidy data. If you're brave (or want good advanced coding skills in R), I'd check out some of the tricks Hadley talks about on his ebook: http://adv-r.had.co.nz/ (check out subsetting)
 

bryangoodrich

Probably A Mammal
#3
Here's a test case

Code:
x <- c(c("Some\t3\ttext", "Another\t44\tline", "this\t14\tsucks"))
x <- do.call(rbind.data.frame, strsplit(x, '\t')) # data.frame alone transposes the results
names(x) <- c("X1", "X2", "X3")
colfunctions <- c(as.factor, as.numeric, I)  # Create list of functions for each column
fix_my_column <- function(col, data, FUNS) FUNS[[col]](data[, col]) 
x[] <- lapply(seq(colfunctions), fix_my_column, data = x, FUNS = colfunctions)

str(x)
# 'data.frame':	3 obs. of  3 variables:
#  $ X1: Factor w/ 3 levels "Another","Some",..: 2 1 3
#  $ X2: num  3 44 14
#  $ X3:Class 'AsIs'  chr [1:3] "text" "line" "sucks"

x
#        X1 X2    X3
# 1    Some  3  text
# 2 Another 44  line
# 3    this 14 sucks
As the Advanced-R book can demonstrate, there are much better ways to approach what I did in that lapply statement to make it a *lot* cleaner to read, but it shows a rather brute force way to apply functions to each column. Though, I was unaware that the I returns an AsIs class object. I figured it was just the identify function lol Maybe return will work? (i.e., return whatever is put into it)

In any case, you might try setting the allowEscape parameter to TRUE and altering the quote parameter in read.table. I replicated data to what you had as a big string and tested it with read.csv(text = "...") with the default parameters and your parameters. It worked just fine. Don't know why a file connection would be different, though.