read.csv adds weird character to my first column header

ondansetron

TS Contributor
#1
New one-off question: I am trying to learn importing data with R.

I am following Mike Marin's Youtube lectures. I saved his excel sheet as .csv and tried the read.csv command, and everything seems to go well. I am totally new to R, essentially. When I try to view the data, the console shows a weird character and ".." before showing the data normally. Any idea what this means? It's the part right before "LungCap". Thanks!

Code:
> data1<-read.csv(file.choose(), header=T)
> data1
   ï..LungCap Age Height Smoke Gender Caesarean
1       6.475   6   62.1    no   male        no
2      10.125  18   74.7   yes female        no
3       9.550  16   69.7    no female       yes
4      11.125  14   71.0    no   male        no
5       4.800   5   56.9    no   male        no
6       6.225  11   58.7    no female        no
7       4.950   8   63.3    no   male       yes
8       7.325  11   70.4    no   male        no
9       8.875  15   70.5    no   male        no
10      6.800  11   59.2    no   male        no
> data2<-read.table(file.choose(), header=T, sep=",")
> data2
   ï..LungCap Age Height Smoke Gender Caesarean
1       6.475   6   62.1    no   male        no
2      10.125  18   74.7   yes female        no
3       9.550  16   69.7    no female       yes
4      11.125  14   71.0    no   male        no
5       4.800   5   56.9    no   male        no
6       6.225  11   58.7    no female        no
7       4.950   8   63.3    no   male       yes
8       7.325  11   70.4    no   male        no
9       8.875  15   70.5    no   male        no
10      6.800  11   59.2    no   male        no
 

Dason

Ambassador to the humans
#2
This looks like an encoding issue to me. Upon playing around with creating different files with different encodings and messing with the headers I think I've found the issue.

I'm guessing your file is UTF-8 encoded.

The solution is fairly simple. We need to tell R the encoding when reading the file in.

Code:
dat <- read.csv("C:/path/to/your/file/because/I/dont/like/file.choose", header = TRUE, fileEncoding = "UTF-8-BOM")
It seems that your file is UTF-8 encoding and against recommendation whatever is making the file is adding a BOM at the beginning to identify itself as UTF-8.

Some relevant readings if you want more info
https://stackoverflow.com/questions/19936699/why-is-r-reading-utf-8-header-as-text
https://stackoverflow.com/questions/2223882/whats-different-between-utf-8-and-utf-8-without-bom
 

ondansetron

TS Contributor
#3
Thank you!

1) Why don't you like the "file.choose()" option?
2) What do you mean "file is UTF-8 encoding and against recommendation whatever..." what does the emboldened text mean? Is it against recommendation to us UTF-8?
 

Dason

Ambassador to the humans
#4
A script should be reproducible. You have a specific file in mind for your code - name it explicitly. If your script could apply to more than one file then you can list those or make the rest of the script into a function and call the function on the files of interest.

Take my quote to mean more "and against recommendation whatever_is_making_the_file is adding a BOM". There is a recommendation to not add the BOM at the beginning of the file. Whatever program is creating your file is adding the BOM which goes against that recommendation.
 

ondansetron

TS Contributor
#5
That makes sense. Thanks. I'm not very versed in coding. Anything I've done in SAS I just keep a log of my code and the output it generates and store that in a Word file for reference if needed. Probably not the best.

I'll try to think about that in the future.