Compute the intersection of two csv file

winecoding

New Member
I have two csv files sharing some common columns. They are referred to as A.csv and B.csv. Right now, I need to generate two new files. C.csv is the intersection of A.csv and B.csv; D.csv consists of the remaining columns after subtracting C.csv from B.csv. Are there any approaches to do the above work in R?
A.csv
Code:
atheism    sport    baseball    alt
1    0    0    1
1    0    0    1
0    1    1    0
1    0    0    1
0    1    1    0
0    1    1    0
0    1    1    0
B.csv
Code:
sport    baseball    alt    rec
0    0    1    0
0    0    1    0
1    1    0    1
0    0    1    0
1    1    0    1
1    1    0    1
1    1    0    1
C.csv should be

Code:
sport    baseball    alt
0    0    1
0    0    1
1    1    0
0    0    1
1    1    0
1    1    0
1    1    0

bryangoodrich

Probably A Mammal
Use match(names(A), names(B)) to find those fields in B that are so named in A. I assume you know how to use read.csv to specify the variables A and B in that example. If not, read the help file: help("read.csv"). How do you use the vector returned by match? Supply it to a subsetting B[, idx] and B[, -idx] will give you B by the specified numbers in "idx" (defined as that match(...) above) and the second will ignore those fields.

winecoding

New Member
Hi, thanks for your response. The following is what I did

Code:
>Amatrix <-read.table("A.csv", sep=",",header=T)
>Bmatrix <-read.table("B.csv", sep=",",header=T)
> name.intersect<-match(names(Amatrix),names(Bmatrix))
> name.intersect
[1]  2  1 NA  4 NA  6  7  8  9
But when I do the following
Code:
> Bmatrix[,name.intersect]
Error in [.data.frame(Bmatrix, , name.intersect) :
undefined columns selected
Looks like the way I used the vector "name.intersect" is not correct. Can you let me know how to do that correctly?
Use match(names(A), names(B)) to find those fields in B that are so named in A. I assume you know how to use read.csv to specify the variables A and B in that example. If not, read the help file: help("read.csv"). How do you use the vector returned by match? Supply it to a subsetting B[, idx] and B[, -idx] will give you B by the specified numbers in "idx" (defined as that match(...) above) and the second will ignore those fields.

bryangoodrich

Probably A Mammal
One, use read.csv, since it sets the parameters you're specifying. Second, you need to remove the NA points. Something like

Code:
B[, na.omit(idx)]
On other notes, add more spaces. It makes your code more readable (e.g., sep = ",", header = TRUE). There's no point in naming something Amatrix. For one, it's a data frame, not a matrix. Second, the name doesn't need to reflect what data type it is. It should describe what content it contains. Like when I fit a regression, I use "fm1" to specify "fitted model number 1." If you're specifying something arbitrary, there's nothing wrong with things like "X" as long as you keep track of them, they're short lived, and there's no name conflicts with other things in your environment (e.g., "C" conflicts with the function C() thats already loaded).

derksheng

New Member
To intersect any two vectors P and Q you can go:

Code:
Reduce(intersect,list(P,Q))

Dason

Ambassador to the humans
To intersect any two vectors P and Q you can go:

Code:
Reduce(intersect,list(P,Q))
Although this form is useful when you want to intersect many vectors... for two vectors it's probably just easier to do
Code:
intersect(P, Q)