Delete duplicates based on various variables

#1
I have a dataset of articles found for a literature review. I have their TITLE, YEAR of PUBLICATION, AUTHOR, and other variables. I know that there are duplicates in my dataset. I would like to delete duplicates based on TITLE, YEAR, AUTHOR.

MY title is a string variable and its a little bit messy. Perhaps the same article can have slightly different titles EX> Breast cancer and pregnancy... or breast cancer and Pregnancy (with an space at the begging). Same thing with my variable author author.
My variable YEAR is numeric so that's fine.

How can I find my duplicates and delete them (the duplicates).

Please I really need help on this one!

thank you in advance... I am attaching a subset of my dataset. Thank you
Marvin
 
#2
Marvin,

Your Excel file was empty, but I'll create a toy data set.

If the titles may have slight differences, then the best you can do
is to list possible duplicates and then make decisions about which
records to delete.

Here is an example of how you could produce a list of
potential duplicates based on same year and same author.


Code:
clear
input str40 TITLE YEAR str40 AUTHOR
  "Some Title" 2011 "Smith, Green, & Jones"
  "Some Title" 2011 "Smith, Green, & Jones"
  "Another Title" 2010 "Smith"
  "Yet Another Title but Not Complete" 2006 "Weinstein & Roberts"
  "A Great Article" 2006 "Marks & Coe"
  "Yet Another Title but More Complete" 2006 "Weinstein & Roberts"
end

list, noobs separator(0)

* Sort by Year and Author within Year
sort YEAR (AUTHOR)

list, noobs separator(0)

display _n(2) "{red:{hline 72}}{c 10}{dup 16:{c 160}}Check for duplicated records below!{c 10}{red:{hline 72}}{c 10}"
list if AUTHOR[_n-1] == AUTHOR[_n] | AUTHOR[_n+1] == AUTHOR[_n], noobs sepby(AUTHOR)
 
#3
Wow brother.. I dont know what you did but this is magic. So you find the duplicates based on which criteria? what about the titles? Did you match them based on the whole title or only a part of them (lets say id 2 titles are similar in their first 5 words)?\

my data-set is very big. is there a way to delete the duplicates automatically?

Thanks you very muchhhh
 
#4
I did not match on titles, because you said there might be different titles for the same articles.
So I matched on year and author(s) and then display a list of titles that have the same year
and author(s).

You would have to review that list of potential duplicates and make decisions about which
potential duplicates are actual dupllicates. Then you would delete the duplicates manually.

I modified my code above to add an article number variable. You can then delete articles
by article number using -inlist()- as shown in the following code.

Code:
clear
input str40 TITLE YEAR str40 AUTHOR
  "Some Title" 2011 "Smith, Green, & Jones"
  "Some Title" 2011 "Smith, Green, & Jones"
  "Another Title" 2010 "Smith"
  "Yet Another Title but Not Complete" 2006 "Weinstein & Roberts"
  "A Great Article" 2006 "Marks & Coe"
  "Yet Another Title but More Complete" 2006 "Weinstein & Roberts"
end

* Generate article number variable
gen articlenum = _n

list, noobs separator(0)

* Sort by Year and Author within Year
sort YEAR (AUTHOR)

list, noobs separator(0)

display _n(2) "{red:{hline 72}}{c 10}{dup 16:{c 160}}Check for duplicated records below!{c 10}{red:{hline 72}}{c 10}"
list if AUTHOR[_n-1] == AUTHOR[_n] | AUTHOR[_n+1] == AUTHOR[_n], noobs sepby(AUTHOR)

* Based on a review of the list, I decide to delete articles 2 and 4.
* The -inlist- argument can handle from 2 to 255 article number elements.
* See -help inlist- .
drop if inlist(articlenum,2,4)

* List cleaned data set
list, noobs separator(0)
 
#5
PErfect I will try this syntax. One more thing, do the authors have to match exactly? The author of the same articles may be slightly different EX one may have a , and the other dont or On start with a capital letter and the other dont.

Thanks,
Marvin

PS: I am learning a lot here.
 
#6
This should give you an idea of how to approach this problem, but
please understand that in the end you are going to have to do some
visual inspection of a list of suspected duplicates and make
human decisions about which cases to drop.

The following expanded code is annotated, but basically it tries to clean up
as much as possible in terms of titles and authors names.

Just study the code to see how it is done. This could be refined
further, but not without a lot more thought and code. (You could also
use regex for the string searches and manipulations, but that's more
complicated if you are not experienced with regex.)

Code:
set more off
clear
input str40 TITLE YEAR str40 AUTHOR
  "Some Title" 2011 "Smith, Green, & Jones"
  "Some Title" 2011 "Smith, Green, & Jones"
  "A Title" 2010 "Smith"
  "Yet Another Title but Not Complete" 2006 "Weinstein and Roberts"
  "The Great Article" 2006 "Marks & Coe"
  "Yet Another Title but More Complete" 2006 "Weinstein & Roberts"
end

* Generate article number variable
gen articlenum = _n

* Create a test variable for title
clonevar TESTTITLE = TITLE

* Create a test variable for AUTHOR
clonevar TESTAUTHOR = AUTHOR

* Convert TESTTITLE to all lowercase
replace TESTTITLE = lower(TESTTITLE)

* Convert TESTAUTHOR to all lowercase
replace TESTAUTHOR = lower(TESTAUTHOR)

* Eliminate any instance of indirect article "a" from TESTTITLE
replace TESTTITLE = subinstr(TESTTITLE," a "," ", .)
replace TESTTITLE = subinstr(TESTTITLE,"a ","", 1) if substr(TESTTITLE,1,2) == "a "

* Eliminate any instance of direct article "the" from TESTTITLE
replace TESTTITLE = subinstr(TESTTITLE," the "," ", .)
replace TESTTITLE = subinstr(TESTTITLE,"the ","", 1) if substr(TESTTITLE,1,4) == "the "

* Eliminate "and" and "&" and "," and "." in TESTAUTHOR
replace TESTAUTHOR = subinstr(TESTAUTHOR," and "," ", .)
replace TESTAUTHOR = subinstr(TESTAUTHOR," & "," ", .)
replace TESTAUTHOR = subinstr(TESTAUTHOR,",","", .)
replace TESTAUTHOR = subinstr(TESTAUTHOR,".","", .)

list, noobs separator(0)

* Sort by Year and Author within Year
sort YEAR (TESTAUTHOR)

list, noobs separator(0)

* Display suspected duplicates for visual review
display _n(2) "{red:{hline 72}}{c 10}{dup 16:{c 160}}Check for duplicated records below!{c 10}{red:{hline 72}}{c 10}"
list articlenum YEAR TESTAUTHOR TESTTITLE if TESTAUTHOR[_n-1] == TESTAUTHOR[_n] | TESTAUTHOR[_n+1] == TESTAUTHOR[_n], noobs sepby(TESTAUTHOR)

* Based on a review of the list, I decide to delete articles 2 and 4.
* The -inlist- argument can handle from 2 to 255 article number elements.
* See -help inlist- .
drop if inlist(articlenum,2,4)

* Sort by TESTAUTHOR and then by YEAR
sort TESTAUTHOR (YEAR)

* List cleaned data set (first with test vars and then with original vars)
list articlenum YEAR TESTAUTHOR TESTTITLE, noobs separator(0) abbrev(10)
list articlenum YEAR AUTHOR TITLE, noobs separator(0) abbrev(10)
 
Last edited: