Check strings for identical letters and cut off not identical letters

#1
Dear all,

i am playing around with R and try to achieve the following:

The task at hand is to clear manual non standardized inputs into a database. The manual input is for the value "Modell" and represents the models of different car producers. Example Ford Focus (Where Ford is the value "Producer" and Focus the value "model").

Often, the input for model is not standardized, so one might write "Ford Focus1.6" or "Ford Focus 1.6", which is the same car but R will treat it as a different one.

All i want to achieve is:
1. Create a distance matrix which gives information about the similarity of "Modell"-strings (Kind of successfull - see code)
2. Write similar "Modell"-strings into a vector, or list or ??? (Kind of successfull, not automatized yet)
3. Standardize similar "Modell" vectors (Example: "Modells" are "116 i", "116i" and "116d" =="116") --> this will be a new column in the data.frame so i can use a standardized Modell for regression analyses.

I understand that this is a big task and in think telling me, how to solve number 3 of my goals would really help me.

Here is the code i was able to produce. I know it lacks the streamlining and is amateurish but its all i got :eek:.

Code:
library(stringdist)
library(stringr)
library(compare)
## 1. Find standardized Modell
# producing example data.frame
df.1<-data.frame(Modell=as.character(c("116i", "116d", "Focus", "323d","525", "323 d", "Fiesta", "Focus 200", "Fiesta 12")),Producer=as.character(c("bmw","bmw","ford","bmw","bmw","bmw","ford","ford","ford")))
# Recode to character (i surely can do this when making the first df.1 Example-data.frame)
df.2<-data.frame(lapply(df.1, as.character), stringsAsFactors=FALSE) 
# make a list of Producers and corresponding Modells
tester<-by(df.2,df.2$Producer,function(x)(subset(x)))
## 1.1 Do the Levensthein Distance Matrix
# I needed to use unlist, as the rest of the code would not work without it
# This of course needs to be automated via loop or apply-family
bmw<-stringdistmatrix(unlist(tester[[1]][1]),unlist(tester[[1]][1]),method="lv")
ford<-stringdistmatrix(unlist(tester[[2]][1]),unlist(tester[[2]][1]),method="lv")

bmw<-cbind(tester[[1]][1],stringdistmatrix(unlist(tester[[1]][1]),unlist(tester[[1]][1]),method="lv"))
colnames(bmw)[2:6]<-c(as.character(bmw[,1]))
ford<-cbind(tester[[2]][1],stringdistmatrix(unlist(tester[[2]][1]),unlist(tester[[2]][1]),method="lv"))
colnames(ford)[2:5]<-c(as.character(ford[,1]))
rownames(bmw) <- NULL # **** you row.names:mad:
rownames(ford) <- NULL # ##
## 2. Find standardized Modell Kategories
# If the Distance in the row is not bigger than 2, write the columnnames into a vector
bmw.modell.116<-colnames(bmw[,which(bmw[1,2:6]<=2)+1]) 
bmw.modell.323<-colnames(bmw[,which(bmw[3,2:6]<=2)+1]) 
# Here im stuck. First of all i have to define the classes manually - this should be automated. Second i need a function which checks colnames(bmw[,which(bmw[1,2:6]<=2)+1]) for example and tells me: 116 i,116i,116d =116. This 116 shall then be written into a new column in the original df.1.
The target should look like this:
Code:
df.1<-data.frame(Modell=as.character(c("116i", "116d", "Focus", "323d","525", "323 d", "Fiesta", "Focus 200", "Fiesta 12")),Producer=as.character(c("bmw","bmw","ford","bmw","bmw","bmw","ford","ford","ford")),Modell_Standard=c("116","116","Focus","323","525","323","Fiesta","Focus","Fiesta"))
If you could point me in the right direction or give alternative solutions, that would be great!

Thanks

Tester
 

Lazar

Phineas Packard
#2
I admit I did not read you post in detail but have you looked at the adist function (Levenshtein distance) and the agrep function?
 
#3
Hi Lazar,

thank you for your comment and hints.

I admit I did not read you post in detail but have you looked at the adist function (Levenshtein distance) and the agrep function?
I was under the impression, that i use the Levenshtein distance function via "stringdismatrix" (via package stringdist) already. See this piece of code im using to get the distance matrix:
Code:
stringdistmatrix(unlist(tester[[1]][1]),unlist(tester[[1]][1]),method="lv")
But you are of course right to recommend Levenshtein as it seems to be the right way to find similarities.

I also saw the agrep function but this function needs a list of standardized words to be compare with a list or vector of to be checked words. My understanding is that t is just another function which does the same a adist or stringdist(matrix).

What i'm trying to achieve is that when i have classified all similar car modells i want to create a summary modellname which contains only the identical letters or numbers of the car modells per class. (116i, 116 i, 116 d == 116).

If this does not work i also could use agrep with a list of predefined carmodelltypes but this will be very work intensive, as i have approx 100 different producers in my database.