Dear all,
i am playing around with R and try to achieve the following:
The task at hand is to clear manual non standardized inputs into a database. The manual input is for the value "Modell" and represents the models of different car producers. Example Ford Focus (Where Ford is the value "Producer" and Focus the value "model").
Often, the input for model is not standardized, so one might write "Ford Focus1.6" or "Ford Focus 1.6", which is the same car but R will treat it as a different one.
All i want to achieve is:
1. Create a distance matrix which gives information about the similarity of "Modell"-strings (Kind of successfull - see code)
2. Write similar "Modell"-strings into a vector, or list or ??? (Kind of successfull, not automatized yet)
3. Standardize similar "Modell" vectors (Example: "Modells" are "116 i", "116i" and "116d" =="116") --> this will be a new column in the data.frame so i can use a standardized Modell for regression analyses.
I understand that this is a big task and in think telling me, how to solve number 3 of my goals would really help me.
Here is the code i was able to produce. I know it lacks the streamlining and is amateurish but its all i got
.
The target should look like this:
If you could point me in the right direction or give alternative solutions, that would be great!
Thanks
Tester
i am playing around with R and try to achieve the following:
The task at hand is to clear manual non standardized inputs into a database. The manual input is for the value "Modell" and represents the models of different car producers. Example Ford Focus (Where Ford is the value "Producer" and Focus the value "model").
Often, the input for model is not standardized, so one might write "Ford Focus1.6" or "Ford Focus 1.6", which is the same car but R will treat it as a different one.
All i want to achieve is:
1. Create a distance matrix which gives information about the similarity of "Modell"-strings (Kind of successfull - see code)
2. Write similar "Modell"-strings into a vector, or list or ??? (Kind of successfull, not automatized yet)
3. Standardize similar "Modell" vectors (Example: "Modells" are "116 i", "116i" and "116d" =="116") --> this will be a new column in the data.frame so i can use a standardized Modell for regression analyses.
I understand that this is a big task and in think telling me, how to solve number 3 of my goals would really help me.
Here is the code i was able to produce. I know it lacks the streamlining and is amateurish but its all i got
Code:
library(stringdist)
library(stringr)
library(compare)
## 1. Find standardized Modell
# producing example data.frame
df.1<-data.frame(Modell=as.character(c("116i", "116d", "Focus", "323d","525", "323 d", "Fiesta", "Focus 200", "Fiesta 12")),Producer=as.character(c("bmw","bmw","ford","bmw","bmw","bmw","ford","ford","ford")))
# Recode to character (i surely can do this when making the first df.1 Example-data.frame)
df.2<-data.frame(lapply(df.1, as.character), stringsAsFactors=FALSE)
# make a list of Producers and corresponding Modells
tester<-by(df.2,df.2$Producer,function(x)(subset(x)))
## 1.1 Do the Levensthein Distance Matrix
# I needed to use unlist, as the rest of the code would not work without it
# This of course needs to be automated via loop or apply-family
bmw<-stringdistmatrix(unlist(tester[[1]][1]),unlist(tester[[1]][1]),method="lv")
ford<-stringdistmatrix(unlist(tester[[2]][1]),unlist(tester[[2]][1]),method="lv")
bmw<-cbind(tester[[1]][1],stringdistmatrix(unlist(tester[[1]][1]),unlist(tester[[1]][1]),method="lv"))
colnames(bmw)[2:6]<-c(as.character(bmw[,1]))
ford<-cbind(tester[[2]][1],stringdistmatrix(unlist(tester[[2]][1]),unlist(tester[[2]][1]),method="lv"))
colnames(ford)[2:5]<-c(as.character(ford[,1]))
rownames(bmw) <- NULL # **** you row.names:mad:
rownames(ford) <- NULL # ##
## 2. Find standardized Modell Kategories
# If the Distance in the row is not bigger than 2, write the columnnames into a vector
bmw.modell.116<-colnames(bmw[,which(bmw[1,2:6]<=2)+1])
bmw.modell.323<-colnames(bmw[,which(bmw[3,2:6]<=2)+1])
# Here im stuck. First of all i have to define the classes manually - this should be automated. Second i need a function which checks colnames(bmw[,which(bmw[1,2:6]<=2)+1]) for example and tells me: 116 i,116i,116d =116. This 116 shall then be written into a new column in the original df.1.
Code:
df.1<-data.frame(Modell=as.character(c("116i", "116d", "Focus", "323d","525", "323 d", "Fiesta", "Focus 200", "Fiesta 12")),Producer=as.character(c("bmw","bmw","ford","bmw","bmw","bmw","ford","ford","ford")),Modell_Standard=c("116","116","Focus","323","525","323","Fiesta","Focus","Fiesta"))
Thanks
Tester