# I need help on R basics; particularly list manipulation

#### student@UW

##### New Member

I am reading in over a 1000 data input entries from a ".csv" file that look like the following list:

Code:
can't drive after midnight and before 5am
cannot drive from 12:30am-5am
Be off the road by 12:30am
I don't know
12:30 curfew
curfew-12:30
idk
I'm attempting to simplify the plethora of inputs. For example, replace all inputs that contain "curfew" or "12:30am" with a uniform word like "curfew". I aim to generate a more uniform list of inputs that looks like the following:

Code:
curfew
curfew
curfew
idk
curfew
curfew
idk

Last edited:

#### bryangoodrich

##### Probably A Mammal
This comment no longer applies! See next one.

Last edited:

#### Ventures

##### New Member
You may want to have a look at google refine for cleaning up messy data and also regular expressions which you can use in R.

#### bryangoodrich

##### Probably A Mammal
holy crap that is a completely different looking question from your first attempt! lol

First off, I wonder why it's a comma-delimited file if you're reading in a bunch of strings, unless there's other columns of data that might be useful? Otherwise, I'd probably just use scan or readLines to read in the statements. For each statement I'd run a fuzzy sort of match, which I believe R has some grep utilities for that. There's algorithms out there, too. As Ventures pointed out, give Google a look. If you expect certain words to show up, you're gonna have to deal with at least two situations: you get a hit, you get more than one hit. In the former case, it's easy: if you get a fuzzy match to a keyword, return the keyword for that input. In the latter case, you'll have to get creative. Maybe you'll want to keep all matches. Say, you get a fuzzy match, continue to search for other matches and return both. You'll definitely want to do some QA/QC on this. Say, do a sample of 100 entries and check (1) that it is doing what you want it to do, and (2) check out its error rate in terms of both hitting matches when none exist and missing matches when they do exist. I'd start out with something as simple as possible at first, because search algorithms can be daunting, especially a fuzzy match like this.

#### student@UW

##### New Member
Okay thanks very much to both of you, I'm on the right track but I still can't accomplish my goal.

My goal: search a string, if a key word exists in the whole string, then replace the whole string with just that key word.

Example String: "I don't like my curfew"
key word: "curfew"
New String: "curfew"

Is this possible in R? I think I can do it in Excel but I'm trying to learn/practice R skills

Thanks for the help thus far!

#### TheEcologist

##### Global Moderator
Hi student@UW,

This code will help you eliminate element that don't match your keyword. All you need to do is figure out how to use it for your purpose. Post back if you need help.

Code:
A=c("I don't like my curfew","curfews rule!","statsrule", "curfew sucks!")
A[grep("curfew",A)]
Note though that by working with keywords you lose the context, in my example 1 respondent actually likes the curfew - you lose this information.

Best,

TE

#### student@UW

##### New Member
Thanks TE

However, your code only removes Strings that do NOT contain the key word.

I want to clean the String up by replacing the whole string with my keyword if it contains my keyword. I also need to preserve all other Strings that do not contain my keyword.

Example String: "I don't like my curfew"
key word: "curfew"
New String: "curfew"

Thanks for the help thus far everybody