how to extract titles out of a full name using (gsub)

semidevil

Member
Using the titanic data set, I know you can extract a salutation out of a name using the following code
Code:
name <- as.character("Braund, Mr. Owen Harris")
gsub('(.*,)|(\\..*)', '', name)
This successfully extracts "Mr" out of the name, but I dont know how/why it works. Can someone explain what the gsub is doing in this situation?

consuli

Member
Backtesting by
Code:
[COLOR=blue]>> [/COLOR][COLOR=blue]name <- as.character("Braund, Mr. Owen Harris") [/COLOR][COLOR=blue]> [/COLOR][COLOR=blue]gsub('(.*,)|(\\..*)', '', name)
[/COLOR][1] " Mr" [COLOR=blue]
> [/COLOR][COLOR=blue]> [/COLOR][COLOR=blue]name <- as.character("Braund, Owen Harris") [/COLOR][COLOR=blue]> [/COLOR][COLOR=blue]gsub('(.*,)|(\\..*)', '', name)
[/COLOR][1] " Owen Harris" [COLOR=blue]
> [/COLOR][COLOR=blue]> [/COLOR][COLOR=blue]name <- as.character("Braund, Owen. Harris") [/COLOR][COLOR=blue]> [/COLOR][COLOR=blue]gsub('(.*,)|(\\..*)', '', name)
[/COLOR][1] " Owen"
leads to the conclusion, the code selects, what is between a comma and point.

However the code works mainly by a regular expression.
If you want to understand the code, you have to learn regular expressions, though.

trinker

ggplot2orBust
match any character (.) zero or more times (*) up to a comma (,) OR (|) a period (\\.) followed any character (.) zero or more times (*)

essentially (.*,) eats up the string up to a character while (\\..*) eats up the string after the period

I would say there are more robust ways to extract the title. I'd use an extraction, rather than subbing approach. Base R can do extraction but it's more complicated than the stringi package which has the stri_extract_all_regex function.

Code:
library(stringi)
name <- as.character("Braund, Mr. Owen Harris")
stri_extract_all_regex(name, '\\b[DrMm]r?s?\\.')

## [[1]]
## [1] "Mr."

rogojel

TS Contributor
match any character (.) zero or more times (*) up to a comma (,) OR (|) a period (\\.) followed any character (.) zero or more times (*)

essentially (.*,) eats up the string up to a character while (\\..*) eats up the string after the period

I would say there are more robust ways to extract the title. I'd use n extraction, rather than subbing approach. Base R can do extraction but it's more complicated than the stringi package which has the stri_extract_all_regex function.

Code:
library(stringi)
name <- as.character("Braund, Mr. Owen Harris")
stri_extract_all_regex(name, '\\b[DrMm]r?s?\\.')

## [[1]]
## [1] "Mr."
That would miss a lot of possibilities, like Miss, Rev. ... etc

regards

trinker

ggplot2orBust
Yeah def. but I'd think it to be safer to add the ones in you want manually. Relying on .* is almost always going to be wrecked by edge cases.