Extract numbers from a string

#1
I have this data of duration in the following format:pT2H4M37S
and I want to extract the hours, minutes and seconds separately. I tried to use split command but it didn't work because the format is not uniform. They can be:
PT2H
PT2H4M
PT2H4M37S
PT4M
PT4M37S
PT37S
(not to mention if number of seconds is less than 10, it is displayed PT9).
In Excel, I can do an IFERROR(SUBSTRACT(LEFT(RIGHT(FIND...)))) to find the correct separation but I have trouble in Stata. Any help / suggestion? Thanks
 
#2
You can do this using regex functions

Code:
clear 
input str9 sandbox 
PT2H 
PT2H4M 
PT2H4M37S
PT4M 
PT4M37S 
PT37S 
end

gen hours = regexs(1) if regexm(sandbox,"([0-9]+)H")
gen mins = regexs(1) if regexm(sandbox,"([0-9]+)M")
gen secs = regexs(1) if regexm(sandbox,"([0-9]+)S")

list
 
#3
Thank you Robert Picard it works very well. But do I have to put each of the element in the sandbox variable? Suppose I have 10,000 observations, each of them is different from one another due to duration, is there anyway I can import a variable in the input? Thanks
 
#4
Sorry but I don't understand what you are saying. If your substrings are part of a larger string variable, then the commands I posted should still work, just use the appropriate variable name instead of the one I used (sandbox). If you are talking about something else, can you post a small representative data example, preferably generated using dataex (from SSC). You can install it using

Code:
ssc install dataex
 
#5
Hi Robert thanks for your reply. My data looks like this. How do you think I should put all of that after the input command?
duration
"PT2M8S"
"PT3M13S"
"PT1M24S"
"PT37S"
"PT20S"
"PT55S"
"PT1M2S"
"PT40S"
"PT50S"
"PT3M17S"
"PT55S"
"PT32S"
"PT18S"
"PT37S"
"PT1M40S"
"PT39S"
"PT3M16S"
"PT1M34S"
"PT1M18S"
"PT1M26S"
"PT1M40S"
"PT1M33S"
"PT19S"
"PT21S"
"PT25S"
"PT37S"
"PT29S"
"PT38S"
"PT2M31S"
"PT1M8S"
"PT2M19S"
"PT44S"
"PT50S"
"PT40S"
"PT52S"
"PT51S"
"PT1H2M49S"
"PT2H49M3S"
"PT3M14S"
"PT50S"
"PT2M39S"
"PT58S"
"PT25S"
"PT34S"
"PT11S"
"PT12S"
"PT9S"
"PT2M25S"
"PT1M51S"
"PT13S"
"PT13S"
"PT16S"
"PT16S"
"PT22S"
"PT16S"
"PT16S"
"PT16S"
"PT16S"
"PT16S"
"PT15S"
"PT16S"
"PT15S"
"PT16S"
"PT16S"
"PT15S"
"PT3H58M12S"
"PT16S"
"PT38S"
"PT1M33S"
"PT46S"
"PT53S"
"PT51S"
"PT48S"
"PT2M30S"
"PT1M53S"
"PT1M13S"
"PT1M40S"
"PT2M10S"
"PT1M30S"
"PT1M25S"
"PT1M57S"
"PT1M22S"
"PT1M24S"
"PT2M2S"
"PT1M35S"
"PT53S"
"PT1M39S"
"PT3M18S"
"PT2M16S"
"PT10S"
"PT1M2S"
"PT1M36S"
"PT44S"
"PT1M18S"
"PT1M16S"
"PT1H1M2S"
"PT1M"
"PT1M5S"
"PT2M31S"
"PT1M54S"
"PT1M24S"
"PT1M33S"
"PT1M59S"
"PT1M33S"
"PT1M25S"
"PT56S"
"PT49S"
"PT57S"
"PT2M43S"
"PT10S"
"PT29S"
"PT2M31S"
"PT1M56S"
"PT30S"
"PT1M2S"
"PT1H43S"
"PT43S"
"PT28S"
"PT46S"
"PT28S"
"PT45S"
"PT46S"
"PT1M36S"
"PT4M51S"
"PT3M31S"
"PT20S"
"PT18S"
"PT1M47S"
"PT21S"
"PT58S"
"PT1M24S"
"PT1M42S"
"PT1M8S"
"PT23S"
"PT25S"
"PT4M35S"
"PT38S"
"PT1M53S"
"PT2M25S"
"PT20S"
"PT1M37S"
"PT1M55S"
"PT19S"
"PT1M3S"
"PT1M6S"
"PT2M9S"
"PT1M51S"
"PT1M54S"
"PT16S"
"PT1M37S"
"PT1M48S"
"PT1M41S"
"PT1M39S"
"PT5M4S"
"PT1M"
"PT2M42S"
"PT1M39S"
"PT1M20S"
"PT2M42S"
"PT1M39S"
"PT1M22S"
"PT1M13S"
"PT1M49S"
"PT1M49S"
"PT14S"
"PT1M8S"
"PT1M42S"
"PT54S"
"PT1M17S"
"PT1M26S"
"PT1M23S"
"PT57S"
"PT1M24S"
"PT45S"
"PT48S"
"PT50S"
"PT29S"
"PT24S"
"PT34S"
"PT42S"
"PT44S"
"PT52S"
"PT22S"
"PT23S"
"PT1M15S"
"PT1M3S"
"PT17S"
"PT52S"
"PT20S"
"PT58S"
"PT1M2S"
"PT47S"
"PT1M31S"
"PT1M5S"
"PT58S"
"PT1M16S"
"PT1M3S"
"PT1M2S"
"PT58S"
"PT1M7S"
"PT58S"
"PT1M5S"
"PT1M"
"PT1M6S"
"PT1M38S"
"PT1M13S"
"PT1M41S"
"PT46S"
"PT2M7S"
"PT51S"
"PT1M37S"
"PT3M34S"
"PT59S"
"PT52S"
"PT54S"
"PT1M6S"
"PT3M13S"
"PT58S"
"PT11M9S"
"PT51S"
"PT49S"
"PT20S"
"PT54S"
"PT1M36S"
"PT51S"
"PT1M25S"
"PT59S"
"PT7S"
"PT20S"
"PT36S"
"PT39S"
"PT40S"
"PT46S"
"PT47S"
"PT31S"
"PT1M32S"
"PT1M34S"
"PT1M34S"
"PT1M44S"
"PT3M1S"
"PT41S"
"PT5M36S"
"PT1M31S"
"PT1M31S"
"PT1M31S"
"PT32S"
"PT48S"
"PT1M1S"
"PT4M1S"
"PT2M3S"
"PT1M24S"
"PT37S"
"PT2M3S"
"PT59S"
"PT2M4S"
"PT2M15S"
"PT1M27S"
"PT2M50S"
"PT3M32S"
"PT4M46S"
"PT1M13S"
"PT5M17S"
 
#6
I think you misunderstood the purpose of the section of code that creates a data example. It's there so that you can test and see the results using a fully functional example of a solution. In your case, if your variable is called duration, then all you need is to load your data in memory and type

Code:
gen hours = regexs(1) if regexm(duration,"([0-9]+)H")
gen mins = regexs(1) if regexm(duration,"([0-9]+)M")
gen secs = regexs(1) if regexm(duration,"([0-9]+)S")