Summarize by group the maximum amount of time a range of values continuously occurred in R

#1
Say I have a dataframe that looks like this (wasn't sure how to actually rep. this data, given its size):

Code:
> head(df)
  yr    mo    day    time       sal    site
2021    8    1    0000    26.614    14
2021    8    1    0015    25.724    14
2021    8    1    0030    25.739    14
2021    8    1    0045    25.831    14
2021    8    1    0100    25.798    14
2021    8    1    0115    25.667    14
The dataframe has +800k rows of continuously measured data in 15 min. increments for the past couple of years at "n" sites.
I'm attempting to make a matrix/table, like the image attached, in R (that was originally created in Excel). Color is not important, I only hope to re-create the table itself. The values in this plot are the maximum/highest number of days (converted from consecutive minutes) where a range of values (sal) were consistently measured (ex. For each site, season, and year, create a table that shows the longest lasting run of continuous time where the values stayed between, for example, 40-50).

To belabor the point: values are only considered consistent if they occur at times, days, months, and years that are in order.

# My attempt:
I found this code somewhat useful, but it doesn't account for the fact that consecutive values need to be by an unbroken chain of the time.

Code:
df %>% group_by(site, mo, yr) %>%
mutate(high_salinity = between(sal, 40, 50),
high_salinity_duration  = cumsum(high_salinity) * high_salinity) %>%
summarise(longest_high_salinity = max(high_salinity_duration))
 

Attachments

Last edited:

Buckeye

Active Member
#2
The goal isn't as clear to me given the code snippet and image file. How do we define the dry and wet columns? I don't see any reference to them in the code snippets. Is there a way to write out the calculation by hand or as a "code sketch"? In particular, what data inputs do you have and what transformations apply to give the output numbers? Then, maybe we can write code to create the calculation.
 
Last edited:
#3
Yes, sorry, I don't know how to recreate datasets well enough. A "dry" season = months November -> April and "wet" season = May -> October.
 
#4
One column I believe I'm missing is one that has both date and time (a POSIX datetime from the year, month etc.). Using this column, I'm looking to see how long the values in the "sal" column can remain in my (arbitrary) range of 40-50. The longest length of time in which the values stay in that range is then put in a table (output) next to the corresponding year/season and site.
 
#5
For example, each year, how many minutes (converted to proportion of a day, i.e. 15min = 0.01 days) did the values in "sal" take the form between 40-50, at each site? The input is "sal", the group_by is year, season, and site and the output is a table in R similar to that of the one in the image attached (the longest/max lengths of time where this condition was met).
 
#6
Or, said another way, I need a function that filters for the desired sal values, then calculates the time between neighboring rows. If that time is 15, then the rows are part of a continuous run. The rle function looked promising for characterizing the run length of all the values. I'm looking for the longest run by year, season, and site in a table/data.frame.
 

Buckeye

Active Member
#7
One column I believe I'm missing is one that has both date and time (a POSIX datetime from the year, month etc.). Using this column, I'm looking to see how long the values in the "sal" column can remain in my (arbitrary) range of 40-50. The longest length of time in which the values stay in that range is then put in a table (output) next to the corresponding year/season and site.
I am thinking in terms of a code sketch. I would create a 0/1 indicator for whether sal is in the desired range.
Then arrange the data based on the datetime.
Finally, take the difference between the first and last times that the indicator is consecutively 1.
I'm still working through the last part as I don't have code written for it.
 
#8
This is as close as I've gotten, but I don't know how to add a "group_by" part in the code (without it I only get a single value - I think it's the longest sustained time for the whole df):

FindRun <- function(df){
DF40_50 <- df %>% filter(between(sal, 40,50)) %>%
mutate(Lag = lag(DateTime),
Diff = as.numeric(DateTime - Lag))

RUNs <- rle(DF40_50$Diff)
Lengths <- RUNs$lengths[RUNs$values==15]
MaxRun <- max(Lengths,na.rm = TRUE)
MaxRun *15/1440 # 1440 = Minutes in a day
}
 
Last edited:

Buckeye

Active Member
#9
This is as close as I've gotten, but I don't know how to add a "group_by" part in the code:

FindRun <- function(df){
DF40_50 <- df %>% filter(between(sal, 40,50)) %>%
mutate(Lag = lag(DateTime),
Diff = as.numeric(DateTime - Lag))

RUNs <- rle(DF40_50$Diff)
Lengths <- RUNs$lengths[RUNs$values==15]
MaxRun <- max(Lengths,na.rm = TRUE)
MaxRun *15/1440 # 1440 = Minutes in a day
}
How about creating a season variable (wet,dry) based on the months.

df %>%
group_by(season) %>%


Or just create 2,3,4 dataframes based on whatever groups there are
 
#11
Got it:

Code:
range.min = 40
range.max = 50

z <- df$mo
df$seas <- ifelse(z<= 10 & z >= 5, "wet", "dry")

# Convert R time to military character string
df$miltime <- sapply(strsplit(as.character(df$time) , " ") , function(x) {
  s <- x[[2]]
  s <- strsplit(s , ":")
  sapply(s , function(x) paste0(x[1],x[2]))
})

df <- df %>% mutate(time = str_pad(miltime, 4, side = "left", pad = "0"),
                    hour = substring(miltime, 1, 2),
                    minute = substring(miltime, 3, 4))

df$hour <- as.numeric(df$hour)
df$minute <- as.numeric(df$minute)


result = df %>% mutate(stamp = make_datetime(yr, mo, day, hour, minute),
                       high_salinity = between(sal, range.min, range.max)) %>%
  filter(!is.na(stamp), !is.na(high_salinity)) %>%
  arrange(yr, mo, day, `wq#`, stamp) %>%
  group_by(yr, mo, day, `wq#`) %>%
  mutate(days = ifelse(row_number() != n(), - as.numeric(stamp - lead(stamp), units = "days"), 0),
         shift = ifelse(high_salinity == lag(high_salinity) | row_number() == 1, 0, 1) %>% cumsum) %>%
  group_by(yr, mo, day, `wq#`, shift) %>%
  mutate(time_high = cumsum(days * high_salinity)) %>%
  group_by(yr, mo, day, `wq#`) %>%
  summarise(max_high = max(time_high)) %>%
  pivot_wider(names_from = yr, values_from = max_high)