Vectorized comparison of column values

#1
Hi all,

I have a large dataset in which each observation contains a time stamp. Now I'm trying to calculate a new variable x that counts, for each observation, how many other observations happened within +/- 5 minute of that observation. I tried doing this with a for-loop (see below), but for some reason the loop uses 0.7 s per cycle, which makes it way to slow for the amount of observations I have. My hunch is that there is an easy solution to this using lapply, but I have no experience using the apply family of function. Thanks in advance for any help!

Klemens

for (i in 1 : nrow(data1)) {
data1$x <- nrow(subset(data1, difftime(data1$DateTime, data1$DateTime, unit = "min") < 5
& difftime(data1$DateTime, data1$DateTime, unit = "min") > -5))
}
 
Last edited:
#2
It's useful if you could provide a sample of your data but you could try something like:

Code:
sapply(data1$DateTime, function(x) sum(abs(difftime(x, data1$DateTime, units = "secs")) <= 300) - 1)
 
#3
Thanks! I tried to use the code you posted, but the function just freezes (?). Apparently it still needs a lot of time? I'm also trying to make the time comparisons dependant on an ID variable, such that only observations with the same ID as the current x get compared and counted... Below please find the code and dataset:

test <- read.csv("https://dl.dropboxusercontent.com/u/4136460/sampledata.csv")
test$exactfreq <- sapply(test$DateTime, function(x) sum(abs(difftime(x, test$DateTime, units = "min")) <= 5) - 1)

Thanks again,
Klemens
 

trinker

ggplot2orBust
#4
On my machine this takes ~30 seconds to run testing each obs against all others within that ID variable group:

Code:
test <- read.csv("https://dl.dropboxusercontent.com/u/4136460/sampledata.csv")

if (!require("pacman")) install.packages("pacman"); library(pacman)
p_load(data.table)

# convert to data time
setDT(test)[, DateTime := as.POSIXct(as.character(DateTime))]

# difference function
differ <- function(x) unlist(lapply(x, function(y) sum(abs(y - x) < 301 )))

test[, count := differ(DateTime), by = "ID"][, count := count - 1]

Giving:

Code:
           X            DateTime ID count
    1:     1 2014-05-20 15:00:11  3     3
    2:     2 2014-05-20 15:02:25  3     4
    3:     3 2014-05-20 15:03:02  3     4
    4:     4 2014-05-20 15:04:44  3     5
    5:     5 2014-05-20 15:07:24  3     5
   ---                                   
43680: 43680 2014-07-13 18:58:05  5    21
43681: 43681 2014-07-13 18:58:21  5    19
43682: 43682 2014-07-13 18:58:51  5    17
43683: 43683 2014-07-13 18:59:13  5    16
43684: 43684 2014-07-13 18:59:29  5    14