# Help writing what I'm doing mathematically

#### trinker

##### ggplot2orBust
I am not a mathematician and thus need major help representing my actions mathematically. First I'll explain what I'm doing and then a crack at writing a formula for it.

I have time stamps for two codes that look like this (call them code x and y):

Code:
#code x
duration start  end
5          1   184  184
55         1   905  905
92         3  1811 1813

#code y
duration start  end
4          2   107  108
6          2   116  117
8          2   131  132
10         1   145  145
12         4   166  169
16         2   212  213
22         5   293  296
58         2   704  705
70         2   877  878
72         2   941  942
109        2  1787 1788
121        1  1982 1982
And here's the structure of those two codes:

Code:
x <- structure(list(duration = c(1L, 1L, 3L), start = c(184L, 905L,
1811L), end = c(184L, 905L, 1813L)), .Names = c("duration", "start",
"end"), class = "data.frame", row.names = c("5", "55", "92"))

y <- structure(list(duration = c(2L, 2L, 2L, 1L, 4L, 2L, 5L, 2L, 2L,
2L, 2L, 1L), start = c(107L, 116L, 131L, 145L, 166L, 212L, 293L,
704L, 877L, 941L, 1787L, 1982L), end = c(108L, 117L, 132L, 145L,
169L, 213L, 296L, 705L, 878L, 942L, 1788L, 1982L)), .Names = c("duration",
"start", "end"), class = "data.frame", row.names = c("4", "6",
"8", "10", "12", "16", "22", "58", "70", "72", "109", "121"))

• I take the first observation of x and calculate the distance to the closest occurrence of y from either end of x. (relational) Repeat for all x_i
• I take the first observation of x and calculate the distance to the closest occurrence of y from the front end of x. (causal; you believe x causes y). repeat for all x_i.

These will output a vector of distances.

Mathematically here's a whack representing it with numbers:
$$min(|y - x_i|)$$

As I write it out the 2 things I am struggling with is how to:
1. Say take the minimum distance
2. For the second calculation method I described above (take only from y values occurring after x_i) how to say only take the positive values.

I can write the code and have it for R if this is useful in understanding what I'm doing though it's not really reproducible as it relies on my package I'm creating:

Code:
cm_describe <- function(code, grouping.var = NULL) {
if (!is.null(grouping.var)) {
if (is.list(grouping.var)) {
m <- unlist(as.character(substitute(grouping.var))[-1])
m <- sapply(strsplit(m, "$", fixed = TRUE), function(x) x[length(x)]) NAME <- paste(m, collapse = "&") } else { G <- as.character(substitute(grouping.var)) NAME <- G[length(G)] } cname <- strsplit(as.character(substitute(code)), "&") NAME <- paste0(cname[[length(cname)]], "&", NAME) group.var <- if (is.list(grouping.var) & length(grouping.var)>1) { apply(data.frame(grouping.var), 1, function(x){ if (any(is.na(x))){ NA } else { paste(x, collapse = ".") } } ) } else { grouping.var } v <- do.call(data.frame, rle(paste2(list(code, group.var)))) } else { v <- do.call(data.frame, rle(code)) } v$end<- cumsum(v[, 1])
colnames(v) <- "duration"
v$start <- c(0, c(v$end+1)[-c(length(v$end))]) v$center <- (v$start + v$end)/2
v2 <- v[, c("values", "center", "duration", "start", "end")]
if (!is.null(grouping.var)) {
nl <- if (is.list(grouping.var)) {
grouping.var
} else {
list(grouping.var)
}
L2 <- lapply(1:(length(nl) + 1), function(i) {
x <- strsplit(as.character(v2[, "values"]), "\\.")
sapply(1:length(x), function(j)x[[j]][i])
}
)
v3 <- data.frame(do.call(cbind, L2))
colnames(v3) <- unlist(strsplit(NAME, "\\&"))
v2 <- data.frame(v3, v2[, -1, drop=FALSE])
} else {
cname <- strsplit(as.character(substitute(code)), "&")
colnames(v2) <- cname[[length(cname)]]
}
return(v2)
}

cm_bidist <- function(code_x, code_y, grouping.var = NULL) {
x <- cm_describe(code_x, grouping.var)
x <- x[as.numeric(as.character(x[, "code_x"])) > 0, ]
y <- cm_describe(code_y, grouping.var)
y <- y[as.numeric(as.character(y[, "code_y"])) > 0, ]
Dnc <- sapply(1:nrow(x), function(i) min(abs(c(y[, "start"],
y[, "end"]) - c(x[i, "start"], x[i, "end"]))))
Dc <- sapply(1:nrow(x), function(i) {
vals <- c(y[, "start"], y[, "end"]) - c(x[i, "start"], x[i, "end"])
if (sum(vals[vals >= 0]) == 0) {
return(NA)  #there should be a penalty for this
} else {
min(vals[vals >= 0])
}
}
)
list(associated_distance = Dnc, mean.sd_assoc_dist = c(mean(Dnc), sd(Dnc)),
causal_distance = Dc, mean.sd_causal_dist = c(mean(na.omit(Dc)), sd(na.omit(Dc))))
}

#### BGM

##### TS Contributor
Let $$\mathcal{Y}$$ be the set of collection of the all occurrences of $$y$$ and artificially define a subset $$Y_1 = \{y: ...\}$$ such that

$$\min_{y \in \mathcal{Y}} |y - x_i|$$

and $$\min_{y \in \mathcal{Y}_1} |y - x_i|$$

correspond to the two thing you want to write?

#### SiBorg

##### New Member
I really wish I understood set theory...

#### Dason

##### Ambassador to the humans
trinker can you give a small input example and an example of the quantities you calculate?

#### trinker

##### ggplot2orBust
Thanks for the feedback thus far:

So let's put it into context. Let's say I want to observe someone and record the duration in seconds of their farts and burps. I want to know if there's a relationship between farting and burping. SO this is where the idea of measuring the distance between events comes from. Let's have a visual of this data: Now the data frame that created it:
Code:
    obs center duration start end
[COLOR="red"]1  fart    2.0        2     1   3
2  fart    7.5        1     7   8
3  fart   11.5        1    11  12
4  fart   18.0        2    17  19
5  fart   21.5        1    21  22
6  fart   23.5        1    23  24
7  fart   26.5        1    26  27
8  fart   29.5        1    29  30
9  fart   34.5        7    31  38
10 fart   41.5        3    40  43
11 fart   44.5        1    44  45
12 fart   49.5        1    49  50[/COLOR]
[COLOR="#4169e1"]13 burp    5.5        1     5   6
14 burp   10.0        2     9  11
15 burp   12.5        1    12  13
16 burp   17.5        1    17  18
17 burp   19.5        1    19  20
18 burp   21.5        1    21  22
19 burp   26.5        3    25  28
20 burp   29.5        1    29  30
21 burp   31.5        1    31  32
22 burp   34.5        1    34  35
23 burp   37.5        1    37  38
24 burp   46.5        1    46  47[/COLOR]
And here's the output I created:

Code:
#============================================================
#  Looking at the distance of nearest burp to each fart
#============================================================
> cm_bidist(fart, burp)
$associated_distance  2 1 0 0 0 1 1 0 0 2 1 2$mean.sd_assoc_dist
 0.8333333 0.8348471

$causal_distance  2 1 0 0 0 1 1 0 0 3 1 NA$mean.sd_causal_dist
 0.8181818 0.9816498
#==============================================================
#  Now looking at the distance of nearest fart to each burp
#==============================================================
> cm_bidist(burp, fart)
$associated_distance  1 0 0 0 0 0 1 0 0 3 0 1$mean.sd_assoc_dist
 0.500000 0.904534

$causal_distance  1 0 0 0 0 0 1 0 0 3 0 2$mean.sd_causal_dist
 0.5833333 0.9962049
I think calling it causal is wrong but right now I'm just trying to get the math for it so I can express the distance measures in a serious way when I talk to statisticians on campus. I think BGM may have it or at least really close as a mathematician friend of mine on campus quickly listened to the problem and looked at the formula and he said that sounds about right. But I want to make sure it's right. So as Dason has suggested I provided a sampled data set and the outcomes of that. If you need more information let me know.

EDIT: I made substantial edits to this post per Dason's comments below. The code has been changed to derive start end points but the mathematical means of finding the distance has not changed.

#### Dason

##### Ambassador to the humans
Can you actually explain how you calculated those values though. It's just that my understanding from the first post doesn't match up with the output you provided.

#### trinker

##### ggplot2orBust
Mistake in programming. Didn't see it until Dason pointed it out as I was working with a more complex problem until farts and burps.

I was attempting to do c(y_all) - c(x_s, x_e) and it wasn't working as expected. I'm pretty sure it is now. I edited the above response to show the new output (I think it's behaving like I explained in the original post).

#### Dason

##### Ambassador to the humans
Can you still work out an example? For instance I don't see why the causal distance between the burp that starts at 27 is 1. It seems to me that if we're looking for causal then wouldn't we want to only look at farts that started after the burp started?

If you work out an example for both distances it might be easier (at least for me) to help you express it mathematically. Right now I only have what you said in the first post (which isn't entirely too clear to me) and my guessing as to how things are done based on the sample input and output.

#### trinker

##### ggplot2orBust
The graphic is wrong too, better fix that #### Dason

##### Ambassador to the humans
Another question...

You have a fart that starts at 26, ends at 27. This gave a causal distance of 1. The only logical value I can find for this is the burp that started at 25 and ended at 28. However, it seems to me that since the burp actually starts before the fart this shouldn't happen. It also seems that if we were to calculate the distance between these intuitively we should get 0 since they occurred at the same time. Can you provide some clarification for this situation?

#### trinker

##### ggplot2orBust
Per Dason's points on the function's faulty logic (or the creator of the function ) here is a revised version. I haven't looked at it myself yet but I think it adresses Dason's concerns. Now the data frame that created it:
Code:
    obs center duration start end
[COLOR="red"]1  fart    2.0        2     1   3
2  fart    7.5        1     7   8
3  fart   11.5        1    11  12
4  fart   18.0        2    17  19
5  fart   21.5        1    21  22
6  fart   23.5        1    23  24
7  fart   26.5        1    26  27
8  fart   29.5        1    29  30
9  fart   34.5        7    31  38
10 fart   41.5        3    40  43
11 fart   44.5        1    44  45
12 fart   49.5        1    49  50[/COLOR]
[COLOR="#4169e1"]13 burp    5.5        1     5   6
14 burp   10.0        2     9  11
15 burp   12.5        1    12  13
16 burp   17.5        1    17  18
17 burp   19.5        1    19  20
18 burp   21.5        1    21  22
19 burp   26.5        3    25  28
20 burp   29.5        1    29  30
21 burp   31.5        1    31  32
22 burp   34.5        1    34  35
23 burp   37.5        1    37  38
24 burp   46.5        1    46  47[/COLOR]
And here's the output I created:

Code:
#============================================================
#  Looking at the distance of nearest burp to each fart
#============================================================
> cm_bidist(fart, burp)
$associated_distance  2 1 0 0 0 1 1 0 0 2 1 2$mean.sd_assoc_dist
 0.8333333 0.8348471

$causal_distance  2 1 0 0 3 1 2 1 8 3 1 NA$mean.sd_causal_dist
 2.000000 2.236068
#==============================================================
#  Now looking at the distance of nearest fart to each burp
#==============================================================
> cm_bidist(burp, fart)
$associated_distance  1 0 0 0 0 0 1 0 0 3 0 1$mean.sd_assoc_dist
 0.500000 0.904534

$causal_distance  1 0 4 3 1 1 1 1 8 5 2 2$mean.sd_causal_dist
 2.416667 2.274696
EDIT: DAson says...
trinker - you have a fart that starts at 31, ends at 38, and the distance is 8. I think I see how you're getting it but I think the distance should be 0 because there is a burp from 34 to 35.

#### trinker

##### ggplot2orBust
Try 4...

First the piece of the function that calculates the causal distance:

Code:
       Dc <- sapply(1:nrow(x), function(i) {
if (sum(y[, "start"] > x[i, "start"] & y[, "start"] <= x[i, "end"]) > 0) {
return(0)
}
inds <- y[, "start"] > x[i, "end"]
vals <- y[inds, "start"] - x[i, "end"]
if (sum(vals[vals >= 0]) == 0) {
return(NA)  #there should be a
} else {
min(vals[vals > 0])
}
}
)
This calculates the associative distance:
Code:
    Dnc <- sapply(1:nrow(x), function(i) {
yind <- 1:nrow(y)
if (sum(y[, "start"] >= x[i, "start"] & y[, "start"] <= x[i, "end"]) > 0 |
sum(y[, "end"] >= x[i, "start"] & y[, "end"] <= x[i, "start"]) > 0 |
sum(sapply(yind, function(j) {
y[j, "start"] < x[i, "start"] & y[j, "end"] > x[i, "end"]
}
)) > 0) {
return(0)
}
sdif <- c(y[, "start"], y[, "end"]) - x[i, "start"]
edif <- c(y[, "start"], y[, "end"]) - x[i, "end"]
min(abs(c(sdif, edif)))
}
)
Now the results:

Code:
#============================================================
#  Looking at the distance of nearest burp to each fart
#============================================================
> cm_bidist(fart, burp)
$associated_distance  2 1 0 0 0 1 0 0 0 2 1 2$mean.sd_assoc_dist
 0.7500000 0.8660254

$causal_distance  2 1 0 0 3 1 2 1 0 3 1 NA$mean.sd_causal_dist
 1.272727 1.103713
#==============================================================
#  Now looking at the distance of nearest fart to each burp
#==============================================================
> cm_bidist(burp, fart)
$associated_distance  1 0 0 0 0 0 0 0 0 0 0 1$mean.sd_assoc_dist
 0.1666667 0.3892495

$causal_distance  1 0 4 3 1 1 0 1 8 5 2 2$mean.sd_causal_dist
 2.333333 2.348436
For causal:
I made it that if the beginning of a y code occurs after the start (not at the same time otherwise we can't say x caused y if they co-occur) of an x code but before the end it's a distance of 0.

For associative:
If an x code starts or ends in the middle of a y code it's 0; and vise versa if a y code starts or ends in the middle of an x code 0; and if an x code is contained within a y code 0.

EDIT: Dason says...
f <- function(xstart, xend, ystart){max(0, min(ystart[ystart > xstart] - xend))}

#### trinker

##### ggplot2orBust
I've been attempting to move forward on this idea for some time but have been stalled by two impasses:

1. I want to distill Dnc in the way Dason distilled Dc (though not necessary)
2. I still have no idea how to represent what I'm doing mathematically. It seems more complicated than when i first proposed the idea per the sticking point Dason showed

After that I need to do some serious thought about how to apply a test of significance but I'm far from there.

Here's what we have so far:

Data set:
Code:
x <- structure(list(duration = c(1L, 1L, 3L), start = c(184L, 905L,
1811L), end = c(184L, 905L, 1813L)), .Names = c("duration", "start",
"end"), class = "data.frame", row.names = c("5", "55", "92"))

y <- structure(list(duration = c(2L, 2L, 2L, 1L, 4L, 2L, 5L, 2L, 2L,
2L, 2L, 1L), start = c(107L, 116L, 131L, 145L, 166L, 212L, 293L,
704L, 877L, 941L, 1787L, 1982L), end = c(108L, 117L, 132L, 145L,
169L, 213L, 296L, 705L, 878L, 942L, 1788L, 1982L)), .Names = c("duration",
"start", "end"), class = "data.frame", row.names = c("4", "6",
"8", "10", "12", "16", "22", "58", "70", "72", "109", "121"))
Image of Codes Data view
Code:
    obs center duration start end
[COLOR="red"]1  fart    2.0        2     1   3
2  fart    7.5        1     7   8
3  fart   11.5        1    11  12
4  fart   18.0        2    17  19
5  fart   21.5        1    21  22
6  fart   23.5        1    23  24
7  fart   26.5        1    26  27
8  fart   29.5        1    29  30
9  fart   34.5        7    31  38
10 fart   41.5        3    40  43
11 fart   44.5        1    44  45
12 fart   49.5        1    49  50[/COLOR]
[COLOR="#4169e1"]13 burp    5.5        1     5   6
14 burp   10.0        2     9  11
15 burp   12.5        1    12  13
16 burp   17.5        1    17  18
17 burp   19.5        1    19  20
18 burp   21.5        1    21  22
19 burp   26.5        3    25  28
20 burp   29.5        1    29  30
21 burp   31.5        1    31  32
22 burp   34.5        1    34  35
23 burp   37.5        1    37  38
24 burp   46.5        1    46  47[/COLOR]
Code:
cm_bidist <- function(code_x, code_y, grouping.var = NULL) {
x <- cm_describe(code_x, grouping.var)
x <- x[as.numeric(as.character(x[, "code_x"])) > 0, ]
y <- cm_describe(code_y, grouping.var)
y <- y[as.numeric(as.character(y[, "code_y"])) > 0, ]
Dnc <- sapply(1:nrow(x), function(i) {
yind <- 1:nrow(y)
if (sum(y[, "start"] >= x[i, "start"] & y[, "start"] <= x[i, "end"]) > 0 |
sum(y[, "end"] >= x[i, "start"] & y[, "end"] <= x[i, "start"]) > 0 |
sum(sapply(yind, function(j) {
y[j, "start"] < x[i, "start"] & y[j, "end"] > x[i, "end"]
}
)) > 0) {
return(0)
}
sdif <- c(y[, "start"], y[, "end"]) - x[i, "start"]
edif <- c(y[, "start"], y[, "end"]) - x[i, "end"]
min(abs(c(sdif, edif)))
}
)
Dc <- sapply(1:nrow(x), function(i) {
FUN <- function(xstart, xend, ystart){
max(0, min(ystart[ystart > xstart] - xend))
}
suppressWarnings(FUN(x[i, "start"], x[i, "end"], y[, "start"]))
}
)
Dc[is.infinite(Dc)] <- NA
list(associated_distance = Dnc, mean.sd_assoc_dist = c(mean(Dnc), sd(Dnc)),
causal_distance = Dc, mean.sd_causal_dist = c(mean(na.omit(Dc)), sd(na.omit(Dc))))
}