Thanks, brain trust

- Thread starter greg.neyman
- Start date
- Tags cycle transformation

Thanks, brain trust

So I grabbed Google Flu data, which is seasonal, from http://www.google.org/flutrends. I performed a traditional dummy variable analysis, using aggregate United States Flu values:

You can see that JAN (the intercept) and FEB are positively corellated with Flu, and APR-OCT are negatively correlated with flu. Somewhat common sense, although I thought I'd see more in Nov, Dec, and Mar. Whatevs.

Then, I performed the following transformation on the date: COS((2*pi)*([day of year]/365)), and regressed the transformed value against Flu:

I specifically chose Cosine because this would make it peak towards the end/begining of the year, and nadir in the summer, giving me the expeceted flu curve. You can see the result I got, and at is significant, but how do I compare the two results?

Thanks

Call:

lm(formula = Flu ~ FEB + MAR + APR + MAY + JUN + JUL + AUG +

SEP + OCT + NOV + DEC, data = data)

Residuals:

Min 1Q Median 3Q Max

-1987.8 -406.1 -146.7 38.0 5544.0

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2431.3 151.4 16.063 < 2e-16 ***

FEB 545.5 218.9 2.492 0.0131 *

MAR -413.0 217.2 -1.902 0.0580 .

APR -1226.3 217.2 -5.646 3.16e-08 ***

MAY -1419.0 212.6 -6.675 8.56e-11 ***

JUN -1562.6 226.6 -6.896 2.17e-11 ***

JUL -1667.2 222.5 -7.493 4.59e-13 ***

AUG -1543.6 220.6 -6.996 1.16e-11 ***

SEP -978.7 224.5 -4.360 1.67e-05 ***

OCT -459.2 214.1 -2.145 0.0325 *

NOV -284.1 215.6 -1.318 0.1883

DEC 220.7 217.2 1.016 0.3102

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 908.1 on 389 degrees of freedom

Multiple R-squared: 0.3988, Adjusted R-squared: 0.3818

F-statistic: 23.45 on 11 and 389 DF, p-value: < 2.2e-16

lm(formula = Flu ~ FEB + MAR + APR + MAY + JUN + JUL + AUG +

SEP + OCT + NOV + DEC, data = data)

Residuals:

Min 1Q Median 3Q Max

-1987.8 -406.1 -146.7 38.0 5544.0

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2431.3 151.4 16.063 < 2e-16 ***

FEB 545.5 218.9 2.492 0.0131 *

MAR -413.0 217.2 -1.902 0.0580 .

APR -1226.3 217.2 -5.646 3.16e-08 ***

MAY -1419.0 212.6 -6.675 8.56e-11 ***

JUN -1562.6 226.6 -6.896 2.17e-11 ***

JUL -1667.2 222.5 -7.493 4.59e-13 ***

AUG -1543.6 220.6 -6.996 1.16e-11 ***

SEP -978.7 224.5 -4.360 1.67e-05 ***

OCT -459.2 214.1 -2.145 0.0325 *

NOV -284.1 215.6 -1.318 0.1883

DEC 220.7 217.2 1.016 0.3102

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 908.1 on 389 degrees of freedom

Multiple R-squared: 0.3988, Adjusted R-squared: 0.3818

F-statistic: 23.45 on 11 and 389 DF, p-value: < 2.2e-16

Then, I performed the following transformation on the date: COS((2*pi)*([day of year]/365)), and regressed the transformed value against Flu:

Call:

lm(formula = Flu ~ Transform, data = data)

Residuals:

Min 1Q Median 3Q Max

-1315.1 -467.2 -190.4 38.8 5520.6

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1689.10 45.81 36.87 <2e-16 ***

Transform 1001.09 65.13 15.37 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 916.5 on 399 degrees of freedom

Multiple R-squared: 0.3719, Adjusted R-squared: 0.3703

F-statistic: 236.3 on 1 and 399 DF, p-value: < 2.2e-16

lm(formula = Flu ~ Transform, data = data)

Residuals:

Min 1Q Median 3Q Max

-1315.1 -467.2 -190.4 38.8 5520.6

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1689.10 45.81 36.87 <2e-16 ***

Transform 1001.09 65.13 15.37 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 916.5 on 399 degrees of freedom

Multiple R-squared: 0.3719, Adjusted R-squared: 0.3703

F-statistic: 236.3 on 1 and 399 DF, p-value: < 2.2e-16

Thanks

Last edited:

So I used a random 60% sample of the flu data as training sets for both of my models above, recrunched the regressions, and then calculated the models with the 40% test set. I then did a paired T test of the actual flu numbers versus each of the calculated results by each of the models:

Test---------- Mean------------ CI95 lo--------- CI95 hi--------- Median---------- 25th%----------- 75th%----------- p (Paired T)-------

Observed------ 1631.07843137255 1485.96825901115 1776.18860373395 1508------------ 902------------- 2021------------ (ref)--------------

Dummy--------- 1785.18235294117 1663.1267155889- 1907.23799029345 1933.6---------- 993.6----------- 2515.4---------- 0.0108451692556381-

Transformation 1805.18174452753 1681.27560127107 1929.08788778399 1889.55718530144 998.370390406813 2540.19308750762 0.00305881920481353

It seems that the transformation model works better at predicting than the dummy model.

What say you, brain trust? Am I reaching?

Test---------- Mean------------ CI95 lo--------- CI95 hi--------- Median---------- 25th%----------- 75th%----------- p (Paired T)-------

Observed------ 1631.07843137255 1485.96825901115 1776.18860373395 1508------------ 902------------- 2021------------ (ref)--------------

Dummy--------- 1785.18235294117 1663.1267155889- 1907.23799029345 1933.6---------- 993.6----------- 2515.4---------- 0.0108451692556381-

Transformation 1805.18174452753 1681.27560127107 1929.08788778399 1889.55718530144 998.370390406813 2540.19308750762 0.00305881920481353

It seems that the transformation model works better at predicting than the dummy model.

What say you, brain trust? Am I reaching?

Last edited:

It seems that the transformation model works better at predicting than the dummy model.

What say you, brain trust? Am I reaching?

What say you, brain trust? Am I reaching?

Th problem with this transformation is that if you don't know if there is cyclical relationship, you have to use a sine and cosine transformation, regressed independently, and pick the better result, which REEKS of post hoc analysis. Also, by picking the better of the two transformations, it is conceptually possible that an artifactual cyclality will appear, that has a nice p value, but doesn't actually occur in real life.

That's why I want to know if someone has blazed this trail already. not necessarily with sine & cosine, but something that'll turn cyclical into continous data.

Thanks

every function Y(x){not just cyclical} can be represented in the form you see in the paper. The sin/cos series you see is called an 'orthoganol bases' (excuse spelling). How many sines cosines will you include in your model, infinity?, 2, 3, maybe 22.

So i did a fourier transform on the flu data with R, and I think i got imaginary numbers back. here's some samples:

Fraction year = Transformed

0.742465753 = 199.54794521+ 0.00000000i

0.761643836 = -2.24882026+ 0.80043764i

0.780821918 = -0.55672289+ 0.75627255i

0.800000000 = -1.81340395+ 0.95325734i

what on earth do i do with this??

Fraction year = Transformed

0.742465753 = 199.54794521+ 0.00000000i

0.761643836 = -2.24882026+ 0.80043764i

0.780821918 = -0.55672289+ 0.75627255i

0.800000000 = -1.81340395+ 0.95325734i

what on earth do i do with this??

Last edited: