Dason

Can you explain the x-axis a little more? I don't know what the units of measurement are. And in the faceted plots I would have thought it was just a plot essentially of when we were chatting over time but all of the plots start at the left and most don't take up the full plotting region - what causes that?

Edit: I think I understand it now but I'd still like to hear your explanation. I was confused before because you said the y-axis was words but that doesn't make sense.

trinker

ggplot2orBust
Yeah the x axis is time but the unit of measure is actually words. So time is measured in words. All days start with 0 words. I suppose I have functions that could plot it in time but I was looking to demonstrate something easier. The restriction ("all of the plots start at the left and most don't take up the full plotting region") on the scales is that it is unfair (IMO) to compare facets when scales are allowed to be free. I may relax this in the future though.

I'm currently working on some functions to deal with time measures rather than words as the units but didn't forsee this until some of my recent work as an RA, so I didn't include this functionality in qdap initially.

Dason

No - it's fine that they don't take up the full plotting region. It was just that my intuition was that the x-axis was time and it didn't logically make sense the way the plots were laid out... which is why I was asking for clarification. But I think I get it now.

trinker

ggplot2orBust
It didn't help that the default for this was duration.default. An oversight. If you try it now it will say duration (words).

trinker

ggplot2orBust
Dason challenged the scales free idea and I decided it shouldn't be up to me if you use it or not. I plotted three different versions playing with scales and colors. Click here to see:

https://dl.dropbox.com/u/61803503/by_date4.pdf

Code:
Code:
library(qdap); library(talkstats)

dat <- ts_chatbox()
with(dat, gantt_plot(dialogue, person, bar.color="black"))
with(dat, gantt_plot(dialogue, person, date, ncol = 3, scale = "free_x"))
with(dat, gantt_plot(dialogue, person, date, ncol = 3, bar.color="black"))

vinux

Dark Knight
I liked the above graph. This is more comprehensible than your first graph.

bryangoodrich

Probably A Mammal
One of the things in the energy industry that is important is looking at smart meter data (meters with wifi giving interval data--15, 60 min data maybe). For instance, we want to be able to tease out from data certain phenomena happening at regular intervals. For instance, one contract's algorithms could take a year's worth of data for a household and find their baseline by looking at the hourly data.

I bring that up because I looked at that last graph and it's entirely incomprehensible looking at daily graphs what the outcome is over all those days. For instance, when do I tend to talk the most? I'm thinking you could generate a single graph for each chatter that is sort of like a heat map where it's brightest when they talk the most and cold where they're most absent. Make sense? Implementing it, not so easy lol

The first plot here sort of does that but it has a larger wave or whatever for when someone talks the most on a given day. I think the idea is to aggregate that information over multiple days at a given time location. That way, you end up with a composite or aggregate time value, but that plot is actually just as informative as the heat map idea I had. In fact, it visually does a good job at showing you where someone is very active, especially if it's with respect to other chatters.

trinker

ggplot2orBust
@BG I think that wouldn't be useful for what's attempting to be shown, that being relationships. Time was pretty well shown already by Vinux. Secondly, the unit I used is words so you couldn't really tell when you talk. I didn't use times. The gantt is better in this case for relationships which is what we're after in that you can see clusters.

Incomprehensible means that something's not comprehend-able. That really is not an accurate assessment. Depending on what you're attempting to show will depend on what graphic you use. If you're looking for when you're most active then a line graph of hourly intervals would be better or perhaps a heat map as you suggest but I think the line graph would be better suited. But this idea would convey nothing about the relationship between chatters. One more thing that throws a monkey wrench into time is that there's a universal time zone being used. So when it say's I'm really active at 6 am, that's not true. For me It's probably 12 am but time is a relative concept with world chatters.

As far as implementing the heat map it would be pretty easy by creating a new variable that turns times into hours and then using geom_tile. I've done as a calendar heat map with relative ease.

trinker

ggplot2orBust
Here's the line and heat plot:

Code:
library(ggplot2); library(talkstats); library(plyr); library(reshape2)
dat <- ts_chatbox()
dat$hour <- sapply(strsplit(as.character(dat$time), ":"),
function(x) x[c(T, F, F)])
dat2 <- data.frame(with(dat, table(person, hour)))

#line plot
ggplot(dat2, aes(hour, Freq, group=person)) +
geom_line(aes(colour=person), size=1) +
facet_wrap(~person, ncol=3)+
theme(axis.text.x=element_text(angle=270), legend.position="none")

#heat plot
x2 <- melt(dat2)
x2 <- ddply(x2, .(person), transform,
rescale = rescale(value))

ggplot(x2, aes(person, hour, group=person)) + geom_tile(aes(fill = rescale),
colour = "white") + scale_fill_gradient(low = "white",
high = "red") + theme_grey() + labs(x = "",
y = "") + scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) + theme(legend.position = "none",
axis.ticks = element_blank(), axis.text.x = element_text(angle = -90,
hjust = 0, colour = "grey50"))

bugman

Super Moderator
haha, I'm one of the "others". Good to know where you stand in this world! ;-)

But Vinux, seriously, this is awesome.

bryangoodrich

Probably A Mammal
Tr, Vinux's plot was an intraday graphic. I'm saying take a composite over multiple days so that an hour slice represents some function of the aggregate at that hour on all days. Then any people that have clustering should align closely and all be in one graph. Your multi-day plot appears to have different time scales on the x-axis. The only way to see how people tend to cluster over days would be if the plots were stacked on top of each other with the same scale. Then you can essentially take a slice down the plots to see how people cluster at that time. Trying to evaluate something like that for a month or a year would basically be incomprehensible. Instead, I was proposing an aggregation of the multi-day information into one plot. Each person's outcome is a function of the distribution of the phenomena over multiple days.

vinux

Dark Knight
Plot1: Clustering based on the time. Clustering based 24 variables ( 0-1, 1-2, ...).
Plot2: I have seen the season package in the latest R journal. We could achieve the same by stars.
Plot3. Further drill down of plot2.
View attachment 2783

Code:
ts <- read.csv(insert_url_for_dropbox_csv_here, stringsAsFactors=FALSE)

ts$dt <- strptime(ts$Date, "%m/%d %H:%M")
ts$dttime <- as.POSIXlt(ts$dt, "IST")

ts$hourgroup <- cut(ts$dttime$hour, breaks=c(-1, 6, 12, 18, 24), labels=c("0-6", "6-12", "12-18", "18-24")) ts$wdays <- factor( weekdays(ts$dttime, abbreviate=TRUE), levels=c("Sun", "Mon", "Tue", "Wed", "Thu","Fri","Sat")) chat.wtable <- table(ts$wdays)

## PLOT 1
par(mfrow=c(1, 2))
plot(hclust(dist(as.data.frame.matrix(table( ts$User.Name, ts$dttime$hour))), "ave"), main="Cluster Based on Time", xlab="Users", frame.plot=T, yaxt="n") ## PLOT 2 library(season) plotCircular(chat.wtable, labels=names(chat.wtable), lines=T, pieces.col="brown", main="Weekly data") ## PLOT 3 stars(as.data.frame.matrix(table(ts$wdays, ts$hourgroup)), key.loc= c(6, 4.5), key.labels="Clock", draw.segments=TRUE, col.segments=gray(c(0.1, .9, .8, .2)), nrow=4, ncol=3, main="Chats by Time and Weekday", frame.plot=TRUE) The clustering brings timezone matches. Even I change the methods the clusters look similar. In the second graph. It is showing the seasonality. It seems chatbox is very active in the last three working days. I am thinking of exploring grid graphics. It would be fun adding "SVG + Javascript" to make it interactive. If anyone can suggest some interesting data, we could all analyze on that. trinker ggplot2orBust @BG still not sure if I know what you mean. Is this what you're thinking? Plus should the scaling be done by person or just overall? I did it by person. Code: library(ggplot2); library(talkstats); library(plyr); library(reshape2) dat <- ts_chatbox() dat$hour <- sapply(strsplit(as.character(dat\$time), ":"),
function(x) x[c(T, F, F)])

dat3 <- data.frame(with(dat, table(person, date, hour)))
x3 <- melt(dat3)
x3 <- ddply(x3, .(person), transform, rescale = rescale(value))

ggplot(x3, aes(hour, date, group=person)) + geom_tile(aes(fill = rescale),
colour = "white") + scale_fill_gradient(low = "gold",
high = "darkviolet") + theme_grey() + labs(x = "",
y = "") + scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) + theme(legend.position = "none",
axis.ticks = element_blank(), axis.text.x = element_text(angle = -90,
hjust = 0, colour = "grey50")) + facet_wrap(~person, ncol=3)

Code:
ggplot(x3, aes(hour, person, group=person)) + geom_tile(aes(fill = rescale),
colour = "white") + scale_fill_gradient(low = "gold",
high = "darkviolet") + theme_grey() + labs(x = "",
y = "") + scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) + theme(legend.position = "none",
axis.ticks = element_blank(), axis.text.x = element_text(angle = -90,
hjust = 0, colour = "grey50")) + facet_wrap(~date, ncol=3)

bryangoodrich

Probably A Mammal
Vinux, that star graph is nice! It clearly shows that people talk the most Wednesday through Friday and aren't that active on the weekends, which I know, but I would always think people would be more active! Mainly because that's when I'm more active since I work all day and can't use TS lol

Trinker, that heat map looks good. The scales are all the same now so when I look at a certain point on one graph then I know it's the same time point on the other graph, but a nice transformation of the data now would be to look at a time point (say Dason at 8 PM) across all the days. Maybe just take the median value. Do that for all time points. Then we'd have a composite single graph representing what hour people are the busiest. With Vinux's star graph, this composite can also be done by day, so we can see which hour people are most active on a given day given all the days data we have.

trinker

ggplot2orBust
@BG I tried median and it wasn't informative except for three people so I used mean instead.

Code:
ggplot(x3, aes(hour, person, group=person)) + geom_tile(aes(fill = rescale),
colour = "white") + scale_fill_gradient(low = "white",
high = "darkblue") + theme_grey() + labs(x = "",
y = "") + scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) + theme(legend.position = "none",
axis.ticks = element_blank(), axis.text.x = element_text(angle = -90,
vjust = .1, colour = "grey50"))

spunky

Can't make spagetti
lol @ trinker & Dason being close together... the eternal battle of the bots VS the raptors will continue forever...

ps- so... why am i associated with jimmy brooks?

trinker

ggplot2orBust
If you're talking about Vinux's it's because you both speak during the same time intervals. However we don't know if it's on the same day. Also Jimmy Brooks has almost no words spoken (see below):

This gives some basic word statistics (ignore sentence because I didn't break it up by sentence; that's more our turns of talk or chat inputs).
Code:
library(qdap); library(talkstats)
dat <- ts_chatbox()
with(dat, word_stats(dialogue, person))
Here's a link to a txt data frame of the stats as it's pretty long: LINK

vinux

Dark Knight
If you're talking about Vinux's it's because you both speak during the same time intervals. However we don't know if it's on the same day. Also Jimmy Brooks has almost no words spoken (see below):

This gives some basic word statistics (ignore sentence because I didn't break it up by sentence; that's more our turns of talk or chat inputs).
Code:
library(qdap); library(talkstats)
dat <- ts_chatbox()
with(dat, word_stats(dialogue, person))
Here's a link to a txt data frame of the stats as it's pretty long: LINK
Trinker, I used the old data for my last graphs, where jimmy was one other top ten. For clustering my idea was to identify the timezone (Just because we have the reference so it can easily comparable). I mean the variable I have created in that way.