Data visualization poll

jpkelley

TS Contributor
#1
Hi,
Originally, I thought this might be appropriate for the R thread, but I think this may have more general relevance.

I'm trying to decide on a data visualization option, and my right and left sides of my brain are at war. In brief, I've modeled predator densities at two sites (in a zero-inflated GLMM framework). I then simulated 10000 random draws from the distributions parameter estimates and their standard errors. Two options at this point:

1) Boxplot with the >1.5 IQR values excluded (since these data come from a simulation). Note that 10000 random draws stabilized the whisker lengths.
2) Simple plot of median +/- 95% "confidence" intervals.

I like 95% CIs better, but I also prefer the boxplot visualization for the fact it shows the 25/75 percentiles. Any preference? You can see they almost align. Also, I'd rather not deviate too much from a classical Tukey-style boxplot, if that's the recommendation. I get twitchy around "modified" boxplots, since I've run across people who have a difficult time interpreting them.

So, again, any preference...or is this a case of splitting hairs on a frog?

Best,
Patrick

P.S. In case any of you want to play with the code used to produce these plots, here it is:
Code:
## simulate data
siteA<-exp(rnorm(10000, mean=-0.950701217, sd=0.698376574))
siteB<-exp(rnorm(10000, mean=0.169387235 , sd=0.424403903))
df<-data.frame(site=rep(c("siteA", "siteB"), each=10000), counts=c(siteA, siteB))

## Calculate 95% "confidence" intervals
siteA.qnt<-quantile(siteA, probs=c(0.025, 0.50, 0.975))
siteB.qnt<-quantile(siteB, probs=c(0.025, 0.50, 0.975))
qnt<-rbind(siteA.qnt, siteB.qnt)

df2<-data.frame(site=c("siteA", "siteB"), qnt)
colnames(df2)<-c("site", "lower", "median", "upper")

## boxplot with outlier (>1.5 IQR suppressed, since this is visualization from a simulation)
p<-ggplot(data=df, aes(x=site, y=counts))
p+geom_boxplot(outlier.colour="NA")
last_plot()+scale_y_continuous(limits=c(0,3))  ## warning message expected

##
limits <- aes(ymax = upper, ymin=lower) 
p <- ggplot(df2, aes(x=site, y=median))
p+geom_point(aes(x=site, y=median))
last_plot()+geom_errorbar(aes(ymax=upper, ymin=lower), width=0.2)
last_plot()+scale_y_continuous(limits=c(0,3))  ## warning message expected
 

trinker

ggplot2orBust
#2
For me personally I think that the box plot conveys the more interesting data. Visually, it is also more grabbing than the CI's and medians. Now I ask: why stop at the one graphic? The scales appear to be the same. Why not plot one on top of the other and have both peices of information. R is extremely good at doing this. I don't know if you want the CI completely supperimposed or next to the boxes. I also remember a library function that gives box plots with CIs. I think it was the psych package and the function was boxplot.
 

jpkelley

TS Contributor
#3
I'm leaning towards the boxplots as well...more interesting, for sure, without adding too much "chart junk." I was intending to avoid mixing the info, but I agree that it (meaning boxplot with CIs --or adjacent CI) would allow for better, and perhaps more intuitive, visualization. As it stands, the ends of the boxplot whiskers and the 95% CIs are generally the same.

Thanks, trinker.
 

Jake

Cookie Scientist
#4
I find the box plots more visually appealing but also would prefer my visualizations to be based on the data I actually analyzed, not quantiles. A nice compromise I have used in the past is to use what a teacher of mine called "oval plots." They are basically differently-shaped box plots--the advantage here being that it conveniently circumvents the box plot convention of being based on somewhat arbitrary quantiles. In the example below I plot some distributions at their means +/- 1 and 2 standard deviations, but you can use whatever intervals you like best.
Code:
# generate data
A <- rnorm(50,mean=30,sd=15)
B <- rnorm(50,mean=20,sd=10)

# get some descriptive stats
means <- c(mean(A),mean(B))
sds <- c(sd(A),sd(B))
depvar <- "Values"
name1 <- "Group A"
name2 <- "Group B"
data <- c(A,B)
x <- c(rep(-.5,length(A)), rep(.5,length(B))) 

# build oval plot
z <- c(-.5,.5)
plot(data ~ x, ylab=depvar, ylim=c(-10,70), yaxp=c(-10,70,8), xlim=c(-1,1), xlab="", xaxt='n',type="n",family="serif",ps=26)
axis(side=1,at=z, labels=c(name1,name2))
segments(z,means-2*sds,z,means+2*sds,lwd=8.5,col="dark grey")
segments(z,means-sds,z,means+sds,lwd=25,col='grey')
points(z,means,col='black',pch=19,lwd=3)
 

trinker

ggplot2orBust
#6
Jake thanks for sharing that. That's the first I'd seen those oval plots. Is this the profs convention or a more well documented plot? A google search doesn't turn up much on "oval plot". I tend to think of means and sd's in a normal distribution (desnity type plot) but the oval plot brought a new way for me to visualize the data. Interesting.

To some extent you loose a sense of where the majority of your distribution lies, the box plots give you more of a sense of this (though it is common knowledge that 95% of the distribution is between +/- 2 SDs).
 

jpkelley

TS Contributor
#7
I just played with a modification of Jake's oval plot (and made it a boxplot), and it looks great. When I make it look a bit better, I'll post.
 

Jake

Cookie Scientist
#8
I've only ever heard of oval plots from the professor here, and as far as I know he must made them up. I was going to use them in a manuscript once but we ended up scrapping that figure before the final submitted version.
 

gianmarco

TS Contributor
#9
Hi!

Just a few additions to this interesting discussion. In a article dealing with archaeological issues, I saw the use of "bullet graphs". This kind of graph is somewhat similar to the violin or oval chart discussed in this thread, but differs in the following: bullet graph displays the mean or median and, at the sane time, the 68-88-99% confidence interval. In this way, you can get the idea of the central values of the groups being compared, and at the same time it is possible to eyebal the significance of the difference. Of course, as already noted in this discussion, you loose the sense of the shape of the distributions. All depends on what you are interested in (shape vs significance of the difference between mean/median).

Alternatively, you could use knotched boxplots that, if I am not mistaken, should also display confidence intervals.

As for bullet graphs, I am not aware of what stat pack has implemented them. I have implemented them in a free Excel template available from my website. Search for my template in the "Other software" section of this forum.

Hope this helps
Best Regards
Gm
 

jpkelley

TS Contributor
#11
I think this type of "bullet graph" would be pretty easy to implement in R. Layers of shaded regions to illustrate the qualitative cut-off points, target values, etc. I'm not sure if it would capture my data's asymmetrical distribution, but I see that there's some consensus about making the standard deviation (-2, -1, 0, +1, +2 standard dev) very clear in a plot. Perhaps we should name this a "standard percentile boxplot" or something...uh...clever.
 

gianmarco

TS Contributor
#12
Gm,

Any possibility of creating a function or lines of code to do this in R?

Here is a link to bullet graphs in wikipedia for anyone interested

Gm I assume this is what you're talking about but I have not tried your template in Excel. I'm not a real proficient Excel user.
Hi!
Please, find attached a pict of "my" bullet graph (as provided by my Excel Template). As you can see, you can quick eyebal the mean of the two samples, as well as the Confidence Intervals. It is quickly evident that the difference in mean values is significant (p= 0,02).

Of course, its use make sense when the analyst's interest is in providing the sense of the difference in mean values between samples, and not in providing information on the shape of the distribution. In the latter case, I believe that a modification of boxplot is better (notched boxplot: see this pdf by Kristin Potter).

As for R, it would be interesting implementing bullet-graph in R, but I am not an R user.

As for my Excel Template, you have just to copy and past your dateset; that's all.


Best Regards
Gm
 

gianmarco

TS Contributor
#13
p.s.
I am troubles in uploading the pict. By the way, I do not understand why...I tried different formats, but nothing happens....
I will keep trying.....

Gm
 

trinker

ggplot2orBust
#15
Another way to visualize the data is the ehplot from the plotrix library. For large data sets this is probably not appropriate but it works for yours.

Code:
## simulate data
siteA<-exp(rnorm(10000, mean=-0.950701217, sd=0.698376574))
siteB<-exp(rnorm(10000, mean=0.169387235 , sd=0.424403903))
df<-data.frame(site=rep(c("siteA", "siteB"), each=10000), counts=c(siteA, siteB))

library(plotrix)
with(df,ehplot(counts,site,box=T,col=as.numeric(site)+2 ))
tab.title("Ehplot Site Counts",tab.col="gray")
Or to add just too much :)...
Code:
meanA <- mean(siteA)
meanB <- mean(siteB)
meanoverall <- mean(df$counts)

with(df,ehplot(counts,site,box=T,boxborder="black",col=as.numeric(site)+2 ))
tab.title("Ehplot Site Counts",tab.col="gray")

abline(h=meanoverall,lwd=2,lty = 1,col="black") 
abline(h=meanA,lwd=2,lty = 3,col="black")
abline(h=meanB,lwd=2,lty = 2,col="black")  

legend(.5,7.5,c("Pooled Mean","Mean GroupA","Mean GroupB"),
lty=c(1,3,2))
 
#16
Hej

this is indeend rather interesting ... I'd only like to add that it depends a lot on whom you want to show this to.

We often have to present to people who have little or no knowledge about statistics, and they LOVE boxplots. Not only do the LOVE them but they really understand what they mean. So it is a wonderful way for us to convey some variance infromation in addition to the means and medians, for a change! We mainly use them for income data, though.

But I'm sure these people would have a hard time getting the idea of a confidence interval and end up confusing them with boxplots anyhow ;)
 

jpkelley

TS Contributor
#18
Same for me. Google Docs didn't work. Don't stress about it, GM. Have you tried taking a screen capture or something, or is it just a problem with the upload? Thanks so much for trying. But, seriously, I don't want anyone using too much of their free time doing stuff like this. I'll feel badly about it.

Best,
Patrick