Controlling Automated Labels in Cut

bryangoodrich

Probably A Mammal
#1
So if you're following the R Mapping Thread, you may have noticed a simple R API I created to access and map BLS unemployment data through their FTP site.

I created a classification routine that does a very quick and dirty quantile classification scheme (others can be added later, such as equal-interval and possibly natural breaks; though I believe package sp or maptools may have a routine for this already).

Code:
classification <- function(x, n, method = "quantile")
# Currently, method is useless, but later it may provide options for 
# alternative classification methods
{
    cutoffs <- quantile(x, seq(0, 1, length = n), na.rm = TRUE)
    cut(x, cutoffs)
}  # end function
The problem is, I have no control over how cut creates the levels. For instance, using employment numbers with the following quantiles results in the following levels

Code:
# Levels
[1] "(46,3.87e+03]"       "(3.87e+03,8.02e+03]" "(8.02e+03,1.58e+04]"
[4] "(1.58e+04,3.88e+04]" "(3.88e+04,4.26e+06]"

# Quantiles
       0%       20%       40%       60%       80%      100% 
     46.0    3874.8    8020.4   15848.4   38823.4 4262266.0
I'd like to be able to control this to, say, a certain level of significance and result in levels like

Code:
(46, 3870], (3870, 8020], (8020, 15800], (15800, 38800], (38800, 426000]
I tried using round and signif on "cutoffs" but it doesn't have an impact since cut controls the labels and automatically switches to scientific notation. In fact, I'd like to have more control over this automation by including comma separation on large numbers. Frankly, does anyone know how I would control the "pretty text" of these levels that result from cut? I'm trying to keep this simple and automated, but I might frankly turn to using other spatial packages that may already have solved this problem. (I'm trying to keep my dependencies down.)
 

bryangoodrich

Probably A Mammal
#2
In working out the result myself, I came up with

Code:
z <- format(signif(quantile(x$emp, seq(0,1, length=6)), 4), trim = TRUE, nsmall = 0, big.mark = ",")
sapply(seq(z)[-1], function(n) paste("(", z[n-1], ", ", z[n], "]", sep = ""))
# [1] "(46, 3,875]"         "(3,875, 8,020]"      "(8,020, 15,850]"    
# [4] "(15,850, 38,820]"    "(38,820, 4,262,000]"
I'm just missing out on how I can make the first element "[46, 3,875]" to match the cut parameter include.lowest = TRUE I'm going to use. The best idea I can come up with is temporarily store the sapply call (say, into 'z'). Then do something like

Code:
z[1] <- sub("\\(", "[", z[1])
 

bryangoodrich

Probably A Mammal
#3
Turns out I can resolve this by applying the formatting within the paste call for each of the components individually

Code:
classification <- function(x, n, sig = 4, method = "quantile") {
    prettify  <- function(x, s = sig) format(signif(x, s), trim = TRUE, nsmall = 0, big.mark = ",")
    cutoffs   <- quantile(x, seq(0, 1, length = n+1), na.rm = TRUE)
    labels    <- sapply(seq(cutoffs)[-1], function(n) {paste("(", prettify(cutoffs[n-1]), ", ", prettify(cutoffs[n]), "]", sep = "")})
    labels[1] <- sub("\\(", "[", labels[1])
    cut(x, cutoffs, labels = labels, include.lowest = TRUE)
}  # end function
 

Dason

Ambassador to the humans
#4
I'm too tired to piece together your code snippets. But grabbing each section of code snippet and then pasting it into R isn't giving any workable code. Can you provide a workable section of all of your code combined?
 

bryangoodrich

Probably A Mammal
#5
It's okay. I solved it, at least from one perspective. I adjusted the n value, so it may not match some earlier snippets (I want to be able to say classification(x, 5) and return 5 classes). The problem I was having needed me to format my classification values as I'm pasting them, otherwise paste would strip out the formatting of the string. I messed up in my code because I had one copy I was working from interactively and one I was keeping in my script (and pasting here). I mixed that a little bit, especially when I altered the process flow of the above snippet: I had been looping on labels but needed to change it to cutoffs. I got strange values for that reason, but fixed that. It all works out now.

Code:
x <- c(22199, 74134, 8586, 7539, 23697, 3114, 7902, 47548, 12129, 
10319, 17276, 4419, 8472, 4612, 5762, 19753, 22651, 3831, 3763, 
14760, 5926, 34485, 18230, 12427, 25880, 32793, 12793, 40186, 
5853, 11757, 9872, 2679, 5949, 6272, 41230, 23533, 272461, 4564, 
39132, 13683, 59572, 34780, 3952, 7747, 156373, 6887, 9929, 37024, 
167286, 6615, 93647, 50472, 2971, 6907, 14422, 7916, 19079, 32579, 
91495, 3940, 32067, 15308, 80839, 24769, 5921, 2615, 7783, 977, 
2720, 143815, 6002, 1011, 1254, 1895, 43268, 1271, 931, 17477, 
24510, 7589, 6193, 989, 38717, 3565, 5101, 2640, 2090, 4390, 
574, 3203, 4777, 2141, 2517, 275, 2511, 19463, 59051, 69780, 
21128, 13207, 3635, 7001, 1816882, 82066, 35455, 446962, 113027, 
15842, 87930, 68541, 10200, 8434, 15596, 101256, 15789, 4977, 
2348, 12825, 4149, 9847, 5766, 10935, 3660, 9496, 9467, 43835, 
25567, 19250, 7559, 3422, 4864, 7291, 52423, 7534, 4802, 38746, 
7903, 16653, 9709, 14209, 5508, 15683, 4775, 6730, 31655, 10516, 
2654, 6604, 3129, 4428, 5742, 9082, 30285, 6929, 5807, 18726, 
18932, 3313, 3822, 3812, 3200, 11086, 4347, 8016, 4490, 9685, 
7875, 27434, 3779, 177072, 6919, 9124, 46266, 4424, 3249, 55534, 
7175, 5674, 4160, 16153, 6322, 93678, 31518, 3025, 9801, 670008, 
371, 15136, 89993, 16957, 9497, 463689, 10136, 79363, 365005, 
10658, 53882, 54239, 8604, 309992, 51224, 20718, 11802, 4262266, 
56540, 120566, 8373, 38166, 87090, 3564, 7797, 191629, 67116, 
44787, 1429690, 156769, 8346, 779484, 588623, 21243, 733771, 
1393866, 414410, 247151, 122335, 339289, 199838, 776917, 128999, 
70465, 1342, 15627, 189090, 227565, 196054, 33778, 21143, 4122, 
172767, 22032, 384317, 85332, 22756, 203876, 8348, 282724, 5510, 
2313, 2381, 160965, 28018, 7650, 1250, 4912, 3451, 1277, 1734, 
1889, 14709, 290731, 835, 147684, 26582, 11734, 269407, 17589, 
29354, 3368, 7850, 8455, 557, 2877, 1128, 275345, 883, 4476, 
3344, 27949, 162765, 7396, 2849, 10606, 70523, 428, 7770, 11888, 
18338, 13798, 8275, 2556, 8474, 2341, 9810, 6214, 66981, 3995, 
6347, 13090, 3107, 496, 4724, 1521, 14933, 10896, 2664, 107501, 
6273, 434159, 420738, 96314, 87843, 408942, 137683, 79804, 58541, 
66392, 239258, 83932, 300663, 120484, 10850, 80966, 11412, 237219, 
885915, 5615, 61285, 50657, 85701, 127264, 28042, 13505, 5080, 
393844, 124599, 28312, 4845, 19041, 6987, 4951, 5658, 4148, 11173, 
14695, 53701, 36396, 530189, 8184, 53721, 20567, 6008, 2813, 
121121, 242042, 136204, 15117, 3668, 6324, 124599, 116334, 56444, 
1114824, 42549, 32359, 88561, 16570, 535509, 123591, 546255, 
170889, 390562, 240903, 29039, 88159, 107901, 64423, 141981, 
213151, 30770, 16459, 8138, 4924, 223067, 15483, 28529, 8876, 
8323, 2565, 4025, 1393, 15879, 9367, 30795, 39922, 5993, 6981, 
65735, 4724, 6374, 7103, 15356, 28579, 8713, 8728, 2032, 17834, 
3773, 45890, 31609, 3785, 117559, 2209, 9182, 98281, 58283, 1285, 
114904, 2378, 333223, 12704, 17988, 56378, 5820, 54092, 5332, 
8039, 7276, 9914, 9569, 335662, 7761, 4110, 36573, 56870, 4856, 
1981, 25750, 8151, 8773, 4330, 9507, 45758, 43153, 79057, 8818, 
428224, 11782, 976, 35080, 22151, 10287, 6332, 371826, 17595, 
80899, 2351, 10899, 15057, 8913, 4162, 85972, 65774, 3368, 24182, 
5528, 4493, 5797, 1981, 2994, 12774, 6878, 3646, 18926, 16334, 
23732, 3309, 6203, 47784, 11427, 9511, 4495, 4344, 14183, 2956, 
7970, 3160, 8840, 12240, 3941, 8005, 16919, 77465, 40434, 17154, 
7040, 59737, 10544, 13204, 7742, 7036, 18176, 4019, 8593, 837, 
5854, 2329, 81603, 34904, 1535, 5944, 3497, 24258, 11955, 2000, 
11571, 2760, 670, 8321, 2988, 3506, 3857, 19216, 16112, 12101, 
5071, 2500, 27811, 3703, 4038, 9877, 10000, 28611, 36357, 13069, 
2076, 6459, 10353, 1015, 2519, 10954, 37935, 2644, 3835, 3887, 
8844, 75146, 414515, 29064, 68682, 177562, 1655, 36555, 3018, 
3631, 21317, 11407, 3057, 17813, 47011, 4015, 1400, 499, 75081, 
3479, 10183, 508, 2848, 2368, 10330, 5512, 5159, 6122, 8326, 
6473, 10726, 9445, 63582, 15823, 3407, 1654, 2255, 15368, 9413, 
17503, 2180, 4104, 9990, 3203, 5163, 4741, 35318, 3886, 4228, 
35525, 2631, 7766, 22258, 3391, 17333, 2250, 7470, 7186, 96826, 
16447, 7219, 5826, 17178, 25265, 2331864, 8855, 5042, 54245, 
8239, 9326, 481005, 9238, 2894, 16923, 9346, 6412, 15875, 16586, 
2417, 6305, 23459, 3715, 8480, 1603, 3431, 24512, 15418, 30248, 
4564, 18766, 10526, 12071, 4656, 243387, 49711, 54287, 23708, 
327288, 52501, 7437, 16462, 17210, 12376, 15604, 163497, 85966, 
48772, 21598, 125119, 16119, 6449, 6752, 6490, 6538, 7819, 17086, 
11979, 16090, 7353, 24137, 87984, 8379, 8199, 8004, 1709, 2548, 
2874, 14015, 6646, 71241, 111192, 11651, 101986, 3953, 2490, 
10120, 2574, 21755, 66179, 7340, 32947, 5503, 8627, 7685, 7401, 
7144, 26966, 329287, 32274, 124134, 19735, 12941, 156834, 34161, 
3771, 5490, 25620, 6590, 8753, 16568, 49992, 11351, 14745, 4526, 
13812, 22892, 11081, 17542, 48040, 20141, 78029, 8283, 33953, 
7358, 10098, 8748, 15261, 29082, 14269, 133109, 32428, 17334, 
66815, 19330, 30743, 16765, 18829, 13686, 10108, 13952, 11715, 
67122, 18419, 36495, 13055, 196174, 44912, 18758, 54503, 407364, 
19571, 4816, 13925, 64372, 16883, 32659, 5895, 18387, 2680, 8883, 
10193, 7245, 8246, 5221, 74153, 12091, 6127, 15063, 11164, 12234, 
8164, 112837, 9467, 20797, 9396, 9158, 13948, 7839, 5177, 74872, 
6357, 3158, 83080, 6893, 45048, 14253, 4140, 28667, 12084, 27265, 
12491, 10902, 15345, 3991, 1971, 7099, 5646, 3054, 13761, 70084, 
14586, 13078, 10286, 10307, 8003, 4890, 11744, 7095, 10766, 23388, 
6444, 6183, 4463, 9055, 9181, 25931, 9089, 32931, 3642, 3836, 
10098, 19199, 8744, 50396, 5341, 10075, 7746, 5531, 3561, 4680, 
6521, 5318, 7176, 5252, 8493, 7272, 8680, 4607, 4811, 3679, 7885, 
10214, 15973, 7270, 76816, 9965, 5155, 8500, 15503, 115273, 5433, 
4483, 6690, 7641, 10762, 15962, 18930, 7547, 5245, 4169, 3978, 
4661, 21584, 7458, 3052, 7028, 4846, 13823, 3818, 226722, 45594, 
9726, 2329, 5515, 83743, 7214, 19020, 46637, 8246, 3134, 6645, 
3494, 17088, 24147, 11216, 3083, 17723, 5233, 11586, 52042, 4094, 
6439, 7019, 4137, 7743, 2689, 15078, 7987, 5346, 29478, 1442, 
1715, 10439, 1535, 1207, 5166, 5593, 5226, 1029, 17168, 18651, 
1632, 10295, 4060, 59394, 1682, 1450, 18090, 3739, 19074, 18946, 
12217, 13371, 1586, 1401, 4018, 3401, 709, 3172, 1247, 3280, 
16510, 2265, 954, 6550, 8990, 1835, 275853, 2128, 4284, 1527, 
10135, 1124, 29640, 1851, 3965, 1646, 17228, 15759, 6411, 5830, 
2488, 14594, 3453, 16459, 2858, 1655, 5796, 7938, 1779, 2697, 
7878, 2231, 3108, 3726, 2962, 10707, 5674, 1277, 32640, 2741, 
5628, 36697, 2541, 1635, 3389, 28998, 2802, 231064, 10498, 88457, 
1603, 3706, 2166, 2153, 1181, 2258, 10364, 4070, 1827, 3496, 
877, 3242, 1183, 4522, 1609, 62036, 8452, 7568, 9973, 3687, 16884, 
4333, 8082, 58835, 8521)

classification <- function(x, n, sig = 4, method = "quantile") {
    prettify  <- function(x, s = sig) format(signif(x, s), trim = TRUE, nsmall = 0, big.mark = ",")
    cutoffs   <- quantile(x, seq(0, 1, length = n+1), na.rm = TRUE)
    labels    <- sapply(seq(cutoffs)[-1], function(n) {paste("(", prettify(cutoffs[n-1]), ", ", prettify(cutoffs[n]), "]", sep = "")})
    labels[1] <- sub("\\(", "[", labels[1])
    cut(x, cutoffs, labels = labels, include.lowest = TRUE)
}  # end function

levels(classification(x, 5))
That should give you something.

Code:
levels(classification(x, 5))
# [1] "[275, 3,953]"        "(3,953, 7,755]"      "(7,755, 14,590]"    
# [4] "(14,590, 43,180]"    "(43,180, 4,262,000]"

quantile(x, seq(0,1, length = 6))
#        0%       20%       40%       60%       80%      100% 
#     275.0    3952.8    7755.4   14589.2   43176.0 4262266.0
 

trinker

ggplot2orBust
#6
Bryan I'm not quite following. I think this relates to the problem of the maps being keyed differently. I was thinking of overcoming this by supplying the cut points to cut instead of telling it how many cuts to make. Are you saying I could use this classification function instead? If so what are the advantages of this function over using cut and supplying the cut points? From what I see the advantage is that it uses integers instead of the fractional forms (post 1).
 

bryangoodrich

Probably A Mammal
#7
The points that you define your classes at define your classification. In GIS there are a number of approaches to this. I often find myself using quantile because it is based on the data and how values are distributed. Equal interval is another option, but I'd more likely use that if I'm creating a standard of classes to be used across a number of maps for comparison (say, comparing the past decade and using equal intervals, because quantiles in each year differ, as my examples show). There's also natural breaks, which historically were done by hand-and-eye, but Jenks provides an algorithm to do it. The basic idea is that where you see clustering, put them in a class.

I recommend looking online about it or even ESRI's ArcGIS 10 help pages on it. My function only does quantile distributions right now, but if I wanted to, say, provide it equal interval, I'd have to take the range of the data and divide it by the number of classes. Easy algorithm. But per my discussion above, that's for one data set. I should still provide the ability to pass break points for classes, as one should always be able to manually define the classes they want. I'm thinking of automated solutions, and using quantiles work for now.