pretty() ain't always so pretty
As part of his continuing campaign to destroy my productivity, Aleks sporadically sends me emails such as the following: the subject line was “algorithm for graph labels” and the message read, in its entirety:
http://books.google.com/books?id=fvA7zLEFWZgC&pg=PA61&lpg=PA61#v=onepage&q=&f=false
I had to look, and what I found was a three-page article by someone named Paul Heckbert, published in a 1990 book called Graphics Gems. The article was called “Nice numbers for graph labels” and, through the magic of Google Books + Print-Screen + Paint + Movable Type, I’m able to share it with you:



Aleks knows this would interest me because he’s always hearing me complain about the R graphics defaults: The tick marks are too long, the axis labels are too far from the axes, there are typically too many tick marks (for my taste) on the graphs, blah blah blah. A bunch of unfair complaints, given that I get R for free, but complaints nonetheless. I see right away that Heckbert (at least, as of 1990) also had the problem of long tick marks, but of course that’s trivial. What’s more relevant is the rule for setting up where the numbers go on the axis.
R does this using the pretty() function. What exactly does prettty() do, I wondered? Back from the days when R was S, I learned that the way to learn what a function does is type its name in the console. So I’ll give that a try:
> pretty
function (x, n = 5, min.n = n%/%3, shrink.sml = 0.75, high.u.bias = 1.5,
u5.bias = 0.5 + 1.5 * high.u.bias, eps.correct = 0)
{
x < - as.numeric(x)
if (length(x) == 0L)
return(x)
x < - x[is.finite(x)]
if (is.na(n < - as.integer(n[1L])) || n < 0L)
stop("invalid 'n' value")
if (!is.numeric(shrink.sml) || shrink.sml < = 0)
stop("'shrink.sml' must be numeric > 0")
if ((min.n < - as.integer(min.n)) < 0 || min.n > n)
stop("'min.n' must be non-negative integer < = n")
if (!is.numeric(high.u.bias) || high.u.bias < 0)
stop("'high.u.bias' must be non-negative numeric")
if (!is.numeric(u5.bias) || u5.bias < 0)
stop("'u5.bias' must be non-negative numeric")
if ((eps.correct < - as.integer(eps.correct)) < 0L || eps.correct >
2L)
stop("'eps.correct' must be 0, 1, or 2")
z < - .C("R_pretty", l = as.double(min(x)), u = as.double(max(x)),
n = n, min.n, shrink = as.double(shrink.sml), high.u.fact = as.double(c(high.u.bias,
u5.bias)), eps.correct, DUP = FALSE, PACKAGE = "base")
s < - seq.int(z$l, z$u, length.out = z$n + 1)
if (!eps.correct && z$n) {
delta < - diff(range(z$l, z$u))/z$n
if (any(small < - abs(s) < 1e-14 * delta))
s[small] < - 0
}
s
}
Damn. That didn’t work. I’d briefly forgotten that a modern R function looks like this:
1. Lots and lots of exception-handling, handshaking, data-frame-handling, and general paperwork.
2. A call to the C or Fortran program that does the real work.
But I have other recourses. Let’s try Googling “R pretty.” No, that doesn’t work (you can try it yourself and see). Neither does “cran pretty,” “cran pretty(),” or anything else I can think of.
But wait! There’s the online help function! Just type “?pretty” from the console and you get a nice man page (as we used to say). Here it is:
Let d < - max(x) - min(x) >= 0. If d is not (very close) to 0, we let c < - d/n, otherwise more or less c <- max(abs(range(x)))*shrink.sml / min.n. Then, the 10 base b is 10^(floor(log10(c))) such that b <= c < 10b.
Now determine the basic unit u as one of \{1,2,5,10\} b, depending on c/b \in [1,10) and the two 'bias' coefficients, h =high.u.bias and f =u5.bias.
I'm too lazy to read this and Heckbert's pseudocode and compare, but they certainly seem to be doing the same thing. We can try it on Heckbert's example:
> pretty (c(105,543))
[1] 100 200 300 400 500 600
Hey, it works! And, indeed, pretty() has an argument called "n"--the desired number of intervals--so I could just set n=3 or 4 and probably be happy. I'm not quite sure how to alter R's plotting functions to work with a modified default parameter for pretty(), but there's probably a way--maybe in ggplot2?
Let's try it out:
> pretty (c(105,543), n=3)
[1] 0 200 400 600
>
Much better. I wasn't happy with that axis starting at 100. If the data range from 105 to 543, I'd rather take that axis all the way down to 0. I'm not a Darrell Huff-style fanatic on taking the axis down to 0, but I'd prefer to include 0 (or other natural boundaries or reference points, for example 100 if you're plotting numbers on a percentage scale, or 1.0 if you're plotting odds ratios).
I'm pretty sure that the next generation of pretty() or its equivalent should have a slightly more elaborate objective function to allow a preference for inclusion of special points such as 0.
More generally, I suspect it's helpful to think of this sort of "AI"-like task as an statistical inference problem, or as a minimization problem, rather than to frame it as a search for an algorithm. I mean, sure, it all comes down to an algorithm at some point, but the inference and minimization frameworks seem better to me--more flexible and more direct--than the approach of going straight for an algorithm.
I suspect the above point is very well understood in computer science, but I also suspect it bears repeating. I say this because statisticians are certainly aware of the benefits of framing decision problems as inference problems, but we still sometimes slip into a lazy algorithmic mode of thinking when we're not careful. Almost always, it's better to ask "what are we estimating?" or, at the very least, "what are we trying to minimize?", rather than jumping to "what do we want our answer to look like." I think there's more to be said on this point, but rather than try to come up with it all myself from scratch, I'll let youall fill me in on the relevant literature.
There's also some other weird thing that goes on in R, where it will put tick marks at, say, 10,12,14,16,18 rather than 10,15,20. The problem here, I think, is a combination of (a) too many intervals as a default setting, and (b) an idea that numbers divisible by 2 are as clean to interpret as numbers divisible by 5. I don't think the latter assumption is correct. To me when reading a graph, 10,15,20 is much much easier to scan than 10,12,14,16,18.
P.S. OK, since we're on the topic of R defaults, howzabout this one:
> a < - 1:5
> hist (a)
Which produces the following:

You notice anything funny about this? Oh, yeah:
1. The data are uniformly distributed but the histogram isn't.
2. The histogram is virtually impossible to read because all of the data fall between the histogram bars.
OK, sure, histograms can't be perfect. But this isn't an isolated case. We deal with integer data all the time, and it's not good to have a default that fails in these settings. There's always a question of how complicated you want a histogram function to be, but I'd think that with integer data it would be a good idea to use integers as the centers of the bars rather than as their boundaries.
P.P.S. Usually I'd put most of this sort of long, technical, code-filled entry under the fold, but there was something about the pointlessness of all of this . . . I couldn't resist splashing it all over the front page.