Who says poor arithmetic skills can’t have social utility? Professor Anne Morrison Piehl put bad math to good use in this week’s Law and Economics Workshop with her paper, Are Criminal Sentencing Guidelines Binding? Quasi-Experimental Evidence from Human Calculation Errors, available at http://www.law.uchicago.edu/files/files/Piehl%20paper.pdf.
What It’s About
After discovering that about 10 percent of guidelines-based sentence level calculations in Maryland had errors, Professor Piehl, along with co-authors Professor Shawn Bushway and Professor Emily Owens, contrasted the sentences that judges impose normally with the sentences that judges imposed in instances where the parties had misidentified the guideline level (i.e., made clerical or arithmetic errors) in order to figure out how much the guideline numbers affect the judge’s final sentencing decision, all other factors being equal.
The problem with earlier empirical studies was that it was hard to isolate the influence of the guidelines system alone. If a researcher compared sentences before and after a guideline revision, the differences may reflect other changes that happened alongside the guideline change. For example, suppose the legislature increased the recommended incarceration period in response to a crime wave. Then suppose that a judge, facing this same crime wave, started cracking down with longer sentences. The longer sentences would have coincided with the increase in the guidelines, but the guidelines would not have caused the longer sentences. This study is different from earlier studies in that arithmetic and clerical errors are not correlated to judicial preferences. Since the study doesn’t use time or some intentional change to differentiate between the control and variable samples, the idea is that we can be relatively confident that the control sample and the variable sample are indistinguishable aside from the error.
The results are surprising: For each month erroneously added, sentences increase by an average of four days. However, for each month erroneously subtracted, sentences decrease by an average of thirteen days. When the crime in question is especially uncommon or complicated, judges deviated less from the calculated guideline range than when the crime was commonplace.
The non-randomness of the seemingly inadvertent guideline calculation errors was also interesting. The study reports that the rate of error varied based on the experience of the preparer of the final guideline worksheet: state’s attorneys, who prepared the finalized guideline worksheets more frequently than public defenders or private attorneys, were comparatively much less likely to commit an error than their defense counterparts.
What Was Discussed
As Professor Piehl presented the paper, participants began to wonder about the asymmetric error rates. If errors are random, as one might expect copying errors and arithmetic errors to be, then why do errors occur more frequently in plea bargained cases? Professor Piehl responded with a few possibilities. Perhaps parties pay less attention when the sentencing comes out of plea bargaining than a trial. Professor Piehl considered the possibility that parties agree to introduce errors, intentionally, to manipulate sentence recommendations.
Other participants were puzzled by the asymmetric distribution of high versus low errors. If these were mere copying or summing errors, then there would be no reason why mistakes with values higher than the correct result would be more common than mistakes with lower values (or vice versa). For example, in drug crime sentences, inaccurately high sentences outnumbered inaccurately low sentences about 3 to 2, while in violent crime sentences, inaccurately low sentences outnumbered inaccurately high sentences about 3 to 2. One participant suggested that some errors may only go one direction because they might belong to the highest category on some guideline metric, so that all miscategorizing errors for that type of crime would be low.
Still others speculated on why there would be differences in the error rate when state’s attorneys fill out the worksheets than when defense attorneys do, even if both sides have to agree on the final result. Perhaps it represented differences in workload, one participant offered. Professor Piehl acknowledged this possibility, and suggested that it may also be useful to compare types of crimes that happen more frequently (such as drug crimes) to those that are less common. She thought it was possible that a “fatigue factor” or sloppiness that comes with familiarity might make errors more common in drug crimes than in violent crimes. To that point, a participant added that the professor might look for time-cyclical effects (e.g., whether less mistakes happen during times of the year where dockets periodically shrink). Another participant suggested that prosecutors may be intentionally introducing errors to take advantage of strategic anchoring effects. He asked whether Professor Piehl had compared whether the direction of the sentencing error was correlated with the party preparing the final worksheet. Professor Piehl thought this would be an interesting avenue for future analysis, and acknowledged that in the current data, defense counsel are more likely than prosecutors to err on the low side.
The discussion also focused on how judges react to errors. Are they unaware of the error? Or are they aware of the errors and simply leaving them uncorrected, finding that it errs in the direction they were inclined to sentence anyway? One participant suggested judges may stick closer to the guidelines for violent crimes precisely because they are aware that parties tend to make fewer errors when filling out sentencing worksheets for violent crimes. This may be a difficult hypothesis to sustain, since Professor Piehl later mentioned that people in the justice system had not been aware of the 10 percent error rate in sentencing guideline calculations (let alone crime-specific data on error rates for violent crimes) before her study had revealed the large number of inconsistencies. However, Professor Piehl agreed that she couldn’t discount this possibility. Some judges may be intuitively correcting worksheet errors as they calibrate their sentences to match their previous sentences for similar crimes. This would be more likely to happen for common crimes where there are many points for comparison (e.g., drug crimes) than for unfamiliar crimes (e.g., violent crimes), and this might explain why judges are less willing to deviate from the guidelines for unfamiliar crimes. Another participant, along the same vein, suggested that upward and downward errors might be asymmetric because judges might only check errors going one direction (for example, they may assume that prosecutors wouldn’t err upwards). On an ending note, the workshop discussion shifted focus to how judges might react depending on how they expect parole boards to act. If parole boards typically release prisoners after about X years, regardless of how high their actual sentences are, one might expect judges to be more concerned about downward errors (which matter) than upward errors (which the parole boards can effectively “veto” anyway).
This paper sets out to describe the level of influence that sentence guidelines have on judges. However, one of the most shocking revelations to come out of the paper is the 10 percent error rate figure itself. Why such a large error rate? If judges do pay attention to sentence guideline recommendations, as this paper proposes, then the magnitude and the asymmetry of the error rate seems especially disturbing: The magnitude suggests the breadth of the inaccuracy problem, and worse, the asymmetry makes the errors look intentional.
As several participants pointed out, we can’t tell from the data alone whether the judges are aware of the errors. There could be a hypothetical world (however unlikely) where judges are meticulously checking every worksheet with a sentence already in mind, catching every error, but only correcting inaccuracies when the errors cut in a different direction than their sentence. They could leave inaccuracies uncorrected when the errors go the same direction as their preconceived sentence, just so their sentences don’t appear radical. (As mentioned in the workshop, there are plenty of reasons why judges may want to appear faithful to guidelines, even when they are completely voluntary. There may be reputational costs to deviating too frequently from the norm and judges may want to maintain appearances of consistency.) Such a world might yield similar correlations where judges’ sentences appear to skew the same direction as the sentence recommendation, but it would mean that the judges are influencing the sentence recommendations, rather than vice versa. Since the study only had access to the “final worksheets” submitted, we don’t know whether judges are sending back some of the worksheets back for correction (and what kinds of errors those contain).
The more it looks like parties or judges are intentionally manipulating the sentence recommendation “errors,” the more difficult it becomes to trust the hypothesis that the numbers in the recommendations, in and of themselves, are what’s affecting the judge’s sentencing decision. And the more bizarre the asymmetries in the error rates, the harder it becomes to accept that there is no artificial manipulation going on. As such, before the paper can make any conclusions about the causal direction of the correlation (1. whether recommendations are affecting judges, 2. whether judges are affecting recommendations, or 3. whether parties are manipulating recommendations under certain conditions and judges are reacting to the presence of those conditions rather than the recommended numbers themselves), it needs to debunk the possibility that the errors are intentional.
In order to refute that possibility, it seems like there should be more research into the actual process leading up to the sentencing. Is the system designed in a way that incentivizes parties to use worksheet errors to get their way? For example, suppose the prosecution wants to reward a defendant for cooperating. In the federal system, the prosecution could recommend an appropriate range with deductions in light of various factors such as the defendant’s cooperation, level of involvement in the crime, and past history. However, the court is not bound to accept the recommendation, and there is no guarantee for the defendant that his cooperation will result in a lower sentence. In such a system, the prosecution and defense might find it less of a hassle simply to hedge their bets by understating the offense levels or categories, in addition to going before the judge to argue for the sentence reduction. But this is precisely the sort of situation where a judge, independent of the inaccurately low sentence recommendation, may be swayed to give a lower sentence. The judge may read the pre-sentence report that speaks glowingly about the defendant’s cooperation with the prosecution and penitent attitude, and then decide to give a lower sentence anyway. In such a scenario, it would be the surrounding events, rather than the recommendation itself, that led to the lower sentence.
As with any good study, Piehl’s data answers one narrow question well but raises several disturbing questions. Even if one chalks up the errors to fatigue, familiarity, or complexity, why are errors then asymmetrical? It may turn out that the asymmetries reveal something more invidious—some sort of extra-legal manipulation by individuals. Or it may turn out to be a mere side effect of the design of the criminal system. In an e-mail response to the blog, Professor Piehl expressed her belief that, at this point, well-intentioned, inadvertent error is just as credible an explanation for the data as intentional manipulation. She noted that it would be odd if three people with professional reputations and some level of personal commitment to the system would sign off on an inaccurate statement intentionally, rather than channeling their energies to legally sanctioned means of adjusting the sentence.
But we won’t know unless there’s more specific analysis done to figure out what’s causing what.