Goodhart’s Law is a famous adage that goes “when a measure becomes a target, it ceases to be a good measure.” If you’re not familiar with the adage, you can go read all about its history on Wikipedia, and perhaps also read the related entry on the ‘cobra effect’ (which includes a litany of entertaining perverse incentive stories, of which the eponymous cobra anecdote is merely one):
The British government, concerned about the number of venomous cobras in Delhi, offered a bounty for every dead cobra. Initially, this was a successful strategy; large numbers of snakes were killed for the reward. Eventually, however, enterprising people began to breed cobras for the income. When the government became aware of this, the reward program was scrapped. When cobra breeders set their now-worthless snakes free, the wild cobra population further increased.
But I’m here to tell you that Goodhart’s Law is not as useful as you might think.
At some level, this is self-evident. Goodhart’s Law is about as pithy and about as practicable as “the only certainty in life is death and taxes” and “hell is other people.” It is descriptive; it tells you of the existence of a phenomenon, but it doesn’t tell you what to do about it or how to solve it.
Thankfully, it turns out that there has been a fair amount of work on solving for Goodhart’s Law at the organisational level. Note that there is a variant of Goodhart’s Law that concerns broad social policy; the ideas here probably won’t work for that. But if you’re an operator, like I am, and you’re interested in solutions at the company level, this is going to be right up your alley.
A brief note so you know the source of these principles: many of these ideas were worked out by W. Edwards Deming and his colleagues in the 70s and 80s, as part of a body of work known today as ‘Statistical Process Control’. I’ve talked a little about how I fell into this rabbit hole in the recent past; the short version is that I did some work for Colin Bryar to explicate Amazon’s Weekly Business Review process for his company’s clients, and during that project I discovered that many of the ideas in the WBR were actually taken from the field of Statistical Process Control. As a result, I started digging into SPC to see what other principles or ideas might be applicable to business.
One of the more interesting things about the WBR is that the folks at Amazon have developed a number of ways to solve for Goodhart’s Law. We’ll use the set of practices around the WBR as an example in a bit. But the main idea that I want to highlight here is that the WBR’s practices came from a body of work; that body of work offers us a bunch of principles to use in our own contexts.
I’ll start with the principles, and then articulate one instantiation of those principles with the Amazon WBR as an example.
A More Solvable Version of Goodhart’s Law
To get at the principles, it’s useful to talk about formulations of Goodhart’s Law that are more useful than the original form.
There’s a fairly interesting paper by David Manheim and Scott Garrabrant titled Categorising Variants of Goodhart’s Law that lays out four categories of the phenomenon. I summarised the paper a number of years back, in which I talked about some of their proposed solutions for each of the categories. I do recommend the paper if you’d like a more general take on various forms of Goodhart’s Law — which is useful if you’re into, say, AI alignment research. But I did not think highly of the solutions — they seemed too academic, too theoretical, for my taste.
Thankfully real world organisational solutions are much simpler. The first step is to turn Goodhart’s Law as a narrower, more actionable formulation. The one that I like the most is from Deming contemporary Donald Wheeler, who writes, in Understanding Variation:
When people are pressured to meet a target value there are three ways they can proceed:
1) They can work to improve the system
2) They can distort the system
3) Or they can distort the data
Let’s demonstrate this by example. Say that you’re working in a widget factory, and management has decided you’re supposed to produce 10,000 widgets per month. If production this month is above the target, you may be tempted to stockpile it and use it against next month’s quota (distorting the system). If production is below target, you may be tempted to bring back skids of finished product from the warehouse, unpack it, load it back onto the conveyor belt, and have the automatic counting machine at the end of the production line count the product as finished units (thus distorting the data). Of course, at the end of the year this deception would show up as inventory shortage, and as plant manager, you’re likely to be fired. But if there is high, unyielding pressure to meet production targets in the short term, and no time for process improvement, the common response is to resort to trickery. Wheeler writes:
Notice how the emphasis upon meeting the production target was the origin of all the turmoil in this case. People were fired and hired, money was spent, all because the production foreman did not like to explain, month after month, why they had not met the production quota.
This list of possible responses to quantitative targets is attributed to Brian Joiner, who ‘came up with this list several years ago’ — likely in the 80s. I immediately glommed onto this list as a more useful formulation than Goodhart’s Law. Joiner’s list suggests a number of solutions:
- Make it difficult to distort the system.
- Make it difficult to distort the data, and
- Give people the slack necessary to improve the system (a tricky thing to do, which we’ve covered elsewhere).
The third point is really important. Preventing distortions is just one half of the solution. Avoiding Goodhart’s Law requires you to also give people the space to improve the system. Which begs the question: how do you encourage people to do just that?
There’s a nuanced point that Wheeler makes immediately after giving us this list. He writes (emphasis mine):
Before you can improve any system you must listen to the voice of the system (the Voice of the Process). Then you must understand how the inputs affect the outputs of the system. Finally, you must be able to change the inputs (and possibly the system) in order to achieve the desired results. This will require sustained effort, constancy of purpose, and an environment where continual improvement is the operating philosophy.
Comparing numbers to specifications will not lead to the improvement of the process. Specifications are the Voice of the Customer, not the Voice of the Process. The specification approach does not reveal any insights into how the process works.
So if you only compare the data to the specifications, then you will be unable to improve the system, and will therefore be left with only the last two ways of meeting your goal (i.e. distorting the system, or distorting the data). When a current value is compared to an arbitrary numerical target (... it) will always create a temptation to make the data look favourable. And distortion is always easier than working to improve the system.
‘Voice of the Customer’ and ‘Voice of the Process’ are fancy ways to say something simple. A target, goal, or budget usually represents some kind of ‘specification’ — some form of demand from the customer or from company management. This is the so-called ‘Voice of the Customer’. The ‘Voice of the Process’, on the other hand, is how the process actually works.
What Wheeler is getting at is something deceptively simple, but actually quite profound. Most of us, when faced with a goal, will fixate on the difference between the current output and our desired target. In other words, we think the way to hit our goals it to obsess over the goal. A common approach is to work backwards to figure out targets at specific checkpoints, and then say (either to ourselves or to our team) “HIT THOSE TARGETS, COME HELL OR HIGH WATER!” Think of your personal life — say that you want to lose 6kg in three months. The natural thing to do is to set a 2kg weight reduction goal each month, weigh yourself every morning, and then pat yourself on the back if you’re on track to hitting your target reduction for the month, or exercise more / eat less if you’re not. A similar thing might occur in business: “We need to get to 100 new deals by the end of the quarter, which means 33 new deals per month, now GET ON IT!” This is a naive view of process improvement, and while it may work for something as simple as weight control, it is not going to work for the kind of complex processes that you would find in a typical business.
Well, let’s think about the weight control example. Losing weight is a process with two well known inputs: calories in (what you eat), and calories out (what you burn through exercise). This means that the primary difficulty of hitting a weight loss goal is to figure out how your body responds to different types of exercise or different types of foods, and how these new habits might fit into your daily life (this assumes you’re disciplined enough to stick to those habits in the first place, which, well, you know). By contrast, business processes are often processes where you don’t know the inputs to your desired output. So the first step is to figure out what those inputs are, and then figure out what subset of those you can influence, and then, finally, figure out the causal relationships between the input metrics and output metrics. A causal relationship looks something like: “an X% lift in input metric A usually leads to a Y% lift in output metric B. Oh, and output metric B is affected by five different input metrics, of which A is merely one”. It is not an accident that the best growth teams are able to say things like “a 30% increase in newsletter performance should drive a 5% improvement in new account signups” — if you ever hear something like this, you should assume that they’ve listened to the Voice of the Process very, very carefully.
This is a long winded way of saying that if you want to improve some process, you have to ignore the goal first, in favour of examining the process itself. On the face of it, this is common sense: you cannot expect your team to improve some metric if that metric isn’t directly controllable. No, you must first work out what set of controllable input metrics leads to the output metrics outcomes you desire, before you can even begin to talk about hitting targets. You’ll need to figure out what levers to pull in order to hit 10,000 units a month; you’ll need to figure out what drivers exist before you push for 100 new deals a quarter. The way you get to this state is nothing at all like obsessively watching your target and measuring how far off you are from it and yelling at your team about the underperformance — down that path lies Goodhart’s Law.
Again, in Wheeler’s words: “You cannot improve a process by listening to the Voice of the Customer. You can only improve a process by listening to the Voice of the Process.”
Now let’s take a look at how Amazon does it.
How the WBR Accomplishes This
In 2000, Amazon suffered from a legendary (read: terrible) holiday season. Bryar and Carr recount this period in Working Backwards:
During the fourth quarter (of 2000)—in which our net sales ended up increasing by 44 percent over Q4 of the previous year—there was a daily “war room” meeting where the senior Amazon leaders would analyze a three-page metrics deck and figure out what actions we’d have to take to successfully respond to the demands of what was shaping up to be a record-breaking holiday season. A key component of the deck was the backlog, which was a tally of the orders we had taken minus the shipments we had made. The backlog indicated the amount of work we’d need to do to make sure our customers received their gifts before the holidays. It would take a massive, concentrated effort. Many corporate employees were conscripted for work in the fulfillment centers and customer service. Colin worked the night shift from 7 p.m. to 5:30 a.m. in the Campbellsville, Kentucky, fulfillment center and telecommuted from the Best Western hotel to stay on top of his day job. Bill stayed in Seattle to keep the Video store running smoothly during the day and traveled south 2.5 miles each night to work in the Seattle fulfillment center.
It was touch and go for a while. If we overpromised, we’d ruin a customer’s holiday. If we underpromised and stopped accepting orders, we were basically telling our customers to go elsewhere for their holiday needs.
It was close, but we made it. Shortly after that holiday season we held a postmortem, out of which was born the Weekly Business Review (WBR). The purpose of the WBR was to provide a more comprehensive lens through which to see the business.
My personal belief is that Amazon’s adoption of the WBR may be traced back to this period of crisis, and the format of the meeting was influenced by folk with strong Operations Research backgrounds. How else to explain the uncanny implementation of just about every principle in Donald Wheeler’s Understanding Variation?
Let’s go over the basics. The Amazon WBR is a weekly operational metrics review meeting in which Amazon’s leadership team gathers and reviews 400-500 metrics within 60-90 minutes. It occurs — or so I’m told — every Wednesday morning. I should note that a) a more detailed description of the WBR may be found in Chapter 6 of Working Backwards, and that b) the authors note that there is no one standard playbook to use and review metrics across Amazon; the details of the WBR here are based off of their experiences, as well as from the recollections of various Amazon execs they’ve talked to whilst writing the book.
The way that a WBR deck is constructed is instructive. Broadly speaking, Amazon divides its metrics into ‘controllable input metrics’ and ‘output metrics’. Output metrics are not typically discussed in detail, because there is no way of directly influencing them. (Yes, Amazon leaders understand that they are evaluated based on their output metrics, but they recognise these are lagging indicators and are not directly actionable). Instead, the majority of discussions during WBR meetings focus on exceptions and trends in controllable input metrics. In other words, a metrics owner is expected to explain abnormal variation or a worrying trend (slowing growth rate, say, or if a metric is lagging behind target) — and is expected to announce “nothing to see here” if the metric is within normal variance and on track to hit target. In the latter case, the entire room glances at the metric for a second, and then moves on to the next graph.
(Note that they do not skip over the metric entirely; glancing at the metric is quite important. You’ll see why in a minute.)
How do you come up with the right set of controllable input metrics? The short answer is that you do so by trial and error. Let’s pretend that you want to influence ‘Marketing Qualified Leads’ (or MQLs) and you hypothesise that ‘percentage of newsletters sent that is promotional’, ‘number of webinars conducted per week’ and ‘number of YouTube videos produced’ are controllable input metrics that affect this particular output metric. You include these three metrics in your WBR metrics deck, and charge the various metrics owners to drive up those numbers. Over the period of a few months (and recall, the WBR is conducted every week) your leadership team will soon say things like “Hmm, we’ve been driving up promotional newsletters for awhile now but there doesn’t seem to be a big difference in MQLs; maybe we should stop doing that” or “Number of webinars seems pretty predictive of a bump in MQLs, but why is the bump in numbers this week so large? You say it’s because of the joint webinar we did with Tableau? Well, should we track ‘number of webinars executed with a partner’ as a new controllable input metric and see if we can drive that up?”
Bryar and Carr include the following compelling anecdote in Working Backwards (emphasis mine):
One mistake we made at Amazon as we started expanding from books into other categories was choosing input metrics focused around selection, that is, how many items Amazon offered for sale. Each item is described on a “detail page” that includes a description of the item, images, customer reviews, availability (e.g., ships in 24 hours), price, and the “buy” box or button. One of the metrics we initially chose for selection was the number of new detail pages created, on the assumption that more pages meant better selection.
Once we identified this metric, it had an immediate effect on the actions of the retail teams. They became excessively focused on adding new detail pages—each team added tens, hundreds, even thousands of items to their categories that had not previously been available on Amazon. For some items, the teams had to establish relationships with new manufacturers and would often buy inventory that had to be housed in the fulfillment centers.
We soon saw that an increase in the number of detail pages, while seeming to improve selection, did not produce a rise in sales, the output metric. Analysis showed that the teams, while chasing an increase in the number of items, had sometimes purchased products that were not in high demand. This activity did cause a bump in a different output metric—the cost of holding inventory—and the low-demand items took up valuable space in fulfillment centers that should have been reserved for items that were in high demand.
When we realized that the teams had chosen the wrong input metric—which was revealed via the WBR process—we changed the metric to reflect consumer demand instead. Over multiple WBR meetings, we asked ourselves, “If we work to change this selection metric, as currently defined, will it result in the desired output?” As we gathered more data and observed the business, this particular selection metric evolved over time from
- number of detail pages, which we refined to
- number of detail page views (you don’t get credit for a new detail page if customers don’t view it), which then became
- the percentage of detail page views where the products were in stock (you don’t get credit if you add items but can’t keep them in stock), which was ultimately finalized as
- the percentage of detail page views where the products were in stock and immediately ready for two-day shipping, which ended up being called Fast Track In Stock.
You’ll notice a pattern of trial and error with metrics in the points above, and this is an essential part of the process. The key is to persistently test and debate as you go. For example, Jeff (Bezos) was concerned that the Fast Track In Stock metric was too narrow. Jeff Wilke argued that the metric would yield broad systematic improvements across the retail business. They agreed to stick with it for a while, and it worked out just as Jeff Wilke had anticipated.
You can see how picking the wrong controllable input metric temporarily created a Goodhart’s Law type of situation within Amazon. But the nature of the WBR prevented the situation from persisting. Implicit in the WBR process is the understanding that the initial controllable input metrics you pick might be the wrong ones. As a result, the WBR acts as a safety net — a weekly checkpoint to examine the relationships between controllable input metrics (which are set up as targets for operational teams) and corresponding output metrics (which represent the fundamental business outcomes that Amazon desires). If the relationship is non-existent or negative, Amazon’s leadership knows to kill that particular input metric. Said differently, the WBR assumes that controllable input metrics are only important if they drive desirable outcomes — if the metric is wrong, or the metric stops driving output metrics at some point in the future, the metric is simply dropped.
This is the third solution in Joiner’s list (“give people enough slack to improve the system so that they do so”). The WBR simply functions as a mechanism to let that happen.
A larger point I want to make is that the WBR becomes a weekly synchronisation mechanism for the company’s entire leadership team. This is more important than it might first seem. Colin told me that the week-in, week-out cadence explains how Amazon’s leadership is eventually able to go through 400-500 metrics in a single hour. An external observer might be overwhelmed. But an insider who has been engaged in the WBR process over a period of months won’t see 500 different metrics — instead, their felt experience of a WBR deck is that they are looking at clusters of controllable input metrics that map to other clusters of output metrics. In other words, the WBR process forces them to build a causal model of the business in their heads. Repeated viewings of the same set of metrics will eventually turn into a ‘fingertip-feel’ of the business — execs will be able to say things like “this feels wrong, this dip is more than expected seasonal variation, what’s up?” — which can really only happen if you’re looking at numbers and trends every week. (This is why glancing at metrics are important, even when the metric owner announces “nothing to see here.”) Most importantly, though, the entire leadership team shares in the same causal model, since they would have been present for the laborious trial and error process to identify, control, and then manipulate each and every metric that mattered.
(Incidentally, this also explains why there are so many goddamn metrics in a typical WBR deck. Any output metric of importance in a business will typically be influenced by multiple input metrics. Pretending otherwise is to deny the multi-faceted nature of business. This means that if you track 50 output metrics, the number of controllable input metrics you’ll need to examine will greatly exceed that number — and may change from week to week! However you cut it, your WBR deck will expand to become much larger than an outside observer might expect. This is why it is often a mistake to ‘present a small, simple set of metrics’ or to anchor on a single ‘North Star metric’ for a business. Read more about this argument here.)
Back to the topic at hand. We’ve just taken a look at one aspect of the WBR: how Amazon enables its staff to ‘improve the system’ in order to drive up output metrics that matter. But there are two other practices that Amazon uses to augment the WBR in order to prevent Goodhart’s Law-type situations.
The first is that the WBR is administered by a completely autonomous group — the Finance department. Each WBR meeting is opened by an individual from the Finance team, who takes note of all questions during the meeting and follows up with unfinished threads from the previous week’s WBR. Finance is responsible for certifying the data presented in the WBR; they sit in on the meeting, and speak up to correct misrepresentations (and also, I suspect, to serve as a physical reminder of their role). Elsewhere in the book Bryar and Carr note that Finance generates and monitors targets that are produced as part of the annual OP1 and OP2 planning process — the same targets that are overlayed onto WBR graphs. But the most important function that Finance plays in the WBR process is that they are empowered (and incentivised!) to investigate and dive deep into any of the metrics that are presented during the WBR.
Bryar and Carr write:
In the early 2000s, Jeff and CFO Warren Jenson—who was succeeded in 2002 by Tom Szkutak—stated explicitly how critical it was for the finance team to uncover and report the unbiased truth. Jeff, Warren, and Tom all insisted that, regardless of whether the business was going well or poorly, the finance team should “have no skin in the game other than to call it like they see it,” based on what the data revealed (emphasis added). This truth-seeking mentality permeated the entire finance team and was critical because it ensured that company leaders would have unvarnished, unbiased information available to them as they made important decisions. Having an independent person or team involved with measurement can help you seek out and eliminate biases in your data.
In addition, metrics owners and finance team members alike are expected to run separate auditing processes for metrics that matter. On the topic of auditing, Bryar and Carr have this to say (emphasis mine):
One often-overlooked piece of the puzzle is determining how to audit metrics. Unless you have a regular process to independently validate the metric, assume that over time something will cause it to drift and skew the numbers. If the metric is important, find out a way to do a separate measurement or gather customer anecdotes and see if the information trues up with the metric you’re looking at. So, a recent example would be testing for COVID-19 by region. It is not enough to look at the number of positive tests in your region as compared to another region with a population of a similar size. You must also look at the number of tests per capita performed in each region. Since both the number of positive tests and the number of tests per capita in each location will keep changing, you will need to keep updating your audit of the measurements.
The net result of this is that it makes it more difficult to distort the system or distort the data when it comes to driving metrics in the WBR.
One of the things that I find perpetually irritating about using data in operations is that even proposing to measure outcomes often sets off resistance along the lines of “oh, but Goodhart’s Law mumble mumble blah blah.” Which, yes, Goodhart’s Law is a thing, and really bad things happen when it occurs. But then what’s your solution for preventing it? It can’t be that you want your org to run without numbers. And it can’t be that you eschew quantitative goals forever!
I think the biggest lesson of this essay is just how difficult it is to be data driven — and to do it properly. A friend of mine likens data-driven decision making to nuclear power: extremely powerful if used correctly, but so very easy to get wrong, and when things go wrong the whole thing blows up in your face. I think about this analogy a lot.
So what have we covered today? I opened this piece by introducing the common sense idea that Goodhart’s Law isn’t particularly useful. Instead, I argued that Donald Wheeler’s Understanding Variation gives us a more actionable formulation of Goodhart’s Law (quoting an observation from Brian Joiner, from the 80s, and drawing more broadly from the field of Statistical Process Control). Joiner points out that when you’re incentivising organisational behaviour with metrics, there are really only three ways people might respond: 1) they might improve the system, 2) they might distort the system, or 3) they might distort the data.
Joiner’s list immediately suggests three solutions: a) you make it hard to distort the system, b) you make it hard to distort the data, and finally c) you make it possible for people to change the inputs to the system, or the system itself. That last solution isn’t as easy as you might think — but at least we know it’s necessary to avoid Goodhart’s Law.
Finally, to make things concrete, we looked at how Amazon’s WBR process implements all three solutions. I should note that this isn’t a complete accounting of how the WBR deals with Goodhart’s Law — there are elements of the operating culture and aspects of Amazon’s data maturity that I can’t get into for the sake of brevity. This is a long enough essay as it is. But the main point I want to make is, I hope, clear: Goodhart’s Law is solvable at an organisational level. And now you have a taste of how.
If you enjoyed this essay, you might also enjoy Putting Amazon's PR/FAQ to Practice.
Originally published , last updated .