Case

Amazon’s Category Expansion - Searching for the Right Incentives

Amazon’s Weekly Business Review (WBR) is a metrics meeting that occurs every Wednesday morning. It is administered by finance, attended by the company’s entire leadership team, with various managers and metrics owners present.

A huge part of Amazon’s data culture is to separate metrics into two broad buckets:

Controllable input metrics — which are directly controllable by operators.
Output metrics — which are metrics that matter to a business, but cannot be directly manipulated (at least not in a sustainable manner!) over the long term.

In Amazon, controllable input metrics typically track things like selection, price, or convenience. These factors may be controlled by adding items to the product catalog, lowering costs so prices can be lowered, or positioning inventory for faster delivery to customers. Output metrics, on the other hand, are things like orders, revenues, and profit.

Outside of Amazon, controllable input metrics are more commonly known as ‘leading indicators’, whilst output metrics are more commonly known as ‘lagging indicators’. Amazon prefers the terms ‘controllable input metric’ and ‘output metric‘ because they better reflect the nature of each bucket. Every controllable input metric should measure things that, done right, bring about the desired results in your output metrics.

The tricky thing about input and output metrics, though, is that it often takes a trial-and-error process to figure out the right controllable input metrics for the output metrics you care about.

In their book Working Backwards, authors Colin Bryar and Bill Carr tell the following story as an example of iterating towards a good controllable input metric (all emphasis ours):

One mistake we made at Amazon as we started expanding from books into other categories was choosing input metrics focused around selection, that is, how many items Amazon offered for sale. Each item is described on a “detail page” that includes a description of the item, images, customer reviews, availability (e.g., ships in 24 hours), price, and the “buy” box or button. One of the metrics we initially chose for selection was the number of new detail pages created, on the assumption that more pages meant better selection.

Once we identified this metric, it had an immediate effect on the actions of the retail teams. They became excessively focused on adding new detail pages—each team added tens, hundreds, even thousands of items to their categories that had not previously been available on Amazon. For some items, the teams had to establish relationships with new manufacturers and would often buy inventory that had to be housed in the fulfillment centers.

We soon saw that an increase in the number of detail pages, while seeming to improve selection, did not produce a rise in sales, the output metric. Analysis showed that the teams, while chasing an increase in the number of items, had sometimes purchased products that were not in high demand. This activity did cause a bump in a different output metric—the cost of holding inventory—and the low-demand items took up valuable space in fulfillment centers that should have been reserved for items that were in high demand.

When we realized that the teams had chosen the wrong input metric—which was revealed via the WBR process—we changed the metric to reflect consumer demand instead. Over multiple WBR meetings, we asked ourselves, “If we work to change this selection metric, as currently defined, will it result in the desired output?” As we gathered more data and observed the business, this particular selection metric evolved over time from

- number of detail pages, which we refined to
- number of detail page views (you don’t get credit for a new detail page if customers don’t view it), which then became
- the percentage of detail page views where the products were in stock (you don’t get credit if you add items but can’t keep them in stock), which was ultimately finalized as
- the percentage of detail page views where the products were in stock and immediately ready for two-day shipping, which ended up being called Fast Track In Stock.

You’ll notice a pattern of trial and error with metrics in the points above, and this is an essential part of the process. The key is to persistently test and debate as you go. For example, Jeff (Bezos) was concerned that the Fast Track In Stock metric was too narrow. Jeff Wilke argued that the metric would yield broad systematic improvements across the retail business. They agreed to stick with it for a while, and it worked out just as Jeff Wilke had anticipated.

You can see how picking the wrong controllable input metric temporarily created a Goodhart’s Law type of situation within Amazon (“when a measure becomes a target, it ceases to be a good measure”).

In an interview, Bryar says:

“One thing you need to do remember is that if you ask people to drive a metric in the right direction, they usually will take it to heart and drive it in the right direction, or [at least] the direction you're asking them to go for that specific metric.
So we asked the category teams, “hey, add more products to our catalog!” So they did a couple of things. First of all they said, “What's the best way to get as many products in our catalog as possible?” Well, it’s to sign up with a distributor. Who is a middleman between a manufacturer and the retailer. You pay for that, so it’s a bit more expensive.
The second one is ... who has the most products? And [they did that] not really looking at “are these products going to drive sales?” [They thought]: “I was asked to just fill up the warehouses with products for sale.” So again: simplistic, single metric.
And so we ended up with warehouses that were full of stuff that wasn't selling, with inventory that we would either have to write down, mark down and write off. And we realised, well, it wasn't the category manager's fault. They did what we asked them to do! So we needed another metric.
We still believed that selection is important, we should have things in our fulfilment centres that customers are looking for. And it took several iterations, but really the metric — and it was a fairly complex one — happened to be, anytime a customer went to a product page the denominator went up by one and then any time that we had a product that you could ship with (we’ll just call it Amazon Prime to simplify it), the numerator went up by one.
If you didn't have it, it was zero. And so this has a couple of things you'll notice here. One is that it’s demand weighted. It’s not sales weighted. It’s based on what the customers are looking at. One problem with a sales weighted metric is that if you're not doing well and the sales go down, the metric becomes less important because, well, there's not many sales for that item.
But if you are the one who caused the drop in sales yourself, that’s a problem. So we wanted a customer view of sales weighted, which was what are they looking at today on the website and how many of those products, each individual product do we have available to ship via Amazon Prime?
So ‘demand weighted in stock’ is what it eventually happened to be called. Complicated metric, not a cool name or anything like that. And then to explain how you measure it, it’s quite complicated to collect all that data when you have a hundred million plus items in the catalog, [so it’s] not a trivial project either.
But it turned out to be, one, that's what customers cared about. Two, it drove the right behaviour for the category teams, because they would get inventory and products in the catalog that customers were looking at, and so it helped their inventory position. And then the third thing, very importantly, is when you drove that metric in the right direction, we did notice an increase in sales in the output metric (emphasis added).
So that's an example of a journey [to find a controllable input metric]. It didn't happen overnight. That probably took a little over a year to get from, ‘let's just throw products in our fulfilment centres to, having that instrumented metric being in the WBR.’ So that's an example of what I mean when we talk about [how] you've got to start somewhere.
So pick something, look at the correlation if there isn't one. Ask yourself, is this really a metric or an area we should be measuring? And if it’s yes, are we measuring it the right way? And then does it have the resulting impact in your output metrics?”

It’s important to note that the nature of the WBR prevented the bad situation from persisting. Implicit in the WBR process is the understanding that the initial controllable input metrics you pick might be the wrong ones. As a result, the WBR acts as a safety net — a weekly checkpoint to examine the relationships between controllable input metrics (which are set up as targets for operational teams) and corresponding output metrics (which represent the fundamental business outcomes that Amazon desires). If the relationship is non-existent or negative, Amazon’s leadership knows to kill that particular input metric. Said differently, the WBR assumes that controllable input metrics are only important if they drive desirable outcomes — if the metric is wrong, or the metric stops driving output metrics at some point in the future, the metric is simply dropped.

Finished reading this case?

Member Comments