Last week, we talked about how some of the leaders of the cognitive biases and heuristics research tradition have increasingly turned to noise reduction as a method for improving decisions. We also discussed why this shift exists: it turns out that reducing cognitive bias in real world decision making is too damned hard to do.
This week, let’s talk about what we currently know about noise reduction.
Table of Contents
- Noise Audits
- Reasoned Rules
- Aggregate Judgments
- Use Algorithms, But Make Them Tolerable
- The Mediating Assessments Protocol
1) Noise Audits
Kahneman et al introduced two actionable ideas in their 2016 HBR article on noise: first, the idea of a noise audit, and second, the idea of a reasoned rule.
A noise audit is a method for identifying noise in an organisation’s decision making. Kahneman and his collaborators observed that bias is rather difficult to detect: you may have to wait years before you know the full outcomes of certain decisions. But they then observed that noise is much easier to measure: you can simply look at the variance in decision outputs across the organisation, without ever knowing what the ‘right’ outcome might be.
The noise audit is the tool they developed to do exactly this.
The intuition behind the noise audit is really simple: pick an existing decision-making process that you want to measure, come up with a few sample cases, and then run it through a large-ish selection of your employees. Then measure the variance in the outputs. How you measure is a little bit tricky — it helps, of course, if the expected output of the process is a numerical figure, but it’s possible to come up with a scoring rubric for more qualitative processes.
The researchers ran the following process in two financial services organisations:
- They asked the managers of the professional teams involved in their experiment to construct several realistic case files for evaluation. To prevent information from leaking, the entire exercise was to be conducted on the same day.
- Employees were asked to spend about half a day analysing two to four cases. They would decide on a dollar amount for each case file, as per normal.
- To avoid collusion, participants were not told that the study was concerned with reliability. (“We want to understand your professional thinking”, the researchers told the employees of one organisation.)
- The researchers then collected the evaluations, and calculated something called a ‘noise index’. This was defined as the difference between two randomly chosen employees, expressed as a percentage of the average between those two employees. For instance, say that one employee evaluates a case for $600, and another for $1000. The average for their assessments is $800, and the difference between them is $400, so the noise index is 400/800 x 100 = 50% for this pair. The researchers then do the same computation for all pairs of employees, before calculating an overall average for each case file. They also calculated an overall average for the organisation across all case files.
- As it turned out, the overall average for the first financial services organisation was 48%, and the overall average for the second was 60%. This came as a shock to the executives in both organisations, who were expecting differences to be at the level of 5-10%.
2) Reasoned Rules
So what do you do if your organisation has noisy decision making? In the second half of their HBR article, Kahneman and his collaborators suggest something called a ‘reasoned rule’ — which is a fancy name for an algorithm that employees can apply to their decision processes.
Why call it a ‘reasoned rule’? Why not just call it an ‘algorithm’? The authors argue that ‘algorithm’ conjures up certain expectations — for instance, people may believe that developing an algorithm demands a rigorous statistical analysis of, say, loan application outcomes, in order to develop a predictive formula for defaults. But a ‘reasoned rule’ is a lot simpler — you come up with a handful of common sense guidelines, and then you turn that into a formula that any employee can use.
The authors describe the process like so:
- First, pick six to eight variables that are related to the outcome being predicted. You may use common sense reasoning to pick these variables. For instance, in a loan application, you'll pick things like assets, revenues, and liabilities of the applicant in question.
- Take a representative handful of historical cases, and then compute the mean and standard deviation of each variable in that set.
- For every case in your historical set, come up with a ‘standard score’ for each variable. A standard score is the difference in value in the case and the mean of the whole set, divided by the standard deviation. This gives you a number that can be compared and averaged across all the other variables.
- Compute a ‘summary score’ for this case. This is the average of all the standard scores for each variable of the case. This will be the output of your reasoned rule. You’ll be repeating this computation of the summary score for new cases, while using the mean and the standard deviation of the original set.
- Now, the final step: order the cases in your set from high to low summary scores, and then determine the cut-off points. For instance, you might want to say “the top 10% of applicants will receive a discount” and “we’ll reject the bottom 30%, which is anyone with a summary score below Y”.
Voila! You now have a formula for loan evaluations, one that tamps down on random variability.
3) Aggregate Judgments
One other, common response to noise is to aggregate judgments. If you aggregate enough judgments, the logic goes, the random variability in those judgments should cancel out. This was originally one of the goals of the ACE program (it wasn’t called the Aggregate Contingent Estimation Program for no reason) — and indeed there was some work done to see if the Good Judgment Project could aggregate the judgments of superforecasters.
Aggregating judgments are commonly taught to MBA students as an analytical technique. A friend described, of his INSEAD MBA experience:
- In one assignment we were asked to assess the accuracy of a linear regression model using data from three analysts, and then asked which analysts you would choose if you could only have two. It turns out that you don’t want the two “good” analysts, but the “worst” analyst + the “best” analyst. This is because the “worst” analyst brings novel information into the analysis, while the two “good” analysts don’t add much to each others analysis.
- As a class we estimated the number of beans in a jar. But instead of taking everyone’s first guess we plotted out everyone’s guesses and asked everyone to update their guess. The second guess turns out to be more accurate, as a result of the discussion and updating process.
The problem with aggregating judgments, of course, is that they are expensive to do. In theory you can poll a group of skilled equity analysts and aggregate their judgments together for everything you do. In practice this will cost you $2000 per hour per analyst, and you’d likely exhaust your research budget before you get good results. Michael Mauboussin summarises, of the aggregating judgments approach: “If you are dealing with noisy forecasts and have a cost-effective way to combine them, do so (emphasis mine). The forecast is very likely to be more accurate than a single forecast.”
4) Use Algorithms, But Make Them Tolerable
One other approach, of course, is to use actual algorithms. There is a large body of research on how algorithms outperform humans in all sorts of judgment and decision making tasks, but there is an equally large body of research on how humans distrust algorithms, are slow to adopt them, and are quick to blame them when things go wrong. (Dietvorst et al, 2015)
I think we all know that there are all certain domains where it is ethically (or at least socially) unacceptable to use algorithmic judgment: for instance, it wouldn’t sit well with most of us if a newspaper reports “person is sentenced to lethal injection thanks to algorithm”; we also wouldn’t look too kindly on “sorry, your dad died on the operating table because the computer screwed up.” Social intuitions sometimes trump mechanistic goals.
So the question of algorithms becomes a more pragmatic one: how do we use algorithms in a way that is acceptable to human beings? One way is to simply adopt more algorithm-like processes (see the ‘reasoned rule’ above), instead of entrusting judgment completely to a computer. For instance, Mauboussin recommends that investors look at their existing processes and ask if there are parts of it that can be systematised; he also recommends that investors think about adopting checklists for certain aspects of the process.
But there is also this fascinating result, from Dietvorst et al, 2016, that people will use imperfect algorithms if they can (even slightly!) modify them. Kahneman calls this ‘disciplined intuition’, and Mauboussin argues that you can probably get a team of humans to use a systematic decision process if you let them apply their judgment to the final result.
“The evidence”, he writes, “shows that allowing that wiggle room improves the overall quality of the decision.”
5) The Mediating Assessments Protocol
The final technique we’ll cover today was articulated by Kahneman and his collaborators in MIT’s Sloan Management Review in 2019. They named it the ‘Mediating Assessments Protocol’, or ‘MAP’, but this is really a fancy name for a process that many of us should be familiar with.
In 1956, a young Daniel Kahneman created a structured interviewing process for the Israeli army, one that is still used (in modified form) today. As Kahneman argued in Thinking: Fast and Slow, the significance of that approach was that ‘intuitive judgment is much improved by delaying a global evaluation until the very end of a structured process’.
In a structured job interview:
- The interviewer has a pre-determined set of criteria to assess.
- During the interview, the interviewer scores the candidate on each criteria they are assigned to assess, before doing an overall judgment at the very end.
- Interviewers are not allowed to discuss the candidate until they have all done their evaluations. Only after all interviews are done may they come together to pool their judgments.
Structured interviewing is now considered a best practice, and companies like Google, Facebook, Amazon, and McKinsey implement forms of it. The idea is so widespread, in fact, that when I designed the hiring process at my previous company, I accidentally made it a structured one, working off publicly available best practice guides.
MAP is simply Kahneman asking the question: “so structured interviews work really well … why not apply that idea to everything?” The article itself documents the history of the protocol, the ideas that motivated the design, and the experience of applying MAP to an unnamed venture capital firm’s investment decisions.
The process of creating a MAP is as follows:
- Define the assessments in advance. The decision maker must identify a handful of mediating assessments, which are key attributes that are critical to the evaluation. For instance, if you were hiring someone, you would put down things like “personality assessment” and “skill fit”. If you were investing in an early-stage company, you might list things like “strength of the founding team” and “potential size of the addressable market”.
- Use fact-based, independently made assessments. When performing the analysis, each assessment should be made independently of the others. To put this another way, decision makers are expected to answer the question “leaving aside how important this topic is to the overall decision, how strongly does the evidence for this particular assessment argue for or against it?” They are expected to record this as a score.
- If a deal-breaker fact is uncovered, stop the assessment. This isn’t so much a step as an obvious caveat: if you uncover something unacceptably bad, just stop the assessment and throw it out. You’ll save time for everyone involved.
- Make the final evaluation when the mediating assessments are complete. You are only allowed to think about the final decision when all key attributes have been scored and a complete profile of assessments is available. The entire process also applies when decision makers come together to make a joint decision: the group goes through each of the mediating assessments in order, and settles on a consensus score for each assessment. (They are not allowed to make general comments during this period.) When all of the mediating assessments are done, the final scores for every assessment may be displayed on some board or projector, and a discussion may turn to a holistic evaluation of the overall decision. At the end, a vote is taken — e.g. hire/no hire; acquire/no acquire; invest/no invest.
Kahneman and his collaborators writes the following, on the underlying theory of the protocol:
The use of mediating assessments reduces variability in decision-making because it seeks to address the limitations of mental model formation, even though it cannot eliminate them entirely. By delineating the assessments clearly and in a fact-based, independent manner, and delaying final judgment until all assessments are finished, MAP tempers the effects of bias and increases the transparency of the process, as all the assessments are presented at one time to all decision makers. For example, because salient or recent pieces of information are not given undue weight, the process preempts the availability bias. MAP also reduces the risk that a solution will be judged by its similarity with known categories or stereotypes (an error arising from the representativeness bias). When differentiated, independent facts are clearly laid out, logical errors are less likely.
This begs one final question: how should you record the scores of each mediating assessment? Kahneman et al argue that verbal scores like “very good” and “bad” aren’t great; even scores like 3 (on a scale of) 10, or letter scores like ‘A’, ‘B’ and ‘C’ may be subject to differing standards amongst decision makers. Instead, the authors argue for percentile scores, like “this particular company has a founding team that is in the top 10% of other companies we’ve seen” and “the size of the opportunity for this company is in the top 60% of other companies we’ve invested in.”
Percentile scores do three things:
- First, they require the decision maker to think of comparable cases, and to think of the case at hand as one particular instance of a broader category. This approach is also known as ‘taking the outside view’, and is a ‘powerful debiasing technique by itself’.
- Second, percentile scores allow individual miscalibration to be detected and corrected. For instance, a decision maker who rates 40% of cases as being in the top 10% may be taken aside and coached (or, you know, excluded from further decisions). The interesting thing here is that people who have been taught to use a percentile scale well are said to be ‘well-calibrated’ — and calibration is a major step in reducing noise in judgments.
- Third, and finally, percentile ratings may be easily translated into policy. If for instance, insurance underwriters rate risks in percentiles, then premiums can be priced based on those ratings, and insurance companies can decide at what percentile in the distribution of risks it is willing to underwrite. (You can imagine how an equivalent process with bet sizing or loan underwriting might occur).
In their article, the authors describe an implementation of MAP in a venture capital firm, noting that:
Assigning a numerical rating to each assessment safeguards against the tendency to form overly coherent mental models. For instance, when VC Co. formulated its assessment of a target company’s top team qualitatively, the wording was sometimes ambiguous: Depending on the mental model that had emerged from prior assessments, a word such as “strong” could be interpreted as either adding to a positive impression or expressing doubt. With a numerical rating, this risk is reduced.
Using percentile scores to make qualitative assessments puts recurrent decisions into context. To evaluate the caliber of a target company’s founders, for instance, VC Co. now uses a “slate” showing the names of the dozens of founders to whom it has had exposure. Instead of asking whether a founder is “an A, a B, or a C,” investment committee members ask: “On this particular subassessment — say, technical skills — is this person in the same league as Anna and the other five founders we regard as the very best on this dimension? Or is she comparable to Bob and the other entrepreneurs we view as belonging to the second quartile?” VC Co. applies the same comparative approach to other qualitative assessments, such as the potential defensibility of a company’s competitive advantage.
I like this approach a lot.
The main caveat I have with this entire collection of techniques is that I’ve not tried any of them in practice. I have implemented a structured interviewing process in the past, but I didn’t do percentile scores — and Kahneman and his collaborators say that getting your org used to percentile scores is a bit troublesome, at least at first.
There are likely other implementation gotchas that I do not know.
I think the core idea at the heart of this post, if there is one, is the following: if you want consistency in your org's decision-making, you should learn to distrust free-form judgments. Expert intuition is a good thing, but — especially in an organisation — it’s probably a good idea to introduce some structure. Having such structure allows you to deploy your intuition in a slightly more disciplined manner, which in turn makes your judgments less variable over time.
- Kahneman et al’s Noise article in HBR is worth a read, and their Mediating Assessments Protocol piece in MIT’s Sloan Management Review is wonderful.
- Michael Mauboussin’s Bin There, Done That paper introduces three of the methods I have covered here as actionable implications of the BIN model. His paper was what led me to MAP as a possible intervention.
- Berkeley Dietvorst, Joseph Simmons and Cade Massey’s Overcoming Algorithm Aversion: People Will Use Imperfect Algorithms If They Can (Even Slightly) Modify Them paper is a pretty fascinating read. It begs the question: what algorithms might you use, if you could use them as inputs to a final, intuition-driven decision?