In our previous post on forecasting we talked about two things: first, we examined the rules and scoring criteria the Good Judgment Project used to evaluate its forecasters, and then we looked at a personal example of how one might take these ideas and apply them to a real-world domain.
This week, we’re going to look at a slightly more interesting question raised by the GJP: how, exactly, are these superforecasters getting their results?
Those of you who have read the Superforecasting book are probably thinking: “ahh, they got those results by overcoming bias — that is, by tamping down on System 1 heuristics through System 2 analytical techniques.” At the time of the book’s publication in 2015, this was the prevailing theory — that is, a bunch of forecasters were observed to systematically outperform actual intelligence analysts (with access to classified data) in various three-lettered agencies, and therefore Tetlock, Mellers, Moore and co theorised that they did so by overcoming cognitive biases and by thinking probabilistically about the world.
This is partially correct. But it turns out that the second part of that sentence was mistaken. The GJP was a research project that sought to identify two things: first, what superforecasters looked like, and second, what methods existed with which forecasting performance might be improved. Very recent research into the data generated by the GJP suggests that cognitive bias avoidance explained only around 25% of performance improvement after interventions. Put differently, superforecasters might be better at overcoming some of their biases, but the bulk of GJP’s performance-enhancing interventions were explained by … something else instead.
The Three Interventions
Before we discuss the actual techniques that superforecasters use, or the techniques that the GJP used to enhance their performance, we need to talk about whether copying these techniques would even work for us. What if the superforecasters were genetic marvels? Or, to quote Daniel Kahneman: “are superforecasters different people, or are they people who do different things?”
This requires a short discussion about the interventions staged during the GJP.
The GJP ran as an annual forecasting tournament for about four years, between 2011 and 2015. After the first year of the tournament, Tetlock et all identified the best performing forecasters and grouped them into elite teams — each equipped with private online discussion boards — and continued to do so for the remaining seasons. They also began to introduce interventions into various sub-groups of the participants, to see if education on bias reduction techniques, for instance, would result in improved forecasting performance. All of these interventions they introduced via randomised controlled trials.
All told, the GJP implemented three interventions: training, teaming, and tracking. The reason I’m talking about this now is because all sorts of interesting questions emerge when you place these three interventions in the context of the broader judgment and decision making literature. For instance, one thing you might reasonably ask is “can people really overcome their biases?” … and the researchers working on the GJP saw their project as an opportunity to explore that very question, at least within the narrow confines of forecasting geopolitical events.
We’ll go through GJP’s three interventions quickly, because this is necessary background but not super useful to the practitioner:
- Training was the activity of giving forecasters an extremely short training module at the start of the tournament. As Mellers explains in this interview: they taught people how to think about probabilities, where to get probabilistic information, how to average professional forecasts, and then they presented participants with a list of common biases in human judgment. At the outset of the intervention, none of the researchers thought this would amount to very much.
- Teaming was the act of lumping forecasters together into teams, each with a private, online discussion board. The researchers discovered that forecasters would encourage each other, correct each other, and push back against ill-judged predictions. They also discovered that teaming resulted in better predictions overall.
- Tracking was the practice of lumping higher-performing forecasters with each other. In the book, Tetlock mentions that putting superforecasters together with each other ‘juiced’ their performance compared to when they were forecasting individually; they continued adding new superforecasters to elite groups as the years went on. About 70% of superforecasters remained superforecasters year-on-year; 30% churned out after a year at the top.
It’s at this point that you might ask: how well did these interventions contribute to eventual performance improvement? The answer here is: a little. In the Superforecasting book, for instance, Tetlock expresses his surprise when he discovered that their 60-minute training intervention resulted in a 10% improvement against control groups over the course of a year. He quotes Aaron Brown, then-chief risk manager of AQR Capital Management: “It’s so hard to see (10%) because it’s not dramatic, but if it is sustained it’s the difference between a consistent winner who’s making a living, or the guy who’s going broke all the time.”
A more important question is: “how likely am I to improve if I adopt the techniques of the superforecasters?” The answer here seems to be: ‘a little’. In her summary paper about the project, Barbara Mellers notes that they managed to improve the Brier scores of intelligence analysts by 50-60% over the course of the GJP. 50-60% is a hell of an improvement, but I’d warn against celebrating this: we must remember that intelligence analysts are a highly selective bunch — they are recruited from difficult courses in good schools and are likely to have higher-than-average (ahem) intelligences.
The general question that we’re trying to answer here, the one that’s sort of hanging out in the background over everything is: ‘is this nature or nurture?’ And Tetlock believes that it’s both. Superforecasters have higher-than-average fluid intelligence. They score higher on tests of open-mindedness. They possess an above-average level of tested general knowledge. But all three of GJP’s interventions have resulted in sustained performance improvements: over time, the correlation between intelligence and forecasting results dropped (which Tetlock took to mean that continued practice was having an effect, even on average forecasters).
So there’s good reason to believe that these techniques are useful, and that if we used them we should see small Brier-score improvements of at least 10%. The GJP’s results go further and tell us that if we happened to be of above-average intelligence, if we were open-minded by nature, and if we were the type of person who naturally distrusts the supernatural … well then, the odds are good that our improvements would be higher.
How They Actually Do It
Superforecasters perform so well because they think in a very particular way. This method of thinking is learnable. I’ll admit that exposure to this style of thinking has had an unforeseen side-effect in the years since I read Superforecasting: I find myself comparing the rigour of any analytical argument against the ideal examples presented by Tetlock and Gardner. As you’ll soon see, the superforecasters of the GJP set a high bar for analysis indeed.
Break the Problem Into Smaller Questions
Superforecasters break stated forecasting problems into smaller subproblems for investigation. The term that Tetlock uses is to ‘Fermi-ize’ a problem — aka ‘do a Fermi estimation’ — which is a fancy name for the method with which physicist Enrico Fermi used to perform educated guesses.
The canonical example for ‘Fermi estimation’ is the question: “how many piano tuners are there in Chicago?” — a brainteaser Fermi reportedly enjoyed giving to his students. In order to answer this question, a good Fermi-estimation goes roughly as follows:
- The number of piano turners must surely be limited by the amount of work available for them in Chicago.
- What determines the amount of piano tuning work available? This probably depends on how many pianos there are in Chicago, and how often a piano needs to be tuned each year. We’ve now decomposed this question into two sub-questions: How many pianos are there in Chicago? How often must a piano be tuned in a year?
- If we can guess the length of time needed to tune a single piano, and multiply that over all the pianos in Chicago and the number of times they must be tuned each year, then we can get at the total amount of piano tuning work available in Chicago per year.
- Finally, we can guess how many hours a year the average piano tuner works.
- Our final answer should be the total amount of piano tuning work divided by the average number of work hours a piano tuner spends in a year, and we’ll have a number that’s roughly the number of piano tuners in Chicago.
Great, you might say, this example is fine for a brainteaser. How does it apply to an actual forecast? We’ll pick a real-world case from the Superforecasting book:
On October 12, 2004, Yasser Arafat, the seventy-five-year-old leader of the Palestine Liberation Organization, became severely ill with vomiting and abdominal pain. Over the next three weeks, his condition worsened. On October 29, he was flown to a hospital in France. He fell into a coma. Decades earlier, before adopting the role of statesman, Arafat had directed bombings and shootings and survived many Israeli attempts on his life, but on November 11, 2004, the man who was once a seemingly indestructible enemy of Israel was pronounced dead. What killed him was uncertain. But even before he died there was speculation that he had been poisoned.
In July 2012 researchers at Switzerland’s Lausanne University Institute of Radiation Physics announced that they had tested some of Arafat’s belongings and discovered unnaturally high levels of polonium-210. That was ominous. Polonium-210 is a radioactive element that can be deadly if ingested. In 2006 Alexander Litvinenko—a former Russian spy living in London and a prominent critic of Vladimir Putin—was murdered with polonium-210.
That August, Arafat’s widow gave permission for his body to be exhumed and tested by two separate agencies, in Switzerland and France. So IARPA asked tournament forecasters the following question: “Will either the French or Swiss inquiries find elevated levels of polonium in the remains of Yasser Arafat’s body?”
Will either the French or Swiss inquiries find elevated levels of polonium in the remains of Yasser Arafat’s body? Boy, what a question. How would you tackle it?
If you were like most readers (myself included), you’d probably start by reading up on the Yasser Arafat’s life and times. You might find old news pieces about the various attempts on his life. You might even dig into the geopolitics of the region. That’s what your instincts would tell you, wouldn’t it? ‘Either Israel poisoned Arafat, or they didn’t. Which is it?’
But Tetlock points out two things: first, your impulsive dive into the life-and-times of Yasser Arafat is really an answer to a substitute question: ‘Did Israel poison Arafat?’ This isn’t the question you’re supposed to answer, but your brain did a bait-and-switch anyway. Second, the superforecasters of the GJP show us that a proper Fermi-ization of this question is simpler: ignore the surrounding geopolitics. Ignore Arafat’s history. Focus on the question itself:
- Polonium decays quickly. For the answer to this question to be a yes, the scientists would have to be able to detect polonium on a dead man who’s been buried for years. Could this be done? What is the likelihood of success? What’s the error rate?
- Assuming this can be done, how easy would it be to contaminate Arafat’s remains with enough polonium to trigger a positive result? (Realise that ‘Israel poisoned Arafat’ is merely one root cause, another might be ‘contaminate Arafat’s remains to frame Israel, so Israel looks bad’).
- For each possible way, increase the probability that a positive test emerges.
- Recognise that there are two labs, and that a positive of either would result in a yes for the overall question. This, too, elevates the overall probability of a positive result.
To let this style of analysis sink in, I’ll give you a second, personal example. In January, earlier this year, John Gruber wrote a blog post titled ‘The R-Word’. The piece opens with analysis of Mark Gurman’s “Apple Plans to Cut Back on Hiring Due to iPhone Sales Struggles” Bloomberg piece, but goes on to speculate that China was then going through a recession, and that Tim Cook was spooked.
I read this in the morning, on a weekend, and became irrationally frightened. I sent the piece to all of my friends, along with a “holy shit if China is facing a recession, what are the odds that it infects the rest of the world and we have another global recession like in 2009?” You can see what happened in those few seconds: a narrative unfurled itself in my head, fully-formed, and out it went, through my fingers and into the ether of chat.
Almost immediately, friends began to push back. “So?” one asked. “What does this mean?” another said. “How do you know?” asked a third.
I realised I’d done two substitutions. I went from “China might be having a recession” to “China is having a recession and this might trigger a global recession” to “A global recession would affect me and everyone I care for.” So I took a moment and decomposed this opinion into two separate statements: “what are the odds that Singapore is affected by a Chinese recession?” and “what are the odds that a Chinese recession might cause a global recession?”
I won’t bore you with my investigation into the first statement, but here’s what I discovered for the second one: the IMF defines a global recession as a ‘decline in annual per-capita real world GDP, weighed for purchasing power parity’. The IMF recognises four global recessions in total; one each in the past four decades: one in 1975, one in 1982, one in 1991, and one in 2009, each spaced roughly 10 years apart.
Each recession has a number of complex causes, but to provide a gross oversimplification (this is all pieced together from a combination of IMF reports and the great arbiter of truth, Wikipedia):
- The 1975 recession was triggered when Nixon left the gold standard (amongst other things, because this coincided with an oil crisis due to the Yom Kippur war).
- The 1982 recession was triggered by an oil crisis thanks to the Iranian revolution, but was compounded by a mismanagement of the money supply by the US Fed.
- The 1991 recession was caused by a 1990 oil shock, coupled with a weak economy thanks to (again) restrictive money supply due to the US Fed.
- And, finally, the 2009 recession was caused by the collapse of the US real estate market.
What did this mean? It meant that four of the previous four global recessions started with a recession in the US. Three of those four recessions were started by an oil shock. And more importantly, when Japan fell into recession in the 90s, it did not trigger a global recession, despite being the third largest consumer in the global economy.
The upshot here was that my fear was overblown. The probability of a world recession being triggered by a Chinese recession was a lot lower than my internal narrative indicated. Not impossible, for sure — history is not mean-reverting, so the future may well look different from the past. But it did mean that I had to calibrate my fear downwards.
Take the Outside View First
This leads us to a second habit. Superforecasters take the ‘outside view’ first, before they consider the ‘inside view’.
Imagine that you’re working on a project, and you are asked to predict the amount of time it would take to complete it. The most intuitive approach to this task would be to think about the project from the inside: that is, to think about the various tasks left in the project, to evaluate the relative merits of the team and the budget you have left, and to try and extrapolate from the tasks you have already completed.
This is known as the ‘inside view’: you analyse according to your knowledge of the internal details of your situation.
The ‘outside view’ would be to ask yourself: how long do most teams of this size, taking on a project of this scope, working in roughly similar circumstances, take to accomplish a similar project? The important insight here is that by looking at the external environment for similar situations, you’d have better calibration for your estimate. This calibration can differ wildly from your internal forecast.
The ‘inside’ and ‘outside’ view was first described by Daniel Kahneman in his seminal book Thinking: Fast and Slow. In the book, he tells the story of working on a textbook project for the Israeli school system. When initially asked for an estimate, his team guessed two-three years based on their internal evaluations. But when pressed for an outside view, they quickly realised that most textbook teams took seven years on average, with a large number of them never finishing the textbook at the end of the seven years.
Indeed, when Kahneman left Israel, the textbook had taken eight years and was never used to teach decision-making skills to students. Kahneman called it "one of the most instructive experiences of his professional life".
With my ‘China recession’ example, the outside view would’ve been to consider: how many global recessions were caused by Chinese ones? Or, to put it another way: what is the base rate for global recessions caused by non-US countries?
Base rates are easy if there are clear historical examples to draw from. But what of our Yasser Arafat example? There’s no set of dead Middle Eastern leaders to examine, no base rate of ‘% of bodies found to be poisoned when exhumed under suspicious conditions’ sample. How do we take an outside view of this forecast?
Tetlock walks us through a Fermi-like examination:
Let’s think about this Fermi-style. Here we have a famous person who is dead. Major investigative bodies think there is enough reason for suspicion that they are exhuming the body. Under those circumstances, how often would the investigation turn up evidence of poisoning? I don’t know and there is no way to find out. But I do know there is at least a prima facie case that persuades courts and medical investigators that this is worth looking into. It has to be considerably above zero. So let’s say it’s at least 20%. But the probability can’t be 100% because if it were that clear and certain the evidence would have been uncovered before burial. So let’s say the probability cannot be higher than 80%. That’s a big range. The midpoint is 50%. So that outside view can serve as our starting point.
Superforecasters begin with an outside view so that they can anchor on a well-calibrated first guess. Then they switch to an inside view.
In our Arafat example, Tetlock walks us through a superforecasters’s thinking process once it's time to switch to the inside view:
When Bill Flack Fermi-ized the Arafat-polonium question, he realised there were several pathways to a ‘yes’ answer: Israel could have poisoned Arafat; Arafat’s Palestinian enemies could have poisoned him; or Arafat’s remains could have been contaminated after his death to make it look like a poisoning. Hypotheses like these are the ideal framework for investigating the inside view.
Start with the first hypothesis: Israel poisoned Yasser Arafat with polonium. “What would it take for that to be true?
1. Israel had, or could obtain, polonium.
2. Israel wanted Arafat dead badly enough to take a big risk.
3. Israel had the ability to poison Arafat with polonium.
Each of these elements could then be researched—looking for evidence pro and con—to get a sense of how likely they are to be true, and therefore how likely the hypothesis is to be true. Then it’s on to the next hypothesis. And the next.
This is the sort of work that goes into a forecast. As Tetlock notes: it’s methodical, slow, and demanding. But it works far better than ‘wandering aimlessly in a forest of information’.
Get Other Perspectives And Synthesise
Decomposing the question and exploring inside and outside views can only get you so far. The next step is to synthesise multiple analyses.
Superforecasters do this in a couple of ways. They actively solicit feedback. They welcome alternative viewpoints. They generate as many narratives as possible.
From their second year onwards, Tetlock and his fellow researchers put forecasters into teams. Teaming is an easy way to get additional perspectives: you get pushback and debate whenever you're dealing with other people. Synthesis of new viewpoints is unavoidable. But in other ways it is also difficult: you need the group dynamics to gel if you want to do well (forecasters are ranked within teams, but the teams's overall Brier scores are also ranked against other teams).
Tetlock notes that in the end, however, a forecaster must synthesise multiple analyses into a final evaluation. This is fundamentally an individual judgment. Good forecasters question themselves, attack their own ideas, and strain for alternative perspectives, even when left alone. Bill Flack writes his thinking down, so he can step back and scrutinise it: “Do I agree with this? Are there holes in this? Should I be looking for something else to fill this in? Would I be convinced by this if I were somebody else?”
Another technique that several superforecasters use is to take a break on a forecast, sleep on it, and return with a fresh set of eyes. This approach is known as using ‘the crowd within’ — a reference to the ‘wisdom of crowds’, which isn't a bad reference, really. Tetlock writes:
Researchers have found that merely asking people to assume their initial judgment is wrong, to seriously consider why that might be, and then make another judgment, produces a second estimate which, when combined with the first, improves accuracy almost as much as getting a second estimate from another person.
Here’s an example of synthesis from a superforecaster:
“On the one hand, Saudi Arabia runs few risks in letting oil prices remain low because it has large financial reserves,” wrote a superforecaster trying to decide if the Saudis would agree to OPEC production cuts in November 2014. “On the other hand, Saudi Arabia needs higher prices to support higher social spending to buy obedience to the monarchy. Yet on the third hand, the Saudis may believe they can’t control the drivers of the price dive, like the drilling frenzy in North America and falling global demand. So they may see production cuts as futile. Net answer: Feels no-ish, 80%.” (As it turned out, the Saudis did not support production cuts—much to the shock of many experts.)
Like economists, superforecasters turn out to have lots of hands.
Start with a probability estimate and then update
The ‘start from the outside view/base rate’ idea is useful not only because it anchors you to a realistic judgment at the beginning of an analysis, but also because it is a useful starting point for a Bayesian updating process.
‘Bayesian updating’ here means something simple: you start out with an initial probability judgment (called your prior), and then you make small updates to that over time, as new information comes in. Some superforecasters do use the actual Bayes's theorem (Tetlock gives one example in his book), but many of them simply adopt the underlying principles of the theorem: they read the news, and then make updates to their overall score commensurate to the importance of said news.
Tetlock notes that updates can be small (superforecasters are known to debate differences in a single percentage point, as we'll see in a bit) but they may also be large (if an initial forecast is found to be wildly off the mark). The remarkable thing is that the best forecasters are never beholden to their past opinions. They view beliefs as ‘hypotheses to be tested, not treasures to be protected.’
We return to our Arafat-polonium example:
Long after Bill Flack made his initial forecast on the Arafat-polonium question, the Swiss research team announced that it would be late releasing its results because it needed to do more—unspecified—testing. What did that mean? It could be irrelevant. “Maybe a technician had celebrated his birthday a little too heartily and missed work the next day. There was no way to find out. But by then Bill knew a lot about polonium, including the fact that it can be found in a body as a result of it being introduced in that form, or it can be produced in the body when naturally occurring lead decays. To identify the true source, analysts remove all the polonium, wait long enough for more of the lead—if present—to decay into polonium, then test again. The Swiss team’s delay suggested it had detected polonium and was now testing to rule out lead as its source. But that was only one possible explanation, so Bill cautiously raised his forecast to 65% yes. That’s smart updating. Bill spotted subtly diagnostic information and moved the forecast in the right direction before everyone else—as the Swiss team did, in fact, find polonium in Arafat’s remains.
As a reminder: in the GJP, multiple updates are allowed for each question. At the end of the forecast, all of the updates made by a forecaster would be rolled up into a single, final, Brier score for the one question. Flack’s score for the Arafat forecast was 0.36, which looks bad as an absolute number, but is in reality great when compared to the vast majority of forecasters who worked on it. (And not bad at all for an outcome that shocked so many experts).
Be oddly specific about numbers
It turns out that superforecasters get into arguments over the smallest of percentage points. This isn't an accident: Barbara Mellers ran the numbers and discovered that granularity of probability estimates correlated in the GJP sample with accuracy: the forecaster who sticks with tens—20%, 30%, 40%—is less accurate than the forecaster who uses fives—20%, 25%, 30%—and less accurate still than the finer-grained forecaster who uses ones—20%, 21%, 22%. Mellers then took the sample and rounded forecasts to make them less granular — ones to fives, fives to tens, and so on — and discovered that superforecasters lost accuracy in response to even the smallest-scale rounding (nearest 0.05), whereas regular forecasters lost little even with rounding four times as large.
Precision of probability estimate matters. But why is this the case? You might think this is an affectation, perhaps because people do use granularity as bafflegab. (One can imagine the Wall Street analyst who says, confidently, “there is a 67% chance that Apple's stock will finish the year 16% above where it started!”)
But Tetlock relays the following anecdote:
I once asked Brian Labatte, a superforecaster from Montreal, what he liked to read. Both fiction and nonfiction, he said. How much of each, I asked? “I would say 70%…”—a long pause—“no, 65/35 nonfiction to fiction.”
Labatte and other superforecasters like him are precise by nature, not precise for affectation's sake. Tetlock thinks this orientation matters a great deal. He believes that it tells us about something fundamental to superforecasters: that they cultivate precision in their heads, and have to fight against their nature for it to happen.
Left to themselves, most people have only three notches in the probability meter in their heads: “gonna happen, not gonna happen, and maybe.” The CIA recommends five notches; the National Intelligence Council recommends seven. But superforecasters fight over single percentage points. They occasionally argue about rounding up or rounding down fractions. And they do this not because they want to show off, but because they’re crazy in that particular way.
When coming up with probability estimates, granularity matters.
Evaluate performance and reflect on cognitive biases
The best forecasters regularly reflect on past failures and successes for feedback. Readers familiar with the literature on deliberate practice might understand why this matters: you can’t get better without a clear feedback cycle.
Tetlock notes, with some pride, that the Brier scores and the team discussion boards forced forecasters to be honest with themselves. Naturally, the best ones go back to review the mistakes they made:
We agreed as a team that the half-life of polonium would make detection virtually impossible. We didn’t do nearly enough to question that assumption— for example, by considering whether the decay products would be a way to detect polonium, or by asking someone with expertise in the area.” That’s a message Devyn Duffy posted for his teammates after he and his team took a drubbing on the question about whether Yasser Arafat’s remains would test positive for polonium. The lesson he drew: “Be careful about making assumptions of expertise, ask experts if you can find them, reexamine your assumptions from time to time.
Sometimes they share lengthy postmortems with teammates. These online discussions can go on for pages. And there’s a lot more introspection when superforecasters have quiet moments to themselves. “I do that while I’m in the shower or during my morning commute to school or work,” Jean-Pierre Beugoms says, “or at random moments during the day when I’m bored or distracted.” In the first two seasons of the tournament, Jean-Pierre would often look at his old forecasts and be frustrated that his comments were so sparse that “I often could not figure out why I made a certain forecast” and thus couldn’t reconstruct his thought process. So he started leaving more and longer comments knowing it would help him critically examine his thinking. In effect, Beugoms prepares for the postmortem from the moment the tournament question is announced.
And they did so even when they got it right. On a question about elections in Guinea — a question his team aced — Duffy says: “I think that with Guinea, we were more inclined to believe that the protests wouldn’t prevent the elections from taking place. But then they nearly did! So we lucked out, too.”
It’s also worth noting that the best forecasters remind each other of cognitive biases to guard against. This was especially prominent when Daniel Kahneman ran an experiment on the GJP forecasters, designed to test for a bias with particular relevance to forecasting called ‘scoped insensitivity’.
Scope insensitivity is this effect where a person substitutes the question asked of them with another, easier question, tossing aside the scope within the question originally asked. In the original experiment that he ran, Kahneman asked a group of randomly selected individuals how much they would be willing to pay to clean up the lakes in a small region of the province of Ontario. They answered about $10. He then asked another randomly selected group how much they would pay to clean up the lakes in the whole of Ontario. This group also answered about $10.
Later experiments replicated this result. One particularly arresting replication asked people how much they would pay to stop migratory birds from drowning in oil ponds; one group was told 2,000 birds drowned a year, the other was told 20,000 birds died each year. Both groups said they would pay around $80. (Kahneman & Frederick, 2002)
Kahneman concluded that people were really responding to a prototypical image that sprung up in their heads, not to the scope of the problem. He wrote: “The prototype automatically evokes an affective response, and the intensity of that emotion is then mapped onto the dollar scale.” This effect matters to forecasting because the scope of a question matters to the confidence of the forecast. “How likely is it that the Assad regime will fall in the next three months?” should result in a different probability rating from “How likely is it that the Assad regime will fall in the next six months?” Kahneman predicted that there would be widespread scope insensitivity — that most forecasters would fail to see the difference between the two.
Mellers ran several studies within the GJP cohort and found that — exactly as Kahneman predicted — the vast majority of forecasters were scope insensitive. They said ~40% to both questions. But the superforecasters did much better: they rated the first question (three months) as 15%, and the second question (six months) as 24%. This was a large enough difference to catch Kahneman by surprise.
It turned out that superforecasters were discussing the problems of scope insensitivity well before Tetlock, Mellers and team thought to warn them of it. They didn’t call it by that name, of course. But they reminded each other to be wary of question scopes in their forecasts, and to react accordingly.
(Keep this anecdote in mind, we’ll return to it in a second).
The real winner here is really the notion of ‘perpetual beta’ that all superforecasters seem to have. Tetlock defines this as ‘the degree to which one is committed to belief updating and self-improvement’. Another way of saying this is that superforecasters have both grit and a growth mindset in abundance … which isn’t really a surprise — we see such attributes at the top of many realms of human performance, and we shouldn’t expect differently in the domain of geopolitical forecasting.
So let’s summarise what superforecasters do to get the results they do:
- They ‘Fermi-ize’ the question by deconstructing it to smaller component questions.
- They start with the outside view, before looking at the inside view.
- They generate or collect as many alternative perspectives as possible, before synthesising them into one overall judgment.
- They start with an initial probabilistic estimate (usually from the outside view) and then update rapidly in response to new information.
- They are oddly granular in their probability estimations.
- They evaluate both failures and successes regularly.
- They watch out for their own cognitive biases.
- They remain in ‘perpetual beta’: that is, they are committed to belief updating and self-improvement.
We’ll end this piece with a return to the discovery I mentioned near the beginning. In the years since the GJP concluded, Barbara Mellers has continued to dig into the data generated by the forecasting tournaments — this time with an eye to explain why the interventions worked as well as they have.
The original intention of all three interventions was to reduce cognitive biases, in line with the Kahneman and Tversky research project of the time. But this didn’t turn out to be the case. In a recent interview, Mellers and INSEAD professor Ville Satopӓӓ explained that they went back and applied a statistical model to the entire 2011-2015 GJP dataset, designed to tease out the effects of bias, information and noise (BIN) respectively. Satopӓӓ writes:
How does the BIN model work? Simply put, it analyses the entire “signal universe” around a given question. Signals are pieces of information that the forecasters may take into account when trying to guess whether something will happen. In formulating predictions, one can rely upon either meaningful signals (i.e. information extraction) or irrelevant signals (i.e. noise). One can also organise information along erroneous lines (i.e. bias). Comparing GJP groups that experienced one or more of the three interventions to those that did not, the BIN model was able to disaggregate the respective contributions of noise, information and bias to overall improvements in prediction accuracy.
This explanation isn’t satisfying to me, but apart from an interview with Mellers and Satopӓӓ, there’s not much else to go on — the paper is still in progress, and hasn’t been published.
But the conclusions are intriguing. Here’s Satopӓӓ again:
Our experiments with the BIN model have also produced results that were more unexpected. Recall that teaming, tracking and training were deployed for the express purpose of reducing bias. Yet it seems that only teaming actually did so. Two of the three — teaming and tracking — increased information. Surprisingly, all three interventions reduced noise. In light of our current study, it appears the GJP’s forecasting improvements were overwhelmingly the result of noise reduction (emphasis added). As a rule of thumb, about 50 percent of the accuracy improvements can be attributed to noise reduction, 25 percent to tamping down bias, and 25 percent to increased information.
The authors have little to offer by way of actionable insight. In their interview, they suggest that using algorithms would improve noise reduction, but they also note that machines aren’t great at the multi-perspective synthesis that human superforecasters are so great at. It’s unclear to me if this is actionable. I’ll wait for the paper and report back to you once that’s out.
To me, the most useful takeaway here is the one about teaming: if Mellers and Satopӓӓ are to be believed, teaming was the only intervention that improved cognitive bias reduction. This matches something that I’ve heard over and over, but didn’t want to believe:
- Kahneman believes that we cannot individually overcome our cognitive biases.
- Psychologist Jonathan Haidt has argued that the magic of academia is really the set of norms that helps cancel out our self-serving biases, in the pursuit of a ‘truth-seeking game’. He then adds that we’re just not good at this ‘truth-seeking game’ on our own.
- Writer, programmer, and startup operator Kevin Simler has proposed that the only way to be ‘more rational’ is to cultivate and then become a member of a ‘rationalist community’; we really have no hope of doing so as individuals.
I’ve resisted these conclusions because they don’t seem very promising for the purpose of self-improvement: the argument is that either you’re predisposed to resisting cognitive biases or you’re not; either way, the only way to improve is to create or join a team.
Meller and Satopӓӓ’s work with the GJP seems to me to be the strongest evidence that you can’t fight your own cognitive biases. This matters in the fierce competitive environment of the GJP: Tetlock says that every forecaster — superforecasters included — are ‘always just a System 2 slipup away from a blown forecast and a nasty tumble down the rankings.’ And so they're better off in a team.
I’ll be honest here: I didn’t go into this essay with the expectation that I would learn something new. I thought I’d internalised most of what Tetlock had to offer. But Meller’s recent results have made me reconsider my original stance on cognitive biases.
When it comes to overcoming biases, it seems, it’s dangerous to go alone.
This post is part of The Forecasting Series. Read the rest of the posts here.