§04 · Almanac · No. 001

Tetlock's Good Judgment Project: how forecasting becomes measurable.

An introductory note on the long programme that turned political prediction from a televised opinion into a scored, falsifiable craft, and the standard of evidence Runaric tries to honour in /prediction.

Filed 2026 · 01 · 09Axis §03 · PredictionKind Intro · external research

Between 1984 and 2003 Philip Tetlock, then at Berkeley, collected approximately twenty-eight thousand probabilistic forecasts from two hundred and eighty-four experts on political and economic questions of the day. Each forecast was solicited as a numeric probability rather than a narrative, scored against the eventual outcome using the Brier score, and aggregated across experts and topics. The result, published in 2005 as Expert Political Judgment, was sobering: experts as a group performed barely better than crude statistical baselines, and the most televised, most confident voices were typically the worst calibrated.

Tetlock's lasting summary was the borrowed distinction between hedgehogs, who knew one big thing and bent every question to it, and foxes, who held many small models loosely and updated frequently. Hedgehogs lost. Foxes did slightly better than chance, although not dramatically. The book's most important contribution was not the result but the methodology: predictions had to be specified in advance, scored mechanically, and published in the aggregate, including the misses.

The hedgehog knows one big thing; the fox knows many little things. In our data, foxes were modestly better forecasters; hedgehogs were modestly worse; and confident television hedgehogs were the worst of all.summary of Expert Political Judgment, Tetlock 2005

The follow-up was the Good Judgment Project, an IARPA-funded forecasting tournament that ran from 2011 to 2015 under the Aggregative Contingent Estimation programme. Several university teams competed against each other, against statistical aggregates, and (the more interesting comparison) against the United States intelligence community on hundreds of geopolitical questions with verifiable outcomes. The questions were ordinary forecasting fare: will country X hold an election by date Y, will commodity Z trade above a given price by a given month, will a named conflict be active on a named date.

The headline result of the tournament: a small subset of trained, properly aggregated volunteers, the so-called superforecasters, beat the intelligence community's internal estimates by significant and reproducible margins, sometimes by as much as thirty percent on the team's preferred Brier metric. The superforecasters were not specialists; they were ordinary people, mostly hobbyists, identified through the first year's tournament data and then trained in basic probabilistic reasoning and reference-class forecasting. Forecasting, it turned out, was a learnable craft, and the craft could be measured.

The wider methodological consequence has been quieter and arguably more important. The Good Judgment Project demonstrated that any prediction can be scored if its terms are written down precisely in advance, that aggregation across diverse forecasters routinely beats individual experts, and that calibration training, the discipline of saying seventy percent only when one is correct seventy percent of the time, transfers across domains. Variants of this discipline now underlie modern preregistration practice, the registered-report format in journals, and the internal forecasting tournaments used by some intelligence services and corporate strategy teams.

For Runaric, the Good Judgment Project sets the standard of evidence for the prediction axis. Every trial in /prediction is preregistered, with numeric priors, mechanical scoring, and a sealed prediction filed before the outcome time. The instruments we use are unusual: a symbolic draw protocol, a hardware random-number generator, a ganzfeld session. The methodology around them is not. If any of those instruments deviate from the null, the deviation has to survive the same scoring discipline a Good Judgment forecast survives, or it does not count as a finding.

§04.001 · Sources

Primary references.

The note above is an introduction to existing research, not a Runaric finding. The references below are the primary sources a reader can go check.

  1. Ref · 01

    Tetlock, P. (2005). Expert Political Judgment: How Good Is It? How Can We Know? Princeton University Press.

  2. Ref · 02

    Tetlock, P. & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.

  3. Ref · 03

    Mellers, B., Stone, E., Atanasov, P., Rohrbaugh, N., et al. (2015). The psychology of intelligence analysis: drivers of prediction accuracy in world politics. Journal of Experimental Psychology: Applied 21, 1 to 14.

  4. Ref · 04

    Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review 78, 1 to 3.