Bayesian probability calibration is a statistical technique that corrects overconfident predictions from sports betting models by adjusting raw probabilities to match observed outcomes. When a model predicts 70% win probability but historically those predictions win only 64% of the time, calibration methods like Platt scaling and isotonic regression learn this mapping and adjust future predictions downward. This matters because Kelly Criterion bet sizing amplifies probability errors — a 6-percentage-point overestimate can cause bets to be sized 30% too large, turning a winning edge into systematic losses. Olympus Bets applies 15% Bayesian shrinkage (blending model probability with a 50% prior) plus Platt scaling trained on thousands of resolved predictions, achieving 3-20% Brier score improvement across all leagues.
The Overconfidence Problem
Every sports prediction model is overconfident. This is not a criticism. It is a mathematical inevitability.
When a Monte Carlo simulation runs 10,000 iterations and finds that Team A wins 70% of them, it is reporting a precise answer to the question: "Given the input data and model structure, how often does Team A win in simulation?" But that is not the same question as: "How often will Team A actually win this game?"
The gap between those two questions is model uncertainty, and it always pushes the raw output away from 50% (toward overconfidence) for several reasons:
- Input data is noisy. Team statistics are estimated from finite samples. A team's "true" offensive rating is not exactly 112.5; it is somewhere in a range around 112.5. The model uses the point estimate, which overstates how much we actually know.
- Models are incomplete. No model captures every factor that influences a game. Referee assignments, travel fatigue, locker room dynamics, in-game adjustments by coaches, and random bounces are all real factors that introduce variance the model cannot predict.
- Variance is underestimated. Models often underestimate the tail probabilities of extreme outcomes. The "cold shooting night" or "lucky bounces" scenario happens more often than most models predict, which compresses actual win probabilities toward 50% relative to model estimates.
This overconfidence is not academic. It directly impacts your bottom line through Kelly Criterion bet sizing. If your model says 70% and the true probability is 64%, Kelly will recommend a bet size that is roughly 30% too large. Over hundreds of bets, that systematic oversizing accumulates into significant losses even when the model correctly identifies the winner most of the time.
What Is Probability Calibration?
A model is well-calibrated when its probability predictions match the observed frequency of outcomes. If the model assigns a 60% probability to 100 different events, approximately 60 of those events should actually occur. If 72 occur, the model is underconfident at that level. If only 53 occur, it is overconfident.
Calibration is typically visualized using a reliability diagram (also called a calibration curve). The x-axis shows the model's predicted probability in bins (50-55%, 55-60%, etc.), and the y-axis shows the actual win rate of predictions in that bin. A perfectly calibrated model produces a diagonal line from (0,0) to (1,1). Most uncalibrated models produce an S-shaped curve that is flatter than the diagonal, showing overconfidence at high probabilities and underconfidence at low probabilities.
A Concrete Example
Consider an NBA model evaluated over 500 games:
| Model Probability Bin | Games in Bin | Actual Win Rate | Calibration Error |
|---|---|---|---|
| 50-55% | 87 | 51.7% | Well calibrated |
| 55-60% | 112 | 56.3% | Slightly overconfident |
| 60-65% | 98 | 59.2% | Overconfident by 3.3% |
| 65-70% | 76 | 61.8% | Overconfident by 5.7% |
| 70-75% | 54 | 64.8% | Overconfident by 7.7% |
| 75-80% | 41 | 68.3% | Overconfident by 9.2% |
| 80%+ | 32 | 71.9% | Overconfident by 11%+ |
The pattern is clear and universal: the higher the model's stated confidence, the larger the calibration error. A model prediction of 75% actually corresponds to about 68% true probability. This is not a bug in any specific model. It is a structural feature of all prediction models operating in high-variance domains like sports.
Bayesian vs. Frequentist Calibration
There are two philosophical frameworks for approaching calibration:
Frequentist Calibration
The frequentist approach treats calibration as a purely empirical mapping: "When the model said X%, outcomes happened Y% of the time, so remap X to Y." This is straightforward and data-driven, but requires large sample sizes to produce stable mappings in each probability bin, and cannot incorporate prior knowledge about the expected direction or magnitude of miscalibration.
Bayesian Calibration
The Bayesian approach incorporates prior beliefs about how the model should be calibrated. The most basic prior is that models are overconfident: predictions should be "shrunk" toward 50%. More sophisticated Bayesian approaches use hierarchical models that share calibration information across similar leagues, account for sample size uncertainty in each bin, and update the calibration function as new data arrives.
The practical advantage of Bayesian calibration is that it works well with limited data. A new league or a new season might only have 50-100 resolved bets, far too few for reliable frequentist binning. Bayesian shrinkage provides a principled default (pull toward 50%) that gradually defers to the data as more results accumulate. This is particularly important for sports betting, where each new season partially resets the data available for calibration.
Platt Scaling Explained
Platt scaling is the most widely used calibration technique in machine learning, and it adapts naturally to sports betting models. Named after John Platt, who introduced it in 1999 for calibrating support vector machine outputs, Platt scaling fits a logistic regression to the model's raw outputs and the actual outcomes:
Platt Scaling Formula
calibrated_prob = 1 / (1 + exp(-(A × raw_prob + B)))
Where A and B are parameters fitted on historical data (model predictions vs. actual outcomes).
The sigmoid function naturally compresses extreme probabilities toward 50%, which is exactly the correction overconfident models need. The fitted parameters A and B determine how much compression occurs: a model that is severely overconfident will have parameter values that aggressively flatten the probability curve, while a model that is only mildly overconfident will have parameters closer to A=1, B=0 (the identity function).
Platt scaling has several advantages for sports betting:
- Monotonic: If the raw model says Team A is more likely to win than Team B, the calibrated output preserves that ordering. Calibration changes the magnitudes, not the rankings.
- Smooth: Unlike binned calibration tables, Platt scaling produces a continuous function that does not suffer from bin-boundary artifacts.
- Low-parameter: Only two parameters (A and B) need to be estimated, which means Platt scaling works well even with modest sample sizes (100+ games).
- Self-correcting: As new game results arrive, the parameters can be re-estimated, allowing the calibration to adapt to changes in model performance over time.
How Calibration Prevents Overbetting
The connection between calibration and bet sizing is direct and quantifiable. Consider a bet at -110 odds (decimal 1.909) where the model estimates a 70% win probability:
Without calibration, Kelly recommends wagering 19.2% of bankroll, which maps to the maximum 3.0 units. With calibration, the recommendation drops to 12.6%, which maps to 2.5 units. That 0.5-unit difference might seem modest on a single bet, but compounded over hundreds of bets across a season, it represents the difference between sustainable profitability and the slow bleed of systematic oversizing.
More importantly, calibration correctly sizes the difference between a 60% edge and a 70% edge. Without calibration, a model that calls everything between 58% and 72% as roughly "60-65% true probability" will dramatically oversize the high-confidence bets (which are the most overconfident) while only slightly oversizing the moderate-confidence bets.
The Overconfidence Inversion: Why High-Confidence Picks Perform Worst
One of the most counterintuitive findings in sports betting analytics is the overconfidence inversion: the picks a model is most confident about often have the worst actual performance, not the best.
This seems paradoxical. If the model says 75%, should those picks not win more often than picks rated at 60%? In absolute terms, yes: 75% picks do win more often than 60% picks. But relative to their predicted probability, they perform much worse. A pick predicted at 60% that wins 58% of the time is close to calibrated. A pick predicted at 75% that wins 65% of the time is losing significant money because the Kelly Criterion sized the bet based on a 75% expectation.
The practical result, validated across thousands of real bets, is that the highest-confidence picks often produce the worst return on investment (ROI). The model's perceived edge at 75% (roughly 20+ percentage points above the market) attracts the maximum unit sizing, but the realized edge is only about 10-12 percentage points, meaning the position is oversized by nearly 2x.
Calibration fixes this by compressing the 75% prediction down to approximately 65-68%, which produces appropriate Kelly sizing for the actual edge. The picks still get above-average unit sizing (they are genuinely above-average plays), but not the maximum sizing that their uncalibrated probability would demand.
Ensemble Methods: Beyond Single-Model Calibration
Platt scaling calibrates a single model's output. Ensemble methods go further by combining multiple models and calibration approaches into a final probability estimate. The most effective approach for sports betting is gradient boosting stacking:
- Generate base predictions from multiple sources: Monte Carlo simulation, market-implied probability, historical base rates, matchup-specific models
- Treat each base prediction as a feature in a gradient boosting model (XGBoost or LightGBM)
- Train the stacker on historical outcomes to learn the optimal weighting and non-linear combination of base predictions
- The stacker's output is the final calibrated probability
The ensemble approach has two key advantages over single-model calibration. First, it naturally weights the most accurate signal sources more heavily. If market-implied probabilities are more accurate than the Monte Carlo model for a particular league, the stacker learns to weight the market signal higher. Second, it captures non-linear interactions: perhaps the Monte Carlo model is well-calibrated when it agrees with the market but overconfident when it diverges, a pattern that Platt scaling alone cannot capture.
Real-World Impact: Brier Score Improvement
The standard metric for evaluating probability calibration is the Brier score, defined as the mean squared error between predicted probabilities and actual outcomes (0 or 1). A Brier score of 0 represents perfect predictions, while 0.25 represents the score of always predicting 50%.
Across our validated leagues, calibration produces Brier score improvements of 3% to 19.7%:
| League | Raw Brier Score | Calibrated Brier Score | Improvement |
|---|---|---|---|
| NBA | 0.228 | 0.214 | 6.1% |
| NHL | 0.248 | 0.232 | 6.5% |
| CBB | 0.231 | 0.199 | 13.9% |
| Soccer | 0.252 | 0.203 | 19.4% |
These improvements translate directly into better Kelly sizing, fewer oversized bets, and smoother bankroll growth curves. The largest improvements come from leagues where the raw model was most overconfident (soccer, college basketball), while leagues where the model was already reasonably calibrated (NBA) see more modest but still meaningful gains.
How Olympus Bets Implements Bayesian Calibration
Calibration at Olympus Bets is not a single step but a multi-layer pipeline that operates daily:
- Bayesian shrinkage: All raw model probabilities are shrunk 15% toward 50% before any further processing. This provides an immediate floor of conservatism: shrunk_prob = model_prob × 0.85 + 0.50 × 0.15
- Platt scaling: League-specific logistic regression calibration fitted on the most recent 200+ resolved bets, re-estimated weekly
- Ensemble stacking: A gradient boosting model combines the calibrated Monte Carlo probability with market-implied probability, league base rates, and contextual features (home/away, rest days, division game) to produce the final probability estimate
- Regime-aware adjustment: The calibration parameters are adjusted based on the current "regime" (hot streak, cold streak, normal) detected by the regime calibrator, preventing the model from oversizing during temporary streaks
The result: when Olympus Bets reports a 62% probability on a pick, that number has been calibrated across four independent layers. It represents our best estimate of the actual probability, not the raw simulation output. This calibrated probability then flows into Kelly Criterion sizing to produce unit recommendations that reflect genuine, survivable edge sizes.
Further Reading
- Monte Carlo Simulation in Sports Betting — how the raw probability estimates are generated before calibration
- Kelly Criterion for Sports Betting — how calibrated probabilities drive optimal bet sizing
- Kelly Criterion Calculator — interactive tool to compute optimal bet size from calibrated probabilities
- Our Methodology — full technical overview of the Olympus Bets pipeline