induwara.lk
induwara.lkAI · Machine learning

Expected Calibration Error (ECE) Calculator

Paste your model's prediction confidences and their 0/1 correctness to get the Expected Calibration Error and Maximum Calibration Error, a per-bin reliability table and diagram, and whether the model is over- or under-confident. Uses the equal-width binning of Guo et al. (2017), runs entirely in your browser, and needs no signup.

By Induwara AshinsanaUpdated Jun 19, 2026
Expected Calibration Error
Input format

Each row: predicted confidence in [0, 1], then 1 if that prediction was correct or 0 if not.

confidence, correct — e.g. 0.92, 1

Equal-width bins from 1 to 100. ECE depends on M; 10–15 is common.

Presets
ECE
0.1640
16.4% · weighted mean gap
MCE
0.4500
45% · worst bin 3
Mean conf vs acc
0.7700 / 0.8000
Gap +0.0300 (acc − conf)
Samples / bins
10 / 5
3 non-empty bins
Underconfident

ECE 0.1640 (16.4%) over 5 equal-width bins; MCE 0.4500 (45.0%). Lower is better — 0 means perfectly calibrated. Accuracy 0.800 exceeds mean confidence 0.770 by 0.030 — the model is underconfident overall.

Decimals

Reliability diagram

0.50
0.70
0.90
AccuracyMean confidenceBin midpoint shown below each pair

Formulas

  • conf(Bₘ) = mean confidence in bin m
  • acc(Bₘ) = correct / count in bin m
  • ECE = Σₘ (|Bₘ|/N)·|acc − conf|
  • MCE = maxₘ |acc − conf|

Cross-check. The weighted mean gap gives ECE = 0.1640; the independent by-sums identity (1/N) Σ |Σy − Σc| per bin gives 0.1640. They reconcile, as they must — the result is verified.

Per-bin reliability table

BinRangeCountConfAccGapWeight
10.00–0.20)00.0000
20.20–0.40)00.0000
30.40–0.60)10.55001.0000+0.45000.1000
40.60–0.80)40.66750.5000-0.16750.4000
50.80–1.00]50.89601.0000+0.10400.5000
ECE = Σ weight × |gap|0.1640

Empty bins (dimmed) contribute 0 to ECE and are excluded from MCE. The bin driving the MCE is highlighted.

Method: ECE = Σ (|Bₘ|/N)·|acc(Bₘ) − conf(Bₘ)| and MCE = maxₘ |acc(Bₘ) − conf(Bₘ)| with equal-width bins (Guo et al. 2017; Naeini et al. 2015). Sources cited below the calculator. No data leaves this page.

How it works

Calibration asks a simple question: when a classifier says it is 80% sure, is it right about 80% of the time? The Expected Calibration Error turns that into a single number by comparing stated confidence with observed accuracy across confidence bands. The binned estimator used here is the one popularised by Guo et al. (2017) and introduced by Naeini et al. (2015).

Each prediction contributes a confidence cᵢ ∈ [0, 1] — the probability of the predicted class — and a correctness yᵢ ∈ {0, 1}. The interval [0, 1] is split into M equal-width bins of width 1/M, and a confidence c lands in bin min(floor(c·M), M−1) so that c = 1.0 falls in the last bin.

ECE = Σₘ (|Bₘ| / N) · |acc(Bₘ) − conf(Bₘ)|

  1. Validate. Every confidence must lie in [0, 1] and every label must be 0 or 1. Rows that fail are listed with the line number and reason — never a silent NaN or a dropped sample.
  2. Bin. For each bin Bₘ compute the mean confidence conf(Bₘ) and the accuracy acc(Bₘ) (fraction correct).
  3. Weight and sum. ECE is the sample-weighted average of the absolute bin gaps; empty bins contribute 0. The worst single gap is the Maximum Calibration Error:

    MCE = maxₘ |acc(Bₘ) − conf(Bₘ)|

  4. Verdict. Overall mean confidence and accuracy are computed directly from every sample (not from the bins), so the over- vs under-confidence verdict is exact and independent of M. Mean confidence above accuracy means overconfident; below means underconfident.

In binary-probability mode the tool reduces a single positive-class probability p to a confidence the standard way: predicted class = (p ≥ 0.5), confidence = max(p, 1 − p), and correct = (predicted == true label). As an internal correctness gate, ECE is also recomputed by the algebraic identity (1/N) Σₘ |Σyᵢ − Σcᵢ|over each bin's raw sums, and the two values are asserted equal to floating-point precision before you see a result. Because ECE depends on the bin count, the chosen M is shown beside every number.

Worked examples

Overconfident-looking set — ECE 0.164, MCE 0.450 (the Demo preset, M = 5)

  1. Confidences/correct: (0.55,1)(0.60,0)(0.62,1)(0.70,1)(0.75,0)(0.80,1)(0.85,1)(0.90,1)(0.95,1)(0.98,1). N = 10
  2. Bin [0.4,0.6): {0.55✓} → count 1, conf 0.550, acc 1.000, gap 0.450
  3. Bin [0.6,0.8): {0.60✗,0.62✓,0.70✓,0.75✗} → count 4, conf 0.6675, acc 0.500, gap 0.1675
  4. Bin [0.8,1.0]: {0.80,0.85,0.90,0.95,0.98 all ✓} → count 5, conf 0.896, acc 1.000, gap 0.104
  5. ECE = (1/10)(0.450) + (4/10)(0.1675) + (5/10)(0.104) = 0.045 + 0.067 + 0.052 = 0.164
  6. MCE = max(0.450, 0.1675, 0.104) = 0.450 (driven by the [0.4,0.6) bin)
  7. Mean conf 0.770 < accuracy 0.800 → underconfident overall, despite one badly under-calibrated bin

Perfectly calibrated set — ECE 0.000, MCE 0.000 (M = 10)

  1. Ten predictions all at confidence 0.70, exactly 7 correct: (0.70,1)×7 and (0.70,0)×3
  2. All fall in bin [0.7,0.8): count 10, conf 0.700, acc 7/10 = 0.700, gap 0.000
  3. ECE = (10/10)(0.000) = 0.000; MCE = 0.000
  4. Mean confidence 0.700 = accuracy 0.700 → well matched. This is the zero-error baseline

Binary-probability mode — the p ≥ 0.5 reduction (M = 2)

  1. Rows (probability, true label): (0.9,1)(0.8,1)(0.2,0)(0.6,0)
  2. Reduce each: (0.9,1)→conf 0.9, correct 1; (0.8,1)→conf 0.8, correct 1
  3. (0.2,0)→pred 0, conf max(0.2,0.8)=0.8, correct 1; (0.6,0)→pred 1, conf 0.6, correct 0
  4. Confidence/correct = (0.9,1)(0.8,1)(0.8,1)(0.6,0); all in bin [0.5,1.0]
  5. count 4, conf 3.1/4 = 0.775, acc 3/4 = 0.750, gap 0.025 → ECE = 0.025, MCE = 0.025
  6. Mean conf 0.775 > accuracy 0.750 → overconfident (just barely)

Frequently asked questions

Sources & references

The formulas on this page were last cross-checked against these sources on 2026-06-19. ECE and MCE are stable mathematical definitions, so this tool needs no rate or schedule updates — only the worked examples are periodically re-reconciled against the binned estimator.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.