What is Expected Calibration Error (ECE)?

Expected Calibration Error measures how well a classifier's stated confidences match its real accuracy. Predictions are grouped into equal-width confidence bins; for each bin you compare the average confidence with the actual fraction correct. ECE is the sample-weighted average of those gaps: ECE = Σ (|Bₘ|/N)·|acc(Bₘ) − conf(Bₘ)|. Zero means perfect calibration; higher means the probabilities are less trustworthy.

How do you calculate ECE for a neural network?

Take the softmax confidence of each prediction (the probability of the predicted class) and whether that prediction was correct. Split [0, 1] into M bins, typically 10–15. In each bin compute mean confidence and accuracy, take the absolute difference, weight it by the bin's share of samples, and sum. This tool does all of that — paste the confidence and a 0/1 correctness for each example and pick M.

What is the difference between ECE and MCE?

Both use the same bins. ECE is the sample-weighted average gap across all bins, so it reflects typical calibration. MCE (Maximum Calibration Error) is the single largest bin gap, so it reflects the worst case. A model can have low ECE but high MCE if one sparsely populated confidence band is badly miscalibrated. This tool reports both and highlights the bin that drives the MCE.

What is a good ECE value?

Lower is better and 0 is perfect, but there is no universal threshold because ECE depends on the bin count and dataset. In the calibration literature well-calibrated models often report ECE below about 0.01–0.05 (1–5%) with 10–15 bins, while uncalibrated modern deep nets can exceed 0.10. Always report the M you used, since ECE changes with binning — this tool shows M next to every result.

How is ECE different from the Brier score?

The Brier score is the mean squared error of probabilities over every individual prediction, so it mixes calibration and sharpness into one number. ECE isolates calibration only: it asks whether, among all predictions made at confidence p, about p of them are correct. ECE needs binning and the Brier score does not. They answer related but different questions — many model cards report both.

Does the bin count change the ECE?

Yes. ECE is binning-dependent: more bins resolve finer miscalibration but put fewer samples in each bin, which adds noise; fewer bins smooth the estimate but can hide it. Guo et al. (2017) use M = 15. Because the number is not comparable across different M, always state the bin count — this tool keeps M visible beside ECE and MCE.

What does over- and under-confident mean here?

The tool compares overall mean confidence with overall accuracy. If mean confidence is higher than accuracy the model is overconfident (it claims more certainty than it earns); if accuracy is higher it is underconfident. Modern neural networks are usually overconfident. The verdict uses the exact sample means, not the binned values, so it is independent of M.

Does this calculator send my predictions anywhere?

No. Parsing your rows, binning the confidences, computing ECE and MCE, and drawing the reliability diagram all run in your browser with plain JavaScript. Nothing is uploaded, logged, or stored, and the page keeps working offline once loaded. You can paste validation-set confidences with no privacy concern.

AI · Machine learning

Expected Calibration Error (ECE) Calculator

Paste your model's prediction confidences and their 0/1 correctness to get the Expected Calibration Error and Maximum Calibration Error, a per-bin reliability table and diagram, and whether the model is over- or under-confident. Uses the equal-width binning of Guo et al. (2017), runs entirely in your browser, and needs no signup.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 19, 2026

Expected Calibration Error

Input format

Each row: predicted confidence in [0, 1], then 1 if that prediction was correct or 0 if not.

Predictions (one pair per line)

confidence, correct — e.g. 0.92, 1

Number of bins (M)

Equal-width bins from 1 to 100. ECE depends on M; 10–15 is common.

Presets

ECE

0.1640

16.4% · weighted mean gap

MCE

0.4500

45% · worst bin 3

Mean conf vs acc

0.7700 / 0.8000

Gap +0.0300 (acc − conf)

Samples / bins

10 / 5

3 non-empty bins

Underconfident

ECE 0.1640 (16.4%) over 5 equal-width bins; MCE 0.4500 (45.0%). Lower is better — 0 means perfectly calibrated. Accuracy 0.800 exceeds mean confidence 0.770 by 0.030 — the model is underconfident overall.

Decimals

Reliability diagram

0.50

0.70

0.90

AccuracyMean confidenceBin midpoint shown below each pair

Formulas

conf(Bₘ) = mean confidence in bin m
acc(Bₘ) = correct / count in bin m
ECE = Σₘ (|Bₘ|/N)·|acc − conf|
MCE = maxₘ |acc − conf|

Cross-check. The weighted mean gap gives ECE = 0.1640; the independent by-sums identity (1/N) Σ |Σy − Σc| per bin gives 0.1640. They reconcile, as they must — the result is verified.

Per-bin reliability table

Bin	Range	Count	Conf	Acc	Gap	Weight
1	0.00–0.20)	0	—	—	—	0.0000
2	0.20–0.40)	0	—	—	—	0.0000
3	0.40–0.60)	1	0.5500	1.0000	+0.4500	0.1000
4	0.60–0.80)	4	0.6675	0.5000	-0.1675	0.4000
5	0.80–1.00]	5	0.8960	1.0000	+0.1040	0.5000
ECE = Σ weight × \|gap\|					0.1640

Empty bins (dimmed) contribute 0 to ECE and are excluded from MCE. The bin driving the MCE is highlighted.

How it works

Calibration asks a simple question: when a classifier says it is 80% sure, is it right about 80% of the time? The Expected Calibration Error turns that into a single number by comparing stated confidence with observed accuracy across confidence bands. The binned estimator used here is the one popularised by Guo et al. (2017) and introduced by Naeini et al. (2015).

Each prediction contributes a confidence cᵢ ∈ [0, 1] — the probability of the predicted class — and a correctness yᵢ ∈ {0, 1}. The interval [0, 1] is split into M equal-width bins of width 1/M, and a confidence c lands in bin min(floor(c·M), M−1) so that c = 1.0 falls in the last bin.

ECE = Σₘ (|Bₘ| / N) · |acc(Bₘ) − conf(Bₘ)|

Validate. Every confidence must lie in [0, 1] and every label must be 0 or 1. Rows that fail are listed with the line number and reason — never a silent NaN or a dropped sample.
Bin. For each bin Bₘ compute the mean confidence conf(Bₘ) and the accuracy acc(Bₘ) (fraction correct).
Weight and sum. ECE is the sample-weighted average of the absolute bin gaps; empty bins contribute 0. The worst single gap is the Maximum Calibration Error:
MCE = maxₘ |acc(Bₘ) − conf(Bₘ)|
Verdict. Overall mean confidence and accuracy are computed directly from every sample (not from the bins), so the over- vs under-confidence verdict is exact and independent of M. Mean confidence above accuracy means overconfident; below means underconfident.

In binary-probability mode the tool reduces a single positive-class probability p to a confidence the standard way: predicted class = (p ≥ 0.5), confidence = max(p, 1 − p), and correct = (predicted == true label). As an internal correctness gate, ECE is also recomputed by the algebraic identity (1/N) Σₘ |Σyᵢ − Σcᵢ|over each bin's raw sums, and the two values are asserted equal to floating-point precision before you see a result. Because ECE depends on the bin count, the chosen M is shown beside every number.

Worked examples

Overconfident-looking set — ECE 0.164, MCE 0.450 (the Demo preset, M = 5)

Confidences/correct: (0.55,1)(0.60,0)(0.62,1)(0.70,1)(0.75,0)(0.80,1)(0.85,1)(0.90,1)(0.95,1)(0.98,1). N = 10
Bin [0.4,0.6): {0.55✓} → count 1, conf 0.550, acc 1.000, gap 0.450
Bin [0.6,0.8): {0.60✗,0.62✓,0.70✓,0.75✗} → count 4, conf 0.6675, acc 0.500, gap 0.1675
Bin [0.8,1.0]: {0.80,0.85,0.90,0.95,0.98 all ✓} → count 5, conf 0.896, acc 1.000, gap 0.104
ECE = (1/10)(0.450) + (4/10)(0.1675) + (5/10)(0.104) = 0.045 + 0.067 + 0.052 = 0.164
MCE = max(0.450, 0.1675, 0.104) = 0.450 (driven by the [0.4,0.6) bin)
Mean conf 0.770 < accuracy 0.800 → underconfident overall, despite one badly under-calibrated bin

Perfectly calibrated set — ECE 0.000, MCE 0.000 (M = 10)

Ten predictions all at confidence 0.70, exactly 7 correct: (0.70,1)×7 and (0.70,0)×3
All fall in bin [0.7,0.8): count 10, conf 0.700, acc 7/10 = 0.700, gap 0.000
ECE = (10/10)(0.000) = 0.000; MCE = 0.000
Mean confidence 0.700 = accuracy 0.700 → well matched. This is the zero-error baseline

Binary-probability mode — the p ≥ 0.5 reduction (M = 2)

Rows (probability, true label): (0.9,1)(0.8,1)(0.2,0)(0.6,0)
Reduce each: (0.9,1)→conf 0.9, correct 1; (0.8,1)→conf 0.8, correct 1
(0.2,0)→pred 0, conf max(0.2,0.8)=0.8, correct 1; (0.6,0)→pred 1, conf 0.6, correct 0
Confidence/correct = (0.9,1)(0.8,1)(0.8,1)(0.6,0); all in bin [0.5,1.0]
count 4, conf 3.1/4 = 0.775, acc 3/4 = 0.750, gap 0.025 → ECE = 0.025, MCE = 0.025
Mean conf 0.775 > accuracy 0.750 → overconfident (just barely)

Frequently asked questions

Sources & references

The formulas on this page were last cross-checked against these sources on 2026-06-19. ECE and MCE are stable mathematical definitions, so this tool needs no rate or schedule updates — only the worked examples are periodically re-reconciled against the binned estimator.

Related tools

LiveAI

Fleiss' Kappa Calculator

Compute Fleiss' kappa for three or more raters from a subject-by-category count matrix. Get the overall κ, observed vs chance agreement, per-category kappa, and the Landis & Koch agreement band. Runs in your browser, no signup.

Open tool

LiveAI

Perplexity Calculator

Compute language-model perplexity from token probabilities, cross-entropy loss, or log-likelihood, with nats and bits-per-token conversions. Step-by-step, matches PyTorch, runs entirely in the browser.

Open tool

LiveAI

Regression Metrics Calculator

Paste your model's predicted values and the actual values to instantly get MAE, MSE, RMSE, R², Adjusted R² and MAPE — each with its formula and a full residual table. Matches scikit-learn, runs in your browser, no signup.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.