What is a good Cohen's kappa value?

There is no universal cut-off, but the most-cited guide is Landis & Koch (1977): below 0 is poor, 0.00–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and 0.81–1.00 almost perfect. Many journals expect at least 0.61 (substantial) for an annotation task to be considered reliable. Always report the value with its confidence interval, not just the band.

How do you calculate Cohen's kappa by hand?

Build the k×k agreement matrix, then compute observed agreement pₒ = (sum of the diagonal) / N. Next get chance agreement pₑ = Σ (rowᵢ/N)(colᵢ/N) from the marginal totals. Finally κ = (pₒ − pₑ) / (1 − pₑ). For a 2×2 table with diagonal 35 of 50 and pₑ = 0.50, κ = (0.70 − 0.50) / (1 − 0.50) = 0.40.

What is the difference between Cohen's kappa and percent agreement?

Percent agreement is just pₒ — the share of items two raters labelled the same. It ignores that some agreement happens by luck, so it looks high even when both raters guess. Kappa subtracts the chance agreement pₑ and rescales, so κ = 0 means 'no better than chance' and κ = 1 means perfect. Two raters can agree 80% of the time yet score a kappa of only 0.4 if one category dominates.

When should I use weighted kappa instead of Cohen's kappa?

Use weighted kappa when the categories are ordered — for example Low / Medium / High or a 1–5 severity scale — so that a near-miss (Medium vs High) should count as partial agreement rather than total disagreement. Linear weights penalise by category distance; quadratic weights penalise far-apart disagreements much more. For unordered, nominal labels (Cat / Dog / Bird) keep weighting on None.

What is the difference between Cohen's kappa and Fleiss' kappa?

Cohen's kappa measures agreement between exactly two raters on the same items. Fleiss' kappa generalises this to three or more raters, and it allows a different set of raters per item. They use different chance-agreement estimators, so the numbers are not directly comparable. This calculator covers the two-rater Cohen's case; a Fleiss' kappa tool for multi-rater data is planned.

Why does my kappa show as undefined?

Kappa is undefined when chance agreement pₑ equals 1, which happens when every rating falls into a single category. With nothing left to correct for, the denominator 1 − pₑ becomes 0. Spread the ratings across at least two categories and a value will appear. The tool reports 'undefined' honestly rather than printing a misleading 0 or NaN.

How is the 95% confidence interval calculated?

The tool uses Cohen's simplified large-sample standard error, SE = √(pₒ(1 − pₒ) / (N·(1 − pₑ)²)), then κ ± 1.96·SE for the 95% interval, clamped to the valid −1…1 range. This normal approximation is reliable for N ≥ 30; below that the tool flags the interval as indicative only. For the full asymptotic variance see Fleiss, Cohen & Everitt (1969).

Can kappa be negative?

Yes. A negative kappa means the two raters agree less often than random labelling would predict — they are systematically disagreeing. It is uncommon in real annotation work and usually points to a swapped label mapping, a misunderstanding of the coding scheme, or a data-entry error. The matrix [[1,9],[9,1]] gives κ = −0.80, deep in the 'poor' band.

Does this calculator send my data anywhere?

No. Every step runs in your browser with plain arithmetic — there is no model, no API call and no upload. Your agreement matrix never leaves your device, so it is safe for confidential annotation data from a labelling contract or an unpublished study.

AI · Inter-rater reliability

Cohen's Kappa Calculator (Inter-Rater Reliability)

Paste your two-rater agreement matrix and get Cohen's kappa (κ) — the chance-corrected measure of how much two annotators really agree — with the observed and chance agreement, a 95% confidence interval, optional weighted kappa for ordinal scales, and the Landis & Koch band. Free, no signup, runs in your browser.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 10, 2026

Cohen's kappa (κ)

Number of categories (k)

The shared rating scale, 2–10 categories.

Agreement matrix

A \ B	Category 1 label	Category 2 label	Category 3 label	Row Σ
Positive	Rater A Positive, Rater B Positive	Rater A Positive, Rater B Neutral	Rater A Positive, Rater B Negative	30
Neutral	Rater A Neutral, Rater B Positive	Rater A Neutral, Rater B Neutral	Rater A Neutral, Rater B Negative	35
Negative	Rater A Negative, Rater B Positive	Rater A Negative, Rater B Neutral	Rater A Negative, Rater B Negative	35
Col Σ	32	36	32	100

Rows = Rater A, columns = Rater B. The shaded diagonal is where the two raters agreed. Grand total N = 100.

Weighting

Nominal categories — any disagreement counts fully.

95% confidence interval

Large-sample interval, κ ± 1.96·SE. Reliable when N ≥ 30.

Examples

Cohen's κ

Substantial

0.700

Substantial agreement beyond chance.

95% CI [0.58, 0.82]

Observed agreement (pₒ)

80%

Share the raters actually agreed on.

Chance agreement (pₑ)

33.4%

Agreement expected from random labelling.

Standard error

0.0601

Asymptotic SE.

Landis & Koch (1977) bands

< 0.00 Poor0.00–0.20 Slight0.21–0.40 Fair0.41–0.60 Moderate0.61–0.80 Substantial0.81–1.00 Almost Perfect

Computed entirely in your browser — nothing is uploaded. Formulas per Cohen (1960), Cohen (1968) and Landis & Koch (1977); last verified 2026-06-10.

How it works

Cohen's kappa measures how much two raters agree when each independently sorts the same items into the same set of categories — and crucially, it corrects for the agreement you would expect from pure chance. You give the tool a k×k agreement matrix where nᵢⱼ is the number of items Rater A put in category i and Rater B put in category j. The diagonal holds the agreements; everything off the diagonal is a disagreement.

From the grand total N, the row marginals rᵢ and the column marginals cⱼ, the calculation follows Cohen (1960):

Observed agreement: pₒ = (Σ nᵢᵢ) / N
Chance agreement: pₑ = Σ (rᵢ / N)(cᵢ / N)
Cohen's kappa: κ = (pₒ − pₑ) / (1 − pₑ)

κ = 1 is perfect agreement, κ = 0 is exactly what chance predicts, and κ < 0 means the raters disagree more than random labelling would. Because pₑ depends on the marginal totals, a lopsided task — where one category dominates — has a high chance agreement, which is why two raters can match on 80% of items yet earn only a modest kappa.

When the categories are ordered (say Low / Medium / High), a Medium-vs-High mix-up is a smaller error than Low-vs-High. Weighted kappa, from Cohen (1968), captures that with a weight matrix built from the category distance |i − j|: linear weights are wᵢⱼ = 1 − |i−j|/(k−1) and quadratic weights are wᵢⱼ = 1 − (i−j)²/(k−1)². The same formula then runs on the weighted proportions: κ_w = (pₒ(w) − pₑ(w)) / (1 − pₑ(w)). For a 2×2 table every weighting scheme collapses to plain kappa, since there is only one disagreement distance.

The 95% confidence interval uses Cohen's simplified large-sample standard error, SE = √(pₒ(1 − pₒ) / (N·(1 − pₑ)²)), giving κ ± 1.96·SE clamped to the valid −1…1 range. This normal approximation is dependable for N ≥ 30 and is flagged as indicative below that; the full asymptotic variance is given by Fleiss, Cohen & Everitt (1969). Finally the headline kappa is mapped to the Landis & Koch (1977) band — < 0.00 poor, 0.00–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 almost perfect — for a one-line verdict. All arithmetic is exact and runs in your browser; nothing is uploaded.

Worked examples

Example 1 — binary relevance, two reviewers, N = 50

Matrix [[20, 5], [10, 15]]: both said Relevant 20×, both said Not 15×.
pₒ = (20 + 15) / 50 = 0.70 (raw agreement is 70%)
pₑ = (25/50)(30/50) + (25/50)(20/50) = 0.30 + 0.20 = 0.50
κ = (0.70 − 0.50) / (1 − 0.50) = 0.20 / 0.50 = 0.40 → Fair
SE = √(0.70·0.30 / (50·0.50²)) = 0.1296; 95% CI [0.15, 0.65]

Example 2 — sentiment labelling, three categories, N = 100

Matrix [[25,3,2],[4,28,3],[3,5,27]]: diagonal = 25 + 28 + 27 = 80.
pₒ = 80 / 100 = 0.80 (raw agreement is 80%)
pₑ = (30/100)(32/100) + (35/100)(36/100) + (35/100)(32/100) = 0.334
κ = (0.80 − 0.334) / (1 − 0.334) = 0.466 / 0.666 = 0.700 → Substantial
Quadratic weighted κ_w = (0.9125 − 0.6775) / (1 − 0.6775) = 0.729 (near-misses credited)

Example 3 — edge case, worse than chance, N = 20

Matrix [[1, 9], [9, 1]]: the raters mostly contradict each other.
pₒ = (1 + 1) / 20 = 0.10 (only 10% raw agreement)
pₑ = (10/20)(10/20) + (10/20)(10/20) = 0.50
κ = (0.10 − 0.50) / (1 − 0.50) = −0.40 / 0.50 = −0.80 → Poor
Negative kappa flags a likely swapped label mapping or coding error.

Frequently asked questions

Sources & references

Every formula on this page was cross-checked against these sources on 2026-06-10, and the unweighted result is verified against the direct one-line formula inside the tool. Your agreement matrix never leaves your browser.

Related tools

LiveAI

Fleiss' Kappa Calculator

Compute Fleiss' kappa for three or more raters from a subject-by-category count matrix. Get the overall κ, observed vs chance agreement, per-category kappa, and the Landis & Koch agreement band. Runs in your browser, no signup.

Open tool

LiveAI

ROUGE Score Calculator

Calculate ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-Lsum precision, recall and F1 between a generated summary and one or more references, entirely in your browser. Shows matched n-grams and the longest common subsequence. Matches Google rouge-score, no signup.

Open tool

LiveAI

Euclidean Distance Calc

Compute the Euclidean (L2) distance between two points or two numeric vectors of any dimension, with the full per-dimension working. Also shows the squared Euclidean, Manhattan (L1), and Chebyshev (L∞) distances, and matches scikit-learn and NumPy — entirely in your browser.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, an edge case, or want Fleiss' kappa for 3+ raters added?

Email me at [email protected] — most fixes ship within 24 hours.