induwara.lk
induwara.lkAI · Inter-rater reliability

Cohen's Kappa Calculator (Inter-Rater Reliability)

Paste your two-rater agreement matrix and get Cohen's kappa (κ) — the chance-corrected measure of how much two annotators really agree — with the observed and chance agreement, a 95% confidence interval, optional weighted kappa for ordinal scales, and the Landis & Koch band. Free, no signup, runs in your browser.

By Induwara AshinsanaUpdated Jun 10, 2026
Cohen's kappa (κ)

The shared rating scale, 210 categories.

Agreement matrix

A \ BRow Σ
Positive30
Neutral35
Negative35
Col Σ323632100

Rows = Rater A, columns = Rater B. The shaded diagonal is where the two raters agreed. Grand total N = 100.

Weighting

Nominal categories — any disagreement counts fully.

95% confidence interval

Large-sample interval, κ ± 1.96·SE. Reliable when N ≥ 30.

Examples
Cohen's κ
Substantial
0.700

Substantial agreement beyond chance.

95% CI [0.58, 0.82]

Observed agreement (pₒ)
80%
Share the raters actually agreed on.
Chance agreement (pₑ)
33.4%
Agreement expected from random labelling.
Standard error
0.0601
Asymptotic SE.
Landis & Koch (1977) bands
< 0.00 Poor0.00–0.20 Slight0.21–0.40 Fair0.41–0.60 Moderate0.61–0.80 Substantial0.81–1.00 Almost Perfect

Computed entirely in your browser — nothing is uploaded. Formulas per Cohen (1960), Cohen (1968) and Landis & Koch (1977); last verified 2026-06-10.

How it works

Cohen's kappa measures how much two raters agree when each independently sorts the same items into the same set of categories — and crucially, it corrects for the agreement you would expect from pure chance. You give the tool a k×k agreement matrix where nᵢⱼ is the number of items Rater A put in category i and Rater B put in category j. The diagonal holds the agreements; everything off the diagonal is a disagreement.

From the grand total N, the row marginals rᵢ and the column marginals cⱼ, the calculation follows Cohen (1960):

  • Observed agreement: pₒ = (Σ nᵢᵢ) / N
  • Chance agreement: pₑ = Σ (rᵢ / N)(cᵢ / N)
  • Cohen's kappa: κ = (pₒ − pₑ) / (1 − pₑ)

κ = 1 is perfect agreement, κ = 0 is exactly what chance predicts, and κ < 0 means the raters disagree more than random labelling would. Because pₑ depends on the marginal totals, a lopsided task — where one category dominates — has a high chance agreement, which is why two raters can match on 80% of items yet earn only a modest kappa.

When the categories are ordered (say Low / Medium / High), a Medium-vs-High mix-up is a smaller error than Low-vs-High. Weighted kappa, from Cohen (1968), captures that with a weight matrix built from the category distance |i − j|: linear weights are wᵢⱼ = 1 − |i−j|/(k−1) and quadratic weights are wᵢⱼ = 1 − (i−j)²/(k−1)². The same formula then runs on the weighted proportions: κ_w = (pₒ(w) − pₑ(w)) / (1 − pₑ(w)). For a 2×2 table every weighting scheme collapses to plain kappa, since there is only one disagreement distance.

The 95% confidence interval uses Cohen's simplified large-sample standard error, SE = √(pₒ(1 − pₒ) / (N·(1 − pₑ)²)), giving κ ± 1.96·SE clamped to the valid −1…1 range. This normal approximation is dependable for N ≥ 30 and is flagged as indicative below that; the full asymptotic variance is given by Fleiss, Cohen & Everitt (1969). Finally the headline kappa is mapped to the Landis & Koch (1977) band — < 0.00 poor, 0.00–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 almost perfect — for a one-line verdict. All arithmetic is exact and runs in your browser; nothing is uploaded.

Worked examples

Example 1 — binary relevance, two reviewers, N = 50

  1. Matrix [[20, 5], [10, 15]]: both said Relevant 20×, both said Not 15×.
  2. pₒ = (20 + 15) / 50 = 0.70 (raw agreement is 70%)
  3. pₑ = (25/50)(30/50) + (25/50)(20/50) = 0.30 + 0.20 = 0.50
  4. κ = (0.70 − 0.50) / (1 − 0.50) = 0.20 / 0.50 = 0.40 → Fair
  5. SE = √(0.70·0.30 / (50·0.50²)) = 0.1296; 95% CI [0.15, 0.65]

Example 2 — sentiment labelling, three categories, N = 100

  1. Matrix [[25,3,2],[4,28,3],[3,5,27]]: diagonal = 25 + 28 + 27 = 80.
  2. pₒ = 80 / 100 = 0.80 (raw agreement is 80%)
  3. pₑ = (30/100)(32/100) + (35/100)(36/100) + (35/100)(32/100) = 0.334
  4. κ = (0.80 − 0.334) / (1 − 0.334) = 0.466 / 0.666 = 0.700 → Substantial
  5. Quadratic weighted κ_w = (0.9125 − 0.6775) / (1 − 0.6775) = 0.729 (near-misses credited)

Example 3 — edge case, worse than chance, N = 20

  1. Matrix [[1, 9], [9, 1]]: the raters mostly contradict each other.
  2. pₒ = (1 + 1) / 20 = 0.10 (only 10% raw agreement)
  3. pₑ = (10/20)(10/20) + (10/20)(10/20) = 0.50
  4. κ = (0.10 − 0.50) / (1 − 0.50) = −0.40 / 0.50 = −0.80 → Poor
  5. Negative kappa flags a likely swapped label mapping or coding error.

Frequently asked questions

Sources & references

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, an edge case, or want Fleiss' kappa for 3+ raters added?

Email me at [email protected] — most fixes ship within 24 hours.