induwara.lk
induwara.lkStatistics · Inter-rater reliability

Fleiss' Kappa Calculator

Measure how well three or more raters agree when they sort items into categories. Enter your subject-by-category counts and read off Fleiss' κ, observed versus chance agreement, and the strength-of-agreement band — instantly, in your browser, with the formulas cited below.

By Induwara AshinsanaUpdated Jun 23, 2026
Fleiss' kappa (κ)

The shared label set, 212 categories.

Rating count matrix

SubjectRaters (m)
#14
#24
#34
#44
Col Σ75416

Rows = subjects, columns = categories. Each cell is the number of raters who chose that category for that subject; every row should total the same m raters.

4 subjects · 3 categories · m = 4 raters
Examples
Fleiss' κ
Substantial
0.807

Substantial agreement beyond chance.

4 subjects · 4 raters each · 3 categories (16 ratings).

Observed agreement (P̄)
87.5%
Mean share of agreeing rater pairs across subjects.
Chance agreement (Pₑ)
35.16%
Agreement expected from random labelling.
Gain over chance
52.34%
P̄ − Pₑ, the raw agreement above random labelling.
CategoryAssignmentsProportion pⱼPer-category κⱼ
Helpful743.75%0.746
Neutral531.25%0.709
Harmful425%1.000

Per-category κⱼ shows which labels raters agree or disagree on. A category used by no one or by everyone has an undefined κⱼ.

Landis & Koch (1977) bands
< 0.00 Poor0.00–0.20 Slight0.21–0.40 Fair0.41–0.60 Moderate0.61–0.80 Substantial0.81–1.00 Almost Perfect

Computed entirely in your browser — nothing is uploaded. Formulas per Fleiss (1971) and Landis & Koch (1977); last verified 2026-06-23.

How it works

Fleiss' kappa (Fleiss, 1971) measures agreement among a fixed number of raters who each place every subject into one nominal category. Let N be the number of subjects, m the number of ratings per subject, k the number of categories, and nij the number of raters who assigned subject i to category j. Every row of the matrix sums to m by construction. The calculation is pure arithmetic over that matrix, computed client-side with no rounding until display:

  1. Category proportions. For each category, pⱼ = (1 / (N·m)) · Σᵢ nᵢⱼ — the share of all assignments that landed in category j.
  2. Chance agreement. Pₑ = Σⱼ pⱼ² — the agreement you would expect if raters labelled at random in proportion to how often each category is used.
  3. Per-subject agreement. Pᵢ = (1 / (m·(m−1))) · ( Σⱼ nᵢⱼ² − m ) — the proportion of rater pairs who agreed on subject i.
  4. Mean observed agreement. P̄ = (1 / N) · Σᵢ Pᵢ — average agreement across all subjects.
  5. Fleiss' kappa. κ = (P̄ − Pₑ) / (1 − Pₑ). The numerator is agreement beyond chance; the denominator is the maximum agreement still available above chance. κ = 1 means perfect agreement, κ = 0 means agreement no better than chance, and κ < 0 means worse than chance.
  6. Per-category kappa.κⱼ = 1 − ( Σᵢ nᵢⱼ·(m − nᵢⱼ) ) / ( N·m·(m−1)·pⱼ·(1 − pⱼ) ) (Fleiss, Levin & Paik, 2003) — how consistently raters apply one specific label.
  7. Interpretation.κ is mapped onto the Landis & Koch (1977) bands: < 0 poor, 0.01–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 almost perfect.

Two values are deliberately reported as undefined rather than as a misleading zero. The overall κ is undefined when Pₑ = 1 (every rating fell in a single category, so 1 − Pₑ = 0). A per-category κⱼ is undefined when its category was used by nobody or by everybody, because pⱼ·(1 − pⱼ) is then zero. The tool also requires the same number of raters on every subject and flags any row that breaks that rule, since averaging over unequal rater counts would silently bias κ.

As a self-check, the calculator computes the headline κ a second way — through a pooled closed form derived directly from Σ nᵢⱼ² — and confirms the two methods agree before showing the “Formulas verified” badge.

Worked examples

Example 1 — 4 subjects, 3 raters, Yes/No

κ = 0.625 → substantial agreement

  1. Matrix nᵢⱼ: [3,0] [0,3] [2,1] [3,0]
  2. Column totals: Yes = 8, No = 4; N·m = 4 × 3 = 12
  3. p_Yes = 8/12 = 0.6667, p_No = 4/12 = 0.3333
  4. Pₑ = 0.6667² + 0.3333² = 0.4444 + 0.1111 = 0.5556
  5. Pᵢ = (Σnᵢⱼ² − 3)/6 → 1.0, 1.0, (5−3)/6 = 0.3333, 1.0
  6. P̄ = (1 + 1 + 0.3333 + 1)/4 = 0.8333
  7. κ = (0.8333 − 0.5556)/(1 − 0.5556) = 0.2778/0.4444 = 0.625

Example 2 — 3 subjects, 4 raters, A/B/C

κ = 0.319 → fair agreement

  1. Matrix nᵢⱼ: [4,0,0] [1,2,1] [0,1,3]
  2. Column totals: A = 5, B = 3, C = 4; N·m = 3 × 4 = 12
  3. p = (0.4167, 0.25, 0.3333)
  4. Pₑ = 0.1736 + 0.0625 + 0.1111 = 0.3472
  5. Pᵢ = (Σnᵢⱼ² − 4)/12 → 1.0, (6−4)/12 = 0.1667, (10−4)/12 = 0.5
  6. P̄ = (1 + 0.1667 + 0.5)/3 = 0.5556
  7. κ = (0.5556 − 0.3472)/(1 − 0.3472) = 0.2083/0.6528 = 0.319
  8. Per-category: κ_A = 0.66, κ_B = −0.04, κ_C = 0.25 — label B drags κ down

Example 3 — edge case: worse than chance

κ = −0.33 → poor agreement

  1. Matrix nᵢⱼ: [2,2] [2,2] [2,2] — 3 subjects, m = 4, split evenly
  2. p = (0.5, 0.5); Pₑ = 0.5² + 0.5² = 0.5
  3. Pᵢ = (Σnᵢⱼ² − 4)/(4·3) = (8 − 4)/12 = 0.3333 for every subject
  4. P̄ = 0.3333
  5. κ = (0.3333 − 0.5)/(1 − 0.5) = −0.1667/0.5 = −0.333
  6. Observed agreement is below chance — a clear sign the scheme is failing

Frequently asked questions

Sources & references

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.