Fleiss' Kappa Calculator
Measure how well three or more raters agree when they sort items into categories. Enter your subject-by-category counts and read off Fleiss' κ, observed versus chance agreement, and the strength-of-agreement band — instantly, in your browser, with the formulas cited below.
How it works
Fleiss' kappa (Fleiss, 1971) measures agreement among a fixed number of raters who each place every subject into one nominal category. Let N be the number of subjects, m the number of ratings per subject, k the number of categories, and nij the number of raters who assigned subject i to category j. Every row of the matrix sums to m by construction. The calculation is pure arithmetic over that matrix, computed client-side with no rounding until display:
- Category proportions. For each category, pⱼ = (1 / (N·m)) · Σᵢ nᵢⱼ — the share of all assignments that landed in category j.
- Chance agreement. Pₑ = Σⱼ pⱼ² — the agreement you would expect if raters labelled at random in proportion to how often each category is used.
- Per-subject agreement. Pᵢ = (1 / (m·(m−1))) · ( Σⱼ nᵢⱼ² − m ) — the proportion of rater pairs who agreed on subject i.
- Mean observed agreement. P̄ = (1 / N) · Σᵢ Pᵢ — average agreement across all subjects.
- Fleiss' kappa.
κ = (P̄ − Pₑ) / (1 − Pₑ). The numerator is agreement beyond chance; the denominator is the maximum agreement still available above chance. κ = 1 means perfect agreement, κ = 0 means agreement no better than chance, and κ < 0 means worse than chance. - Per-category kappa.κⱼ = 1 − ( Σᵢ nᵢⱼ·(m − nᵢⱼ) ) / ( N·m·(m−1)·pⱼ·(1 − pⱼ) ) (Fleiss, Levin & Paik, 2003) — how consistently raters apply one specific label.
- Interpretation.κ is mapped onto the Landis & Koch (1977) bands: < 0 poor, 0.01–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 almost perfect.
Two values are deliberately reported as undefined rather than as a misleading zero. The overall κ is undefined when Pₑ = 1 (every rating fell in a single category, so 1 − Pₑ = 0). A per-category κⱼ is undefined when its category was used by nobody or by everybody, because pⱼ·(1 − pⱼ) is then zero. The tool also requires the same number of raters on every subject and flags any row that breaks that rule, since averaging over unequal rater counts would silently bias κ.
As a self-check, the calculator computes the headline κ a second way — through a pooled closed form derived directly from Σ nᵢⱼ² — and confirms the two methods agree before showing the “Formulas verified” badge.
Worked examples
Frequently asked questions
Sources & references
- Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.
- Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
- Fleiss, J. L., Levin, B., & Paik, M. C. (2003). Statistical Methods for Rates and Proportions (3rd ed.), Wiley — per-category kappa.
The formulas on this page were last cross-checked against these sources on 2026-06-23. Both on-page worked examples reconcile to the cited definitions (κ = 0.625 and κ = 0.319).
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a bug, edge case, or want to suggest an improvement?
Email me at [email protected] — most fixes ship within 24 hours.