What is a good Fleiss' kappa value?

Most researchers follow Landis & Koch (1977): below 0 is poor, 0.01–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and 0.81–1.00 almost perfect. For published research or annotated training data, aim for 0.61 or higher. Anything below 0.40 usually means your rating guidelines need tightening.

What is the difference between Cohen's kappa and Fleiss' kappa?

Cohen's kappa measures agreement between exactly two raters. Fleiss' kappa generalises it to three or more raters and, unlike Cohen's, does not require the same raters to judge every subject — only the same number of raters per subject. For a two-rater study use our Cohen's kappa calculator instead.

How do you calculate Fleiss' kappa by hand?

Build a matrix of subjects (rows) by categories (columns) holding rater counts. Find each category's proportion pⱼ, then chance agreement Pₑ = Σ pⱼ². Compute each subject's agreement Pᵢ = (Σ nᵢⱼ² − m) / (m(m−1)), average them into P̄, then κ = (P̄ − Pₑ) / (1 − Pₑ). The worked examples on this page show every step.

Can Fleiss' kappa be negative?

Yes. κ is negative when observed agreement is below what chance alone would produce — raters disagree more than random labelling would. It is bounded below by −1 in the extreme but rarely falls past about −0.5 in practice. A negative κ is a strong signal that the rating scheme or rater training has a problem.

How many raters do you need for Fleiss' kappa?

At least three; with two raters use Cohen's kappa. Each subject must be judged by the same number of raters (m), though they need not be the same people. There is no fixed maximum — crowd-labelling and LLM-evaluation panels often use dozens. More raters per subject give a more stable estimate.

Do all subjects need the same number of raters?

Yes — classic Fleiss' kappa assumes a fixed number of ratings per subject, m. This calculator derives m from your row sums and flags any row that does not match rather than silently averaging. If your rater counts genuinely vary per subject, Krippendorff's alpha is the appropriate statistic instead.

What does the per-category kappa tell me?

Per-category κⱼ measures how consistently raters apply one specific label. A low or negative κⱼ for a single category, while the others are high, pinpoints exactly which label is ambiguous — useful for revising a codebook or annotation guideline before re-running the study.

Is my data uploaded anywhere?

No. The entire calculation runs in your browser with plain JavaScript. Nothing is sent to a server, logged, or stored, so it is safe for unpublished study data and confidential annotations.

Statistics · Inter-rater reliability

Fleiss' Kappa Calculator

Measure how well three or more raters agree when they sort items into categories. Enter your subject-by-category counts and read off Fleiss' κ, observed versus chance agreement, and the strength-of-agreement band — instantly, in your browser, with the formulas cited below.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 23, 2026

Fleiss' kappa (κ)

Number of categories (k)

The shared label set, 2–12 categories.

Rating count matrix

Subject	Category 1 label	Category 2 label	Category 3 label	Raters (m)
#1	Subject 1, category Helpful	Subject 1, category Neutral	Subject 1, category Harmful	4
#2	Subject 2, category Helpful	Subject 2, category Neutral	Subject 2, category Harmful	4
#3	Subject 3, category Helpful	Subject 3, category Neutral	Subject 3, category Harmful	4
#4	Subject 4, category Helpful	Subject 4, category Neutral	Subject 4, category Harmful	4
Col Σ	7	5	4	16

Rows = subjects, columns = categories. Each cell is the number of raters who chose that category for that subject; every row should total the same m raters.

4 subjects · 3 categories · m = 4 raters

Examples

Fleiss' κ

Substantial

0.807

Substantial agreement beyond chance.

4 subjects · 4 raters each · 3 categories (16 ratings).

Observed agreement (P̄)

87.5%

Mean share of agreeing rater pairs across subjects.

Chance agreement (Pₑ)

35.16%

Agreement expected from random labelling.

Gain over chance

52.34%

P̄ − Pₑ, the raw agreement above random labelling.

Category	Assignments	Proportion pⱼ	Per-category κⱼ
Helpful	7	43.75%	0.746
Neutral	5	31.25%	0.709
Harmful	4	25%	1.000

Per-category κⱼ shows which labels raters agree or disagree on. A category used by no one or by everyone has an undefined κⱼ.

Landis & Koch (1977) bands

< 0.00 Poor0.00–0.20 Slight0.21–0.40 Fair0.41–0.60 Moderate0.61–0.80 Substantial0.81–1.00 Almost Perfect

Computed entirely in your browser — nothing is uploaded. Formulas per Fleiss (1971) and Landis & Koch (1977); last verified 2026-06-23.

How it works

Fleiss' kappa (Fleiss, 1971) measures agreement among a fixed number of raters who each place every subject into one nominal category. Let N be the number of subjects, m the number of ratings per subject, k the number of categories, and n_ij the number of raters who assigned subject i to category j. Every row of the matrix sums to m by construction. The calculation is pure arithmetic over that matrix, computed client-side with no rounding until display:

Category proportions. For each category, pⱼ = (1 / (N·m)) · Σᵢ nᵢⱼ — the share of all assignments that landed in category j.
Chance agreement. Pₑ = Σⱼ pⱼ² — the agreement you would expect if raters labelled at random in proportion to how often each category is used.
Per-subject agreement. Pᵢ = (1 / (m·(m−1))) · ( Σⱼ nᵢⱼ² − m ) — the proportion of rater pairs who agreed on subject i.
Mean observed agreement. P̄ = (1 / N) · Σᵢ Pᵢ — average agreement across all subjects.
Fleiss' kappa. κ = (P̄ − Pₑ) / (1 − Pₑ). The numerator is agreement beyond chance; the denominator is the maximum agreement still available above chance. κ = 1 means perfect agreement, κ = 0 means agreement no better than chance, and κ < 0 means worse than chance.
Per-category kappa.κⱼ = 1 − ( Σᵢ nᵢⱼ·(m − nᵢⱼ) ) / ( N·m·(m−1)·pⱼ·(1 − pⱼ) ) (Fleiss, Levin & Paik, 2003) — how consistently raters apply one specific label.
Interpretation.κ is mapped onto the Landis & Koch (1977) bands: < 0 poor, 0.01–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 almost perfect.

Two values are deliberately reported as undefined rather than as a misleading zero. The overall κ is undefined when Pₑ = 1 (every rating fell in a single category, so 1 − Pₑ = 0). A per-category κⱼ is undefined when its category was used by nobody or by everybody, because pⱼ·(1 − pⱼ) is then zero. The tool also requires the same number of raters on every subject and flags any row that breaks that rule, since averaging over unequal rater counts would silently bias κ.

As a self-check, the calculator computes the headline κ a second way — through a pooled closed form derived directly from Σ nᵢⱼ² — and confirms the two methods agree before showing the “Formulas verified” badge.

Worked examples

Example 1 — 4 subjects, 3 raters, Yes/No

κ = 0.625 → substantial agreement

Matrix nᵢⱼ: [3,0] [0,3] [2,1] [3,0]
Column totals: Yes = 8, No = 4; N·m = 4 × 3 = 12
p_Yes = 8/12 = 0.6667, p_No = 4/12 = 0.3333
Pₑ = 0.6667² + 0.3333² = 0.4444 + 0.1111 = 0.5556
Pᵢ = (Σnᵢⱼ² − 3)/6 → 1.0, 1.0, (5−3)/6 = 0.3333, 1.0
P̄ = (1 + 1 + 0.3333 + 1)/4 = 0.8333
κ = (0.8333 − 0.5556)/(1 − 0.5556) = 0.2778/0.4444 = 0.625

Example 2 — 3 subjects, 4 raters, A/B/C

κ = 0.319 → fair agreement

Matrix nᵢⱼ: [4,0,0] [1,2,1] [0,1,3]
Column totals: A = 5, B = 3, C = 4; N·m = 3 × 4 = 12
p = (0.4167, 0.25, 0.3333)
Pₑ = 0.1736 + 0.0625 + 0.1111 = 0.3472
Pᵢ = (Σnᵢⱼ² − 4)/12 → 1.0, (6−4)/12 = 0.1667, (10−4)/12 = 0.5
P̄ = (1 + 0.1667 + 0.5)/3 = 0.5556
κ = (0.5556 − 0.3472)/(1 − 0.3472) = 0.2083/0.6528 = 0.319
Per-category: κ_A = 0.66, κ_B = −0.04, κ_C = 0.25 — label B drags κ down

Example 3 — edge case: worse than chance

κ = −0.33 → poor agreement

Matrix nᵢⱼ: [2,2] [2,2] [2,2] — 3 subjects, m = 4, split evenly
p = (0.5, 0.5); Pₑ = 0.5² + 0.5² = 0.5
Pᵢ = (Σnᵢⱼ² − 4)/(4·3) = (8 − 4)/12 = 0.3333 for every subject
P̄ = 0.3333
κ = (0.3333 − 0.5)/(1 − 0.5) = −0.1667/0.5 = −0.333
Observed agreement is below chance — a clear sign the scheme is failing

Frequently asked questions

Sources & references

The formulas on this page were last cross-checked against these sources on 2026-06-23. Both on-page worked examples reconcile to the cited definitions (κ = 0.625 and κ = 0.319).

Related tools

LiveAI

Cohen's Kappa Calculator

Enter a two-rater agreement matrix and get Cohen's kappa (κ) with observed vs chance agreement, a 95% confidence interval, optional linear/quadratic weighted kappa for ordinal scales, and the Landis & Koch strength-of-agreement band. Runs in your browser, no signup.

Open tool

LiveAI

ROUGE Score Calculator

Calculate ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-Lsum precision, recall and F1 between a generated summary and one or more references, entirely in your browser. Shows matched n-grams and the longest common subsequence. Matches Google rouge-score, no signup.

Open tool

LiveAI

Euclidean Distance Calc

Compute the Euclidean (L2) distance between two points or two numeric vectors of any dimension, with the full per-dimension working. Also shows the squared Euclidean, Manhattan (L1), and Chebyshev (L∞) distances, and matches scikit-learn and NumPy — entirely in your browser.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.