induwara.lk
induwara.lkAI · Code evaluation

pass@k Calculator

Compute the unbiased pass@kmetric for code-generation LLMs the way HumanEval does. Enter the samples you generated (n), how many passed the tests (c), and the k you want — for a single problem or a whole benchmark. Matches OpenAI's human-eval estimator, entirely in your browser.

By Induwara AshinsanaUpdated Jun 30, 2026
pass@k calculator

Completions per problem.

How many passed the tests.

Samples you draw (1 ≤ k ≤ n).

pass@1
30.00%
Fraction 0.3000 · higher is better
pass@1 (= c / n)
30.00%
Single-sample success rate
Cross-check (binomial)
30.00%
Reconciles with the product form

With 10 samples and 3 correct, you have a 30.00% chance that at least one of any 1 sampled completion passes the tests.

pass@k for the same n / c

Metricpass@k (%)Fraction
pass@130.00%0.3000
pass@591.67%0.9167
pass@10100.00%1.0000
pass@100n < k

Rows where k exceeds n are not defined — you can't draw more attempts than the 10 samples you generated.

Method: unbiased pass@k = 1 − ∏i=n−c+1n(1 − k/i), Chen et al. 2021 (arXiv:2107.03374, §2.1); reference code in OpenAI's human-eval. No data leaves this page.

How it works

pass@k is the probability that at least one of k code samples passes a problem's unit tests. It is the headline metric for code-generation models on benchmarks such as HumanEval, MBPP and LiveCodeBench, and it was defined in Chen et al. 2021, Evaluating Large Language Models Trained on Code (arXiv:2107.03374). Rather than score how a completion looks, pass@k scores whether it works.

The naive approach — generate k samples, mark the problem solved if any passes — is noisy, and the closed form 1 − (1 − c/n)^k is biased because it treats samples as drawn with replacement. The unbiased estimator instead generates n ≥ k samples per problem, counts the correct ones c, and computes the exact probability of drawing k samples (without replacement) with at least one correct:

pass@k = 1 − C(n−c, k) / C(n, k)

Evaluating the binomial coefficients directly overflows for large n, so the paper's reference code (mirrored in OpenAI's estimate_pass_at_k) uses a numerically stable product instead, which is exactly what this tool runs:

  1. If n − c < k there are fewer than k failing samples, so any k you draw must contain a passing one — pass@k = 1.
  2. Otherwise pass@k = 1 − ∏(i = n−c+1 … n) (1 − k / i). Every factor lies in [0, 1], so the product never overflows or underflows for the supported input range.
  3. For a benchmark of Pproblems, the dataset pass@k is the arithmetic mean of each problem's pass@k, computed with that problem's own n_i and c_i — the same aggregation HumanEval uses.

Two identities make the result self-checking: pass@1 always equals c / n (the product telescopes to (n−c)/n), and pass@k is 0 when c = 0 and 1 when c > n − k. The tool also recomputes pass@k through the independent binomial closed form 1 − C(n−c,k)/C(n,k) evaluated as a stable falling ratio, and confirms the two methods agree to ~12 decimal places. All of it is plain double-precision arithmetic in your browser — no model, no API, nothing uploaded.

Worked examples

pass@1 sanity check — n = 10, c = 3, k = 1

  1. n − c = 7 ≥ 1, so use the product form
  2. ∏(i = 8,9,10) (1 − 1/i) = (7/8)(8/9)(9/10) = 7/10 = 0.7
  3. pass@1 = 1 − 0.7 = 0.300000 = 30.0%
  4. Identity check: pass@1 = c/n = 3/10 = 30% ✓

pass@5 from the same samples — n = 10, c = 3, k = 5

  1. n − c = 7 ≥ 5, so use the product form
  2. ∏(i = 8,9,10) (1 − 5/i) = (3/8)(4/9)(5/10) = 60/720 = 1/12
  3. pass@5 = 1 − 1/12 = 11/12 = 0.916667 = 91.67%
  4. Closed form: 1 − C(7,5)/C(10,5) = 1 − 21/252 = 11/12 ✓

No correct samples (edge) — n = 5, c = 0, k = 1

  1. i runs from n−c+1 = 6 to n = 5 — an empty product = 1
  2. pass@k = 1 − 1 = 0.000000 = 0%
  3. Matches the c = 0 ⇒ pass@k = 0 identity ✓

Benchmark aggregation — two problems, both n = 10, k = 5

  1. Problem 1: c = 3 → pass@5 = 91.67% (example above)
  2. Problem 2: c = 0 → pass@5 = 0%
  3. Dataset pass@5 = (91.67% + 0%) / 2 = 45.83% ✓
  4. Each problem uses its own n and c; the dataset value is their mean

Frequently asked questions

Sources & references

The formula and reference implementation on this page were last cross-checked against these sources on 2026-06-30. pass@k is a fixed mathematical definition, so this tool needs no rate updates — only the worked examples are periodically re-reconciled.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.