pass@k Calculator
Compute the unbiased pass@kmetric for code-generation LLMs the way HumanEval does. Enter the samples you generated (n), how many passed the tests (c), and the k you want — for a single problem or a whole benchmark. Matches OpenAI's human-eval estimator, entirely in your browser.
How it works
pass@k is the probability that at least one of k code samples passes a problem's unit tests. It is the headline metric for code-generation models on benchmarks such as HumanEval, MBPP and LiveCodeBench, and it was defined in Chen et al. 2021, Evaluating Large Language Models Trained on Code (arXiv:2107.03374). Rather than score how a completion looks, pass@k scores whether it works.
The naive approach — generate k samples, mark the problem solved if any passes — is noisy, and the closed form 1 − (1 − c/n)^k is biased because it treats samples as drawn with replacement. The unbiased estimator instead generates n ≥ k samples per problem, counts the correct ones c, and computes the exact probability of drawing k samples (without replacement) with at least one correct:
pass@k = 1 − C(n−c, k) / C(n, k)
Evaluating the binomial coefficients directly overflows for large n, so the paper's reference code (mirrored in OpenAI's estimate_pass_at_k) uses a numerically stable product instead, which is exactly what this tool runs:
- If
n − c < kthere are fewer than k failing samples, so any k you draw must contain a passing one —pass@k = 1. - Otherwise
pass@k = 1 − ∏(i = n−c+1 … n) (1 − k / i). Every factor lies in [0, 1], so the product never overflows or underflows for the supported input range. - For a benchmark of
Pproblems, the dataset pass@k is the arithmetic mean of each problem's pass@k, computed with that problem's ownn_iandc_i— the same aggregation HumanEval uses.
Two identities make the result self-checking: pass@1 always equals c / n (the product telescopes to (n−c)/n), and pass@k is 0 when c = 0 and 1 when c > n − k. The tool also recomputes pass@k through the independent binomial closed form 1 − C(n−c,k)/C(n,k) evaluated as a stable falling ratio, and confirms the two methods agree to ~12 decimal places. All of it is plain double-precision arithmetic in your browser — no model, no API, nothing uploaded.
Worked examples
Frequently asked questions
Sources & references
- Chen et al. 2021 — Evaluating Large Language Models Trained on Code (arXiv:2107.03374), §2.1: the unbiased pass@k estimator
- OpenAI human-eval — estimate_pass_at_k reference implementation
- Kulal et al. 2019 — SPoC: Search-based Pseudocode to Code (arXiv:1906.04908): origin of the pass@k framing
The formula and reference implementation on this page were last cross-checked against these sources on 2026-06-30. pass@k is a fixed mathematical definition, so this tool needs no rate updates — only the worked examples are periodically re-reconciled.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a bug, edge case, or want to suggest an improvement?
Email me at [email protected] — most fixes ship within 24 hours.