What is pass@k in LLM code evaluation?

pass@k is the probability that at least one of k generated code samples passes the unit tests for a problem. It is the standard functional-correctness metric for code models on benchmarks like HumanEval and MBPP. Instead of grading text similarity, it runs each completion and checks whether it actually works, then estimates the chance that sampling k attempts would yield a working one.

How is pass@1 calculated from multiple samples?

Generate n samples per problem, count how many pass (c), then pass@1 = c / n — the single-sample success rate averaged over the dataset. Estimating pass@1 from many samples (rather than generating just one) gives a far lower-variance number. This is exactly the k = 1 case of the unbiased estimator pass@k = 1 − C(n−c, k) / C(n, k).

Why is the naive pass@k estimator biased?

The tempting shortcut — generate k samples and report 1 if any passes, averaged over problems — is a high-variance estimate, and computing 1 − (1 − c/n)^k treats draws as independent (with replacement), which overstates pass@k. Chen et al. 2021 instead draw k samples without replacement from the n generated, giving the unbiased combinatorial estimator this tool uses.

What's the difference between pass@1, pass@10 and pass@100?

They report success when you allow 1, 10, or 100 attempts at a problem. pass@100 is always at least pass@10, which is at least pass@1, because more attempts can only help. The gap shows how much a model benefits from sampling: a model with low pass@1 but high pass@100 often knows the answer but needs many tries to surface it.

How many samples (n) do I need to estimate pass@100 reliably?

You need n ≥ k for the estimator to be defined, and the original HumanEval setup uses n = 200 to estimate pass@100 with low variance. A common rule is n ≈ 2k–4k. With n = k the estimate degenerates (it can only be 0 or 1 per problem), so generate comfortably more samples than the largest k you plan to report.

How is dataset pass@k aggregated across problems?

Compute pass@k for each problem with its own n and c, then take the arithmetic mean across all problems. That mean is the dataset-level pass@k quoted on leaderboards. In benchmark mode this tool averages over the problems where n ≥ k and shows how many problems contributed, so any with too few samples are visible rather than silently dropped.

Does pass@k run my code or call a model?

No. This tool only does the arithmetic — you supply n and c from your own evaluation harness, which is what actually executes the completions against the hidden tests. Nothing is uploaded, no model is called, and no code runs in a sandbox here. That keeps it free, keyless, fully client-side, and safe to use on private benchmark results.

Why does pass@k jump to 100% in some cases?

When the number of failing samples (n − c) is smaller than k, it is impossible to draw k samples that all fail, so at least one of the k must pass and pass@k = 1. For example, with n = 8 and c = 8 every sample is correct, so pass@k = 100% for any k. The estimator returns exactly 1.0 in that regime by definition.

AI · Code evaluation

pass@k Calculator

Compute the unbiased pass@kmetric for code-generation LLMs the way HumanEval does. Enter the samples you generated (n), how many passed the tests (c), and the k you want — for a single problem or a whole benchmark. Matches OpenAI's human-eval estimator, entirely in your browser.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 30, 2026

pass@k calculator

n — samples generated

Completions per problem.

c — correct samples

How many passed the tests.

k — attempts

Samples you draw (1 ≤ k ≤ n).

pass@1

30.00%

Fraction 0.3000 · higher is better

pass@1 (= c / n)

30.00%

Single-sample success rate

Cross-check (binomial)

30.00%

Reconciles with the product form

With 10 samples and 3 correct, you have a 30.00% chance that at least one of any 1 sampled completion passes the tests.

pass@k for the same n / c

Metric	pass@k (%)	Fraction
pass@1	30.00%	0.3000
pass@5	91.67%	0.9167
pass@10	100.00%	1.0000
pass@100	n < k	—

Rows where k exceeds n are not defined — you can't draw more attempts than the 10 samples you generated.

Method: unbiased pass@k = 1 − ∏_i=n−c+1ⁿ(1 − k/i), Chen et al. 2021 (arXiv:2107.03374, §2.1); reference code in OpenAI's human-eval. No data leaves this page.

How it works

pass@k is the probability that at least one of k code samples passes a problem's unit tests. It is the headline metric for code-generation models on benchmarks such as HumanEval, MBPP and LiveCodeBench, and it was defined in Chen et al. 2021, Evaluating Large Language Models Trained on Code (arXiv:2107.03374). Rather than score how a completion looks, pass@k scores whether it works.

The naive approach — generate k samples, mark the problem solved if any passes — is noisy, and the closed form 1 − (1 − c/n)^k is biased because it treats samples as drawn with replacement. The unbiased estimator instead generates n ≥ k samples per problem, counts the correct ones c, and computes the exact probability of drawing k samples (without replacement) with at least one correct:

pass@k = 1 − C(n−c, k) / C(n, k)

Evaluating the binomial coefficients directly overflows for large n, so the paper's reference code (mirrored in OpenAI's estimate_pass_at_k) uses a numerically stable product instead, which is exactly what this tool runs:

If n − c < k there are fewer than k failing samples, so any k you draw must contain a passing one — pass@k = 1.
Otherwise pass@k = 1 − ∏(i = n−c+1 … n) (1 − k / i). Every factor lies in [0, 1], so the product never overflows or underflows for the supported input range.
For a benchmark of Pproblems, the dataset pass@k is the arithmetic mean of each problem's pass@k, computed with that problem's own n_i and c_i — the same aggregation HumanEval uses.

Two identities make the result self-checking: pass@1 always equals c / n (the product telescopes to (n−c)/n), and pass@k is 0 when c = 0 and 1 when c > n − k. The tool also recomputes pass@k through the independent binomial closed form 1 − C(n−c,k)/C(n,k) evaluated as a stable falling ratio, and confirms the two methods agree to ~12 decimal places. All of it is plain double-precision arithmetic in your browser — no model, no API, nothing uploaded.

Worked examples

pass@1 sanity check — n = 10, c = 3, k = 1

n − c = 7 ≥ 1, so use the product form
∏(i = 8,9,10) (1 − 1/i) = (7/8)(8/9)(9/10) = 7/10 = 0.7
pass@1 = 1 − 0.7 = 0.300000 = 30.0%
Identity check: pass@1 = c/n = 3/10 = 30% ✓

pass@5 from the same samples — n = 10, c = 3, k = 5

n − c = 7 ≥ 5, so use the product form
∏(i = 8,9,10) (1 − 5/i) = (3/8)(4/9)(5/10) = 60/720 = 1/12
pass@5 = 1 − 1/12 = 11/12 = 0.916667 = 91.67%
Closed form: 1 − C(7,5)/C(10,5) = 1 − 21/252 = 11/12 ✓

No correct samples (edge) — n = 5, c = 0, k = 1

i runs from n−c+1 = 6 to n = 5 — an empty product = 1
pass@k = 1 − 1 = 0.000000 = 0%
Matches the c = 0 ⇒ pass@k = 0 identity ✓

Benchmark aggregation — two problems, both n = 10, k = 5

Problem 1: c = 3 → pass@5 = 91.67% (example above)
Problem 2: c = 0 → pass@5 = 0%
Dataset pass@5 = (91.67% + 0%) / 2 = 45.83% ✓
Each problem uses its own n and c; the dataset value is their mean

Frequently asked questions

Sources & references

The formula and reference implementation on this page were last cross-checked against these sources on 2026-06-30. pass@k is a fixed mathematical definition, so this tool needs no rate updates — only the worked examples are periodically re-reconciled.

Related tools

LiveAI

Minkowski Distance Calc

Compute the Minkowski distance (the generalized Lₚ metric) between two numeric vectors of any dimension, for any order p ≥ 1, with the full per-dimension working. Shows the Manhattan (p=1), Euclidean (p=2), and Chebyshev (p→∞) special cases side by side, and matches scikit-learn and SciPy — entirely in your browser.

Open tool

LiveAI

ROUGE Score Calculator

Calculate ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-Lsum precision, recall and F1 between a generated summary and one or more references, entirely in your browser. Shows matched n-grams and the longest common subsequence. Matches Google rouge-score, no signup.

Open tool

LiveAI

METEOR Score Calculator

Calculate the METEOR score for a candidate translation against a reference, entirely in your browser. Shows unigram matches, precision, recall, the recall-weighted Fmean, the chunk-based fragmentation penalty, and the aligned tokens. Matches NLTK single_meteor_score, no signup.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.