How do you calculate perplexity of a language model?

Take the model's probability pᵢ for each of the N tokens in the test set, sum the natural logs, average and negate them to get the cross-entropy H = −(1/N)·Σ ln pᵢ, then exponentiate: PP = exp(H). Equivalently PP = (∏ pᵢ)^(−1/N). Lower perplexity means the model assigned higher probability to the real text.

What is the relationship between perplexity and cross-entropy loss?

Perplexity is the exponential of the cross-entropy. If the average cross-entropy (or negative log-likelihood) loss is H in nats, then PP = exp(H). If H is measured in bits, PP = 2^H. So a model with a loss of 2.3 nats has a perplexity of e^2.3 ≈ 9.97. The two numbers carry the same information on different scales.

How do you convert cross-entropy loss to perplexity?

If your loss is in nats — the default for PyTorch CrossEntropyLoss and most frameworks — raise e to the loss: PP = exp(loss). If the loss is in bits-per-token, use PP = 2^loss. For example, a bits-per-token of 1.5 gives PP = 2^1.5 ≈ 2.83. Switch the unit toggle in the 'From loss' tab to match your framework.

What is a good perplexity score for a language model?

It depends entirely on the dataset and tokenizer, so there is no universal threshold. On the WikiText-2 word-level benchmark, classic LSTMs scored around 60–80 and strong transformers reach the low 20s; modern large models report single-digit-to-teens perplexity on their own test data. Only compare perplexities computed on the same corpus with the same tokenizer.

Is lower or higher perplexity better?

Lower is better. Perplexity is the model's average uncertainty — roughly the number of equally likely choices it is deciding between per token. A perfect model that always assigned probability 1 to the correct token has a perplexity of 1. A uniform model over a vocabulary of V tokens has perplexity V. So smaller perplexity means a more confident, better-fitting model.

Why must every token probability be greater than 0?

Perplexity uses ln pᵢ, and ln 0 is −∞, which makes the cross-entropy infinite and the perplexity blow up. A probability of exactly 0 means the model ruled out a token that actually occurred, which is the worst possible prediction. Real models avoid this with smoothing or a softmax that never outputs a hard 0. This calculator rejects 0 and negative probabilities with a clear message.

Can I compare perplexity between two different models?

Only if both numbers were computed on the same test set with the same tokenization. Perplexity is per-token, so a model that splits text into more, smaller subword tokens can look artificially lower than one using larger tokens, even on identical text. For cross-tokenizer comparison, convert to bits-per-character first; this tool reports single-sequence perplexity, not BPC.

Does this calculator send my data anywhere?

No. Every calculation — parsing your probabilities, summing the logs, exponentiating the cross-entropy — runs in your browser with plain JavaScript. Nothing is uploaded, logged, or stored. You can paste a validation-set log-likelihood with no privacy concern, and the page works offline once loaded.

AI · Machine learning

Perplexity Calculator

Compute language-model perplexity in your browser — from a list of token probabilities, a cross-entropy / NLL loss, or a total log-likelihood. See the cross-entropy in nats and bits-per-token, the average token probability, and the exact formula behind every result.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 10, 2026

Perplexity calculator

Per-token probabilities (pᵢ)

The model's probability for each observed token, each in (0, 1]. Separate with commas, spaces, or new lines.

Values are log-probabilities (ln pᵢ)

Examples

Perplexity (PP)

2.8284

Lower is better — 1 is a perfect model

Cross-entropy (nats)

1.0397

ln PP — matches PyTorch loss

Bits per token

1.5000

log₂ PP

Avg token probability

0.3536

1 / PP

Decimals

Formula used

PP = exp( −(1/N) · Σ ln pᵢ )

On average the model is as uncertain as choosing uniformly among 2.83 equally likely tokens, over N = 4 tokens.

Cross-check. The exponential form gives PP = 2.8284; the independent product form (∏pᵢ)^(−1/N) gives 2.8284. They reconcile, as they must. (Shown for up to 50tokens, where the raw product doesn't underflow.)

Per-token contributions

Token	Probability pᵢ	ln pᵢ
#1	0.5000	-0.6931
#2	0.2500	-1.3863
#3	0.2500	-1.3863
#4	0.5000	-0.6931
Σ ln pᵢ		-4.1589

Method: PP = exp(−(1/N)·Σ ln pᵢ) = exp(H_nats) = 2^(H_bits), with H_bits = H_nats / ln 2 — Jurafsky & Martin, Speech and Language Processing(3rd ed.) Ch. 3; Hugging Face perplexity guide; PyTorch CrossEntropyLoss. No data leaves this page.

How it works

Perplexitymeasures how well a probability model predicts a sample of text: it is the model's average uncertainty per token, read as the number of equally likely options it is effectively choosing between. Lower is better. The definition comes from Jurafsky & Martin's Speech and Language Processing, Chapter 3.

For a test sequence of N tokens, where the model assigns probability pᵢ to the i-th observed token in context, perplexity is the inverse geometric mean of those probabilities:

PP = ( ∏ pᵢ )^(−1/N) = exp( −(1/N) · Σ ln pᵢ )

The exponent −(1/N)·Σ ln pᵢ is the average cross-entropy (equivalently, the mean negative log-likelihood) H, in nats. So perplexity is simply the exponential of the cross-entropy, and the two are interchangeable:

From probabilities. Sum the natural logs of the per-token probabilities, average and negate to get H = −(1/N)·Σ ln pᵢ, then PP = exp(H). If you enter log-probabilities directly, the logs are already taken.
From loss. An average cross-entropy / NLL loss already is H. In nats — PyTorch CrossEntropyLoss, TensorFlow — PP = exp(loss). In bits, PP = 2^loss.
From log-likelihood. Given a total Σ log P and token count N, the per-token cross-entropy is H = −(Σ log P)/N in the chosen unit, and PP is its exponential (base e for nats, base 2 for bits).

Units convert with H_bits = H_nats / ln 2, so log₂ PP is the bits-per-token figure and 1/PP = exp(−H_nats) is the average per-token probability. All three input modes converge on the same (PP, nats, bits) triple, which is why the tool can cross-check a probabilities-mode result against the independent product form (∏ pᵢ)^(−1/N) and have them agree to floating-point precision. Probabilities of 0 or below are rejected, because ln 0 = −∞ would send perplexity to infinity. Everything is plain double-precision arithmetic in your browser.

Worked examples

From token probabilities — p = [0.5, 0.25, 0.25, 0.5], N = 4

Σ ln p = −0.693147 − 1.386294 − 1.386294 − 0.693147 = −4.158883
H (nats) = 4.158883 / 4 = 1.039721
PP = exp(1.039721) = 2.828427
Cross-check: ∏p = 0.015625, 0.015625^(−1/4) = 64^(1/4) = 2.828427 ✓
Bits/token = 1.039721 / ln 2 = 1.5; avg token prob = 1/2.828427 = 0.353553

From cross-entropy loss — PyTorch CrossEntropyLoss = 2.3 (nats)

Loss is already the per-token cross-entropy H = 2.3 nats
PP = e^2.3 = 9.974182
Bits/token = 2.3 / ln 2 = 3.318137
Avg token prob = 1 / 9.974182 = 0.100259
Same input as bits: 3.318137 bits → 2^3.318137 = 9.974182 ✓

From log-likelihood in bits — total log₂P = −9000 over N = 1000

H (bits) = −(−9000) / 1000 = 9 bits per token
H (nats) = 9 × ln 2 = 6.238325
PP = 2^9 = exp(6.238325) = 512
Equivalent to a uniform model over a 512-token vocabulary
Avg token prob = 1 / 512 = 0.001953 ✓

Frequently asked questions

Sources & references

The formulas on this page were last cross-checked against these sources on 2026-06-10. Perplexity is a stable mathematical definition, so this tool needs no rate or schedule updates — only the worked examples are periodically re-reconciled.

Related tools

LiveAI

Cross-Entropy Loss Calc

Compute cross-entropy (log) loss for binary and multi-class classification from labels and predicted probabilities or logits. Shows per-sample loss, the mean log-loss metric, perplexity and full step-by-step working — matches scikit-learn log_loss and PyTorch CrossEntropyLoss, entirely in the browser.

Open tool

LiveAI

AI Parameter Count Calc

Compute the exact parameter count of a decoder-only (GPT-style) transformer from its architecture — vocab, hidden size, layers, FFN size, and head config — broken down into embedding, attention, feed-forward, and norm shares. GPT-2 124M and GPT-3 175B verified, formulas cited.

Open tool

LiveAI

F1 Score Calculator

Calculate the F1 score, precision, recall and F-beta of a binary classifier from confusion-matrix counts (TP, FP, FN) or directly from precision and recall, with every step of the arithmetic shown. Matches scikit-learn, runs in your browser.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.