How does GPT split text into tokens?

GPT uses byte-pair encoding (BPE). It starts from raw UTF-8 bytes and repeatedly merges the most frequent adjacent pairs into a fixed vocabulary of sub-word tokens. Common words become a single token; rarer words, code, and non-English scripts split into several. This tool runs the exact OpenAI tiktoken merge rules, so the split you see is the split the model sees.

What is BPE (byte pair encoding)?

Byte-pair encoding builds a vocabulary by merging frequent byte pairs. Instead of one token per character or one per word, BPE finds a middle ground: about 100,000 sub-word pieces for cl100k_base and ~200,000 for o200k_base. That keeps common text short while still being able to encode any byte sequence, including emoji and Sinhala.

Why does one word sometimes count as several tokens?

Only frequent words earn their own token. A long or rare word like "internationalization" is stored as two pieces ("international" + "ization"). Misspellings, code identifiers, and non-English words split even further. Try the "Long word" sample to see a single English word become two tokens.

How many tokens is one word on average?

For typical English, roughly one token per 0.75 words — about 1.3 tokens per word, or one token per 4 characters. The tool shows a live tokens-per-word ratio for your exact text, plus a labelled "by the 4-chars rule" estimate, so you can compare your prompt to that baseline.

Do GPT-4o and GPT-4 tokenize text differently?

Yes. GPT-4o, GPT-4.1 and the GPT-5 / o-series use o200k_base; GPT-4 and GPT-3.5-turbo use cl100k_base. Token IDs always differ between them, and counts can differ too — o200k_base is usually more compact, especially for non-English text. Turn on Compare encodings to see both counts and the difference side by side.

Why do Sinhala or Tamil characters cost so many tokens?

Non-Latin scripts are encoded as raw UTF-8 bytes, and each byte (or short byte run) becomes its own token. On cl100k_base a single Sinhala character can cost two tokens, so an eight-character word can reach 16 tokens. o200k_base handles these scripts far more efficiently — a good reason to pick a newer model for multilingual prompts.

Is my text sent anywhere?

No. The tokenizer runs entirely in your browser using gpt-tokenizer, a JavaScript port of OpenAI's tiktoken. Your text is never uploaded, logged, or sent to any server or API, and the tool needs no API key. You can disconnect from the internet after the page loads and it still works.

Can it tokenize Claude or Gemini exactly?

No, and it does not pretend to. Anthropic (Claude) and Google (Gemini) do not publish an exact browser tokenizer, so for those this tool only offers a clearly-labelled character-based estimate. The per-token chips and IDs are exact for OpenAI models only.

Each token is an index into the model's vocabulary — a single integer. The model never sees letters; it sees this sequence of IDs. Turn on Token IDs to show the integer on every chip, and use Copy token IDs (JSON) to grab the full array for debugging an API call.

AI · Developer tools

AI Tokenizer Visualizer — See How Text Becomes Tokens

Paste any text and watch it split into the exact tokens an LLM reads — each token a colour chip with its ID and decoded bytes. Runs OpenAI's real BPE tokenizer (o200k_base and cl100k_base) entirely in your browser. No signup, no API key, no upload.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 9, 2026

Tokenize your text

tiktoken-verified · 100% in-browser

Encoding / model

The newer, ~200k-token vocabulary. More compact for non-English text.

Your text

Text never leaves your browser — no upload, no API call, no key.179 / 100,000 chars

Try

Tokens

Characters

179

Words

Tokens / word

0.00

≈ 45 by the 4-chars rule

Token strip

Loading the o200k_base tokenizer…

Token splits and IDs come from gpt-tokenizer, a browser port of OpenAI's tiktoken. Counts are exact for OpenAI models only — see the methodology below.

How it works

A language model never sees letters — it sees tokens, the sub-word pieces its tokenizer produces. This tool runs the same byte-pair-encoding (BPE) algorithm and vocabularies that OpenAI ships in its open-source tiktoken library, via the browser port gpt-tokenizer. Because it uses the published vocabularies, the split shown is byte-for-byte the split the model receives.

Your text and the selected encoding are passed to encode(), which applies BPE: starting from UTF-8 bytes, it repeatedly merges the highest-priority adjacent pair until no merge in the vocabulary applies.
encode() returns an ordered array of integer token IDs. The token count is simply ids.length — not characters and not words.
For each ID, decode([id]) recovers the exact substring that token represents — that is what each chip shows. Multi-byte characters (emoji, Sinhala, Tamil) are split across several byte-tokens; the continuation bytes carry no glyph of their own, which the tool fades and labels honestly rather than hiding.
The tokens-per-word ratio is tokenCount / max(1, wordCount) where word count is the number of whitespace-separated runs. It is guidance, not a billing figure.

Encoding → model map (from tiktoken's model.py): o200k_base → GPT-4o, GPT-4.1, GPT-5 family, o1, o3, o4-mini; cl100k_base → GPT-4, GPT-3.5-turbo, text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002. The ≈4-characters-per-token figure used in the estimate is OpenAI's published English rule of thumb, not an exact count.

Everything runs client-side. The tokenizer is lazy-loaded after the page paints so it never blocks the first render, and inputs are capped at 100,000 characters to keep the tab responsive. Each worked example below was verified by re-encoding its pinned token-ID array with gpt-tokenizer, so the IDs shown match the library's output exactly.

Worked examples

“Hello, world!”

cl100k_base

Splits into ["Hello", ",", " world", "!"]. The space attaches to " world"; punctuation is its own token.

Tokens: 4 · Characters: 13

IDs: [9906, 11, 1917, 0]

“Hello, world!”

o200k_base

Same four-token split, different IDs — IDs are encoding-specific, counts here happen to match.

Tokens: 4 · Characters: 13

IDs: [13225, 11, 2375, 0]

“internationalization”

cl100k_base

One word → ["international", "ization"] = 2 tokens. Answers "why is my word more than one token?"

Tokens: 2 · Characters: 20

IDs: [98697, 2065]

“ආයුබෝවන්”

cl100k_base

8 characters → 16 tokens. Non-Latin scripts are byte-encoded, so each character costs several tokens.

Tokens: 16 · Characters: 8

IDs: [55742, 228, 55742, 118, 49849, 242, 55742, 114, 49849, 251, 49849, 222, 55742, 109, 49849, 232]

“ආයුබෝවන්”

o200k_base

The same 8 characters cost only 8 tokens on o200k_base — its larger vocabulary handles Sinhala better.

Tokens: 8 · Characters: 8

IDs: [1456, 228, 7664, 8809, 28256, 38739, 8600, 11804]

The Sinhala examples are the edge case worth dwelling on: the same eight-character word “ආයුබෝවන්” costs 16 tokens on cl100k_base but only 8 on o200k_base. If your prompts mix English and Sinhala, the encoding you target changes the bill.

Frequently asked questions

Sources & references

Token splits and IDs are deterministic from the tiktoken vocabularies. The worked-example arrays on this page were last regenerated and reconciled against gpt-tokenizer on 2026-06-09.

Related tools

LiveAI

AI Token Counter

Count tokens for any text against GPT-5, GPT-4o, Claude 4.x, Gemini 3, and Llama 4. See how much of each model's context window you'll use before sending. Runs entirely in your browser, no signup, sources cited.

Open tool

LiveAI

AI Vision Token Calculator

Calculate how many tokens an image costs on GPT-4o, GPT-4o mini, Claude, and Gemini from its pixel dimensions — plus the per-image and total cost in USD and LKR, side by side. Runs entirely in your browser; the image is never uploaded.

Open tool

LiveAI

AI Rate Limit Calculator

Computes whether an LLM workload will hit OpenAI, Anthropic, or Gemini rate limits — effective max requests/min, which limit binds (RPM vs TPM/ITPM/OTPM/RPD), and batch wall-clock time.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a token split that looks wrong, or want another encoding added?

Email me at [email protected] — most fixes ship within 24 hours.

Token strip

How it works

Worked examples

Frequently asked questions

How does GPT split text into tokens?

What is BPE (byte pair encoding)?

Why does one word sometimes count as several tokens?

How many tokens is one word on average?

Do GPT-4o and GPT-4 tokenize text differently?

Why do Sinhala or Tamil characters cost so many tokens?

Is my text sent anywhere?

Can it tokenize Claude or Gemini exactly?

What is a token ID?

Sources & references

Related tools

AI Token Counter

AI Vision Token Calculator

AI Rate Limit Calculator

Comments & feedback