AI Tokenizer Visualizer — See How Text Becomes Tokens
Paste any text and watch it split into the exact tokens an LLM reads — each token a colour chip with its ID and decoded bytes. Runs OpenAI's real BPE tokenizer (o200k_base and cl100k_base) entirely in your browser. No signup, no API key, no upload.
How it works
A language model never sees letters — it sees tokens, the sub-word pieces its tokenizer produces. This tool runs the same byte-pair-encoding (BPE) algorithm and vocabularies that OpenAI ships in its open-source tiktoken library, via the browser port gpt-tokenizer. Because it uses the published vocabularies, the split shown is byte-for-byte the split the model receives.
- Your text and the selected encoding are passed to
encode(), which applies BPE: starting from UTF-8 bytes, it repeatedly merges the highest-priority adjacent pair until no merge in the vocabulary applies. encode()returns an ordered array of integer token IDs. The token count is simplyids.length— not characters and not words.- For each ID,
decode([id])recovers the exact substring that token represents — that is what each chip shows. Multi-byte characters (emoji, Sinhala, Tamil) are split across several byte-tokens; the continuation bytes carry no glyph of their own, which the tool fades and labels honestly rather than hiding. - The tokens-per-word ratio is
tokenCount / max(1, wordCount)where word count is the number of whitespace-separated runs. It is guidance, not a billing figure.
Encoding → model map (from tiktoken's model.py): o200k_base → GPT-4o, GPT-4.1, GPT-5 family, o1, o3, o4-mini; cl100k_base → GPT-4, GPT-3.5-turbo, text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002. The ≈4-characters-per-token figure used in the estimate is OpenAI's published English rule of thumb, not an exact count.
Everything runs client-side. The tokenizer is lazy-loaded after the page paints so it never blocks the first render, and inputs are capped at 100,000 characters to keep the tab responsive. Each worked example below was verified by re-encoding its pinned token-ID array with gpt-tokenizer, so the IDs shown match the library's output exactly.
Worked examples
The Sinhala examples are the edge case worth dwelling on: the same eight-character word “ආයුබෝවන්” costs 16 tokens on cl100k_base but only 8 on o200k_base. If your prompts mix English and Sinhala, the encoding you target changes the bill.
Frequently asked questions
Sources & references
- OpenAI tiktoken — official BPE tokenizer, cl100k_base / o200k_base vocab + model.py mapping
- gpt-tokenizer — browser-ready JavaScript port of tiktoken (ISC licence)
- OpenAI Help Center — What are tokens and how to count them
Token splits and IDs are deterministic from the tiktoken vocabularies. The worked-example arrays on this page were last regenerated and reconciled against gpt-tokenizer on 2026-06-09.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a token split that looks wrong, or want another encoding added?
Email me at [email protected] — most fixes ship within 24 hours.