induwara.lk
induwara.lkAI · Developer tools

AI Tokenizer Visualizer — See How Text Becomes Tokens

Paste any text and watch it split into the exact tokens an LLM reads — each token a colour chip with its ID and decoded bytes. Runs OpenAI's real BPE tokenizer (o200k_base and cl100k_base) entirely in your browser. No signup, no API key, no upload.

By Induwara AshinsanaUpdated Jun 9, 2026
Tokenize your text
tiktoken-verified · 100% in-browser

The newer, ~200k-token vocabulary. More compact for non-English text.

Text never leaves your browser — no upload, no API call, no key.179 / 100,000 chars
Try
Tokens
0
Characters
179
Words
31
Tokens / word
0.00
≈ 45 by the 4-chars rule

Token strip

Loading the o200k_base tokenizer…

Token splits and IDs come from gpt-tokenizer, a browser port of OpenAI's tiktoken. Counts are exact for OpenAI models only — see the methodology below.

How it works

A language model never sees letters — it sees tokens, the sub-word pieces its tokenizer produces. This tool runs the same byte-pair-encoding (BPE) algorithm and vocabularies that OpenAI ships in its open-source tiktoken library, via the browser port gpt-tokenizer. Because it uses the published vocabularies, the split shown is byte-for-byte the split the model receives.

  1. Your text and the selected encoding are passed to encode(), which applies BPE: starting from UTF-8 bytes, it repeatedly merges the highest-priority adjacent pair until no merge in the vocabulary applies.
  2. encode() returns an ordered array of integer token IDs. The token count is simply ids.length — not characters and not words.
  3. For each ID, decode([id]) recovers the exact substring that token represents — that is what each chip shows. Multi-byte characters (emoji, Sinhala, Tamil) are split across several byte-tokens; the continuation bytes carry no glyph of their own, which the tool fades and labels honestly rather than hiding.
  4. The tokens-per-word ratio is tokenCount / max(1, wordCount) where word count is the number of whitespace-separated runs. It is guidance, not a billing figure.

Encoding → model map (from tiktoken's model.py): o200k_baseGPT-4o, GPT-4.1, GPT-5 family, o1, o3, o4-mini; cl100k_baseGPT-4, GPT-3.5-turbo, text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002. The ≈4-characters-per-token figure used in the estimate is OpenAI's published English rule of thumb, not an exact count.

Everything runs client-side. The tokenizer is lazy-loaded after the page paints so it never blocks the first render, and inputs are capped at 100,000 characters to keep the tab responsive. Each worked example below was verified by re-encoding its pinned token-ID array with gpt-tokenizer, so the IDs shown match the library's output exactly.

Worked examples

Hello, world!

cl100k_base

Splits into ["Hello", ",", " world", "!"]. The space attaches to " world"; punctuation is its own token.

Tokens: 4 · Characters: 13
IDs: [9906, 11, 1917, 0]

Hello, world!

o200k_base

Same four-token split, different IDs — IDs are encoding-specific, counts here happen to match.

Tokens: 4 · Characters: 13
IDs: [13225, 11, 2375, 0]

internationalization

cl100k_base

One word → ["international", "ization"] = 2 tokens. Answers "why is my word more than one token?"

Tokens: 2 · Characters: 20
IDs: [98697, 2065]

ආයුබෝවන්

cl100k_base

8 characters → 16 tokens. Non-Latin scripts are byte-encoded, so each character costs several tokens.

Tokens: 16 · Characters: 8
IDs: [55742, 228, 55742, 118, 49849, 242, 55742, 114, 49849, 251, 49849, 222, 55742, 109, 49849, 232]

ආයුබෝවන්

o200k_base

The same 8 characters cost only 8 tokens on o200k_base — its larger vocabulary handles Sinhala better.

Tokens: 8 · Characters: 8
IDs: [1456, 228, 7664, 8809, 28256, 38739, 8600, 11804]

The Sinhala examples are the edge case worth dwelling on: the same eight-character word “ආයුබෝවන්” costs 16 tokens on cl100k_base but only 8 on o200k_base. If your prompts mix English and Sinhala, the encoding you target changes the bill.

Frequently asked questions

Sources & references

Token splits and IDs are deterministic from the tiktoken vocabularies. The worked-example arrays on this page were last regenerated and reconciled against gpt-tokenizer on 2026-06-09.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a token split that looks wrong, or want another encoding added?

Email me at [email protected] — most fixes ship within 24 hours.