AI Text Chunker — Split Text for RAG & Embeddings
Split long text into token-sized, optionally overlapping chunks for retrieval-augmented generation and embedding pipelines. Runs OpenAI's real tiktoken counter in your browser, shows each chunk's exact token count, and exports JSON or JSONL. No upload, no signup.
How it works
Before you can embed a document for semantic search, you have to split it into chunks small enough to fit the embedding model's input window and focused enough to match a single question. This tool does that entirely in your browser, with four strategies and exact token counts.
- Tokenize.The text is encoded with OpenAI's tiktoken byte-pair encoding —
cl100k_base(GPT-3.5/4 and the text-embedding-3 models) oro200k_base(GPT-4o). Counts are exact, not the common characters ÷ 4 estimate. You can also measure in characters or words. - Fixed-window strategy. A window of
chunk_sizeunits slides with stridestride = chunk_size − overlap. Chunk k spans units[k·stride, k·stride + chunk_size), clamped to the end. The number of chunks forN > chunk_sizeisceil((N − overlap) / stride), and duplicated units total(chunks − 1) × overlap. The source module verifies the sliding-window count against this closed-form formula so an off-by-one at a boundary cannot slip through. - Recursive strategy (default).This follows LangChain's RecursiveCharacterTextSplitter. The text is split on the largest separator in the hierarchy
["\n\n", "\n", " ", ""]; any piece still over the size is split again on the next-smaller separator, down to single characters. Adjacent pieces are then merged greedily until adding the next would exceed the size, and the trailing overlap units of each chunk are carried into the front of the next. - Sentence and paragraph strategies.The text is segmented on sentence boundaries (the browser's
Intl.Segmenter) or on blank-line paragraph breaks, then those segments are packed into chunks without ever splitting a sentence or paragraph in half. A single segment larger than the chunk size becomes its own chunk and is flagged. - Flagging. Any chunk over 8,191tokens is marked in red, because that is the per-input limit for OpenAI's text-embedding-3 models. The summary also reports how many tokens overlap duplicates and what percentage extra you will pay to embed them.
Everything is integer arithmetic over token, character, or word indices, so the same input always produces exactly the same chunks. Up to 500,000 characters can be processed at once, all on your device.
Worked examples
Frequently asked questions
Sources & references
- OpenAI tiktoken — cl100k_base & o200k_base byte-pair encodings
- OpenAI Tokenizer — interactive reference
- LangChain — Text splitters (RecursiveCharacterTextSplitter)
- OpenAI — Embeddings guide (8,191-token input limit)
The encodings, splitting algorithm, and embedding limit were last cross-checked against the sources above on 2026-06-06. The fixed-window chunk-count math is reconciled against a closed-form formula in lib/data/ai-text-chunker.ts.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a bug, edge case, or want to suggest an improvement?
Email me at [email protected] — most fixes ship within 24 hours.