What is the best chunk size for RAG?

There is no single best size, but 256–512 tokens with 10–20% overlap is a strong default for prose. Dense reference text tolerates larger chunks (768–1024); chatty transcripts and FAQs retrieve better when smaller (128–256). Test retrieval on real questions and adjust — smaller chunks sharpen the embedding but raise the vector count you store and search.

How much overlap should I use between chunks?

Roughly 10–20% of the chunk size. For 512-token chunks that is about 50–100 tokens. Overlap carries context across the boundary so a sentence split between two chunks is still recoverable at retrieval time. More overlap improves recall but inflates the tokens you embed and store, shown here as the overlap-tokens and percentage-extra stats.

How do I split text by tokens instead of characters?

Set the measure to Tokens and pick the encoding your model uses — cl100k_base for GPT-3.5/4 and the text-embedding-3 models, or o200k_base for GPT-4o. The tool runs OpenAI's tiktoken byte-pair encoder in your browser, so each chunk's token count is exact, not a characters-divided-by-four estimate.

What is recursive character text splitting?

It is LangChain's RecursiveCharacterTextSplitter approach. The text is split on the largest separator first (paragraph breaks), and any piece still over the chunk size is split again on the next-smaller separator (line breaks, then spaces, then characters). Adjacent pieces are merged greedily up to the size limit. It keeps related text together far better than a blind fixed window.

How many tokens should an embedding chunk be?

Stay well under the model's input limit. OpenAI's text-embedding-3-small and -large accept up to 8,191 tokens per input, but embedding a single 8,191-token block is rarely useful — the vector becomes too diffuse to match a specific query. Most pipelines sit between 200 and 800 tokens. Any chunk over the limit is flagged here in red.

Is my text uploaded anywhere?

No. The tokenizer and every splitting strategy run entirely in your browser. Your text is never sent to a server, never logged, and never stored. The only optional network request is the one-time lazy download of the tokenizer vocabulary the first time you press Chunk text.

Does this work for Sinhala or Tamil text?

Yes — the tiktoken encoders handle any Unicode text, so token counts for Sinhala and Tamil are exact. Note that non-Latin scripts cost more tokens per character because the byte-pair vocabulary is English-weighted, so the same passage produces more, smaller chunks. The character and word measures are script-agnostic.

Which models do cl100k_base and o200k_base cover?

cl100k_base is the encoding for GPT-3.5 Turbo, GPT-4, and the text-embedding-3-small, text-embedding-3-large, and ada-002 embedding models. o200k_base is the newer encoding for GPT-4o and GPT-4o-mini. Anthropic and Gemini use their own tokenizers that are not openly published, so counts for those models are approximate.

When were the algorithms and sources last verified?

The tiktoken encodings, the LangChain separator hierarchy, and the 8,191-token embedding limit were last cross-checked against their official documentation on 2026-06-06. The fixed-window chunk-count math is reconciled against a closed-form formula in the source module.

AI · Developer · Privacy-first

AI Text Chunker — Split Text for RAG & Embeddings

Split long text into token-sized, optionally overlapping chunks for retrieval-augmented generation and embedding pipelines. Runs OpenAI's real tiktoken counter in your browser, shows each chunk's exact token count, and exports JSON or JSONL. No upload, no signup.

By Induwara Ashinsana— Executive Director, Ryzera TechnologiesUpdated Jun 6, 2026

Split text into chunkstiktoken · in-browser

Text stays on your device

Paste your text

No upload, no logging — chunking runs entirely on your device.931 / 500,000 chars

Try a sample

Chunk size

tokens

Target units per chunk (1–8,000).

Overlap

tokens

Units carried into the next chunk (0 to size − 1).

Presets

Measure in

Tokenizer model

GPT-3.5 / GPT-4 · text-embedding-3-small/large · ada-002

Split strategy

Keep separators (recursive / sentence / paragraph)

What this does

Splits long text into token-sized, optionally overlapping chunks for retrieval-augmented generation and embeddings. Pick a size, overlap, and strategy, then press Chunk textto see each chunk's exact token count and export the result as JSON or JSONL — ready to feed an embeddings pipeline.

Token counts use OpenAI's tiktoken BPE encodings; the recursive strategy follows LangChain's separator hierarchy. Chunks over 8,191 tokens are flagged as too large for text-embedding-3. Sources linked below; last verified 2026-06-06.

How it works

Before you can embed a document for semantic search, you have to split it into chunks small enough to fit the embedding model's input window and focused enough to match a single question. This tool does that entirely in your browser, with four strategies and exact token counts.

Tokenize.The text is encoded with OpenAI's tiktoken byte-pair encoding — cl100k_base (GPT-3.5/4 and the text-embedding-3 models) or o200k_base (GPT-4o). Counts are exact, not the common characters ÷ 4 estimate. You can also measure in characters or words.
Fixed-window strategy. A window of chunk_size units slides with stride stride = chunk_size − overlap. Chunk k spans units [k·stride, k·stride + chunk_size), clamped to the end. The number of chunks for N > chunk_size is ceil((N − overlap) / stride), and duplicated units total (chunks − 1) × overlap. The source module verifies the sliding-window count against this closed-form formula so an off-by-one at a boundary cannot slip through.
Recursive strategy (default).This follows LangChain's RecursiveCharacterTextSplitter. The text is split on the largest separator in the hierarchy ["\n\n", "\n", " ", ""]; any piece still over the size is split again on the next-smaller separator, down to single characters. Adjacent pieces are then merged greedily until adding the next would exceed the size, and the trailing overlap units of each chunk are carried into the front of the next.
Sentence and paragraph strategies.The text is segmented on sentence boundaries (the browser's Intl.Segmenter) or on blank-line paragraph breaks, then those segments are packed into chunks without ever splitting a sentence or paragraph in half. A single segment larger than the chunk size becomes its own chunk and is flagged.
Flagging. Any chunk over 8,191tokens is marked in red, because that is the per-input limit for OpenAI's text-embedding-3 models. The summary also reports how many tokens overlap duplicates and what percentage extra you will pay to embed them.

Everything is integer arithmetic over token, character, or word indices, so the same input always produces exactly the same chunks. Up to 500,000 characters can be processed at once, all on your device.

Worked examples

Fixed window, tokens, with overlap

Input: 1,000 tokens. chunk_size = 400, overlap = 50.
stride = 400 − 50 = 350.
Window starts: 0, 350, 700.
Chunk 0 → [0, 400) = 400 tokens; Chunk 1 → [350, 750) = 400; Chunk 2 → [700, 1000) = 300.
Result: 3 chunks. Duplicated = 400 + 400 + 300 − 1000 = 100 = (3 − 1) × 50. ✓

Fixed window, large document

Input: 16,000 tokens. chunk_size = 512, overlap = 64.
stride = 512 − 64 = 448.
chunks = ceil((16000 − 64) / 448) = ceil(15936 / 448) = ceil(35.57) = 36.
Last window: start = 35 × 448 = 15680 → [15680, 16000) = 320 tokens.
Result: 36 chunks. Duplicated = 35 × 64 = 2,240 tokens. ✓

Recursive packing (edge case — exact, by characters)

Input: "one two three four" (18 chars). chunk_size = 8 characters, overlap = 0, recursive.
Split on " ": segments "one " (4), "two " (4), "three " (6), "four" (4).
Pack greedily ≤ 8: "one " + "two " = 8 → close chunk; "three " = 6 → next would overflow; "four" = 4 → own chunk.
Result: 3 chunks → "one two " (8), "three " (6), "four" (4).
Duplicated = 8 + 6 + 4 − 18 = 0 (no overlap requested). ✓

Frequently asked questions

Sources & references

The encodings, splitting algorithm, and embedding limit were last cross-checked against the sources above on 2026-06-06. The fixed-window chunk-count math is reconciled against a closed-form formula in lib/data/ai-text-chunker.ts.

Related tools

LiveAI

RAG Chunk Size Calculator

Plan a RAG embedding pipeline before you spend a cent: enter a document length, chunk size and overlap, and instantly see the chunk count, tokens duplicated by overlap, total tokens embedded, embedding cost in USD across OpenAI, Cohere and Voyage models, the vector-store size for float32/float16/int8, and the top-k retrieval prompt budget. Pure client-side sliding-window math — no keys, no upload.

Open tool

LiveAI

Document Token Cost Calc

Upload a PDF, Word, text, Markdown or code file and count its exact input tokens in your browser, then see what it costs to send to GPT-5, GPT-4o, Claude, Gemini, Llama, DeepSeek and Mistral — in USD and LKR — plus whether it fits each model's context window in a single call. No signup, no API key, no upload.

Open tool

LiveAI

Sinhala/Tamil AI Cost

Measure the token tax: Sinhala and Tamil fragment into far more tokens than English, so LLM API calls cost more. Tokenize real text with the exact GPT-4o/GPT-5 tokenizer in your browser, see the per-word tax ratio versus English, and project a monthly OpenAI, Claude, Gemini or Llama bill in USD and LKR for an app serving Sri Lankan users — with a cheapest-model comparison. No signup, no upload, sources cited.

Open tool

Rate this tool

Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.