induwara.lk
induwara.lkAI · Developer · Privacy-first

AI Text Chunker — Split Text for RAG & Embeddings

Split long text into token-sized, optionally overlapping chunks for retrieval-augmented generation and embedding pipelines. Runs OpenAI's real tiktoken counter in your browser, shows each chunk's exact token count, and exports JSON or JSONL. No upload, no signup.

By Induwara AshinsanaUpdated Jun 6, 2026
Split text into chunkstiktoken · in-browser
Text stays on your device
No upload, no logging — chunking runs entirely on your device.931 / 500,000 chars
Try a sample
tokens

Target units per chunk (1–8,000).

tokens

Units carried into the next chunk (0 to size − 1).

Presets

GPT-3.5 / GPT-4 · text-embedding-3-small/large · ada-002

What this does

Splits long text into token-sized, optionally overlapping chunks for retrieval-augmented generation and embeddings. Pick a size, overlap, and strategy, then press Chunk textto see each chunk's exact token count and export the result as JSON or JSONL — ready to feed an embeddings pipeline.

Token counts use OpenAI's tiktoken BPE encodings; the recursive strategy follows LangChain's separator hierarchy. Chunks over 8,191 tokens are flagged as too large for text-embedding-3. Sources linked below; last verified 2026-06-06.

How it works

Before you can embed a document for semantic search, you have to split it into chunks small enough to fit the embedding model's input window and focused enough to match a single question. This tool does that entirely in your browser, with four strategies and exact token counts.

  1. Tokenize.The text is encoded with OpenAI's tiktoken byte-pair encoding — cl100k_base (GPT-3.5/4 and the text-embedding-3 models) or o200k_base (GPT-4o). Counts are exact, not the common characters ÷ 4 estimate. You can also measure in characters or words.
  2. Fixed-window strategy. A window of chunk_size units slides with stride stride = chunk_size − overlap. Chunk k spans units [k·stride, k·stride + chunk_size), clamped to the end. The number of chunks for N > chunk_size is ceil((N − overlap) / stride), and duplicated units total (chunks − 1) × overlap. The source module verifies the sliding-window count against this closed-form formula so an off-by-one at a boundary cannot slip through.
  3. Recursive strategy (default).This follows LangChain's RecursiveCharacterTextSplitter. The text is split on the largest separator in the hierarchy ["\n\n", "\n", " ", ""]; any piece still over the size is split again on the next-smaller separator, down to single characters. Adjacent pieces are then merged greedily until adding the next would exceed the size, and the trailing overlap units of each chunk are carried into the front of the next.
  4. Sentence and paragraph strategies.The text is segmented on sentence boundaries (the browser's Intl.Segmenter) or on blank-line paragraph breaks, then those segments are packed into chunks without ever splitting a sentence or paragraph in half. A single segment larger than the chunk size becomes its own chunk and is flagged.
  5. Flagging. Any chunk over 8,191tokens is marked in red, because that is the per-input limit for OpenAI's text-embedding-3 models. The summary also reports how many tokens overlap duplicates and what percentage extra you will pay to embed them.

Everything is integer arithmetic over token, character, or word indices, so the same input always produces exactly the same chunks. Up to 500,000 characters can be processed at once, all on your device.

Worked examples

Fixed window, tokens, with overlap

  1. Input: 1,000 tokens. chunk_size = 400, overlap = 50.
  2. stride = 400 − 50 = 350.
  3. Window starts: 0, 350, 700.
  4. Chunk 0 → [0, 400) = 400 tokens; Chunk 1 → [350, 750) = 400; Chunk 2 → [700, 1000) = 300.
  5. Result: 3 chunks. Duplicated = 400 + 400 + 300 − 1000 = 100 = (3 − 1) × 50. ✓

Fixed window, large document

  1. Input: 16,000 tokens. chunk_size = 512, overlap = 64.
  2. stride = 512 − 64 = 448.
  3. chunks = ceil((16000 − 64) / 448) = ceil(15936 / 448) = ceil(35.57) = 36.
  4. Last window: start = 35 × 448 = 15680 → [15680, 16000) = 320 tokens.
  5. Result: 36 chunks. Duplicated = 35 × 64 = 2,240 tokens. ✓

Recursive packing (edge case — exact, by characters)

  1. Input: "one two three four" (18 chars). chunk_size = 8 characters, overlap = 0, recursive.
  2. Split on " ": segments "one " (4), "two " (4), "three " (6), "four" (4).
  3. Pack greedily ≤ 8: "one " + "two " = 8 → close chunk; "three " = 6 → next would overflow; "four" = 4 → own chunk.
  4. Result: 3 chunks → "one two " (8), "three " (6), "four" (4).
  5. Duplicated = 8 + 6 + 4 − 18 = 0 (no overlap requested). ✓

Frequently asked questions

Sources & references

The encodings, splitting algorithm, and embedding limit were last cross-checked against the sources above on 2026-06-06. The fixed-window chunk-count math is reconciled against a closed-form formula in lib/data/ai-text-chunker.ts.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a bug, edge case, or want to suggest an improvement?

Email me at [email protected] — most fixes ship within 24 hours.