AI Fine-Tuning Dataset Validator (JSONL)
Paste or upload an OpenAI chat fine-tuning .jsonl dataset and instantly catch the format errors that get an upload rejected — malformed lines, bad roles, missing assistant replies, unknown keys — then see your example count and an estimated training-token total. Runs entirely in your browser; the file never leaves your device.
How it works
The validator mirrors the checks in OpenAI's reference Cookbook script chat_finetuning_data_prep, which is the same logic the platform applies when you upload a training file. It runs as a single deterministic pass over your input — identical input and settings always produce the same report — and nothing is sent over the network.
- Line parsing. A JSONL file is one JSON object per line. Each non-blank line is parsed with
JSON.parse; a failure is reported as invalid JSON for that line. A whole-file JSON array (a common paste mistake) is caught and explained rather than silently mis-parsed. - Top-level structure. In chat mode each object must hold a non-empty
messagesarray. Keys other thanmessages,tools,parallel_tool_calls, orfunctionsare flagged as warnings, exactly as the Cookbook flags unexpected keys. - Message checks. Every message needs a
rolein{system, user, assistant, tool}and, for ordinary messages, a non-emptycontent. Assistant messages may carry afunction_callortool_callsinstead of content. Unrecognized message keys become warnings. - Assistant-presence. Each example must contain at least one
assistantmessage — without a target reply the model has nothing to learn. When an example already has another error (say a misspelled role), the missing-assistant flag is suppressed so the root cause is reported once, not twice. - Token estimation. Following the Cookbook's
num_tokens_from_messages, each example's tokens are the sum of its message text plus a fixed overhead — 3 tokens per message and 3 priming tokens per example. The text itself is estimated (≈4 characters per token, or a closer per-word approximation), since bundling a full tokenizer would bloat the page; for exact counts, the page links to the dedicated token counter. - Readiness rules. The example count is checked against the API floor of 10 and the recommended 50. Examples longer than your per-example token cap are flagged as truncated from the end, per OpenAI's best-practices guidance.
As a credibility cross-check, every token total is also computed a second, simpler way — a flat characters-÷-4 over the same counted text with no overhead — so you can see the two estimates bracket the real figure. The exact value from OpenAI's tokenizer sits between them.
Worked examples
Frequently asked questions
Sources & references
- OpenAI Cookbook — Data preparation and analysis for chat model fine-tuning
- OpenAI Docs — Supervised fine-tuning (chat JSONL format & roles)
- OpenAI Docs — Fine-tuning best practices (example counts, truncation)
The format rules and token overhead on this page were last cross-checked against the OpenAI sources above on 2026-06-24. The token text estimate is approximate — for an exact cl100k/o200k count, use the AI token counter.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a format edge case the validator misses, or want another provider's schema added?
Email me at [email protected] — most fixes ship within 24 hours.