induwara.lk
induwara.lkAI · Developer

AI Fine-Tuning Dataset Validator (JSONL)

Paste or upload an OpenAI chat fine-tuning .jsonl dataset and instantly catch the format errors that get an upload rejected — malformed lines, bad roles, missing assistant replies, unknown keys — then see your example count and an estimated training-token total. Runs entirely in your browser; the file never leaves your device.

By Induwara AshinsanaUpdated Jun 24, 2026
Validate your JSONL
100% in your browser
Format
Token estimate basis
Tokens
Examples above this many tokens are flagged as truncated from the end.

5 non-blank lines. Nothing is uploaded — validation runs on this device.

2 errors found across 5 lines

3 of 5 lines are valid examples. Fix the rows below before uploading.

Valid examples
3
5 non-blank lines parsed
Messages
10
1 example with a system msg
Est. training tokens
106
max/example: 33 · median: 17
Over token cap
0
None exceed the per-example cap

Readiness checklist

  • At least 10 examples

    5 examples — below the 10 the API requires.

  • Every example has an assistant reply

    At least one example has no assistant message.

  • No fatal format errors

    2 errors the API would reject.

Per-line issues

LineSeverityMessage
4errorLine 4: no assistant message — the model has no target to learn from. Every example needs at least one assistant reply.
5errorLine 5: invalid role "asistant" — must be one of system, user, assistant, tool.

Sources: OpenAI Cookbook — chat fine-tuning data prep · OpenAI Docs — Supervised fine-tuning & best practices. Token figures are estimates; for exact cl100k/o200k counts use the dedicated token counter. Linked under “Sources & references” below.

How it works

The validator mirrors the checks in OpenAI's reference Cookbook script chat_finetuning_data_prep, which is the same logic the platform applies when you upload a training file. It runs as a single deterministic pass over your input — identical input and settings always produce the same report — and nothing is sent over the network.

  1. Line parsing. A JSONL file is one JSON object per line. Each non-blank line is parsed with JSON.parse; a failure is reported as invalid JSON for that line. A whole-file JSON array (a common paste mistake) is caught and explained rather than silently mis-parsed.
  2. Top-level structure. In chat mode each object must hold a non-empty messages array. Keys other than messages, tools, parallel_tool_calls, or functions are flagged as warnings, exactly as the Cookbook flags unexpected keys.
  3. Message checks. Every message needs a role in {system, user, assistant, tool} and, for ordinary messages, a non-empty content. Assistant messages may carry a function_call or tool_calls instead of content. Unrecognized message keys become warnings.
  4. Assistant-presence. Each example must contain at least one assistant message — without a target reply the model has nothing to learn. When an example already has another error (say a misspelled role), the missing-assistant flag is suppressed so the root cause is reported once, not twice.
  5. Token estimation. Following the Cookbook's num_tokens_from_messages, each example's tokens are the sum of its message text plus a fixed overhead — 3 tokens per message and 3 priming tokens per example. The text itself is estimated (≈4 characters per token, or a closer per-word approximation), since bundling a full tokenizer would bloat the page; for exact counts, the page links to the dedicated token counter.
  6. Readiness rules. The example count is checked against the API floor of 10 and the recommended 50. Examples longer than your per-example token cap are flagged as truncated from the end, per OpenAI's best-practices guidance.

As a credibility cross-check, every token total is also computed a second, simpler way — a flat characters-÷-4 over the same counted text with no overhead — so you can see the two estimates bracket the real figure. The exact value from OpenAI's tokenizer sits between them.

Worked examples

A — a clean 3-example dataset

0 errors · 3 valid examples

Input JSONL

{"messages":[{"role":"system","content":"You are terse."},{"role":"user","content":"Hi"},{"role":"assistant","content":"Hello."}]}
{"messages":[{"role":"user","content":"2+2?"},{"role":"assistant","content":"4"}]}
{"messages":[{"role":"user","content":"Capital of Sri Lanka?"},{"role":"assistant","content":"Sri Jayawardenepura Kotte (commercial: Colombo)."}]}
  1. All three lines are valid JSON objects, each with a non-empty messages array.
  2. Every example has at least one assistant message, and every role is valid → 0 errors, 3 valid examples.
  3. One example (line 1) includes a system message; the stats card reports that.
  4. Readiness: example-count FAILS (3 < 10), assistant-presence PASSES, no-errors PASSES — so the format is correct but the dataset is too small to train on yet.

B — a broken dataset (edge case)

3 errors across 4 lines · 1 valid example

Input JSONL

{"messages":[{"role":"user","content":"Hi"},{"role":"assistant","content":"Hey"}]}
{"messages":[{"role":"user","content":"No answer here"}]}
{"messages":[{"role":"user","content":"Typo role"},{"role":"asistant","content":"oops"}]}
{"messages":[{"role":"user","content":"Bad json"]}
  1. Line 1 is valid.
  2. Line 2 has no assistant message → error.
  3. Line 3 misspells the role as "asistant" → invalid-role error. Because that example already has an error, the missing-assistant flag is suppressed — you fix the typo once.
  4. Line 4 is missing a closing brace → invalid-JSON error.
  5. Total: 3 errors, 1 valid example — exactly what the OpenAI API would reject on upload.

C — token estimate for one example

Heuristic basis, system message counted

Input JSONL

{"messages":[
  {"role":"user","content":"2+2?"},
  {"role":"assistant","content":"4"}
]}
  1. User message: 3 (per-message) + ceil(4/4)=1 for the role text + ceil(4/4)=1 for "2+2?" = 5 tokens.
  2. Assistant message: 3 + ceil(9/4)=3 for "assistant" + ceil(1/4)=1 for "4" = 7 tokens.
  3. Plus 3 priming tokens per example: 5 + 7 + 3 = 15 estimated training tokens.
  4. Multiply the dataset total by your number of epochs to gauge training cost; switch the basis to Approx cl100k for a tighter estimate.

Frequently asked questions

Sources & references

The format rules and token overhead on this page were last cross-checked against the OpenAI sources above on 2026-06-24. The token text estimate is approximate — for an exact cl100k/o200k count, use the AI token counter.

Related tools

Rate this tool
Be the first to rate

Comments & feedback

Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.

Found a format edge case the validator misses, or want another provider's schema added?

Email me at [email protected] — most fixes ship within 24 hours.