AI Content Moderation Checker — Free Toxicity Checker
Paste any comment, review, or message and check it for toxicity, profanity, threats, insults, and hate speech across the six Jigsaw categories. Offending words are highlighted, sources are cited, and nothing is stored. No signup.
How it works
The checker runs two independent layers and combines them into one verdict — the same pattern as a profanity filter sitting next to a machine-learning classifier. Each layer is transparent, and either can flag a message on its own.
Layer 1 — deterministic profanity scan. The text is lowercased, split into word tokens, and each token is checked for an exact match against a curated 72-term subset of the LDNOOBW list (“List of Dirty, Naughty, Obscene and Otherwise Bad Words”), the open profanity list used by Shutterstock. This layer always runs in your browser, needs no network, and highlights every matched word. It also computes a transparent density figure:
severity = min(1, (matches ÷ words) × 5)
The 5× multiplier means a profanity density of 20% or more saturates to 100%, so a single bad word in a short message still registers while one in a long, otherwise clean paragraph scores low. This is a stated heuristic, not a vendor figure.
Layer 2 — toxicity model. When configured, the text is sent once to the unitary/toxic-bert classifier through the Hugging Face Inference API on the server — no model weights are ever downloaded to your browser. It returns an independent sigmoid probability between 0 and 1 for each of the six categories. Because the head is multi-label rather than softmax, the six scores do not sum to 1; a message can be high on several categories at once. A category counts as flagged when its score is at or above your chosen threshold (Strict 0.3, Balanced 0.5, Lenient 0.7).
Combined verdict. The text is flagged when any model category crosses the threshold or any profanity word is matched. It escalates to strongly flaggedwhen the model's top score reaches 0.85, when a high-harm category (severe_toxic, threat, or identity_hate) is flagged, or when the profanity density reaches 60%. The verdict maps to a plain action: Clean → “Likely safe to publish”, Flagged → “Review before publishing”, Strongly flagged → “Recommend removing”. No score is invented — model probabilities are shown verbatim, and the only computed numbers are the profanity ratio and the threshold comparisons.
The six categories
Toxic
Rude, disrespectful, or unreasonable language likely to make someone leave a discussion.
Severe toxic
Very hateful, aggressive, or disrespectful content — toxicity at its most extreme.
Obscene
Vulgar, sexually explicit, or profane language.
Threat
A statement of intent to inflict physical or other harm on a person or group.
Insult
An inflammatory or negative comment directed at a person (a personal attack).
Identity hate
Hateful content targeting a person's race, religion, gender, sexual orientation, disability, or other identity.
Worked examples
The profanity layer is fully hand-checkable. These three reconcile exactly with the formula above and with the tool's built-in verifyWorkedExamples() check. (The neural scores are not hand-computable, so only the deterministic numbers are shown.)
Frequently asked questions
Sources & references
- Jigsaw / Conversation AI — Toxic Comment Classification Challenge (the six-label taxonomy)
- unitary/toxic-bert — model card (multi-label sigmoid toxicity classifier)
- LDNOOBW — List of Dirty, Naughty, Obscene and Otherwise Bad Words (profanity list)
The taxonomy, model, and profanity list were last cross-checked on 2026-06-22. This v1 is English-only; image, audio, and Sinhala/Tamil moderation are out of scope. A high score is a prompt for human review, not a final ruling.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Found a bug, edge case, or want to suggest an improvement?
Email me at [email protected] — most fixes ship within 24 hours.