AI Alt-Text Generator — WCAG-friendly captions for any image
Drop an image and get a concise, WCAG 2.1-friendly alt-text string plus two longer caption variants. BLIP captioning runs server-side through the Hugging Face Inference API; a built-in WebAIM/W3C linter catches common alt-text mistakes. No signup, no ads.
Captions are generated by the BLIP model on the Hugging Face Inference API. Image bytes are sent once for inference and not stored. AI output is a draft — review it before publishing per W3C guidance.
How it works
The page wraps three independent layers: the W3C decision tree (drives the role switch), the BLIP vision-language model (writes the caption), and the WebAIM/W3C linter (scores the caption against accessibility heuristics). Each layer is documented below so you can see why a given result was chosen — and override it when the model gets it wrong.
1. W3C decision tree — what role is this image?
Before the model runs, the page asks the same question the W3C WAI alt-text decision tree asks: is this image informative (adds information beyond the surrounding text), decorative (a divider, ornament, or visual filler with no meaning), functional (used inside a link or a button), or complex (a chart, diagram, or infographic that needs a long description). Each branch has a different rule. Decorative images take an empty alt="" — the page skips the model entirely. Functional images get described by their purpose (Search, Open menu, Buy now), not their picture. Informative and complex images get a caption from the model.
2. BLIP captioning — what does the image show?
The primary captioner is Salesforce/blip-image-captioning-base — a ViT-B/16 vision encoder paired with a BERT-base text decoder, trained on COCO + Conceptual Captions 3M + 12M + SBU + filtered LAION 115M (Li, Li, Xiong & Hoi, ICML 2022). The model card reports CIDEr 136.7 / SPICE 26.0 on the COCO Karpathy test split. We call it via the Hugging Face Inference API from a server-only Next.js route handler — your image bytes travel from your browser to this server to Hugging Face and back. No third-party SaaS, no model download to your device.
Beam search is enabled by default with width 1–5 (slider above). Wider beams explore more wording at the cost of one or two extra seconds of latency. The token budget is set from your length preset — 20 tokens for Brief, 30 for Standard, 60 for Detailed — which keeps the model close to the WebAIM-recommended ~125-character cap without an over-aggressive trim.
The optional context hintfield is wired to BLIP's conditional-captioning entry point (paper §3.3). Anything you type is prepended to the decoder input as a photo of <hint> so the model continues from there. Useful for proper nouns the model cannot guess — place names, brand names, named dishes.
3. Post-processing — clip, lint, score, rank
Each candidate is run through the same pipeline:
- Length cap. Walk back to the previous word boundary inside the [60 %, 100 %] window of the cap and append an ellipsis (U+2026) so truncation is visible. If a candidate already fits, it is returned verbatim.
- WebAIM/W3C linter.Five heuristics run against the clipped string: drop "image of / picture of / photo of" lead phrases; warn on filename suffixes (.jpg, .png, …); flag over-limit lengths; warn on "click here" in functional roles; warn when alt duplicates the visible page caption.
- Quality score.Each candidate scores 100 minus 20 per warning, minus 5 per info, plus a +5 bonus when length sits in WebAIM's 25–125-character sweet spot. The cross-check function
alternateScoreCaption()reproduces the same number from a separate formulation, so you can verify the ranking is deterministic and not a black-box re-rank.
The highest-scoring candidate becomes the primary alt text shown first. The next two are surfaced as alternates so you can compare phrasings. The model's own beam-search order is preserved in the API response — the score-based re-rank only swaps a candidate forward when the linter says the underlying beam-top has a clearly fixable problem.
Hard limits
Images larger than 8.0 MBor 4096×4096 px are rejected with a specific error rather than silently re-scaled on the server. The captioner only ever sees a 384×384 view of your image after Hugging Face's preprocessing, so larger files just waste bandwidth.
Worked examples
Frequently asked questions
Sources & references
- W3C WCAG 2.1 Success Criterion 1.1.1 — Non-text Content (Level A)
- W3C WAI — An alt Decision Tree
- WebAIM — Alternative Text (Institute for Disability Research, Utah State University)
- WHATWG HTML living standard — alt attribute
- Li, Li, Xiong & Hoi (2022) — BLIP: Bootstrapping Language-Image Pre-training (ICML 2022)
- Hugging Face model card — Salesforce/blip-image-captioning-base
- Hugging Face model card — nlpconnect/vit-gpt2-image-captioning
- Hugging Face Inference API documentation
Sources, model cards, and accessibility guidance were last cross-checked on 2026-05-12. The captioning service is the public Hugging Face Inference API; the role switch and linter on this page implement the W3C WAI decision tree and WebAIM alt-text guidance verbatim.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Spot a caption the linter should be catching, or an image the model keeps mis-reading?
Email me at [email protected] — most fixes ship within 24 hours.