AI Object Detector — labelled bounding boxes for any image
Drop an image and get a labelled bounding box around every object the COCO-trained DETR or YOLOS transformer can find — class name, confidence, pixel coordinates, and exports in PNG, COCO JSON, and YOLO TXT. Server-side inference via the Hugging Face Inference API; no signup.
Detections come from DETR or YOLOS served via the Hugging Face Inference API. Image bytes are sent once for inference and not stored. The 80-class COCO label space does not cover tuk-tuks, sarongs, or local dishes — review the result before publishing it as feature data.
How it works
Object detection answers a different question than image classification or captioning. Where a classifier outputs one label for the whole picture, and a captioner writes one sentence, a detector returns a list of (class, confidence, bounding box) tuples — one per recognised object. This page wraps three layers: file validation + preprocessing in the browser, transformer inference on a Hugging Face endpoint, and a deterministic post-processing pipeline that re-applies threshold, class filter, and (for YOLOS) non-maximum suppression on the client without a second network call.
1. Validate and preprocess the image
The browser rejects anything outside JPG, PNG, WebP, GIF, or AVIF, or larger than 8.0 MB on disk, or wider than 4096px on either side. Files that pass the gate are decoded once via createImageBitmap (which applies EXIF orientation in modern browsers) and posted as multipart form data to /api/tools/detect-objects. The route forwards the raw bytes to the Hugging Face Inference endpoint for the chosen backbone — DETR or YOLOS — and returns the parsed detections plus an inference-time milliseconds figure.
2. Transformer detection (DETR or YOLOS)
The Quality backbone is facebook/detr-resnet-50. The ResNet-50 backbone produces a 2048-channel feature map at stride 32, a 1×1 conv squeezes it to 256 channels, a 6-layer transformer encoder attends across spatial positions, and a 6-layer decoder cross-attends against 100 learned object queries. Each query output goes through two heads: an 81-way classifier (80 COCO classes + a "no object" class) and a 4-way bbox regressor predicting centre/size form (c_x, c_y, w, h) in normalised image coordinates. To draw the box in pixels we convert withx_min = (c_x − w/2) × W, y_min = (c_y − h/2) × H. DETR's bipartite-matching training loss makes the 100 queries non-overlapping, so no NMS pass is needed (paper §3).
The Fast and Balanced backbones are hustvl/yolos-tiny (6.5 M params, COCO AP 28.7) and hustvl/yolos-small (30.0 M params, COCO AP 36.1). YOLOS treats the image as a sequence of 16×16 patches plus 100 learned [DET]tokens; each token output is classified and regressed exactly like DETR. Because YOLOS does not have DETR's set-prediction loss, the page applies a per-class greedy non-maximum suppression at IoU 0.50 on the client.
The standard IoU formula inter / (|A| + |B| − inter) is implemented twice in lib/data/ai-object-detector.ts — iou() uses Math.max and Math.min, and iouCrossCheck() walks corners explicitly. Both produce identical values to within floating-point ε, so a divergence between them would be a clear bug signal — the cross-check is the safety net behind every detection you see on this page.
3. Threshold, filter, and draw
The slider above defaults to 50% — anything below it is hidden from the canvas and table, but the raw detection list is kept in memory so dragging the slider re-renders instantly. The class filter accepts any subset of the 80 COCO categories and is applied beforethe max-detections cap, so a request for "only person and dog" still surfaces up to N person/dog boxes rather than wasting the cap on a low-scoring background detection. The per-class palette spaces hues by hueForClass(idx) = (idx × 137) mod 360 — golden-ratio-friendly rotation that maps adjacent COCO IDs to widely separated hues. Label text colour is picked by a YIQ luminance test so it stays legible at WCAG AA on any background hue.
Hard limits and privacy
The free Hugging Face Inference tier rejects payloads above ~10 MB, so the client caps file uploads at 8.0 MB. The 4096×4096 px dimension cap protects low-RAM devices during canvas rendering and keeps the export PNG under a sensible disk size. Image bytes leave the device exactly once — the single POST to this server's route. The route does not write to disk, log the file, or persist the response; nothing is kept after the response is sent back to your browser. The COCO 80-class space is fixed; tuk-tuks, sarongs, kottu, king coconut, and other locally relevant items fall outside it. The FAQ explains this honestly rather than mis-labelling them.
Worked examples
Frequently asked questions
Sources & references
- Carion et al. (2020) — End-to-End Object Detection with Transformers (ECCV 2020)
- Fang et al. (2021) — You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection (NeurIPS 2021)
- Hugging Face model card — facebook/detr-resnet-50
- Hugging Face model card — hustvl/yolos-tiny
- Hugging Face model card — hustvl/yolos-small
- Lin et al. (2014) — Microsoft COCO: Common Objects in Context
- Official COCO 2017 class list mirror
- Hugging Face Inference API — Object detection task
- Neubeck & Van Gool (2006) — Efficient Non-Maximum Suppression
- COCO Detection evaluation metrics (mAP / AP50 / AP75)
Model cards, papers, dataset and class list, and the inference API documentation were last cross-checked on 2026-05-12. The COCO 80-class set is documented in lib/data/ai-object-detector.ts (80 entries) and matches the model cards verbatim.
Related tools
Comments & feedback
Spotted a bug or want an improvement? Tell us — our team reviews every comment, and good ideas get built. Comments are public and anonymous.
Spot an image the detector keeps mis-reading, or a Sri Lanka-specific class you wish the COCO label space covered?
Email me at [email protected] — most fixes ship within 24 hours.