Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.9.0
title: LabelPlayground
app_file: app.py
sdk: gradio
sdk_version: 6.8.0
autolabel β OWLv2 + SAM2 labeling pipeline
Auto-label images using OWLv2 (open-vocabulary object detection) and optionally SAM2 (instance segmentation), then export a COCO dataset ready for model fine-tuning.
Quickstart
# 1. Install
uv sync
# 2. Copy env file (sets PYTORCH_ENABLE_MPS_FALLBACK=1 for Apple Silicon)
cp .env.example .env
# 3. Launch
make app
Models download automatically on first use and are cached in
~/.cache/huggingface. Nothing else is written to the project directory.
| Model | Size | Purpose |
|---|---|---|
owlv2-large-patch14-finetuned |
~700 MB | Text β bounding boxes |
sam2-hiera-tiny |
~160 MB | Box prompts β pixel masks |
How the app works
Mode selector
Both tabs have a Detection / Segmentation radio button:
| Mode | What runs | COCO output |
|---|---|---|
| Detection | OWLv2 only | bbox + empty segmentation: [] |
| Segmentation | OWLv2 β SAM2 | bbox + segmentation polygon list |
How Detection and Segmentation work
Detection uses OWLv2 β an open-vocabulary object detector. You give it a text prompt ("cup, bottle") and it returns bounding boxes with confidence scores. No fixed class list, no retraining needed.
Segmentation uses the Grounded SAM2 pattern β two models chained together:
Text prompts ("cup, bottle")
β
βΌ
OWLv2 β understands text, produces bounding boxes
β
βΌ
Bounding boxes
β
βΌ
SAM2 β understands spatial prompts, produces pixel masks
β
βΌ
Masks + COCO polygons
SAM2 (sam2-hiera-tiny) is a prompt-based segmenter β it accepts box, point, or mask
prompts but has no concept of text or class names. It can't answer "find me a cup"; it
can only answer "segment the object inside this box." OWLv2 is the grounding step
that translates your words into coordinates SAM2 can act on.
Both models run in Segmentation mode. Detection mode skips SAM2 entirely.
π§ͺ Test tab
Upload a single image, pick a mode, and type comma-separated object prompts. Hit Detect to see an annotated preview alongside a results table (label, confidence, bounding box). In Segmentation mode, pixel mask overlays are drawn on top of the bounding boxes. Use this tab to dial in prompts and threshold before a batch run β nothing is saved to disk.
π Batch tab
Upload multiple images and run the chosen mode on all of them at once. You get:
- An annotated gallery showing every image
- A Download ZIP button containing:
coco_export.jsonβ COCO-format annotations ready for fine-tuningimages/β all images resized to your chosen training size
The size dropdown offers common YOLOX training resolutions (416 β 1024) plus As is to keep the original dimensions. Coordinates in the COCO file match the resized images exactly.
All artifacts live in a system temp directory β nothing is written to the project.
Project layout
autolabel/
βββ config.py # Pydantic settings, auto device detection (CUDA β MPS β CPU)
βββ detect.py # OWLv2 inference β infer() shared by app + CLI
βββ segment.py # SAM2 integration β box prompts β masks + COCO polygons
βββ export.py # COCO JSON builder (no pycocotools); bbox + segmentation
βββ finetune.py # Fine-tuning loop (future use)
βββ utils.py # Shared helpers
scripts/
βββ run_detection.py # CLI: batch detect β data/detections/
βββ export_coco.py # CLI: build coco_export.json from data/labeled/
βββ finetune_owlv2.py # CLI: fine-tune OWLv2 (future use)
app.py # Gradio web UI
CLI workflow
Detection and export can be driven from the command line without the UI:
# Detect all images in data/raw/ β data/detections/
make detect
# Custom prompts
uv run python scripts/run_detection.py --prompts "cup,mug,bottle"
# Force re-run on already-processed images
uv run python scripts/run_detection.py --force
# Build COCO JSON from data/labeled/
make export
Fine-tuning (future)
The fine-tuning infrastructure is already in place. Once you have a
coco_export.json from a Batch run:
make finetune
# or:
uv run python scripts/finetune_owlv2.py \
--coco-json data/labeled/coco_export.json \
--image-dir data/raw \
--epochs 10
Key hyperparameters
| Parameter | Default | Notes |
|---|---|---|
| Epochs | 10 | More epochs β higher overfit risk on small datasets |
| Learning rate | 1e-4 | Applied to the detection head |
| Gradient accumulation | 4 | Effective batch size multiplier |
| Unfreeze backbone | off | Also trains the vision encoder β needs more data |
Tips
- Start with 50β100 annotated images per class minimum; 200β500 is better.
- Fine-tuned models are more confident β raise the threshold to 0.2β0.4.
- Leave the backbone frozen unless you have 500+ images per class.
Prerequisites
| Tool | Version | Notes |
|---|---|---|
| Python | 3.11.x | Managed by uv |
| uv | latest | curl -LsSf https://astral.sh/uv/install.sh | sh |
| CUDA toolkit | 11.8+ | Windows/Linux GPU users only |
Apple Silicon: PYTORCH_ENABLE_MPS_FALLBACK=1 is pre-set in .env.example.
Windows/CUDA: remove PYTORCH_ENABLE_MPS_FALLBACK from .env. For a
specific CUDA build:
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
uv sync
Makefile targets
| Target | Description |
|---|---|
make setup |
Install dependencies, copy .env.example |
make app |
Launch the Gradio UI |
make detect |
Batch detect via CLI β data/detections/ |
make export |
Build COCO JSON via CLI |
make finetune |
Fine-tune OWLv2 via CLI |
make clean |
Delete generated JSONs (raw images untouched) |
License
MIT