--- title: LabelPlayground app_file: app.py sdk: gradio sdk_version: 6.8.0 --- # autolabel — OWLv2 + SAM2 labeling pipeline Auto-label images using **OWLv2** (open-vocabulary object detection) and optionally **SAM2** (instance segmentation), then export a COCO dataset ready for model fine-tuning. --- ## Quickstart ```bash # 1. Install uv sync # 2. Copy env file (sets PYTORCH_ENABLE_MPS_FALLBACK=1 for Apple Silicon) cp .env.example .env # 3. Launch make app ``` Models download automatically on first use and are cached in `~/.cache/huggingface`. Nothing else is written to the project directory. | Model | Size | Purpose | |-------|------|---------| | `owlv2-large-patch14-finetuned` | ~700 MB | Text → bounding boxes | | `sam2-hiera-tiny` | ~160 MB | Box prompts → pixel masks | --- ## How the app works ### Mode selector Both tabs have a **Detection / Segmentation** radio button: | Mode | What runs | COCO output | |------|-----------|-------------| | **Detection** | OWLv2 only | `bbox` + empty `segmentation: []` | | **Segmentation** | OWLv2 → SAM2 | `bbox` + `segmentation` polygon list | ### How Detection and Segmentation work **Detection** uses [OWLv2](https://huggingface.co/google/owlv2-large-patch14-finetuned) — an open-vocabulary object detector. You give it a text prompt ("cup, bottle") and it returns bounding boxes with confidence scores. No fixed class list, no retraining needed. **Segmentation** uses the **Grounded SAM2** pattern — two models chained together: ``` Text prompts ("cup, bottle") │ ▼ OWLv2 ← understands text, produces bounding boxes │ ▼ Bounding boxes │ ▼ SAM2 ← understands spatial prompts, produces pixel masks │ ▼ Masks + COCO polygons ``` SAM2 (`sam2-hiera-tiny`) is a *prompt-based* segmenter — it accepts box, point, or mask prompts but has no concept of text or class names. It can't answer "find me a cup"; it can only answer "segment the object inside this box." OWLv2 is the **grounding** step that translates your words into coordinates SAM2 can act on. Both models run in Segmentation mode. Detection mode skips SAM2 entirely. ### 🧪 Test tab Upload a single image, pick a mode, and type comma-separated object prompts. Hit **Detect** to see an annotated preview alongside a results table (label, confidence, bounding box). In Segmentation mode, pixel mask overlays are drawn on top of the bounding boxes. Use this tab to dial in prompts and threshold before a batch run — nothing is saved to disk. ### 📂 Batch tab Upload multiple images and run the chosen mode on all of them at once. You get: - An annotated **gallery** showing every image - A **Download ZIP** button containing: - `coco_export.json` — COCO-format annotations ready for fine-tuning - `images/` — all images resized to your chosen training size The size dropdown offers common YOLOX training resolutions (416 → 1024) plus **As is** to keep the original dimensions. Coordinates in the COCO file match the resized images exactly. All artifacts live in a system temp directory — nothing is written to the project. --- ## Project layout ``` autolabel/ ├── config.py # Pydantic settings, auto device detection (CUDA → MPS → CPU) ├── detect.py # OWLv2 inference — infer() shared by app + CLI ├── segment.py # SAM2 integration — box prompts → masks + COCO polygons ├── export.py # COCO JSON builder (no pycocotools); bbox + segmentation ├── finetune.py # Fine-tuning loop (future use) └── utils.py # Shared helpers scripts/ ├── run_detection.py # CLI: batch detect → data/detections/ ├── export_coco.py # CLI: build coco_export.json from data/labeled/ └── finetune_owlv2.py # CLI: fine-tune OWLv2 (future use) app.py # Gradio web UI ``` --- ## CLI workflow Detection and export can be driven from the command line without the UI: ```bash # Detect all images in data/raw/ → data/detections/ make detect # Custom prompts uv run python scripts/run_detection.py --prompts "cup,mug,bottle" # Force re-run on already-processed images uv run python scripts/run_detection.py --force # Build COCO JSON from data/labeled/ make export ``` --- ## Fine-tuning (future) The fine-tuning infrastructure is already in place. Once you have a `coco_export.json` from a Batch run: ```bash make finetune # or: uv run python scripts/finetune_owlv2.py \ --coco-json data/labeled/coco_export.json \ --image-dir data/raw \ --epochs 10 ``` ### Key hyperparameters | Parameter | Default | Notes | |-----------|---------|-------| | Epochs | 10 | More epochs → higher overfit risk on small datasets | | Learning rate | 1e-4 | Applied to the detection head | | Gradient accumulation | 4 | Effective batch size multiplier | | Unfreeze backbone | off | Also trains the vision encoder — needs more data | ### Tips - Start with **50–100 annotated images per class** minimum; 200–500 is better. - Fine-tuned models are more confident — raise the threshold to 0.2–0.4. - Leave the backbone frozen unless you have 500+ images per class. --- ## Prerequisites | Tool | Version | Notes | |------|---------|-------| | Python | **3.11.x** | Managed by uv | | [uv](https://docs.astral.sh/uv/) | latest | `curl -LsSf https://astral.sh/uv/install.sh \| sh` | | CUDA toolkit | 11.8+ | Windows/Linux GPU users only | **Apple Silicon:** `PYTORCH_ENABLE_MPS_FALLBACK=1` is pre-set in `.env.example`. **Windows/CUDA:** remove `PYTORCH_ENABLE_MPS_FALLBACK` from `.env`. For a specific CUDA build: ```powershell uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118 uv sync ``` --- ## Makefile targets | Target | Description | |--------|-------------| | `make setup` | Install dependencies, copy `.env.example` | | `make app` | Launch the Gradio UI | | `make detect` | Batch detect via CLI → `data/detections/` | | `make export` | Build COCO JSON via CLI | | `make finetune` | Fine-tune OWLv2 via CLI | | `make clean` | Delete generated JSONs (raw images untouched) | --- ## License MIT