---
title: LabelPlayground
app_file: app.py
sdk: gradio
sdk_version: 6.8.0
---
# autolabel — OWLv2 + SAM2 labeling pipeline

Auto-label images using **OWLv2** (open-vocabulary object detection) and
optionally **SAM2** (instance segmentation), then export a COCO dataset ready
for model fine-tuning.

---

## Quickstart

```bash
# 1. Install
uv sync

# 2. Copy env file (sets PYTORCH_ENABLE_MPS_FALLBACK=1 for Apple Silicon)
cp .env.example .env

# 3. Launch
make app
```

Models download automatically on first use and are cached in
`~/.cache/huggingface`. Nothing else is written to the project directory.

| Model | Size | Purpose |
|-------|------|---------|
| `owlv2-large-patch14-finetuned` | ~700 MB | Text → bounding boxes |
| `sam2-hiera-tiny` | ~160 MB | Box prompts → pixel masks |

---

## How the app works

### Mode selector

Both tabs have a **Detection / Segmentation** radio button:

| Mode | What runs | COCO output |
|------|-----------|-------------|
| **Detection** | OWLv2 only | `bbox` + empty `segmentation: []` |
| **Segmentation** | OWLv2 → SAM2 | `bbox` + `segmentation` polygon list |

### How Detection and Segmentation work

**Detection** uses [OWLv2](https://huggingface.co/google/owlv2-large-patch14-finetuned) — an
open-vocabulary object detector. You give it a text prompt ("cup, bottle") and it returns
bounding boxes with confidence scores. No fixed class list, no retraining needed.

**Segmentation** uses the **Grounded SAM2** pattern — two models chained together:

```
Text prompts ("cup, bottle")
        │
        ▼
     OWLv2          ← understands text, produces bounding boxes
        │
        ▼
  Bounding boxes
        │
        ▼
     SAM2           ← understands spatial prompts, produces pixel masks
        │
        ▼
  Masks + COCO polygons
```

SAM2 (`sam2-hiera-tiny`) is a *prompt-based* segmenter — it accepts box, point, or mask
prompts but has no concept of text or class names. It can't answer "find me a cup"; it
can only answer "segment the object inside this box." OWLv2 is the **grounding** step
that translates your words into coordinates SAM2 can act on.

Both models run in Segmentation mode. Detection mode skips SAM2 entirely.

### 🧪 Test tab

Upload a single image, pick a mode, and type comma-separated object prompts.
Hit **Detect** to see an annotated preview alongside a results table (label,
confidence, bounding box). In Segmentation mode, pixel mask overlays are drawn
on top of the bounding boxes. Use this tab to dial in prompts and threshold
before a batch run — nothing is saved to disk.

### 📂 Batch tab

Upload multiple images and run the chosen mode on all of them at once. You get:

- An annotated **gallery** showing every image
- A **Download ZIP** button containing:
  - `coco_export.json` — COCO-format annotations ready for fine-tuning
  - `images/` — all images resized to your chosen training size

The size dropdown offers common YOLOX training resolutions (416 → 1024) plus
**As is** to keep the original dimensions. Coordinates in the COCO file match
the resized images exactly.

All artifacts live in a system temp directory — nothing is written to the project.

---

## Project layout

```
autolabel/
├── config.py       # Pydantic settings, auto device detection (CUDA → MPS → CPU)
├── detect.py       # OWLv2 inference — infer() shared by app + CLI
├── segment.py      # SAM2 integration — box prompts → masks + COCO polygons
├── export.py       # COCO JSON builder (no pycocotools); bbox + segmentation
├── finetune.py     # Fine-tuning loop (future use)
└── utils.py        # Shared helpers
scripts/
├── run_detection.py   # CLI: batch detect → data/detections/
├── export_coco.py     # CLI: build coco_export.json from data/labeled/
└── finetune_owlv2.py  # CLI: fine-tune OWLv2 (future use)
app.py              # Gradio web UI
```

---

## CLI workflow

Detection and export can be driven from the command line without the UI:

```bash
# Detect all images in data/raw/ → data/detections/
make detect

# Custom prompts
uv run python scripts/run_detection.py --prompts "cup,mug,bottle"

# Force re-run on already-processed images
uv run python scripts/run_detection.py --force

# Build COCO JSON from data/labeled/
make export
```

---

## Fine-tuning (future)

The fine-tuning infrastructure is already in place. Once you have a
`coco_export.json` from a Batch run:

```bash
make finetune
# or:
uv run python scripts/finetune_owlv2.py \
  --coco-json data/labeled/coco_export.json \
  --image-dir data/raw \
  --epochs 10
```

### Key hyperparameters

| Parameter | Default | Notes |
|-----------|---------|-------|
| Epochs | 10 | More epochs → higher overfit risk on small datasets |
| Learning rate | 1e-4 | Applied to the detection head |
| Gradient accumulation | 4 | Effective batch size multiplier |
| Unfreeze backbone | off | Also trains the vision encoder — needs more data |

### Tips

- Start with **50–100 annotated images per class** minimum; 200–500 is better.
- Fine-tuned models are more confident — raise the threshold to 0.2–0.4.
- Leave the backbone frozen unless you have 500+ images per class.

---

## Prerequisites

| Tool | Version | Notes |
|------|---------|-------|
| Python | **3.11.x** | Managed by uv |
| [uv](https://docs.astral.sh/uv/) | latest | `curl -LsSf https://astral.sh/uv/install.sh \| sh` |
| CUDA toolkit | 11.8+ | Windows/Linux GPU users only |

**Apple Silicon:** `PYTORCH_ENABLE_MPS_FALLBACK=1` is pre-set in `.env.example`.

**Windows/CUDA:** remove `PYTORCH_ENABLE_MPS_FALLBACK` from `.env`. For a
specific CUDA build:

```powershell
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
uv sync
```

---

## Makefile targets

| Target | Description |
|--------|-------------|
| `make setup` | Install dependencies, copy `.env.example` |
| `make app` | Launch the Gradio UI |
| `make detect` | Batch detect via CLI → `data/detections/` |
| `make export` | Build COCO JSON via CLI |
| `make finetune` | Fine-tune OWLv2 via CLI |
| `make clean` | Delete generated JSONs (raw images untouched) |

---

## License

MIT