File size: 6,227 Bytes
9634000
 
47cb9bd
9634000
 
 
47cb9bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9634000
47cb9bd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
title: LabelPlayground
app_file: app.py
sdk: gradio
sdk_version: 6.8.0
---
# autolabel β€” OWLv2 + SAM2 labeling pipeline

Auto-label images using **OWLv2** (open-vocabulary object detection) and
optionally **SAM2** (instance segmentation), then export a COCO dataset ready
for model fine-tuning.

---

## Quickstart

```bash
# 1. Install
uv sync

# 2. Copy env file (sets PYTORCH_ENABLE_MPS_FALLBACK=1 for Apple Silicon)
cp .env.example .env

# 3. Launch
make app
```

Models download automatically on first use and are cached in
`~/.cache/huggingface`. Nothing else is written to the project directory.

| Model | Size | Purpose |
|-------|------|---------|
| `owlv2-large-patch14-finetuned` | ~700 MB | Text β†’ bounding boxes |
| `sam2-hiera-tiny` | ~160 MB | Box prompts β†’ pixel masks |

---

## How the app works

### Mode selector

Both tabs have a **Detection / Segmentation** radio button:

| Mode | What runs | COCO output |
|------|-----------|-------------|
| **Detection** | OWLv2 only | `bbox` + empty `segmentation: []` |
| **Segmentation** | OWLv2 β†’ SAM2 | `bbox` + `segmentation` polygon list |

### How Detection and Segmentation work

**Detection** uses [OWLv2](https://huggingface.co/google/owlv2-large-patch14-finetuned) β€” an
open-vocabulary object detector. You give it a text prompt ("cup, bottle") and it returns
bounding boxes with confidence scores. No fixed class list, no retraining needed.

**Segmentation** uses the **Grounded SAM2** pattern β€” two models chained together:

```
Text prompts ("cup, bottle")
        β”‚
        β–Ό
     OWLv2          ← understands text, produces bounding boxes
        β”‚
        β–Ό
  Bounding boxes
        β”‚
        β–Ό
     SAM2           ← understands spatial prompts, produces pixel masks
        β”‚
        β–Ό
  Masks + COCO polygons
```

SAM2 (`sam2-hiera-tiny`) is a *prompt-based* segmenter β€” it accepts box, point, or mask
prompts but has no concept of text or class names. It can't answer "find me a cup"; it
can only answer "segment the object inside this box." OWLv2 is the **grounding** step
that translates your words into coordinates SAM2 can act on.

Both models run in Segmentation mode. Detection mode skips SAM2 entirely.

### πŸ§ͺ Test tab

Upload a single image, pick a mode, and type comma-separated object prompts.
Hit **Detect** to see an annotated preview alongside a results table (label,
confidence, bounding box). In Segmentation mode, pixel mask overlays are drawn
on top of the bounding boxes. Use this tab to dial in prompts and threshold
before a batch run β€” nothing is saved to disk.

### πŸ“‚ Batch tab

Upload multiple images and run the chosen mode on all of them at once. You get:

- An annotated **gallery** showing every image
- A **Download ZIP** button containing:
  - `coco_export.json` β€” COCO-format annotations ready for fine-tuning
  - `images/` β€” all images resized to your chosen training size

The size dropdown offers common YOLOX training resolutions (416 β†’ 1024) plus
**As is** to keep the original dimensions. Coordinates in the COCO file match
the resized images exactly.

All artifacts live in a system temp directory β€” nothing is written to the project.

---

## Project layout

```
autolabel/
β”œβ”€β”€ config.py       # Pydantic settings, auto device detection (CUDA β†’ MPS β†’ CPU)
β”œβ”€β”€ detect.py       # OWLv2 inference β€” infer() shared by app + CLI
β”œβ”€β”€ segment.py      # SAM2 integration β€” box prompts β†’ masks + COCO polygons
β”œβ”€β”€ export.py       # COCO JSON builder (no pycocotools); bbox + segmentation
β”œβ”€β”€ finetune.py     # Fine-tuning loop (future use)
└── utils.py        # Shared helpers
scripts/
β”œβ”€β”€ run_detection.py   # CLI: batch detect β†’ data/detections/
β”œβ”€β”€ export_coco.py     # CLI: build coco_export.json from data/labeled/
└── finetune_owlv2.py  # CLI: fine-tune OWLv2 (future use)
app.py              # Gradio web UI
```

---

## CLI workflow

Detection and export can be driven from the command line without the UI:

```bash
# Detect all images in data/raw/ β†’ data/detections/
make detect

# Custom prompts
uv run python scripts/run_detection.py --prompts "cup,mug,bottle"

# Force re-run on already-processed images
uv run python scripts/run_detection.py --force

# Build COCO JSON from data/labeled/
make export
```

---

## Fine-tuning (future)

The fine-tuning infrastructure is already in place. Once you have a
`coco_export.json` from a Batch run:

```bash
make finetune
# or:
uv run python scripts/finetune_owlv2.py \
  --coco-json data/labeled/coco_export.json \
  --image-dir data/raw \
  --epochs 10
```

### Key hyperparameters

| Parameter | Default | Notes |
|-----------|---------|-------|
| Epochs | 10 | More epochs β†’ higher overfit risk on small datasets |
| Learning rate | 1e-4 | Applied to the detection head |
| Gradient accumulation | 4 | Effective batch size multiplier |
| Unfreeze backbone | off | Also trains the vision encoder β€” needs more data |

### Tips

- Start with **50–100 annotated images per class** minimum; 200–500 is better.
- Fine-tuned models are more confident β€” raise the threshold to 0.2–0.4.
- Leave the backbone frozen unless you have 500+ images per class.

---

## Prerequisites

| Tool | Version | Notes |
|------|---------|-------|
| Python | **3.11.x** | Managed by uv |
| [uv](https://docs.astral.sh/uv/) | latest | `curl -LsSf https://astral.sh/uv/install.sh \| sh` |
| CUDA toolkit | 11.8+ | Windows/Linux GPU users only |

**Apple Silicon:** `PYTORCH_ENABLE_MPS_FALLBACK=1` is pre-set in `.env.example`.

**Windows/CUDA:** remove `PYTORCH_ENABLE_MPS_FALLBACK` from `.env`. For a
specific CUDA build:

```powershell
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
uv sync
```

---

## Makefile targets

| Target | Description |
|--------|-------------|
| `make setup` | Install dependencies, copy `.env.example` |
| `make app` | Launch the Gradio UI |
| `make detect` | Batch detect via CLI β†’ `data/detections/` |
| `make export` | Build COCO JSON via CLI |
| `make finetune` | Fine-tune OWLv2 via CLI |
| `make clean` | Delete generated JSONs (raw images untouched) |

---

## License

MIT