--- license: apache-2.0 language: - en library_name: data-label-factory tags: - vision - dataset-labeling - object-detection - apple-silicon - mlx - qwen - gemma - falcon-perception pipeline_tag: image-feature-extraction --- # data-label-factory A generic auto-labeling pipeline for vision datasets. Pick any object class in a YAML file, run one command, and end up with a clean COCO dataset reviewed in a browser. Designed to run entirely on a 16 GB Apple Silicon Mac. ``` gather → filter → label → verify → review (DDG/ (VLM YES/ (Falcon (VLM per- (canvas yt) NO) bbox) bbox) UI) ``` Two interchangeable VLM backends: | Backend | Model | Server | Pick when | |---|---|---|---| | `qwen` | Qwen 2.5-VL-3B 4-bit | `mlx_vlm.server` | You want fast YES/NO classification (~3.5s/img on M4) | | `gemma` | Gemma 4-26B-A4B 4-bit | `mac_tensor` (Expert Sniper) | You want richer reasoning + grounded segmentation in one server | The `label` stage always uses **Falcon Perception** for bbox grounding, served out of `mac_tensor` alongside Gemma. Falcon doesn't depend on the VLM choice — it's a separate ~600 MB model. --- ## What you get when this finishes For our reference run on a fiber-optic-drone detector: - **1421 source images** gathered from DuckDuckGo + Wikimedia + Openverse - **15,355 Falcon Perception bboxes** generated via the `label` stage - **11,928 / 15,355 (78%)** approved by Qwen 2.5-VL in the `verify` stage - **Reviewed in a browser** via the canvas web UI (`web/`) Per-query agreement between Falcon and Qwen on this dataset: `cable spool` 88%, `quadcopter` 81%, `drone` 80%, `fiber optic spool` 57%. You can reproduce all of this from this repo by following the steps below. --- ## 1. Install ```bash # Clone git clone https://github.com/walter-grace/data-label-factory.git cd data-label-factory # Install the CLI (registers `data_label_factory` on your $PATH) pip install -e . # (Optional) Add image-search dependencies for the `gather` stage pip install -e ".[gather]" # (Optional) Web UI deps — only if you want to review labels in a browser cd web && npm install && cd .. ``` Or install straight from GitHub without cloning first: ```bash pip install git+https://github.com/walter-grace/data-label-factory ``` The repo is also mirrored on Hugging Face at [`waltgrace/data-label-factory`](https://huggingface.co/waltgrace/data-label-factory). HF git serving doesn't play well with pip's partial-clone, so to install from HF use a regular clone: ```bash git clone https://huggingface.co/waltgrace/data-label-factory cd data-label-factory && pip install -e . ``` The factory CLI needs Python 3.10+. The backend servers (Qwen and/or Gemma) are installed separately — you only need the one(s) you plan to use. --- ## 2. Pick a backend and start it ### Option A — Qwen 2.5-VL (recommended for filter/verify) ```bash # Install mlx-vlm (Apple Silicon) pip install mlx-vlm # Start the OpenAI-compatible server python3 -m mlx_vlm.server \ --model mlx-community/Qwen2.5-VL-3B-Instruct-4bit \ --port 8291 ``` Verify it's alive: ```bash QWEN_URL=http://localhost:8291 data_label_factory status ``` ### Option B — Gemma 4 + Falcon (recommended for `label`) This is the [MLX Expert Sniper](https://github.com/walter-grace/mac-code) deploy package. It serves Gemma 4-26B-A4B (chat / `--vision`) **and** Falcon Perception (`--falcon`) from the same process at port 8500. Total ~5 GB resident on a 16 GB Mac via SSD-streamed experts. ```bash # Install + download model (one-time, ~13 GB) git clone https://github.com/walter-grace/mac-code cd mac-code/research/expert-sniper/distributed pip install -e . mlx mlx-vlm fastapi uvicorn pillow huggingface_hub python-multipart huggingface-cli download mlx-community/gemma-4-26b-a4b-it-4bit \ --local-dir ~/models/gemma4-source python3 split_gemma4.py \ --input ~/models/gemma4-source \ --output ~/models/gemma4-stream # Launch python3 -m mac_tensor ui --vision --falcon \ --stream-dir ~/models/gemma4-stream \ --source-dir ~/models/gemma4-source \ --port 8500 ``` Verify: ```bash GEMMA_URL=http://localhost:8500 data_label_factory status ``` You can run **both** servers at the same time. The factory CLI will use whichever backend you select per command via `--backend qwen|gemma`. --- ## 3. Define a project A project YAML is the *only* thing you need to write to onboard a new object class. Two examples ship in `projects/`: - [`projects/drones.yaml`](projects/drones.yaml) — fiber-optic drone detection (the original use case) - [`projects/stop-signs.yaml`](projects/stop-signs.yaml) — minimal smoke test Copy one and edit the four important fields: ```yaml project_name: fire-hydrants target_object: "fire hydrant" # templated into all prompts as {target_object} data_root: ~/data-label-factory/fire-hydrants buckets: positive/clear_view: queries: ["red fire hydrant", "yellow fire hydrant", "fire hydrant on sidewalk"] negative/other_street_objects: queries: ["mailbox", "parking meter", "trash can"] background/empty_streets: queries: ["empty city street", "suburban sidewalk"] falcon_queries: # what Falcon will look for during `label` - "fire hydrant" - "red metal post" backends: filter: qwen # default per stage; CLI --backend overrides label: gemma verify: qwen ``` Inspect a project before running anything: ```bash data_label_factory project --project projects/fire-hydrants.yaml ``` --- ## 4. Run the pipeline The four stages can be run individually or chained: ```bash PROJECT=projects/stop-signs.yaml # 4a. Gather — image search across buckets data_label_factory gather --project $PROJECT --max-per-query 30 # 4b. Filter — image-level YES/NO via your chosen VLM data_label_factory filter --project $PROJECT --backend qwen # 4c. Label — Falcon Perception bbox grounding (needs Gemma server up) data_label_factory label --project $PROJECT # 4d. Verify — per-bbox YES/NO via your chosen VLM # (verify is a TODO in the generic CLI today; runpod_falcon/verify_vlm.py # is the original drone-specific impl that the generic version will wrap.) # OR run gather → filter end-to-end: data_label_factory pipeline --project $PROJECT --backend qwen ``` Every command writes a timestamped folder under `experiments/` (relative to your current working directory) with the config, prompts, raw model answers, and JSON outputs. List them with: ```bash data_label_factory list ``` --- ## 5. Review the labels in a browser The `web/` directory is a Next.js + HTML5 Canvas review tool. It reads your labeled JSON straight from R2 (or local — see `web/app/api/labels/route.ts`) and renders the bboxes over each image with hover, click-to-select, scroll-zoom, and keyboard navigation. ```bash cd web PORT=3030 npm run dev # open http://localhost:3030/canvas ``` Features: - **Drag** to pan, **scroll** to zoom around the cursor, **double-click** to reset - **←/→** to navigate images, **click** a bbox to select it - **Color coding**: per-query color, dashed red for VLM rejections, white outline for active - **Bucket tabs** to filter by source bucket - **Per-image query summary** with YES/NO counts The grid view at `http://localhost:3030/` is the older shadcn-based browser with thumbnail-grid + per-bbox approve/reject buttons. --- ## Optional: GPU path via RunPod For larger runs (tens of thousands of images), there's an opt-in GPU path that orchestrates a RunPod pod, runs the same pipeline on it, and publishes the result straight to Hugging Face: ```bash pip install -e ".[runpod]" export RUNPOD_API_KEY=rpa_xxxxxxxxxx python3 -m data_label_factory.runpod pipeline \ --project projects/drones.yaml --gpu L40S \ --publish-to / ``` See [`data_label_factory/runpod/README.md`](data_label_factory/runpod/README.md) for the full architecture, costs (~$0.06 for the canonical 2,260-image run), and pod-vs-serverless trade-offs. Local Mac execution is still the default — runpod is just an option. --- ## Optional: open-set image identification The base pipeline produces COCO labels for training a closed-set **detector**. The opt-in `data_label_factory.identify` subpackage produces a CLIP retrieval **index** for open-set identification — given a known set of N reference images, identify which one a webcam frame is showing. **Use it when you have 1 image per class and want zero training time.** ```bash pip install -e ".[identify]" # Build an index from a folder of references python3 -m data_label_factory.identify index --refs ~/my-cards/ --out my.npz # Optional: contrastive fine-tune for fine-grained accuracy (~5 min on M4 MPS) python3 -m data_label_factory.identify train --refs ~/my-cards/ --out my-proj.pt python3 -m data_label_factory.identify index --refs ~/my-cards/ --out my.npz --projection my-proj.pt # Self-test the index python3 -m data_label_factory.identify verify --index my.npz # Serve as a mac_tensor-shaped /api/falcon endpoint python3 -m data_label_factory.identify serve --index my.npz --refs ~/my-cards/ # → web/canvas/live can hit it with FALCON_URL=http://localhost:8500/api/falcon ``` Built-in **rarity / variant detection** for free — if your filenames encode a suffix like `_pscr`, `_scr`, `_ur`, the matched filename's suffix becomes a separate `rarity` field on the response. See [`data_label_factory/identify/README.md`](data_label_factory/identify/README.md) for the full blueprint and concrete examples (trading cards, album covers, industrial parts, plant species, …). --- ## Configuration reference ### Environment variables | Var | Default | What | |---|---|---| | `QWEN_URL` | `http://localhost:8291` | Where the `mlx_vlm.server` lives | | `QWEN_MODEL_PATH` | `mlx-community/Qwen2.5-VL-3B-Instruct-4bit` | Model id sent in the OpenAI request | | `GEMMA_URL` | `http://localhost:8500` | Where `mac_tensor` lives (also serves Falcon) | Set them inline for one command, or `export` them in your shell. ### CLI flags ``` data_label_factory [flags] Commands: status Check both backends are alive project --project P Print a project YAML for inspection gather --project P Search the web for images across buckets filter --project P Image-level YES/NO via Qwen or Gemma label --project P Falcon Perception bbox grounding pipeline --project P gather → filter list Show experiments Common flags: --backend qwen|gemma Pick the VLM (filter, pipeline). Overrides project YAML. --limit N Process at most N images (smoke testing) --experiment NAME Reuse an existing experiment dir ``` ### Project YAML reference See [`projects/drones.yaml`](projects/drones.yaml) for the canonical, fully commented example. Required fields: `project_name`, `target_object`, `buckets`, `falcon_queries`. Everything else has defaults. --- ## How big is this thing? | Component | Disk | RAM (resident) | |---|---|---| | Factory CLI + Python deps | < 50 MB | negligible | | Qwen 2.5-VL-3B 4-bit | ~2.2 GB | ~2.5 GB | | Gemma 4-26B-A4B (Expert Sniper streaming) | ~13 GB on disk | ~3 GB | | Falcon Perception 0.6B | ~1.5 GB | ~1.5 GB | | Web UI dev server | ~300 MB node_modules | ~150 MB | | **Total (Gemma + Falcon path)** | **~17 GB** | **~5 GB** | Fits comfortably on a 16 GB Apple Silicon Mac. --- ## Known issues 1. **Gemma `/api/chat_vision` is unreliable for batch YES/NO prompts.** When the chained agent doesn't see a clear reason to call Falcon, it can stall. For the `filter` and `verify` stages, prefer `--backend qwen`. Gemma is rock solid for the `label` stage (which uses `/api/falcon` directly). 2. **The generic `verify` command is a TODO** — the original drone-specific `runpod_falcon/verify_vlm.py` works today, the generic wrapper is a small refactor still pending. 3. **Image search hits DDG rate limits** if you run with too high `--max-per-query`. 30-50 per query is comfortable; beyond ~100 you'll see throttling. --- ## Credits - **Falcon Perception** by TII — Apache 2.0 - **Gemma 4** by Google DeepMind — Apache 2.0 - **Qwen 2.5-VL** by Alibaba — Apache 2.0 - **MLX** by Apple Machine Learning Research — MIT - **mlx-vlm** by Prince Canuma — MIT - **MLX Expert Sniper** streaming engine by [walter-grace](https://github.com/walter-grace/mac-code)