data-label-factory

A generic auto-labeling pipeline for vision datasets. Pick any object class in a YAML file, run one command, and end up with a clean COCO dataset reviewed in a browser. Designed to run entirely on a 16 GB Apple Silicon Mac.

gather  β†’  filter  β†’  label  β†’  verify  β†’  review
 (DDG/   (VLM YES/   (Falcon  (VLM per-   (canvas
  yt)     NO)         bbox)    bbox)       UI)

Two interchangeable VLM backends:

Backend Model Server Pick when
qwen Qwen 2.5-VL-3B 4-bit mlx_vlm.server You want fast YES/NO classification (~3.5s/img on M4)
gemma Gemma 4-26B-A4B 4-bit mac_tensor (Expert Sniper) You want richer reasoning + grounded segmentation in one server

The label stage always uses Falcon Perception for bbox grounding, served out of mac_tensor alongside Gemma. Falcon doesn't depend on the VLM choice β€” it's a separate ~600 MB model.


What you get when this finishes

For our reference run on a fiber-optic-drone detector:

  • 1421 source images gathered from DuckDuckGo + Wikimedia + Openverse
  • 15,355 Falcon Perception bboxes generated via the label stage
  • 11,928 / 15,355 (78%) approved by Qwen 2.5-VL in the verify stage
  • Reviewed in a browser via the canvas web UI (web/)

Per-query agreement between Falcon and Qwen on this dataset: cable spool 88%, quadcopter 81%, drone 80%, fiber optic spool 57%.

You can reproduce all of this from this repo by following the steps below.


1. Install

# Clone
git clone https://github.com/walter-grace/data-label-factory.git
cd data-label-factory

# Install the CLI (registers `data_label_factory` on your $PATH)
pip install -e .

# (Optional) Add image-search dependencies for the `gather` stage
pip install -e ".[gather]"

# (Optional) Web UI deps β€” only if you want to review labels in a browser
cd web && npm install && cd ..

Or install straight from GitHub without cloning first:

pip install git+https://github.com/walter-grace/data-label-factory

The repo is also mirrored on Hugging Face at waltgrace/data-label-factory. HF git serving doesn't play well with pip's partial-clone, so to install from HF use a regular clone:

git clone https://huggingface.co/waltgrace/data-label-factory
cd data-label-factory && pip install -e .

The factory CLI needs Python 3.10+. The backend servers (Qwen and/or Gemma) are installed separately β€” you only need the one(s) you plan to use.


2. Pick a backend and start it

Option A β€” Qwen 2.5-VL (recommended for filter/verify)

# Install mlx-vlm (Apple Silicon)
pip install mlx-vlm

# Start the OpenAI-compatible server
python3 -m mlx_vlm.server \
  --model mlx-community/Qwen2.5-VL-3B-Instruct-4bit \
  --port 8291

Verify it's alive:

QWEN_URL=http://localhost:8291 data_label_factory status

Option B β€” Gemma 4 + Falcon (recommended for label)

This is the MLX Expert Sniper deploy package. It serves Gemma 4-26B-A4B (chat / --vision) and Falcon Perception (--falcon) from the same process at port 8500. Total ~5 GB resident on a 16 GB Mac via SSD-streamed experts.

# Install + download model (one-time, ~13 GB)
git clone https://github.com/walter-grace/mac-code
cd mac-code/research/expert-sniper/distributed
pip install -e . mlx mlx-vlm fastapi uvicorn pillow huggingface_hub python-multipart

huggingface-cli download mlx-community/gemma-4-26b-a4b-it-4bit \
  --local-dir ~/models/gemma4-source
python3 split_gemma4.py \
  --input  ~/models/gemma4-source \
  --output ~/models/gemma4-stream

# Launch
python3 -m mac_tensor ui --vision --falcon \
  --stream-dir ~/models/gemma4-stream \
  --source-dir ~/models/gemma4-source \
  --port 8500

Verify:

GEMMA_URL=http://localhost:8500 data_label_factory status

You can run both servers at the same time. The factory CLI will use whichever backend you select per command via --backend qwen|gemma.


3. Define a project

A project YAML is the only thing you need to write to onboard a new object class. Two examples ship in projects/:

Copy one and edit the four important fields:

project_name:  fire-hydrants
target_object: "fire hydrant"            # templated into all prompts as {target_object}
data_root:     ~/data-label-factory/fire-hydrants

buckets:
  positive/clear_view:
    queries: ["red fire hydrant", "yellow fire hydrant", "fire hydrant on sidewalk"]
  negative/other_street_objects:
    queries: ["mailbox", "parking meter", "trash can"]
  background/empty_streets:
    queries: ["empty city street", "suburban sidewalk"]

falcon_queries:                          # what Falcon will look for during `label`
  - "fire hydrant"
  - "red metal post"

backends:
  filter: qwen                           # default per stage; CLI --backend overrides
  label:  gemma
  verify: qwen

Inspect a project before running anything:

data_label_factory project --project projects/fire-hydrants.yaml

4. Run the pipeline

The four stages can be run individually or chained:

PROJECT=projects/stop-signs.yaml

# 4a. Gather β€” image search across buckets
data_label_factory gather  --project $PROJECT --max-per-query 30

# 4b. Filter β€” image-level YES/NO via your chosen VLM
data_label_factory filter  --project $PROJECT --backend qwen

# 4c. Label β€” Falcon Perception bbox grounding (needs Gemma server up)
data_label_factory label   --project $PROJECT

# 4d. Verify β€” per-bbox YES/NO via your chosen VLM
#     (verify is a TODO in the generic CLI today; runpod_falcon/verify_vlm.py
#      is the original drone-specific impl that the generic version will wrap.)

# OR run gather β†’ filter end-to-end:
data_label_factory pipeline --project $PROJECT --backend qwen

Every command writes a timestamped folder under experiments/ (relative to your current working directory) with the config, prompts, raw model answers, and JSON outputs. List them with:

data_label_factory list

5. Review the labels in a browser

The web/ directory is a Next.js + HTML5 Canvas review tool. It reads your labeled JSON straight from R2 (or local β€” see web/app/api/labels/route.ts) and renders the bboxes over each image with hover, click-to-select, scroll-zoom, and keyboard navigation.

cd web
PORT=3030 npm run dev
# open http://localhost:3030/canvas

Features:

  • Drag to pan, scroll to zoom around the cursor, double-click to reset
  • ←/β†’ to navigate images, click a bbox to select it
  • Color coding: per-query color, dashed red for VLM rejections, white outline for active
  • Bucket tabs to filter by source bucket
  • Per-image query summary with YES/NO counts

The grid view at http://localhost:3030/ is the older shadcn-based browser with thumbnail-grid + per-bbox approve/reject buttons.


Optional: GPU path via RunPod

For larger runs (tens of thousands of images), there's an opt-in GPU path that orchestrates a RunPod pod, runs the same pipeline on it, and publishes the result straight to Hugging Face:

pip install -e ".[runpod]"
export RUNPOD_API_KEY=rpa_xxxxxxxxxx
python3 -m data_label_factory.runpod pipeline \
    --project projects/drones.yaml --gpu L40S \
    --publish-to <you>/<dataset>

See data_label_factory/runpod/README.md for the full architecture, costs (~$0.06 for the canonical 2,260-image run), and pod-vs-serverless trade-offs. Local Mac execution is still the default β€” runpod is just an option.


Optional: open-set image identification

The base pipeline produces COCO labels for training a closed-set detector. The opt-in data_label_factory.identify subpackage produces a CLIP retrieval index for open-set identification β€” given a known set of N reference images, identify which one a webcam frame is showing. Use it when you have 1 image per class and want zero training time.

pip install -e ".[identify]"

# Build an index from a folder of references
python3 -m data_label_factory.identify index --refs ~/my-cards/ --out my.npz

# Optional: contrastive fine-tune for fine-grained accuracy (~5 min on M4 MPS)
python3 -m data_label_factory.identify train --refs ~/my-cards/ --out my-proj.pt
python3 -m data_label_factory.identify index --refs ~/my-cards/ --out my.npz --projection my-proj.pt

# Self-test the index
python3 -m data_label_factory.identify verify --index my.npz

# Serve as a mac_tensor-shaped /api/falcon endpoint
python3 -m data_label_factory.identify serve --index my.npz --refs ~/my-cards/
# β†’ web/canvas/live can hit it with FALCON_URL=http://localhost:8500/api/falcon

Built-in rarity / variant detection for free β€” if your filenames encode a suffix like _pscr, _scr, _ur, the matched filename's suffix becomes a separate rarity field on the response. See data_label_factory/identify/README.md for the full blueprint and concrete examples (trading cards, album covers, industrial parts, plant species, …).


Configuration reference

Environment variables

Var Default What
QWEN_URL http://localhost:8291 Where the mlx_vlm.server lives
QWEN_MODEL_PATH mlx-community/Qwen2.5-VL-3B-Instruct-4bit Model id sent in the OpenAI request
GEMMA_URL http://localhost:8500 Where mac_tensor lives (also serves Falcon)

Set them inline for one command, or export them in your shell.

CLI flags

data_label_factory <command> [flags]

Commands:
  status                      Check both backends are alive
  project --project P         Print a project YAML for inspection
  gather  --project P         Search the web for images across buckets
  filter  --project P         Image-level YES/NO via Qwen or Gemma
  label   --project P         Falcon Perception bbox grounding
  pipeline --project P        gather β†’ filter
  list                        Show experiments

Common flags:
  --backend qwen|gemma        Pick the VLM (filter, pipeline). Overrides project YAML.
  --limit N                   Process at most N images (smoke testing)
  --experiment NAME           Reuse an existing experiment dir

Project YAML reference

See projects/drones.yaml for the canonical, fully commented example. Required fields: project_name, target_object, buckets, falcon_queries. Everything else has defaults.


How big is this thing?

Component Disk RAM (resident)
Factory CLI + Python deps < 50 MB negligible
Qwen 2.5-VL-3B 4-bit ~2.2 GB ~2.5 GB
Gemma 4-26B-A4B (Expert Sniper streaming) ~13 GB on disk ~3 GB
Falcon Perception 0.6B ~1.5 GB ~1.5 GB
Web UI dev server ~300 MB node_modules ~150 MB
Total (Gemma + Falcon path) ~17 GB ~5 GB

Fits comfortably on a 16 GB Apple Silicon Mac.


Known issues

  1. Gemma /api/chat_vision is unreliable for batch YES/NO prompts. When the chained agent doesn't see a clear reason to call Falcon, it can stall. For the filter and verify stages, prefer --backend qwen. Gemma is rock solid for the label stage (which uses /api/falcon directly).
  2. The generic verify command is a TODO β€” the original drone-specific runpod_falcon/verify_vlm.py works today, the generic wrapper is a small refactor still pending.
  3. Image search hits DDG rate limits if you run with too high --max-per-query. 30-50 per query is comfortable; beyond ~100 you'll see throttling.

Credits

  • Falcon Perception by TII β€” Apache 2.0
  • Gemma 4 by Google DeepMind β€” Apache 2.0
  • Qwen 2.5-VL by Alibaba β€” Apache 2.0
  • MLX by Apple Machine Learning Research β€” MIT
  • mlx-vlm by Prince Canuma β€” MIT
  • MLX Expert Sniper streaming engine by walter-grace
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support