microfactory-lab / docs /RUNBOOK.md
kylebrodeur's picture
deploy: update docs/RUNBOOK.md
b320b0c verified
|
Raw
History Blame Contribute Delete
16.3 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

RUNBOOK β€” set up, use, deploy, record

Microfactory Node: 3D Printer β€” the 3D-printing node of the Microfactory. A small, fully local Gemma that learns 3D-printing expertise job-by-job and tells you where a print will fail before it runs. Two personas keep it honest: Chief Engineer O'Brien proposes settings; La Forge (a separate QA Inspector) grades β€” O'Brien never grades his own work. Model: gemma4:e4b local Β· google/gemma-4-E4B-it on the Space (ZeroGPU).

This is the one operational doc: how to run it, use it, deploy it, and record it. History, findings, and the phase log live in docs/reference/RUNBOOK-FINDINGS.md; day-by-day plan in docs/plan/; the recording plan + published demo link live in docs/writeup/02-VIDEO.md.


1 Β· Run it locally

cd chief-engineer
make setup                      # uv sync (locked env) + generate sample meshes
ollama serve &                  # its own terminal; leave running (for live Gemma)
ollama pull gemma4:e4b          # = gemma4:latest, ~9.6GB
make test                       # offline core tests (~1s, no Ollama) β€” expect ALL PASSED
make run                        # β†’ http://localhost:7860  (status bar shows the live model)

Falls back to a deterministic advisor if Ollama is unreachable, so it never crashes; the UI always shows which model ran. CHIEF_ENGINEER_MODEL=gemma4:e2b for ~2Γ— faster CPU latency.

Want the fine-tuned Chief Engineer instead of stock Gemma? Four LoRA-fine-tuned variants are pre-merged + quantized and live on the public Ollama registry at ollama.com/kylebrodeur:

ollama pull kylebrodeur/microfactory-node-v3-qat                    # recommended (q4_k_m, 5.3GB)
CHIEF_ENGINEER_MODEL=kylebrodeur/microfactory-node-v3-qat make run

The other tags (-v2, :q4_0, microfactory-node v1) plus the canonical HF Hub copies and the full publishing runbook are documented in learn/finetune/OLLAMA_PUBLISHING.md and the GGUF model card.


2 Β· Use the tool (the guided tour β€” also the judge's tour)

Live at node.microfactory.space (custom domain; fallback build-small-hackathon-microfactory-lab.hf.space). Four tabs, left to right. This is exactly what a judge should see. The top header carries the model switcher (LoRA v3 QAT / LoRA v2 / Base / Modal API), a small info icon tooltip, the Warm up Model button, and the live model status

  • clock. Every tab has its small primary action and a persistent Reset in the same top-right spot. The UI uses custom icons (no emojis) and one consolidated loader that reveals content once the work finishes.
  1. Build β€” define the job. Quick-load 3DBenchy (or generate a primitive / drop a mesh; drop one of your own printed STLs to demo on real parts). You don't pick the part type β€” the engineer infers it from the mesh ("reads this as overhang-dominant"). The simulated environment (temp / humidity), build-plate position (center / edge / corner), and material (PLA / PETG / ABS / TPU) share the top control row, with RANDOMIZE / OVERRIDE / RESET / SLICE on the right. The Job Log up top shows what is stored and where. Hit SLICE (top right).
  2. Slice β€” the pre-flight read (the load-bearing moment). Slicer image + motion preview render immediately; the horizontal LAYER scrubber below the image steps through real cross-sections of this part. The THE READ segmented toggle flips between Engineer's Read and Second Opinion, one panel at a time. O'Brien recalls the closest prior jobs, says what transfers, flags the failure regions before anything prints, and the Spine vetoes unsafe values (validation + g-code fold into his read). Flip to Second Opinion β†’ La Forge critiques the plan; a dispute holds β†’ PRINT until you acknowledge. Clicking Second Opinion a second time does not re-run it for the same build.
  3. Print β€” run it and watch it compound. THE PLAN card up top frames the run: what we're testing (the job + conditions + the question), the Engineer's Spine-validated proposed settings, and what La Forge expects. Press PRINT (top right); the compact ITERATIONS slider sets how many runs (1 = a single print). Results stream in live as each iteration finishes: the OUTCOME Β· WHAT HAPPENED block (simulated result + La Forge run verdict) appears first, followed by the quality curve, the iteration log, the learned policy cell, and a compact LOG A REAL PRINT strip to feed a real-machine outcome back into the ledger. OVERRIDE PLAN (same popup component as OVERRIDE ENVIRONMENT on LOAD) lets you print against your own settings instead of the Engineer's. (Pick a genuinely hard job β€” the sim only fails prints that should: PETG overhang @ ~30 Β°C/65 % climbs 0.55β†’0.70; Benchy+PETG @30/68 climbs 0.68β†’0.81.)
  4. Review β€” the whole job in one place. The SESSION RECORD assembles the full story: the inputs, O'Brien's read, La Forge's pre-print second opinion, the simulated run (curve + iteration log), the outcome + run verdict, and next steps. Below it the lesson ledger grows (seed β†’ earned β†’ sim) and the capability mesh is a collapsible outlook view. La Forge's run verdict also sits up top. RESET TO BASELINE (top right, present on every tab) starts a fresh demo (clears this session's runs + learned policy; keeps seed + ingested).

The honesty spine (say this out loud): the Engineer proposes, a deterministic Spine disposes, a deterministic world produces the outcome, and a separate Inspector grades it β€” the model never marks its own homework. Real: compounding retrieval + learned policy, proactive risk flags, local Gemma, the QA Inspector. Simulated (the one boundary): print outcomes (sim/outcome.py). Frontier (named, not faked): weight-level fine-tuning, multi-node execution, physical sensors/camera. We calibrated the sim against 178 real failure prints, measured 32.6%, found the gap was structural, and documented it instead of faking a tuned number (sim/calibration/CALIBRATION-REPORT.md).

If the live model is slow/falls back: the Space runs Gemma on ZeroGPU (first BUILD loads it, ~30s). If GPU quota is momentarily out it falls back to the deterministic advisor (clearly labeled) rather than erroring.


3 Β· Deploy to the Space

Final submission pass? Work the docs/plan/SUBMISSION-PUNCHDOWN.md list, then follow docs/plan/FINAL-SEED-AND-DEPLOY.md β€” the ordered seed/deploy commands. (Both live in the private working repo; they're operator checklists, not part of the public artifact.)

Space: build-small-hackathon/microfactory-lab Β· SDK gradio, app at root, hardware ZeroGPU, HF Pro active (ample quota β†’ live Gemma on screen). The Space installs from requirements.txt; the chief-engineer/ contents go to the Space root. The local agent has hf CLI access and can run the auth/repo checks and the push directly.

make deploy-check      # offline GO/NO-GO gates (D1–D10). Run any time; nothing is pushed.
make deploy            # gates β†’ if green + authenticated, upload_folder to the Space + factory reboot

make deploy (= scripts/deploy_preflight.py --push) uploads everything except docs/, spike/, caches, secrets, and runtime files β€” so learn/, assets/, and data/*.jsonl go too (the app needs them). The gates: build imports + core tests, all Space files present, README frontmatter valid (short_description ≀60), lean reference block, clean ledger baseline, data well-formed, credentials (D8), live Space state (D9), and the field-log dataset set + logging (D10).

Credentials: an HF write token (member of build-small-hackathon) β€” hf auth login or export HF_TOKEN=…. Check with hf auth whoami. The Claude-on-the-web session carries no token by default (deploy-check warns D8); deploy from an authenticated machine or set HF_TOKEN.

Space variables (set once; a change triggers rebuild):

hf spaces variables add build-small-hackathon/microfactory-lab \
  -e GRADIO_SSR_MODE=False -e CHIEF_ENGINEER_BACKEND=zerogpu \
  -e CHIEF_ENGINEER_HF_MODEL=google/gemma-4-E4B-it

After deploy, smoke-test the live UI: LOAD a part, then SLICE shows O'Brien reasoning (NOT "Error"); La Forge second opinion + the dispute-gate work; LAYER scrubber slides; Print loop runs; Review shows the ledger + verdict + β†Ί RESET; wide layout, no empty right gutter.

Field log β€” "all runs β†’ a shared dataset" (Sharing is Caring)

Every interaction (build / second-opinion / simulate / print / record) logs one flat row to a HF Dataset via core/field_log.py (CommitScheduler, flushes ~5 min) β€” automatic once the token is set, silently no-ops without it (local/offline unaffected; config + outcomes only, never PII or files; rows are candidates, never auto-promoted to the curated ledger).

  1. Dataset repo β€” check first (it likely exists): hf datasets info build-small-hackathon/chief-engineer-field-log. Create only if missing (the first hf upload to a non-existent dataset creates it; add --private for a private repo); never recreate. Must match FIELD_LOG_REPO in core/field_log.py.
  2. HF_TOKEN as a Space secret (write, org member): Space β†’ Settings β†’ Variables and secrets β†’ New secret. Reboot.
  3. Verify: do one SLICE on the Space, wait ≀5 min, confirm a new row in interactions.jsonl (or make deploy-check D10). The schema is one flat 26-column table β†’ renders cleanly in the HF dataset viewer.

Deliberation traces β€” "how the agent reasons" (Sharing is Caring)

A second open dataset that captures the turn-by-turn argument between the personas (O'Brien proposes β†’ Spine vetoes β†’ La Forge second opinion/dispute β†’ operator override β†’ World simulates β†’ La Forge grades β†’ run verdict). One row per turn; shares one schema across two sources:

  • Static, reproducible export: make deliberation β†’ dist/deliberation/ (JSONL + card). Offline-safe; run with ollama serve up first to capture O'Brien's real reasoning rather than the [fallback] text.
  • Live, every run: core/deliberation_log.py (CommitScheduler, same HF_TOKEN gate + best-effort/never-break contract as the field log) appends turns on each Space run.

Schema: session_id, track, turn, agent, role, act, stance, content, material, geometry, bed_position, env_temp, env_humidity, ts β€” renders cleanly in the dataset viewer.

  1. Dataset repo β€” check first: hf datasets info kylebrodeur/chief-engineer-deliberation. Create only if missing (the first hf upload creates it; add --private for private); must match DELIB_LOG_REPO in core/deliberation_log.py and HF_REPO in scripts/export_deliberation.py.
  2. Seed it (optional but nicer): ollama serve + make deliberation, then hf upload kylebrodeur/chief-engineer-deliberation dist/deliberation . --repo-type dataset (the card ships as the repo README).
  3. Live capture: the same HF_TOKEN Space secret that powers the field log also powers this β€” no extra setup. Do one LOAD β†’ SLICE β†’ PRINT on the Space, wait ≀5 min, confirm new rows in deliberations.jsonl.

4 Β· Record the demo

Full beat sheet, shot list, and what to say for each beat: docs/writeup/02-VIDEO.md.

# clean curve first (keeps seed+ingested), then record
git checkout -- data/lessons.jsonl && rm -f data/policy.json   # or click β†Ί RESET in Review
make record-check        # recording preflight (cap-cli + Space + playwright gates)
make record-beat BEAT=load   # record the LOAD beat (one .cap project per beat)
make record-beat BEAT=slice  # record the SLICE beat
./scripts/export-beat.sh /path/to/beat.cap recordings/beats/<beat>.mp4

Record from the live Space (HF Pro = live Gemma reasoning on screen). Use a climbing job for the compounding beat (above). Per-beat recording is handled by scripts/record-beat.sh. See recordings/EDITING.md for the full beat list and which beats need Cap Studio polish. One-time deps: uv pip install playwright && uv run playwright install chromium.

Cleaner capture (Kyle's setup):

  • Install the Chrome app Gradio offers (the "Install" / PWA prompt in the address bar on the Space). It opens the app in its own chromeless window β€” no tabs, no URL bar β€” so the screen recording is a clean interface with just the console.
  • Cap Studio with a "Desktop Background" for the final cut: export a 1080p version so the screen capture sits cleanly alongside the regular (camera) footage when the two tracks are cut together. Match the 1080p export to the camera end-cap resolution.

5 Β· Quick reference

Command Purpose
make setup uv sync (locked env) + generate meshes
make test offline core tests (no Ollama, ~1s)
make run launch the app (status bar shows the live model)
make preflight live-stack GO/NO-GO (Ollama, latency, JSON, reasoning, Spine, assets)
make deploy-check deploy/record readiness gates (D1–D10, offline)
make deploy gates β†’ push the Space (upload_folder) + factory reboot (needs HF_TOKEN)
make record-check recording preflight (cap-cli + Space + playwright gates)
make record-beat BEAT=load record one beat to its own .cap project
./scripts/export-beat.sh <cap> <mp4> export a .cap project to MP4
uv run python scripts/assemble-video.py recordings/manifest.json assemble camera + beats + VO
make trace export the ledger as a Hub-ready dataset β†’ kylebrodeur/chief-engineer-ledger
make deliberation export the multi-persona deliberation as a Hub-ready dataset β†’ kylebrodeur/chief-engineer-deliberation
core/field_log.py live Space interactions β†’ build-small-hackathon/chief-engineer-field-log
docs/reference/ACTIVITY.jsonl build activity trace β†’ kylebrodeur/chief-engineer-build-activity
learn/finetune/activity.jsonl fine-tune pipeline activity trace β†’ kylebrodeur/chief-engineer-finetune-activity
git checkout -- data/lessons.jsonl && rm -f data/policy.json reset demo curve (keeps seed + ingested)
CHIEF_ENGINEER_MODEL=gemma4:e2b faster model (env wins over .env)

6 Β· Troubleshoot (durable gotchas)

  • Dev Mode must be OFF on the Space β€” an openvscode build log means it runs the dev shell, not app.py (you'll see a default greet template). Disable + reboot.
  • short_description ≀ 60 chars in README frontmatter or hf upload rejects it.
  • requirements.txt only is mounted at build β€” ZeroGPU deps (spaces/torch/ transformers/accelerate) are inlined there. Local dev stays lean (uv base, 4 deps); app.py shims spaces to a no-op when absent.
  • @spaces.GPU is on the inference function only (core/llm_zerogpu._generate), NOT build_job β€” so a quota-out falls back to the advisor instead of erroring the whole handler. app.py imports core.llm_zerogpu at startup so ZeroGPU still detects the GPU function.
  • data/lessons.jsonl is the durable ledger (tracked) β€” don't rm it; use the reset command (keeps seed + ingested).
  • Use make targets / uv run python -m scripts.<name> β€” bare python scripts/x.py fails.

History, findings, and the phase log: docs/reference/RUNBOOK-FINDINGS.md. Deploy deep-dive: docs/reference/DEPLOYMENT.md. Open items: docs/plan/ISSUES.md.