Spaces:

build-small-hackathon
/

microfactory-lab

Runtime error

App Files Files Community

microfactory-lab / docs /RUNBOOK.md

kylebrodeur

deploy: update docs/RUNBOOK.md

b320b0c verified 17 days ago

preview code

Raw

History Blame Contribute Delete

16.3 kB

	# RUNBOOK — set up, use, deploy, record

	Microfactory Node: 3D Printer — the 3D-printing node of the Microfactory. A small,
	fully local Gemma that learns 3D-printing expertise job-by-job and tells you where a
	print will fail before it runs. Two personas keep it honest: Chief Engineer O'Brien
	proposes settings; La Forge (a separate QA Inspector) grades — O'Brien never grades his
	own work. Model: `gemma4:e4b` local · `google/gemma-4-E4B-it` on the Space (ZeroGPU).

	This is the one operational doc: how to run it, use it, deploy it, and record it. History,
	findings, and the phase log live in `docs/reference/RUNBOOK-FINDINGS.md`; day-by-day plan in
	`docs/plan/`; the recording plan + published demo link live in `docs/writeup/02-VIDEO.md`.

	---

	## 1 · Run it locally

	```bash
	cd chief-engineer
	make setup # uv sync (locked env) + generate sample meshes
	ollama serve & # its own terminal; leave running (for live Gemma)
	ollama pull gemma4:e4b # = gemma4:latest, ~9.6GB
	make test # offline core tests (~1s, no Ollama) — expect ALL PASSED
	make run # → http://localhost:7860 (status bar shows the live model)
	```

	Falls back to a deterministic advisor if Ollama is unreachable, so it never crashes; the UI
	always shows which model ran. `CHIEF_ENGINEER_MODEL=gemma4:e2b` for ~2× faster CPU latency.

	Want the fine-tuned Chief Engineer instead of stock Gemma? Four LoRA-fine-tuned variants are
	pre-merged + quantized and live on the public Ollama registry at
	[`ollama.com/kylebrodeur`](https://ollama.com/kylebrodeur):
	```bash
	ollama pull kylebrodeur/microfactory-node-v3-qat # recommended (q4_k_m, 5.3GB)
	CHIEF_ENGINEER_MODEL=kylebrodeur/microfactory-node-v3-qat make run
	```
	The other tags (`-v2`, `:q4_0`, `microfactory-node` v1) plus the canonical HF Hub copies
	and the full publishing runbook are documented in
	[`learn/finetune/OLLAMA_PUBLISHING.md`](../learn/finetune/OLLAMA_PUBLISHING.md) and the
	[GGUF model card](https://huggingface.co/kylebrodeur/microfactory-node-gguf).

	---

	## 2 · Use the tool (the guided tour — also the judge's tour)

	Live at [node.microfactory.space](https://node.microfactory.space) (custom domain; fallback
	`build-small-hackathon-microfactory-lab.hf.space`). Four tabs, left to right. This is exactly what
	a judge should see. The top header carries the model switcher (LoRA v3 QAT / LoRA v2 / Base /
	Modal API), a small info icon tooltip, the Warm up Model button, and the live model status
	+ clock. Every tab has its small primary action and a persistent Reset in the same top-right
	spot. The UI uses custom icons (no emojis) and one consolidated loader that reveals content once
	the work finishes.

	1. Build — define the job. Quick-load 3DBenchy (or generate a primitive / drop a
	mesh; drop one of your own printed STLs to demo on real parts). You don't pick the part
	type — the engineer infers it from the mesh ("reads this as overhang-dominant").
	The simulated environment (temp / humidity), build-plate position (center / edge /
	corner), and material (PLA / PETG / ABS / TPU) share the top control row, with
	RANDOMIZE / OVERRIDE / RESET / SLICE on the right. The Job Log up top shows what
	is stored and where. Hit SLICE (top right).
	2. Slice — the pre-flight read (the load-bearing moment). Slicer image + motion preview
	render immediately; the horizontal LAYER scrubber below the image steps through real
	cross-sections of this part. The THE READ segmented toggle flips between
	Engineer's Read and Second Opinion, one panel at a time. O'Brien recalls the
	closest prior jobs, says what transfers, flags the failure regions **before anything
	prints, and the Spine** vetoes unsafe values (validation + g-code fold into his read).
	Flip to Second Opinion → La Forge critiques the plan; a dispute holds → PRINT
	until you acknowledge. Clicking Second Opinion a second time does not re-run it for
	the same build.
	3. Print — run it and watch it compound. THE PLAN card up top frames the run: *what
	we're testing* (the job + conditions + the question), the Engineer's Spine-validated
	proposed settings, and what La Forge expects. Press PRINT (top right); the compact
	ITERATIONS slider sets how many runs (1 = a single print). Results stream in live as
	each iteration finishes: the OUTCOME · WHAT HAPPENED block (simulated result + La Forge
	run verdict) appears first, followed by the quality curve, the iteration log, the learned
	policy cell, and a compact LOG A REAL PRINT strip to feed a real-machine outcome back into
	the ledger. OVERRIDE PLAN (same popup component as OVERRIDE ENVIRONMENT on LOAD) lets you
	print against your own settings instead of the Engineer's. *(Pick a genuinely hard job — the
	sim only fails prints that should: PETG overhang @ ~30 °C/65 % climbs 0.55→0.70; Benchy+PETG
	@30/68 climbs 0.68→0.81.)*
	4. Review — the whole job in one place. The SESSION RECORD assembles the full story:
	the inputs, O'Brien's read, La Forge's pre-print second opinion, the simulated run (curve +
	iteration log), the outcome + run verdict, and next steps. Below it the lesson ledger grows
	(seed → earned → sim) and the capability mesh is a collapsible outlook view. La Forge's
	run verdict also sits up top. RESET TO BASELINE (top right, present on every tab) starts a
	fresh demo (clears this session's runs + learned policy; keeps seed + ingested).

	The honesty spine (say this out loud): the Engineer proposes, a deterministic Spine
	disposes, a deterministic world produces the outcome, and a separate Inspector grades it —
	the model never marks its own homework. Real: compounding retrieval + learned policy,
	proactive risk flags, local Gemma, the QA Inspector. Simulated (the one boundary): print
	outcomes (`sim/outcome.py`). Frontier (named, not faked): weight-level fine-tuning,
	multi-node execution, physical sensors/camera. We calibrated the sim against 178 real failure
	prints, measured 32.6%, found the gap was structural, and **documented it instead of
	faking a tuned number** (`sim/calibration/CALIBRATION-REPORT.md`).

	If the live model is slow/falls back: the Space runs Gemma on ZeroGPU (first BUILD loads
	it, ~30s). If GPU quota is momentarily out it falls back to the deterministic
	advisor (clearly labeled) rather than erroring.

	---

	## 3 · Deploy to the Space

	> Final submission pass? Work the `docs/plan/SUBMISSION-PUNCHDOWN.md` list, then
	> follow `docs/plan/FINAL-SEED-AND-DEPLOY.md` — the ordered seed/deploy commands. (Both live in
	> the private working repo; they're operator checklists, not part of the public artifact.)

	Space: `build-small-hackathon/microfactory-lab` · SDK gradio, app at root, hardware
	ZeroGPU, HF Pro active (ample quota → live Gemma on screen). The Space installs from
	`requirements.txt`; the `chief-engineer/` contents go to the Space root. The local agent
	has `hf` CLI access and can run the auth/repo checks and the push directly.

	```bash
	make deploy-check # offline GO/NO-GO gates (D1–D10). Run any time; nothing is pushed.
	make deploy # gates → if green + authenticated, upload_folder to the Space + factory reboot
	```

	`make deploy` (= `scripts/deploy_preflight.py --push`) uploads everything except
	`docs/`, `spike/`, caches, secrets, and runtime files — so `learn/`, `assets/`, and
	`data/*.jsonl` go too (the app needs them). The gates: build imports + core tests, all Space
	files present, README frontmatter valid (`short_description` ≤60), lean reference block, clean
	ledger baseline, data well-formed, credentials (D8), live Space state (D9), and the **field-log
	dataset set + logging (D10)**.

	Credentials: an HF write token (member of `build-small-hackathon`) — `hf auth login`
	or `export HF_TOKEN=…`. Check with `hf auth whoami`. The Claude-on-the-web session carries no
	token by default (deploy-check warns D8); deploy from an authenticated machine or set `HF_TOKEN`.

	Space variables (set once; a change triggers rebuild):
	```bash
	hf spaces variables add build-small-hackathon/microfactory-lab \
	-e GRADIO_SSR_MODE=False -e CHIEF_ENGINEER_BACKEND=zerogpu \
	-e CHIEF_ENGINEER_HF_MODEL=google/gemma-4-E4B-it
	```

	After deploy, smoke-test the live UI: LOAD a part, then SLICE shows O'Brien
	reasoning (NOT "Error"); La Forge second opinion + the dispute-gate work; LAYER
	scrubber slides; Print loop runs; Review shows the ledger + verdict + ↺ RESET;
	wide layout, no empty right gutter.

	### Field log — "all runs → a shared dataset" (Sharing is Caring)

	Every interaction (build / second-opinion / simulate / print / record) logs one flat row to a
	HF Dataset via `core/field_log.py` (`CommitScheduler`, flushes ~5 min) — **automatic once the
	token is set, silently no-ops without it** (local/offline unaffected; config + outcomes only,
	never PII or files; rows are candidates, never auto-promoted to the curated ledger).

	1. Dataset repo — check first (it likely exists): `hf datasets info build-small-hackathon/chief-engineer-field-log`. Create only if missing (the first `hf upload` to a non-existent dataset creates it; add `--private` for a private repo); never recreate. Must match `FIELD_LOG_REPO` in `core/field_log.py`.
	2. *`HF_TOKEN` as a Space secret*** (write, org member): Space → Settings → Variables and secrets → New secret. Reboot.
	3. Verify: do one SLICE on the Space, wait ≤5 min, confirm a new row in `interactions.jsonl` (or `make deploy-check` D10). The schema is one flat 26-column table → renders cleanly in the HF dataset viewer.

	### Deliberation traces — "how the agent reasons" (Sharing is Caring)

	A second open dataset that captures the turn-by-turn argument between the personas
	(O'Brien proposes → Spine vetoes → La Forge second opinion/dispute → operator override →
	World simulates → La Forge grades → run verdict). One row per turn; shares one schema across
	two sources:

	- Static, reproducible export: `make deliberation` → `dist/deliberation/` (JSONL + card).
	Offline-safe; run with `ollama serve` up first to capture O'Brien's real reasoning rather
	than the `[fallback]` text.
	- Live, every run: `core/deliberation_log.py` (`CommitScheduler`, same `HF_TOKEN` gate +
	best-effort/never-break contract as the field log) appends turns on each Space run.

	Schema: `session_id, track, turn, agent, role, act, stance, content, material, geometry,
	bed_position, env_temp, env_humidity, ts` — renders cleanly in the dataset viewer.

	1. Dataset repo — check first: `hf datasets info kylebrodeur/chief-engineer-deliberation`. Create only if missing (the first `hf upload` creates it; add `--private` for private); must match `DELIB_LOG_REPO` in `core/deliberation_log.py` and `HF_REPO` in `scripts/export_deliberation.py`.
	2. Seed it (optional but nicer): `ollama serve` + `make deliberation`, then `hf upload kylebrodeur/chief-engineer-deliberation dist/deliberation . --repo-type dataset` (the card ships as the repo README).
	3. Live capture: the same `HF_TOKEN` Space secret that powers the field log also powers this — no extra setup. Do one LOAD → SLICE → PRINT on the Space, wait ≤5 min, confirm new rows in `deliberations.jsonl`.

	---

	## 4 · Record the demo

	Full beat sheet, shot list, and what to say for each beat: `docs/writeup/02-VIDEO.md`.

	```bash
	# clean curve first (keeps seed+ingested), then record
	git checkout -- data/lessons.jsonl && rm -f data/policy.json # or click ↺ RESET in Review
	make record-check # recording preflight (cap-cli + Space + playwright gates)
	make record-beat BEAT=load # record the LOAD beat (one .cap project per beat)
	make record-beat BEAT=slice # record the SLICE beat
	./scripts/export-beat.sh /path/to/beat.cap recordings/beats/<beat>.mp4
	```

	Record from the live Space (HF Pro = live Gemma reasoning on screen). Use a **climbing
	job** for the compounding beat (above). Per-beat recording is handled by `scripts/record-beat.sh`. See `recordings/EDITING.md` for the
	full beat list and which beats need Cap Studio polish. One-time deps: `uv pip install playwright && uv run playwright install chromium`.

	Cleaner capture (Kyle's setup):
	- Install the Chrome app Gradio offers (the "Install" / PWA prompt in the address bar on the
	Space). It opens the app in its own chromeless window — no tabs, no URL bar — so the screen
	recording is a clean interface with just the console.
	- Cap Studio with a "Desktop Background" for the final cut: export a 1080p version so the
	screen capture sits cleanly alongside the regular (camera) footage when the two tracks are cut
	together. Match the 1080p export to the camera end-cap resolution.

	---

	## 5 · Quick reference

	\| Command \| Purpose \|
	\|---\|---\|
	\| `make setup` \| uv sync (locked env) + generate meshes \|
	\| `make test` \| offline core tests (no Ollama, ~1s) \|
	\| `make run` \| launch the app (status bar shows the live model) \|
	\| `make preflight` \| live-stack GO/NO-GO (Ollama, latency, JSON, reasoning, Spine, assets) \|
	\| `make deploy-check` \| deploy/record readiness gates (D1–D10, offline) \|
	\| `make deploy` \| gates → push the Space (`upload_folder`) + factory reboot (needs `HF_TOKEN`) \|
	\| `make record-check` \| recording preflight (cap-cli + Space + playwright gates) \|
	\| `make record-beat BEAT=load` \| record one beat to its own `.cap` project \|
	\| `./scripts/export-beat.sh <cap> <mp4>` \| export a `.cap` project to MP4 \|
	\| `uv run python scripts/assemble-video.py recordings/manifest.json` \| assemble camera + beats + VO \|
	\| `make trace` \| export the ledger as a Hub-ready dataset → [kylebrodeur/chief-engineer-ledger](https://huggingface.co/datasets/kylebrodeur/chief-engineer-ledger) \|
	\| `make deliberation` \| export the multi-persona deliberation as a Hub-ready dataset → [kylebrodeur/chief-engineer-deliberation](https://huggingface.co/datasets/kylebrodeur/chief-engineer-deliberation) \|
	\| `core/field_log.py` \| live Space interactions → [build-small-hackathon/chief-engineer-field-log](https://huggingface.co/datasets/build-small-hackathon/chief-engineer-field-log) \|
	\| `docs/reference/ACTIVITY.jsonl` \| build activity trace → [kylebrodeur/chief-engineer-build-activity](https://huggingface.co/datasets/kylebrodeur/chief-engineer-build-activity) \|
	\| `learn/finetune/activity.jsonl` \| fine-tune pipeline activity trace → [kylebrodeur/chief-engineer-finetune-activity](https://huggingface.co/datasets/kylebrodeur/chief-engineer-finetune-activity) \|
	\| `git checkout -- data/lessons.jsonl && rm -f data/policy.json` \| reset demo curve (keeps seed + ingested) \|
	\| `CHIEF_ENGINEER_MODEL=gemma4:e2b` \| faster model (env wins over `.env`) \|

	---

	## 6 · Troubleshoot (durable gotchas)

	- Dev Mode must be OFF on the Space — an `openvscode` build log means it runs the dev
	shell, not `app.py` (you'll see a default greet template). Disable + reboot.
	- `short_description` ≤ 60 chars in README frontmatter or `hf upload` rejects it.
	- `requirements.txt` only is mounted at build — ZeroGPU deps (`spaces`/`torch`/
	`transformers`/`accelerate`) are inlined there. Local dev stays lean (uv base, 4 deps);
	`app.py` shims `spaces` to a no-op when absent.
	- `@spaces.GPU` is on the inference function only (`core/llm_zerogpu._generate`), NOT
	`build_job` — so a quota-out falls back to the advisor instead of erroring the whole handler.
	`app.py` imports `core.llm_zerogpu` at startup so ZeroGPU still detects the GPU function.
	- `data/lessons.jsonl` is the durable ledger (tracked) — don't `rm` it; use the reset
	command (keeps seed + ingested).
	- Use `make` targets / `uv run python -m scripts.<name>` — bare `python scripts/x.py` fails.

	History, findings, and the phase log: `docs/reference/RUNBOOK-FINDINGS.md`. Deploy
	deep-dive: `docs/reference/DEPLOYMENT.md`. Open items: `docs/plan/ISSUES.md`.