File size: 10,520 Bytes
13fe947
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
# AGENTS.md

Operating manual for coding agents working in this repo.

---

## What this is

**Hackathon Advisor** is a Gradio `gradio.Server` (FastAPI subclass) Space for the
[Build Small Hackathon](https://huggingface.co/build-small-hackathon). It is a small-model (**≀32B**, largest single
model **≀4B**) originality coach: it crawls the public `build-small-hackathon` org into a live project atlas, then lets a
builder search the field and open **The Unwritten Almanac** advisor to test an idea against existing work.

The engine in `hackathon_advisor/` is **UI-agnostic**; `app.py` and `static/` are one possible front door.

**Model stack (all open-weight, all local):**

| Role | Model | Runtime |
| --- | --- | --- |
| Advisor brain (tool planning) | `openbmb/MiniCPM5-1B` + advisor LoRA | Transformers + PEFT, ZeroGPU |
| Quest classifier | `openbmb/MiniCPM5-1B` + quest LoRA | Transformers + PEFT, ZeroGPU |
| Retrieval / atlas | `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF` | llama.cpp (llama-cpp-python) |
| Voice input (ASR) | `nvidia/nemotron-speech-streaming-en-0.6b` | NVIDIA NeMo |

---

## Setup & commands

- **Python** `>=3.11,<3.13`. Dependency manager is **uv** (`uv.lock` is the source of truth).
- **System packages** (`packages.txt`): `ffmpeg`, `libsndfile1`.

```bash
uv sync                       # or: pip install -r requirements.txt
uv run pytest                 # run the test suite (fast, NO GPU/weights needed β€” heavy models are mocked)
uvx ruff check .              # lint   (config: pyproject.toml [tool.ruff], line-length 100, py311; ruff is not a pinned dep)
uvx ruff format .             # format
```

Run the app locally (greedy CPU/MPS path, no ZeroGPU):

```bash
mkdir -p .cache/advisor-dashboard
ADVISOR_CACHE_DIR=.cache/advisor-dashboard \
ADVISOR_MODEL_BACKEND=minicpm-transformers \
ADVISOR_QUEST_ANALYZER_BACKEND=minicpm-transformers \
python app.py                 # β†’ http://127.0.0.1:7860
```

`ADVISOR_MODEL_BACKEND=rules` swaps the LLM for a deterministic planner β€” use it for UI/plumbing work without loading
MiniCPM.

`pytest` config lives in `pyproject.toml` (`testpaths=["tests"]`, `pythonpath=["."]`). **Always run it before
committing** β€” there are 26 test files and they are the contract.

---

## Repo map

```
app.py                  gr.Server entry: static UI + FastAPI /api/* + @app.api() client endpoints + refresh scheduler
hackathon_advisor/      the engine package (UI-agnostic β€” keep it that way)
static/                 bespoke frontend (index.html / app.js / styles.css) β€” the Off-Brand custom UI
scripts/                offline pipelines (crawl, Modal index/LoRA build, Hub publish) β€” NOT runtime
data/                   checked-in snapshots: projects.json, project_index.json, sample_trace.jsonl, quest dataset
artifacts/quest-lora/   local quest-LoRA training output (gitignored; loaded from the Hub repo at runtime)
docs/                   build reports (e.g. quest-classification-lora.md)
tests/                  pytest suite (mirrors module names: test_<module>.py)
```

### Engine package (`hackathon_advisor/`)

| Module | Responsibility |
| --- | --- |
| `agent.py` | `AdvisorEngine.turn()` / `turn_stream()`. **One** LLM tool-pick per turn, then deterministic Python orchestration (`search β†’ whitespace β†’ score β†’ plan`). Advisor prose is built from **f-string templates** here, not by the model. |
| `model_runtime.py` | `ToolPlanner` backends. `create_tool_planner()` selects via `ADVISOR_MODEL_BACKEND`: `minicpm-transformers` (MiniCPM5-1B + advisor LoRA, device ladder `auto/CUDA β†’ MPS β†’ CPU`) or `rules` (`RuleBasedPlanner`). |
| `tool_contracts.py` | `TOOL_SPECS` typed schema; `parse_xml_tool_call()`; `resolve_tool_call()` returns `valid` or a `defaulted` call (the tool-call **degradation ladder**). |
| `tools.py` | Tool implementations over `ProjectIndex` (search, whitespace, score, plan, profile, …). Heavy logic lives here, not in the model. |
| `aliases.py` | Jargon normalization (fuzzy-maps "neutron" β†’ Nemotron, "mini cpm" β†’ MiniCPM5, …) applied **before** tool routing. |
| `data.py` | `ProjectIndex`: loads the snapshot + embedding index, `_embed_query()` via llama.cpp, cosine search. |
| `llama_embedding.py` | `LlamaCppEmbedder` β€” EmbeddingGemma GGUF through llama-cpp-python (the Llama Champion path). |
| `dashboard.py` / `dashboard_storage.py` / `dashboard_search.py` | Atlas payload (t-SNE / KMeans / nearest links), BM25 search, and the refresh **lease + heartbeat + atomic `latest.json` swap**. |
| `quest_analysis.py` / `quest_taxonomy.py` / `quest_cache.py` | MiniCPM quest LoRA β†’ strict quest JSON; the taxonomy; per-project cache keyed on prompt/taxonomy/model/adapter hashes. |
| `scoring.py` | Deterministic idea rubric (the model only triggers + verbalizes it). |
| `wood_map.py` / `png_export.py` | PCA projection + Pillow render of the shareable page PNG. |
| `field_notes.py` / `chapter.py` / `trace_export.py` / `submission_packet.py` / `artifact_bundle.py` / `demo_rehearsal.py` | Export surfaces (notes, chapter, agent trace, submission packet, demo bundle). |
| `prize_ledger.py` | Model stack + parameter budget + badge ledger reported at `/api/prize-ledger`. |
| `zerogpu.py` | `gpu_task()` decorator (no-op unless `ADVISOR_ZERO_GPU=1`) + GPU-quota error detection for the CPU fallback. |
| `runtime_hooks.py` / `profiling.py` | Process/runtime helpers and turn profiling. |

### Routes (`app.py`)

First-party FastAPI routes power the visible app; `@app.api()` endpoints stay available for Gradio/Python clients.

| Route | Purpose |
| --- | --- |
| `GET /` , `GET /static/{path}` | Serve the bespoke `static/` frontend |
| `POST /api/agent-turn` | The advisor turn β€” **NDJSON stream**; this is the `@spaces.GPU` boundary |
| `POST /api/transcribe` | Voice note β†’ transcript (NeMo, see ASR gotcha) |
| `GET /api/dashboard` Β· `GET /api/dashboard/search` | Atlas payload Β· BM25 search |
| `POST/GET /api/dashboard/refresh` | Start / poll one background refresh job |
| `GET /api/bootstrap` Β· `GET /api/runtime` Β· `GET /api/prize-ledger` Β· `GET /api/tool-contracts` | Frontend bootstrap, runtime status, prize ledger, tool schema |
| `GET /api/demo-bundle.zip` Β· `GET /api/lora-training-kit.zip` Β· `POST /api/artifact.png` Β· `POST /api/field-notes` Β· `POST /api/chapter` | Exports |
| `GET /health` | Liveness |

---

## Gotchas (the things that bite agents here)

1. **The 1B model only emits ONE XML tool call per turn.** All user-facing prose is templated Python (`agent.py`
   `_*_response`), and multi-step flows are orchestrated in code β€” not a model-driven ReAct loop. Do **not** "make the
   model write the response" or add multi-hop tool loops; route through `tool_contracts.py` instead.
2. **Off the Grid is a hard constraint.** No proprietary cloud inference API may touch the runtime path. All three
   engines run locally from open weights. Don't add `InferenceClient`, `openai`, etc. to runtime code.
3. **Parameter budget.** Total ≀32B, largest single model ≀4B (Tiny Titan). Don't introduce a larger model;
   `prize_ledger.py` documents the ~1.98B stack.
4. **MiniCPM (PyTorch) and llama.cpp clash on OpenMP.** Query embedding runs in a **worker subprocess** on macOS, and
   dashboard refresh builds the GGUF index in a subprocess before returning to the MiniCPM process. Keep these isolated;
   don't import both heavy runtimes into the same hot path.
5. **Decoding is greedy.** `enable_thinking=False`, `temperature=0` for tool calls and strict quest JSON. Keep tool
   schemas small and single-hop (1B discipline).
6. **Never write `latest.json` directly.** Refreshes write `runs/{run_id}/…` then do an **atomic swap** under
   `$ADVISOR_CACHE_DIR/refresh.lock` with a heartbeat; a failed run leaves the last validated dashboard in place.
7. **Tests must stay GPU-free.** The suite mocks torch/transformers/llama.cpp β€” `pytest` runs with no GPU and no model
   weights. Don't add module-top heavy imports that break CPU-only test collection.
8. **ASR backend.** `asr_runtime.py` requires NVIDIA NeMo ASR for `nvidia/nemotron-speech-streaming-en-0.6b`; missing
   NeMo is a hard runtime error, locally and on the deployed Space. `status()` reports the configured Nemotron backend.

---

## Offline pipelines (`scripts/`, build-time only)

Runtime never calls these β€” they keep the Space self-contained.

```bash
python scripts/crawl_hf_spaces.py --org build-small-hackathon --out data/projects.json   # crawl the field
python scripts/build_project_index.py --projects data/projects.json --out data/project_index.json   # local llama.cpp index
python scripts/build_project_index.py --location modal ...   # same build, on Modal (one CLI, --location switches where it runs)
modal run scripts/modal_train_quest_lora.py ...           # train the quest LoRA on Modal
python scripts/publish_quest_adapter.py ... / publish_quest_dataset.py ...   # push adapter / dataset to the Hub
```

---

## Commits & reviews

- **Conventional commits**, one concern per commit. Observed history: `feat:`, `fix:`, `refactor:`, `chore:`, `docs:`.
- **Gate before committing:** `uv run pytest` green, `uvx ruff check .` clean, and the README updated if behavior
  changed.
- Keep the engine package UI-agnostic; if you touch a runtime model path, re-check gotchas 2–4 (Off the Grid, param
  budget, OpenMP isolation).

---

## Key environment variables

| Variable | Default | Use |
| --- | --- | --- |
| `ADVISOR_CACHE_DIR` | β€” | Artifact store (mounted bucket on Spaces); enables the refresh scheduler when set |
| `ADVISOR_MODEL_BACKEND` | `minicpm-transformers` | Advisor planner: `minicpm-transformers` or `rules` |
| `ADVISOR_MODEL_ID` / `ADVISOR_ADAPTER_ID` / `ADVISOR_ADAPTER_REVISION` | MiniCPM5-1B + advisor LoRA | Advisor model + pinned LoRA |
| `ADVISOR_QUEST_ANALYZER_BACKEND` / `ADVISOR_QUEST_ADAPTER_ID` | `minicpm-transformers` / `build-small-hackathon/hackathon-advisor-quest-minicpm5-lora` | Quest classifier |
| `ADVISOR_ZERO_GPU` / `ADVISOR_ZERO_GPU_DURATION` | off / `120` | Wrap the engine turn in `@spaces.GPU` on the deployed Space |
| `ADVISOR_ASR_MODEL_ID` | Nemotron | Voice ASR model |
| `ADVISOR_EMBEDDING_MODEL_REPO` / `ADVISOR_EMBEDDING_MODEL_FILE` | EmbeddingGemma GGUF | llama.cpp retrieval model |
| `ADVISOR_REFRESH_COMPUTE` / `ADVISOR_REFRESH_INTERVAL_SECONDS` | `cpu` / `3600` | Scheduled refresh compute + cadence |

See `## Runtime Backend` in `README.md` for the full deployed configuration.