Spaces:
Running on Zero
Running on Zero
File size: 10,520 Bytes
13fe947 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 | # AGENTS.md
Operating manual for coding agents working in this repo.
---
## What this is
**Hackathon Advisor** is a Gradio `gradio.Server` (FastAPI subclass) Space for the
[Build Small Hackathon](https://huggingface.co/build-small-hackathon). It is a small-model (**β€32B**, largest single
model **β€4B**) originality coach: it crawls the public `build-small-hackathon` org into a live project atlas, then lets a
builder search the field and open **The Unwritten Almanac** advisor to test an idea against existing work.
The engine in `hackathon_advisor/` is **UI-agnostic**; `app.py` and `static/` are one possible front door.
**Model stack (all open-weight, all local):**
| Role | Model | Runtime |
| --- | --- | --- |
| Advisor brain (tool planning) | `openbmb/MiniCPM5-1B` + advisor LoRA | Transformers + PEFT, ZeroGPU |
| Quest classifier | `openbmb/MiniCPM5-1B` + quest LoRA | Transformers + PEFT, ZeroGPU |
| Retrieval / atlas | `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF` | llama.cpp (llama-cpp-python) |
| Voice input (ASR) | `nvidia/nemotron-speech-streaming-en-0.6b` | NVIDIA NeMo |
---
## Setup & commands
- **Python** `>=3.11,<3.13`. Dependency manager is **uv** (`uv.lock` is the source of truth).
- **System packages** (`packages.txt`): `ffmpeg`, `libsndfile1`.
```bash
uv sync # or: pip install -r requirements.txt
uv run pytest # run the test suite (fast, NO GPU/weights needed β heavy models are mocked)
uvx ruff check . # lint (config: pyproject.toml [tool.ruff], line-length 100, py311; ruff is not a pinned dep)
uvx ruff format . # format
```
Run the app locally (greedy CPU/MPS path, no ZeroGPU):
```bash
mkdir -p .cache/advisor-dashboard
ADVISOR_CACHE_DIR=.cache/advisor-dashboard \
ADVISOR_MODEL_BACKEND=minicpm-transformers \
ADVISOR_QUEST_ANALYZER_BACKEND=minicpm-transformers \
python app.py # β http://127.0.0.1:7860
```
`ADVISOR_MODEL_BACKEND=rules` swaps the LLM for a deterministic planner β use it for UI/plumbing work without loading
MiniCPM.
`pytest` config lives in `pyproject.toml` (`testpaths=["tests"]`, `pythonpath=["."]`). **Always run it before
committing** β there are 26 test files and they are the contract.
---
## Repo map
```
app.py gr.Server entry: static UI + FastAPI /api/* + @app.api() client endpoints + refresh scheduler
hackathon_advisor/ the engine package (UI-agnostic β keep it that way)
static/ bespoke frontend (index.html / app.js / styles.css) β the Off-Brand custom UI
scripts/ offline pipelines (crawl, Modal index/LoRA build, Hub publish) β NOT runtime
data/ checked-in snapshots: projects.json, project_index.json, sample_trace.jsonl, quest dataset
artifacts/quest-lora/ local quest-LoRA training output (gitignored; loaded from the Hub repo at runtime)
docs/ build reports (e.g. quest-classification-lora.md)
tests/ pytest suite (mirrors module names: test_<module>.py)
```
### Engine package (`hackathon_advisor/`)
| Module | Responsibility |
| --- | --- |
| `agent.py` | `AdvisorEngine.turn()` / `turn_stream()`. **One** LLM tool-pick per turn, then deterministic Python orchestration (`search β whitespace β score β plan`). Advisor prose is built from **f-string templates** here, not by the model. |
| `model_runtime.py` | `ToolPlanner` backends. `create_tool_planner()` selects via `ADVISOR_MODEL_BACKEND`: `minicpm-transformers` (MiniCPM5-1B + advisor LoRA, device ladder `auto/CUDA β MPS β CPU`) or `rules` (`RuleBasedPlanner`). |
| `tool_contracts.py` | `TOOL_SPECS` typed schema; `parse_xml_tool_call()`; `resolve_tool_call()` returns `valid` or a `defaulted` call (the tool-call **degradation ladder**). |
| `tools.py` | Tool implementations over `ProjectIndex` (search, whitespace, score, plan, profile, β¦). Heavy logic lives here, not in the model. |
| `aliases.py` | Jargon normalization (fuzzy-maps "neutron" β Nemotron, "mini cpm" β MiniCPM5, β¦) applied **before** tool routing. |
| `data.py` | `ProjectIndex`: loads the snapshot + embedding index, `_embed_query()` via llama.cpp, cosine search. |
| `llama_embedding.py` | `LlamaCppEmbedder` β EmbeddingGemma GGUF through llama-cpp-python (the Llama Champion path). |
| `dashboard.py` / `dashboard_storage.py` / `dashboard_search.py` | Atlas payload (t-SNE / KMeans / nearest links), BM25 search, and the refresh **lease + heartbeat + atomic `latest.json` swap**. |
| `quest_analysis.py` / `quest_taxonomy.py` / `quest_cache.py` | MiniCPM quest LoRA β strict quest JSON; the taxonomy; per-project cache keyed on prompt/taxonomy/model/adapter hashes. |
| `scoring.py` | Deterministic idea rubric (the model only triggers + verbalizes it). |
| `wood_map.py` / `png_export.py` | PCA projection + Pillow render of the shareable page PNG. |
| `field_notes.py` / `chapter.py` / `trace_export.py` / `submission_packet.py` / `artifact_bundle.py` / `demo_rehearsal.py` | Export surfaces (notes, chapter, agent trace, submission packet, demo bundle). |
| `prize_ledger.py` | Model stack + parameter budget + badge ledger reported at `/api/prize-ledger`. |
| `zerogpu.py` | `gpu_task()` decorator (no-op unless `ADVISOR_ZERO_GPU=1`) + GPU-quota error detection for the CPU fallback. |
| `runtime_hooks.py` / `profiling.py` | Process/runtime helpers and turn profiling. |
### Routes (`app.py`)
First-party FastAPI routes power the visible app; `@app.api()` endpoints stay available for Gradio/Python clients.
| Route | Purpose |
| --- | --- |
| `GET /` , `GET /static/{path}` | Serve the bespoke `static/` frontend |
| `POST /api/agent-turn` | The advisor turn β **NDJSON stream**; this is the `@spaces.GPU` boundary |
| `POST /api/transcribe` | Voice note β transcript (NeMo, see ASR gotcha) |
| `GET /api/dashboard` Β· `GET /api/dashboard/search` | Atlas payload Β· BM25 search |
| `POST/GET /api/dashboard/refresh` | Start / poll one background refresh job |
| `GET /api/bootstrap` Β· `GET /api/runtime` Β· `GET /api/prize-ledger` Β· `GET /api/tool-contracts` | Frontend bootstrap, runtime status, prize ledger, tool schema |
| `GET /api/demo-bundle.zip` Β· `GET /api/lora-training-kit.zip` Β· `POST /api/artifact.png` Β· `POST /api/field-notes` Β· `POST /api/chapter` | Exports |
| `GET /health` | Liveness |
---
## Gotchas (the things that bite agents here)
1. **The 1B model only emits ONE XML tool call per turn.** All user-facing prose is templated Python (`agent.py`
`_*_response`), and multi-step flows are orchestrated in code β not a model-driven ReAct loop. Do **not** "make the
model write the response" or add multi-hop tool loops; route through `tool_contracts.py` instead.
2. **Off the Grid is a hard constraint.** No proprietary cloud inference API may touch the runtime path. All three
engines run locally from open weights. Don't add `InferenceClient`, `openai`, etc. to runtime code.
3. **Parameter budget.** Total β€32B, largest single model β€4B (Tiny Titan). Don't introduce a larger model;
`prize_ledger.py` documents the ~1.98B stack.
4. **MiniCPM (PyTorch) and llama.cpp clash on OpenMP.** Query embedding runs in a **worker subprocess** on macOS, and
dashboard refresh builds the GGUF index in a subprocess before returning to the MiniCPM process. Keep these isolated;
don't import both heavy runtimes into the same hot path.
5. **Decoding is greedy.** `enable_thinking=False`, `temperature=0` for tool calls and strict quest JSON. Keep tool
schemas small and single-hop (1B discipline).
6. **Never write `latest.json` directly.** Refreshes write `runs/{run_id}/β¦` then do an **atomic swap** under
`$ADVISOR_CACHE_DIR/refresh.lock` with a heartbeat; a failed run leaves the last validated dashboard in place.
7. **Tests must stay GPU-free.** The suite mocks torch/transformers/llama.cpp β `pytest` runs with no GPU and no model
weights. Don't add module-top heavy imports that break CPU-only test collection.
8. **ASR backend.** `asr_runtime.py` requires NVIDIA NeMo ASR for `nvidia/nemotron-speech-streaming-en-0.6b`; missing
NeMo is a hard runtime error, locally and on the deployed Space. `status()` reports the configured Nemotron backend.
---
## Offline pipelines (`scripts/`, build-time only)
Runtime never calls these β they keep the Space self-contained.
```bash
python scripts/crawl_hf_spaces.py --org build-small-hackathon --out data/projects.json # crawl the field
python scripts/build_project_index.py --projects data/projects.json --out data/project_index.json # local llama.cpp index
python scripts/build_project_index.py --location modal ... # same build, on Modal (one CLI, --location switches where it runs)
modal run scripts/modal_train_quest_lora.py ... # train the quest LoRA on Modal
python scripts/publish_quest_adapter.py ... / publish_quest_dataset.py ... # push adapter / dataset to the Hub
```
---
## Commits & reviews
- **Conventional commits**, one concern per commit. Observed history: `feat:`, `fix:`, `refactor:`, `chore:`, `docs:`.
- **Gate before committing:** `uv run pytest` green, `uvx ruff check .` clean, and the README updated if behavior
changed.
- Keep the engine package UI-agnostic; if you touch a runtime model path, re-check gotchas 2β4 (Off the Grid, param
budget, OpenMP isolation).
---
## Key environment variables
| Variable | Default | Use |
| --- | --- | --- |
| `ADVISOR_CACHE_DIR` | β | Artifact store (mounted bucket on Spaces); enables the refresh scheduler when set |
| `ADVISOR_MODEL_BACKEND` | `minicpm-transformers` | Advisor planner: `minicpm-transformers` or `rules` |
| `ADVISOR_MODEL_ID` / `ADVISOR_ADAPTER_ID` / `ADVISOR_ADAPTER_REVISION` | MiniCPM5-1B + advisor LoRA | Advisor model + pinned LoRA |
| `ADVISOR_QUEST_ANALYZER_BACKEND` / `ADVISOR_QUEST_ADAPTER_ID` | `minicpm-transformers` / `build-small-hackathon/hackathon-advisor-quest-minicpm5-lora` | Quest classifier |
| `ADVISOR_ZERO_GPU` / `ADVISOR_ZERO_GPU_DURATION` | off / `120` | Wrap the engine turn in `@spaces.GPU` on the deployed Space |
| `ADVISOR_ASR_MODEL_ID` | Nemotron | Voice ASR model |
| `ADVISOR_EMBEDDING_MODEL_REPO` / `ADVISOR_EMBEDDING_MODEL_FILE` | EmbeddingGemma GGUF | llama.cpp retrieval model |
| `ADVISOR_REFRESH_COMPUTE` / `ADVISOR_REFRESH_INTERVAL_SECONDS` | `cpu` / `3600` | Scheduled refresh compute + cadence |
See `## Runtime Backend` in `README.md` for the full deployed configuration.
|