hackathon-advisor / AGENTS.md
JacobLinCool's picture
deploy: sync GitHub main de5dbf9
13fe947 verified
# AGENTS.md
Operating manual for coding agents working in this repo.
---
## What this is
**Hackathon Advisor** is a Gradio `gradio.Server` (FastAPI subclass) Space for the
[Build Small Hackathon](https://huggingface.co/build-small-hackathon). It is a small-model (**≀32B**, largest single
model **≀4B**) originality coach: it crawls the public `build-small-hackathon` org into a live project atlas, then lets a
builder search the field and open **The Unwritten Almanac** advisor to test an idea against existing work.
The engine in `hackathon_advisor/` is **UI-agnostic**; `app.py` and `static/` are one possible front door.
**Model stack (all open-weight, all local):**
| Role | Model | Runtime |
| --- | --- | --- |
| Advisor brain (tool planning) | `openbmb/MiniCPM5-1B` + advisor LoRA | Transformers + PEFT, ZeroGPU |
| Quest classifier | `openbmb/MiniCPM5-1B` + quest LoRA | Transformers + PEFT, ZeroGPU |
| Retrieval / atlas | `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF` | llama.cpp (llama-cpp-python) |
| Voice input (ASR) | `nvidia/nemotron-speech-streaming-en-0.6b` | NVIDIA NeMo |
---
## Setup & commands
- **Python** `>=3.11,<3.13`. Dependency manager is **uv** (`uv.lock` is the source of truth).
- **System packages** (`packages.txt`): `ffmpeg`, `libsndfile1`.
```bash
uv sync # or: pip install -r requirements.txt
uv run pytest # run the test suite (fast, NO GPU/weights needed β€” heavy models are mocked)
uvx ruff check . # lint (config: pyproject.toml [tool.ruff], line-length 100, py311; ruff is not a pinned dep)
uvx ruff format . # format
```
Run the app locally (greedy CPU/MPS path, no ZeroGPU):
```bash
mkdir -p .cache/advisor-dashboard
ADVISOR_CACHE_DIR=.cache/advisor-dashboard \
ADVISOR_MODEL_BACKEND=minicpm-transformers \
ADVISOR_QUEST_ANALYZER_BACKEND=minicpm-transformers \
python app.py # β†’ http://127.0.0.1:7860
```
`ADVISOR_MODEL_BACKEND=rules` swaps the LLM for a deterministic planner β€” use it for UI/plumbing work without loading
MiniCPM.
`pytest` config lives in `pyproject.toml` (`testpaths=["tests"]`, `pythonpath=["."]`). **Always run it before
committing** β€” there are 26 test files and they are the contract.
---
## Repo map
```
app.py gr.Server entry: static UI + FastAPI /api/* + @app.api() client endpoints + refresh scheduler
hackathon_advisor/ the engine package (UI-agnostic β€” keep it that way)
static/ bespoke frontend (index.html / app.js / styles.css) β€” the Off-Brand custom UI
scripts/ offline pipelines (crawl, Modal index/LoRA build, Hub publish) β€” NOT runtime
data/ checked-in snapshots: projects.json, project_index.json, sample_trace.jsonl, quest dataset
artifacts/quest-lora/ local quest-LoRA training output (gitignored; loaded from the Hub repo at runtime)
docs/ build reports (e.g. quest-classification-lora.md)
tests/ pytest suite (mirrors module names: test_<module>.py)
```
### Engine package (`hackathon_advisor/`)
| Module | Responsibility |
| --- | --- |
| `agent.py` | `AdvisorEngine.turn()` / `turn_stream()`. **One** LLM tool-pick per turn, then deterministic Python orchestration (`search β†’ whitespace β†’ score β†’ plan`). Advisor prose is built from **f-string templates** here, not by the model. |
| `model_runtime.py` | `ToolPlanner` backends. `create_tool_planner()` selects via `ADVISOR_MODEL_BACKEND`: `minicpm-transformers` (MiniCPM5-1B + advisor LoRA, device ladder `auto/CUDA β†’ MPS β†’ CPU`) or `rules` (`RuleBasedPlanner`). |
| `tool_contracts.py` | `TOOL_SPECS` typed schema; `parse_xml_tool_call()`; `resolve_tool_call()` returns `valid` or a `defaulted` call (the tool-call **degradation ladder**). |
| `tools.py` | Tool implementations over `ProjectIndex` (search, whitespace, score, plan, profile, …). Heavy logic lives here, not in the model. |
| `aliases.py` | Jargon normalization (fuzzy-maps "neutron" β†’ Nemotron, "mini cpm" β†’ MiniCPM5, …) applied **before** tool routing. |
| `data.py` | `ProjectIndex`: loads the snapshot + embedding index, `_embed_query()` via llama.cpp, cosine search. |
| `llama_embedding.py` | `LlamaCppEmbedder` β€” EmbeddingGemma GGUF through llama-cpp-python (the Llama Champion path). |
| `dashboard.py` / `dashboard_storage.py` / `dashboard_search.py` | Atlas payload (t-SNE / KMeans / nearest links), BM25 search, and the refresh **lease + heartbeat + atomic `latest.json` swap**. |
| `quest_analysis.py` / `quest_taxonomy.py` / `quest_cache.py` | MiniCPM quest LoRA β†’ strict quest JSON; the taxonomy; per-project cache keyed on prompt/taxonomy/model/adapter hashes. |
| `scoring.py` | Deterministic idea rubric (the model only triggers + verbalizes it). |
| `wood_map.py` / `png_export.py` | PCA projection + Pillow render of the shareable page PNG. |
| `field_notes.py` / `chapter.py` / `trace_export.py` / `submission_packet.py` / `artifact_bundle.py` / `demo_rehearsal.py` | Export surfaces (notes, chapter, agent trace, submission packet, demo bundle). |
| `prize_ledger.py` | Model stack + parameter budget + badge ledger reported at `/api/prize-ledger`. |
| `zerogpu.py` | `gpu_task()` decorator (no-op unless `ADVISOR_ZERO_GPU=1`) + GPU-quota error detection for the CPU fallback. |
| `runtime_hooks.py` / `profiling.py` | Process/runtime helpers and turn profiling. |
### Routes (`app.py`)
First-party FastAPI routes power the visible app; `@app.api()` endpoints stay available for Gradio/Python clients.
| Route | Purpose |
| --- | --- |
| `GET /` , `GET /static/{path}` | Serve the bespoke `static/` frontend |
| `POST /api/agent-turn` | The advisor turn β€” **NDJSON stream**; this is the `@spaces.GPU` boundary |
| `POST /api/transcribe` | Voice note β†’ transcript (NeMo, see ASR gotcha) |
| `GET /api/dashboard` Β· `GET /api/dashboard/search` | Atlas payload Β· BM25 search |
| `POST/GET /api/dashboard/refresh` | Start / poll one background refresh job |
| `GET /api/bootstrap` Β· `GET /api/runtime` Β· `GET /api/prize-ledger` Β· `GET /api/tool-contracts` | Frontend bootstrap, runtime status, prize ledger, tool schema |
| `GET /api/demo-bundle.zip` Β· `GET /api/lora-training-kit.zip` Β· `POST /api/artifact.png` Β· `POST /api/field-notes` Β· `POST /api/chapter` | Exports |
| `GET /health` | Liveness |
---
## Gotchas (the things that bite agents here)
1. **The 1B model only emits ONE XML tool call per turn.** All user-facing prose is templated Python (`agent.py`
`_*_response`), and multi-step flows are orchestrated in code β€” not a model-driven ReAct loop. Do **not** "make the
model write the response" or add multi-hop tool loops; route through `tool_contracts.py` instead.
2. **Off the Grid is a hard constraint.** No proprietary cloud inference API may touch the runtime path. All three
engines run locally from open weights. Don't add `InferenceClient`, `openai`, etc. to runtime code.
3. **Parameter budget.** Total ≀32B, largest single model ≀4B (Tiny Titan). Don't introduce a larger model;
`prize_ledger.py` documents the ~1.98B stack.
4. **MiniCPM (PyTorch) and llama.cpp clash on OpenMP.** Query embedding runs in a **worker subprocess** on macOS, and
dashboard refresh builds the GGUF index in a subprocess before returning to the MiniCPM process. Keep these isolated;
don't import both heavy runtimes into the same hot path.
5. **Decoding is greedy.** `enable_thinking=False`, `temperature=0` for tool calls and strict quest JSON. Keep tool
schemas small and single-hop (1B discipline).
6. **Never write `latest.json` directly.** Refreshes write `runs/{run_id}/…` then do an **atomic swap** under
`$ADVISOR_CACHE_DIR/refresh.lock` with a heartbeat; a failed run leaves the last validated dashboard in place.
7. **Tests must stay GPU-free.** The suite mocks torch/transformers/llama.cpp β€” `pytest` runs with no GPU and no model
weights. Don't add module-top heavy imports that break CPU-only test collection.
8. **ASR backend.** `asr_runtime.py` requires NVIDIA NeMo ASR for `nvidia/nemotron-speech-streaming-en-0.6b`; missing
NeMo is a hard runtime error, locally and on the deployed Space. `status()` reports the configured Nemotron backend.
---
## Offline pipelines (`scripts/`, build-time only)
Runtime never calls these β€” they keep the Space self-contained.
```bash
python scripts/crawl_hf_spaces.py --org build-small-hackathon --out data/projects.json # crawl the field
python scripts/build_project_index.py --projects data/projects.json --out data/project_index.json # local llama.cpp index
python scripts/build_project_index.py --location modal ... # same build, on Modal (one CLI, --location switches where it runs)
modal run scripts/modal_train_quest_lora.py ... # train the quest LoRA on Modal
python scripts/publish_quest_adapter.py ... / publish_quest_dataset.py ... # push adapter / dataset to the Hub
```
---
## Commits & reviews
- **Conventional commits**, one concern per commit. Observed history: `feat:`, `fix:`, `refactor:`, `chore:`, `docs:`.
- **Gate before committing:** `uv run pytest` green, `uvx ruff check .` clean, and the README updated if behavior
changed.
- Keep the engine package UI-agnostic; if you touch a runtime model path, re-check gotchas 2–4 (Off the Grid, param
budget, OpenMP isolation).
---
## Key environment variables
| Variable | Default | Use |
| --- | --- | --- |
| `ADVISOR_CACHE_DIR` | β€” | Artifact store (mounted bucket on Spaces); enables the refresh scheduler when set |
| `ADVISOR_MODEL_BACKEND` | `minicpm-transformers` | Advisor planner: `minicpm-transformers` or `rules` |
| `ADVISOR_MODEL_ID` / `ADVISOR_ADAPTER_ID` / `ADVISOR_ADAPTER_REVISION` | MiniCPM5-1B + advisor LoRA | Advisor model + pinned LoRA |
| `ADVISOR_QUEST_ANALYZER_BACKEND` / `ADVISOR_QUEST_ADAPTER_ID` | `minicpm-transformers` / `build-small-hackathon/hackathon-advisor-quest-minicpm5-lora` | Quest classifier |
| `ADVISOR_ZERO_GPU` / `ADVISOR_ZERO_GPU_DURATION` | off / `120` | Wrap the engine turn in `@spaces.GPU` on the deployed Space |
| `ADVISOR_ASR_MODEL_ID` | Nemotron | Voice ASR model |
| `ADVISOR_EMBEDDING_MODEL_REPO` / `ADVISOR_EMBEDDING_MODEL_FILE` | EmbeddingGemma GGUF | llama.cpp retrieval model |
| `ADVISOR_REFRESH_COMPUTE` / `ADVISOR_REFRESH_INTERVAL_SECONDS` | `cpu` / `3600` | Scheduled refresh compute + cadence |
See `## Runtime Backend` in `README.md` for the full deployed configuration.