Spaces:

build-small-hackathon
/

WitnessBox

Sleeping

App Files Files Community

WitnessBox / PRD.md

Farseen0

Deploy WitnessBox

c519923 verified 13 days ago

preview code

Raw

History Blame Contribute Delete

7.86 kB

	# ⚖️ WitnessBox — PRD

	> Cross-examine a hostile AI witness. A courtroom interrogation game where the witness reacts
	> to how you deliver, the AI is the irreplaceable mechanic, and a **Modal Sandbox executing
	> model-written code** is the game's referee.
	>
	> Track: 🍄 Thousand Token Wood · Primary prize: Best Use of Modal (1st-caliber, Axis A:
	> Sandbox-runs-model-generated-code) · Status: built, compiles clean (see existing `hf-hackathon/witnessbox/`).

	## 1. Vision & why it wins
	Interrogate Marcus Reid, CFO of Halcyon Dynamics. He's evasive and reads your **delivery
	stance** (vocal confidence) — sound confident and he clams up; sound hesitant and he gets cocky
	and overshares. Catch him in 3 contradictions and his voice cracks as he breaks.

	Three independent win mechanisms, three judge pools:
	1. Best Use of Modal (#1 target): the core mechanic IS Modal's documented flagship pattern —
	an LLM writes code, a Sandbox safely executes it. Modal's own GRPO example: the *"Best Use of
	Modal prize showcased the use of sandboxes for securely evaluating model-generated code."* No
	rival in the field centers on this; most use Modal as plain inference hosting.
	2. OpenBMB Best MiniCPM Build (Wood): MiniCPM-o is the character, VoxCPM2's style-tags are the
	game state — "model is the product," which beats "model is a component."
	3. Wood track podium (4 paid slots): delight + load-bearing AI + originality + polish; a voiced,
	interactive game with a win condition and an audiovisual climax stands out vs watch-only demos.

	## 2. Target prizes
	Primary: Best Use of Modal (1st). Secondary (awards stack): OpenBMB-Wood · Wood podium ·
	Community Choice (Wood) · Nemotron Hardware (ASR) · Best Agent · Best Demo · Off-Brand *(only if a
	real `gr.Server` custom UI is built — not earned by CSS alone)*.

	## 3. Users & core experience
	Player = anyone who wants the fantasy of breaking a witness on the stand. Turn-based push-to-talk:
	```
	player records a question (mic)
	→ Nemotron ASR transcribes + librosa reads DELIVERY STANCE (perceived confidence; NOT lie detection)
	→ stance steers the witness system prompt (Hesitant → he overshares a thread toward an uncaught lie)
	→ ONE MiniCPM-o call returns {in-character reply, contradiction-check Python}
	→ modal.Sandbox executes the MODEL-WRITTEN code; its JSON verdict DECIDES the catch
	(keyword matching is only a silent fallback; on Sandbox error, the model self-corrects its code)
	→ VoxCPM2 voices the reply; style escalates with pressure
	catch #3 → win; the witness's voice cracks (pre-generated best take)
	```

	## 4. Functional requirements
	- 3 planted lies injected into the system prompt (timeline, authorization, relationship), each
	with a concrete contradiction cue the player must surface. Detection fires against THESE, not on
	emergent model inconsistency (reliable > magical).
	- Delivery stance from a parallel librosa pass (pause-rate + speaking-rate dominant per the
	prosody literature; pitch minor). Framed as perceived delivery, never "lie detector."
	- Stance is load-bearing: Hesitant delivery makes the witness leak a cue toward one uncaught lie.
	- Win at 3 catches, ≤ ~12 turns; the climactic break line is pre-generated and cached.
	- The model-written code + Sandbox verdict are shown live in an open panel (the Modal evidence).

	## 5. Technical architecture (all ≤32B; ≈12B combined)
	\| Component \| Model / lib \| Notes (verified) \|
	\|---\|---\|---\|
	\| Witness brain \| `openbmb/MiniCPM-o-4_5` (9.4B) \| `AutoModel`, `trust_remote_code`; `chat(msgs=, use_tts_template=False, enable_thinking=False, generate_audio=False)`; `init_vision/audio/tts=False` (text-only). \|
	\| Witness voice \| `openbmb/VoxCPM2` (2B) \| `from_pretrained(load_denoiser=False)`; Voice-Design CFO once → Controllable-Clone per line `generate(text="(style)...", reference_wav_path=ref)`; 48kHz; torch≥2.5.0. \|
	\| Player ASR \| `nvidia/nemotron-speech-streaming-en-0.6b` (or `-3.5-asr-streaming-`) \| whisper-small local fallback. \|
	\| Delivery stance \| `librosa` \| parallel waveform pass; pause/rate → tier. \|
	\| Contradiction engine \| MiniCPM-o generates networkx code → `modal.Sandbox` \| the verdict authority. \|

	## 6. Best Use of Modal — five load-bearing primitives (the #1-prize section)
	The core mechanic is Modal's flagship Sandbox pattern (`docs/examples/agent`, `safe_code_execution`).
	1. ⭐ Sandbox executes model-written code — the game's referee (network-blocked; its JSON decides catches).
	2. 🔧 Agentic self-correction — on Sandbox error, the error feeds back to MiniCPM-o, which repairs its own code and reruns (max 2) — Modal's `devlooper` generate→execute→fix loop.
	3. GPU inference via `@app.cls`, scale-to-zero — MiniCPM-o (A100) + VoxCPM2 (A10G) + Nemotron ASR (A10G), idle → $0.
	4. Parallel `.map()` — pre-generates the scripted voice beats (incl. the voice-crack) at load.
	5. Memory snapshot + Volume — snapshot cuts cold start (measured); a Volume persists the designed CFO voice clip + model cache.
	Measured cost: quote real container-seconds → "$0.0X / match" (read from the Modal dashboard).
	Map this verbatim into the README's "Best Use of Modal" section (REQ-06 requires noting Modal).

	## 7. UX / UI requirements
	Courtroom aesthetic (parchment, serif). CFO portrait. "Delivery Stance" bar (labeled *not a lie
	detector). X/3 contradiction counter. Autoplay witness audio. *Contradiction Engine accordion
	defaults OPEN** (the #1-prize evidence must be on camera). Latency (~20–35s warm) masked diegetically
	("the witness considers…"). For Off-Brand, a real `gr.Server` custom courtroom UI would be required.

	## 8. Demo video (the judged artifact)
	60–90s, controlled, ~20 dry runs first: stance steers witness → ask hesitantly, he overshares →
	catch #1 → the Sandbox panel shows model-written code + verdict → catch #3 → voice cracks →
	cost readout. Show the Sandbox executing the model's code as the dramatic beat.

	## 9. Success metrics
	Five consecutive clean end-to-end turns from the deployed Space · win-at-3 reliable · Sandbox
	verdict authoritative (codegen broken <~30% of turns, self-correction covers the rest) · voice-crack
	lands · measured Modal cost + snapshot seconds captured.

	## 10. Risks & mitigations
	- End-to-end turn never run (highest risk) → deploy + prove 5 turns before anything downstream.
	- Modal secrets unset → Space boots (lookup is lazy/try-excepted) but the Sandbox is dead; set `MODAL_TOKEN_ID`/`MODAL_TOKEN_SECRET` as Space secrets.
	- Codegen unreliable → self-correction loop + a networkx skeleton in the prompt; never show repeated `score=0.00`.
	- Voice-crack variance → pre-generate ≥30 takes of the win line, cache the best.
	- Nemotron ASR install friction → bounded attempt, else pivot to parakeet or whisper fallback (never blocks the critical path).

	## 11. Build plan (by dependency — no calendar)
	1. Set Space secrets · generate CFO portrait · (done in scaffold: lazy lookup, warmup sandbox prebuild, accordion open, torch≥2.5, generate_audio/init_audio).
	2. Deploy + smoke-test `run_in_sandbox()` and the voxcpm image standalone.
	3. Five consecutive end-to-end turns from the deployed Space + measured latencies/cost (the gate).
	4. ≥30 win-line takes cached · codegen reliability hardened.
	5. Nemotron ASR pivot-gate (stop-loss) · optional real `gr.Server` UI for Off-Brand.
	6. Demo video (after dry runs) → README measured numbers → social → submit.

	## 12. Integrity rules
	Claims follow code — no "only entry that…" claims about a moving field; cost/latency are measured,
	never fabricated. Pre-submit grep: `TODO \| YOUR_HF_USER \| NotImplementedError \| <!--`.