# ⚖️ WitnessBox — PRD > **Cross-examine a hostile AI witness.** A courtroom interrogation game where the witness reacts > to *how you deliver*, the AI is the irreplaceable mechanic, and a **Modal Sandbox executing > model-written code** is the game's referee. > > **Track:** 🍄 Thousand Token Wood · **Primary prize:** Best Use of Modal (1st-caliber, Axis A: > Sandbox-runs-model-generated-code) · **Status:** built, compiles clean (see existing `hf-hackathon/witnessbox/`). ## 1. Vision & why it wins Interrogate **Marcus Reid, CFO of Halcyon Dynamics**. He's evasive and reads your **delivery stance** (vocal confidence) — sound confident and he clams up; sound hesitant and he gets cocky and overshares. Catch him in **3 contradictions** and his voice **cracks** as he breaks. Three independent win mechanisms, three judge pools: 1. **Best Use of Modal (#1 target):** the core mechanic IS Modal's documented flagship pattern — an LLM writes code, a Sandbox safely executes it. Modal's own GRPO example: the *"Best Use of Modal prize showcased the use of sandboxes for securely evaluating model-generated code."* No rival in the field centers on this; most use Modal as plain inference hosting. 2. **OpenBMB Best MiniCPM Build (Wood):** MiniCPM-o is the *character*, VoxCPM2's style-tags are the *game state* — "model is the product," which beats "model is a component." 3. **Wood track podium (4 paid slots):** delight + load-bearing AI + originality + polish; a voiced, interactive game with a win condition and an audiovisual climax stands out vs watch-only demos. ## 2. Target prizes Primary: **Best Use of Modal (1st)**. Secondary (awards stack): OpenBMB-Wood · Wood podium · Community Choice (Wood) · Nemotron Hardware (ASR) · Best Agent · Best Demo · Off-Brand *(only if a real `gr.Server` custom UI is built — not earned by CSS alone)*. ## 3. Users & core experience Player = anyone who wants the fantasy of breaking a witness on the stand. Turn-based push-to-talk: ``` player records a question (mic) → Nemotron ASR transcribes + librosa reads DELIVERY STANCE (perceived confidence; NOT lie detection) → stance steers the witness system prompt (Hesitant → he overshares a thread toward an uncaught lie) → ONE MiniCPM-o call returns {in-character reply, contradiction-check Python} → modal.Sandbox executes the MODEL-WRITTEN code; its JSON verdict DECIDES the catch (keyword matching is only a silent fallback; on Sandbox error, the model self-corrects its code) → VoxCPM2 voices the reply; style escalates with pressure catch #3 → win; the witness's voice cracks (pre-generated best take) ``` ## 4. Functional requirements - **3 planted lies** injected into the system prompt (timeline, authorization, relationship), each with a concrete contradiction cue the player must surface. Detection fires against THESE, not on emergent model inconsistency (reliable > magical). - **Delivery stance** from a parallel librosa pass (pause-rate + speaking-rate dominant per the prosody literature; pitch minor). Framed as *perceived delivery*, **never** "lie detector." - **Stance is load-bearing:** Hesitant delivery makes the witness leak a cue toward one uncaught lie. - **Win at 3 catches**, ≤ ~12 turns; the climactic break line is pre-generated and cached. - The model-written code + Sandbox verdict are shown **live** in an open panel (the Modal evidence). ## 5. Technical architecture (all ≤32B; ≈12B combined) | Component | Model / lib | Notes (verified) | |---|---|---| | Witness brain | `openbmb/MiniCPM-o-4_5` (9.4B) | `AutoModel`, `trust_remote_code`; `chat(msgs=, use_tts_template=False, enable_thinking=False, generate_audio=False)`; `init_vision/audio/tts=False` (text-only). | | Witness voice | `openbmb/VoxCPM2` (2B) | `from_pretrained(load_denoiser=False)`; Voice-Design CFO once → Controllable-Clone per line `generate(text="(style)...", reference_wav_path=ref)`; 48kHz; **torch≥2.5.0**. | | Player ASR | `nvidia/nemotron-speech-streaming-en-0.6b` (or `-3.5-asr-streaming-`) | whisper-small local fallback. | | Delivery stance | `librosa` | parallel waveform pass; pause/rate → tier. | | Contradiction engine | MiniCPM-o **generates** networkx code → `modal.Sandbox` | the verdict authority. | ## 6. Best Use of Modal — five load-bearing primitives (the #1-prize section) The core mechanic is Modal's flagship Sandbox pattern (`docs/examples/agent`, `safe_code_execution`). 1. **⭐ Sandbox executes model-written code** — the game's referee (network-blocked; its JSON decides catches). 2. **🔧 Agentic self-correction** — on Sandbox error, the error feeds back to MiniCPM-o, which repairs its own code and reruns (max 2) — Modal's `devlooper` generate→execute→fix loop. 3. **GPU inference via `@app.cls`, scale-to-zero** — MiniCPM-o (A100) + VoxCPM2 (A10G) + Nemotron ASR (A10G), idle → $0. 4. **Parallel `.map()`** — pre-generates the scripted voice beats (incl. the voice-crack) at load. 5. **Memory snapshot + Volume** — snapshot cuts cold start (measured); a Volume persists the designed CFO voice clip + model cache. **Measured cost:** quote real container-seconds → "$0.0X / match" (read from the Modal dashboard). Map this verbatim into the README's "Best Use of Modal" section (REQ-06 requires noting Modal). ## 7. UX / UI requirements Courtroom aesthetic (parchment, serif). CFO portrait. "Delivery Stance" bar (labeled *not a lie detector*). X/3 contradiction counter. Autoplay witness audio. **Contradiction Engine accordion defaults OPEN** (the #1-prize evidence must be on camera). Latency (~20–35s warm) masked diegetically ("the witness considers…"). For Off-Brand, a real `gr.Server` custom courtroom UI would be required. ## 8. Demo video (the judged artifact) 60–90s, controlled, ~20 dry runs first: stance steers witness → ask hesitantly, he overshares → catch #1 → the Sandbox panel shows model-written code + verdict → catch #3 → **voice cracks** → cost readout. Show the Sandbox executing the model's code as the dramatic beat. ## 9. Success metrics Five consecutive clean end-to-end turns from the deployed Space · win-at-3 reliable · Sandbox verdict authoritative (codegen broken <~30% of turns, self-correction covers the rest) · voice-crack lands · measured Modal cost + snapshot seconds captured. ## 10. Risks & mitigations - **End-to-end turn never run** (highest risk) → deploy + prove 5 turns before anything downstream. - **Modal secrets unset** → Space boots (lookup is lazy/try-excepted) but the Sandbox is dead; set `MODAL_TOKEN_ID`/`MODAL_TOKEN_SECRET` as Space secrets. - **Codegen unreliable** → self-correction loop + a networkx skeleton in the prompt; never show repeated `score=0.00`. - **Voice-crack variance** → pre-generate ≥30 takes of the win line, cache the best. - **Nemotron ASR install friction** → bounded attempt, else pivot to parakeet or whisper fallback (never blocks the critical path). ## 11. Build plan (by dependency — no calendar) 1. Set Space secrets · generate CFO portrait · (done in scaffold: lazy lookup, warmup sandbox prebuild, accordion open, torch≥2.5, generate_audio/init_audio). 2. Deploy + smoke-test `run_in_sandbox()` and the voxcpm image standalone. 3. **Five consecutive end-to-end turns** from the deployed Space + measured latencies/cost (the gate). 4. ≥30 win-line takes cached · codegen reliability hardened. 5. Nemotron ASR pivot-gate (stop-loss) · optional real `gr.Server` UI for Off-Brand. 6. Demo video (after dry runs) → README measured numbers → social → submit. ## 12. Integrity rules Claims follow code — no "only entry that…" claims about a moving field; cost/latency are measured, never fabricated. Pre-submit grep: `TODO | YOUR_HF_USER | NotImplementedError |