Spaces:
Sleeping
Sleeping
| # ⚖️ WitnessBox — PRD | |
| > **Cross-examine a hostile AI witness.** A courtroom interrogation game where the witness reacts | |
| > to *how you deliver*, the AI is the irreplaceable mechanic, and a **Modal Sandbox executing | |
| > model-written code** is the game's referee. | |
| > | |
| > **Track:** 🍄 Thousand Token Wood · **Primary prize:** Best Use of Modal (1st-caliber, Axis A: | |
| > Sandbox-runs-model-generated-code) · **Status:** built, compiles clean (see existing `hf-hackathon/witnessbox/`). | |
| ## 1. Vision & why it wins | |
| Interrogate **Marcus Reid, CFO of Halcyon Dynamics**. He's evasive and reads your **delivery | |
| stance** (vocal confidence) — sound confident and he clams up; sound hesitant and he gets cocky | |
| and overshares. Catch him in **3 contradictions** and his voice **cracks** as he breaks. | |
| Three independent win mechanisms, three judge pools: | |
| 1. **Best Use of Modal (#1 target):** the core mechanic IS Modal's documented flagship pattern — | |
| an LLM writes code, a Sandbox safely executes it. Modal's own GRPO example: the *"Best Use of | |
| Modal prize showcased the use of sandboxes for securely evaluating model-generated code."* No | |
| rival in the field centers on this; most use Modal as plain inference hosting. | |
| 2. **OpenBMB Best MiniCPM Build (Wood):** MiniCPM-o is the *character*, VoxCPM2's style-tags are the | |
| *game state* — "model is the product," which beats "model is a component." | |
| 3. **Wood track podium (4 paid slots):** delight + load-bearing AI + originality + polish; a voiced, | |
| interactive game with a win condition and an audiovisual climax stands out vs watch-only demos. | |
| ## 2. Target prizes | |
| Primary: **Best Use of Modal (1st)**. Secondary (awards stack): OpenBMB-Wood · Wood podium · | |
| Community Choice (Wood) · Nemotron Hardware (ASR) · Best Agent · Best Demo · Off-Brand *(only if a | |
| real `gr.Server` custom UI is built — not earned by CSS alone)*. | |
| ## 3. Users & core experience | |
| Player = anyone who wants the fantasy of breaking a witness on the stand. Turn-based push-to-talk: | |
| ``` | |
| player records a question (mic) | |
| → Nemotron ASR transcribes + librosa reads DELIVERY STANCE (perceived confidence; NOT lie detection) | |
| → stance steers the witness system prompt (Hesitant → he overshares a thread toward an uncaught lie) | |
| → ONE MiniCPM-o call returns {in-character reply, contradiction-check Python} | |
| → modal.Sandbox executes the MODEL-WRITTEN code; its JSON verdict DECIDES the catch | |
| (keyword matching is only a silent fallback; on Sandbox error, the model self-corrects its code) | |
| → VoxCPM2 voices the reply; style escalates with pressure | |
| catch #3 → win; the witness's voice cracks (pre-generated best take) | |
| ``` | |
| ## 4. Functional requirements | |
| - **3 planted lies** injected into the system prompt (timeline, authorization, relationship), each | |
| with a concrete contradiction cue the player must surface. Detection fires against THESE, not on | |
| emergent model inconsistency (reliable > magical). | |
| - **Delivery stance** from a parallel librosa pass (pause-rate + speaking-rate dominant per the | |
| prosody literature; pitch minor). Framed as *perceived delivery*, **never** "lie detector." | |
| - **Stance is load-bearing:** Hesitant delivery makes the witness leak a cue toward one uncaught lie. | |
| - **Win at 3 catches**, ≤ ~12 turns; the climactic break line is pre-generated and cached. | |
| - The model-written code + Sandbox verdict are shown **live** in an open panel (the Modal evidence). | |
| ## 5. Technical architecture (all ≤32B; ≈12B combined) | |
| | Component | Model / lib | Notes (verified) | | |
| |---|---|---| | |
| | Witness brain | `openbmb/MiniCPM-o-4_5` (9.4B) | `AutoModel`, `trust_remote_code`; `chat(msgs=, use_tts_template=False, enable_thinking=False, generate_audio=False)`; `init_vision/audio/tts=False` (text-only). | | |
| | Witness voice | `openbmb/VoxCPM2` (2B) | `from_pretrained(load_denoiser=False)`; Voice-Design CFO once → Controllable-Clone per line `generate(text="(style)...", reference_wav_path=ref)`; 48kHz; **torch≥2.5.0**. | | |
| | Player ASR | `nvidia/nemotron-speech-streaming-en-0.6b` (or `-3.5-asr-streaming-`) | whisper-small local fallback. | | |
| | Delivery stance | `librosa` | parallel waveform pass; pause/rate → tier. | | |
| | Contradiction engine | MiniCPM-o **generates** networkx code → `modal.Sandbox` | the verdict authority. | | |
| ## 6. Best Use of Modal — five load-bearing primitives (the #1-prize section) | |
| The core mechanic is Modal's flagship Sandbox pattern (`docs/examples/agent`, `safe_code_execution`). | |
| 1. **⭐ Sandbox executes model-written code** — the game's referee (network-blocked; its JSON decides catches). | |
| 2. **🔧 Agentic self-correction** — on Sandbox error, the error feeds back to MiniCPM-o, which repairs its own code and reruns (max 2) — Modal's `devlooper` generate→execute→fix loop. | |
| 3. **GPU inference via `@app.cls`, scale-to-zero** — MiniCPM-o (A100) + VoxCPM2 (A10G) + Nemotron ASR (A10G), idle → $0. | |
| 4. **Parallel `.map()`** — pre-generates the scripted voice beats (incl. the voice-crack) at load. | |
| 5. **Memory snapshot + Volume** — snapshot cuts cold start (measured); a Volume persists the designed CFO voice clip + model cache. | |
| **Measured cost:** quote real container-seconds → "$0.0X / match" (read from the Modal dashboard). | |
| Map this verbatim into the README's "Best Use of Modal" section (REQ-06 requires noting Modal). | |
| ## 7. UX / UI requirements | |
| Courtroom aesthetic (parchment, serif). CFO portrait. "Delivery Stance" bar (labeled *not a lie | |
| detector*). X/3 contradiction counter. Autoplay witness audio. **Contradiction Engine accordion | |
| defaults OPEN** (the #1-prize evidence must be on camera). Latency (~20–35s warm) masked diegetically | |
| ("the witness considers…"). For Off-Brand, a real `gr.Server` custom courtroom UI would be required. | |
| ## 8. Demo video (the judged artifact) | |
| 60–90s, controlled, ~20 dry runs first: stance steers witness → ask hesitantly, he overshares → | |
| catch #1 → the Sandbox panel shows model-written code + verdict → catch #3 → **voice cracks** → | |
| cost readout. Show the Sandbox executing the model's code as the dramatic beat. | |
| ## 9. Success metrics | |
| Five consecutive clean end-to-end turns from the deployed Space · win-at-3 reliable · Sandbox | |
| verdict authoritative (codegen broken <~30% of turns, self-correction covers the rest) · voice-crack | |
| lands · measured Modal cost + snapshot seconds captured. | |
| ## 10. Risks & mitigations | |
| - **End-to-end turn never run** (highest risk) → deploy + prove 5 turns before anything downstream. | |
| - **Modal secrets unset** → Space boots (lookup is lazy/try-excepted) but the Sandbox is dead; set `MODAL_TOKEN_ID`/`MODAL_TOKEN_SECRET` as Space secrets. | |
| - **Codegen unreliable** → self-correction loop + a networkx skeleton in the prompt; never show repeated `score=0.00`. | |
| - **Voice-crack variance** → pre-generate ≥30 takes of the win line, cache the best. | |
| - **Nemotron ASR install friction** → bounded attempt, else pivot to parakeet or whisper fallback (never blocks the critical path). | |
| ## 11. Build plan (by dependency — no calendar) | |
| 1. Set Space secrets · generate CFO portrait · (done in scaffold: lazy lookup, warmup sandbox prebuild, accordion open, torch≥2.5, generate_audio/init_audio). | |
| 2. Deploy + smoke-test `run_in_sandbox()` and the voxcpm image standalone. | |
| 3. **Five consecutive end-to-end turns** from the deployed Space + measured latencies/cost (the gate). | |
| 4. ≥30 win-line takes cached · codegen reliability hardened. | |
| 5. Nemotron ASR pivot-gate (stop-loss) · optional real `gr.Server` UI for Off-Brand. | |
| 6. Demo video (after dry runs) → README measured numbers → social → submit. | |
| ## 12. Integrity rules | |
| Claims follow code — no "only entry that…" claims about a moving field; cost/latency are measured, | |
| never fabricated. Pre-submit grep: `TODO | YOUR_HF_USER | NotImplementedError | <!--`. | |