Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.19.0
⚖️ WitnessBox — PRD
Cross-examine a hostile AI witness. A courtroom interrogation game where the witness reacts to how you deliver, the AI is the irreplaceable mechanic, and a Modal Sandbox executing model-written code is the game's referee.
Track: 🍄 Thousand Token Wood · Primary prize: Best Use of Modal (1st-caliber, Axis A: Sandbox-runs-model-generated-code) · Status: built, compiles clean (see existing
hf-hackathon/witnessbox/).
1. Vision & why it wins
Interrogate Marcus Reid, CFO of Halcyon Dynamics. He's evasive and reads your delivery stance (vocal confidence) — sound confident and he clams up; sound hesitant and he gets cocky and overshares. Catch him in 3 contradictions and his voice cracks as he breaks.
Three independent win mechanisms, three judge pools:
- Best Use of Modal (#1 target): the core mechanic IS Modal's documented flagship pattern — an LLM writes code, a Sandbox safely executes it. Modal's own GRPO example: the "Best Use of Modal prize showcased the use of sandboxes for securely evaluating model-generated code." No rival in the field centers on this; most use Modal as plain inference hosting.
- OpenBMB Best MiniCPM Build (Wood): MiniCPM-o is the character, VoxCPM2's style-tags are the game state — "model is the product," which beats "model is a component."
- Wood track podium (4 paid slots): delight + load-bearing AI + originality + polish; a voiced, interactive game with a win condition and an audiovisual climax stands out vs watch-only demos.
2. Target prizes
Primary: Best Use of Modal (1st). Secondary (awards stack): OpenBMB-Wood · Wood podium ·
Community Choice (Wood) · Nemotron Hardware (ASR) · Best Agent · Best Demo · Off-Brand (only if a
real gr.Server custom UI is built — not earned by CSS alone).
3. Users & core experience
Player = anyone who wants the fantasy of breaking a witness on the stand. Turn-based push-to-talk:
player records a question (mic)
→ Nemotron ASR transcribes + librosa reads DELIVERY STANCE (perceived confidence; NOT lie detection)
→ stance steers the witness system prompt (Hesitant → he overshares a thread toward an uncaught lie)
→ ONE MiniCPM-o call returns {in-character reply, contradiction-check Python}
→ modal.Sandbox executes the MODEL-WRITTEN code; its JSON verdict DECIDES the catch
(keyword matching is only a silent fallback; on Sandbox error, the model self-corrects its code)
→ VoxCPM2 voices the reply; style escalates with pressure
catch #3 → win; the witness's voice cracks (pre-generated best take)
4. Functional requirements
- 3 planted lies injected into the system prompt (timeline, authorization, relationship), each with a concrete contradiction cue the player must surface. Detection fires against THESE, not on emergent model inconsistency (reliable > magical).
- Delivery stance from a parallel librosa pass (pause-rate + speaking-rate dominant per the prosody literature; pitch minor). Framed as perceived delivery, never "lie detector."
- Stance is load-bearing: Hesitant delivery makes the witness leak a cue toward one uncaught lie.
- Win at 3 catches, ≤ ~12 turns; the climactic break line is pre-generated and cached.
- The model-written code + Sandbox verdict are shown live in an open panel (the Modal evidence).
5. Technical architecture (all ≤32B; ≈12B combined)
| Component | Model / lib | Notes (verified) |
|---|---|---|
| Witness brain | openbmb/MiniCPM-o-4_5 (9.4B) |
AutoModel, trust_remote_code; chat(msgs=, use_tts_template=False, enable_thinking=False, generate_audio=False); init_vision/audio/tts=False (text-only). |
| Witness voice | openbmb/VoxCPM2 (2B) |
from_pretrained(load_denoiser=False); Voice-Design CFO once → Controllable-Clone per line generate(text="(style)...", reference_wav_path=ref); 48kHz; torch≥2.5.0. |
| Player ASR | nvidia/nemotron-speech-streaming-en-0.6b (or -3.5-asr-streaming-) |
whisper-small local fallback. |
| Delivery stance | librosa |
parallel waveform pass; pause/rate → tier. |
| Contradiction engine | MiniCPM-o generates networkx code → modal.Sandbox |
the verdict authority. |
6. Best Use of Modal — five load-bearing primitives (the #1-prize section)
The core mechanic is Modal's flagship Sandbox pattern (docs/examples/agent, safe_code_execution).
- ⭐ Sandbox executes model-written code — the game's referee (network-blocked; its JSON decides catches).
- 🔧 Agentic self-correction — on Sandbox error, the error feeds back to MiniCPM-o, which repairs its own code and reruns (max 2) — Modal's
devloopergenerate→execute→fix loop. - GPU inference via
@app.cls, scale-to-zero — MiniCPM-o (A100) + VoxCPM2 (A10G) + Nemotron ASR (A10G), idle → $0. - Parallel
.map()— pre-generates the scripted voice beats (incl. the voice-crack) at load. - Memory snapshot + Volume — snapshot cuts cold start (measured); a Volume persists the designed CFO voice clip + model cache. Measured cost: quote real container-seconds → "$0.0X / match" (read from the Modal dashboard). Map this verbatim into the README's "Best Use of Modal" section (REQ-06 requires noting Modal).
7. UX / UI requirements
Courtroom aesthetic (parchment, serif). CFO portrait. "Delivery Stance" bar (labeled not a lie
detector). X/3 contradiction counter. Autoplay witness audio. Contradiction Engine accordion
defaults OPEN (the #1-prize evidence must be on camera). Latency (~20–35s warm) masked diegetically
("the witness considers…"). For Off-Brand, a real gr.Server custom courtroom UI would be required.
8. Demo video (the judged artifact)
60–90s, controlled, ~20 dry runs first: stance steers witness → ask hesitantly, he overshares → catch #1 → the Sandbox panel shows model-written code + verdict → catch #3 → voice cracks → cost readout. Show the Sandbox executing the model's code as the dramatic beat.
9. Success metrics
Five consecutive clean end-to-end turns from the deployed Space · win-at-3 reliable · Sandbox verdict authoritative (codegen broken <~30% of turns, self-correction covers the rest) · voice-crack lands · measured Modal cost + snapshot seconds captured.
10. Risks & mitigations
- End-to-end turn never run (highest risk) → deploy + prove 5 turns before anything downstream.
- Modal secrets unset → Space boots (lookup is lazy/try-excepted) but the Sandbox is dead; set
MODAL_TOKEN_ID/MODAL_TOKEN_SECRETas Space secrets. - Codegen unreliable → self-correction loop + a networkx skeleton in the prompt; never show repeated
score=0.00. - Voice-crack variance → pre-generate ≥30 takes of the win line, cache the best.
- Nemotron ASR install friction → bounded attempt, else pivot to parakeet or whisper fallback (never blocks the critical path).
11. Build plan (by dependency — no calendar)
- Set Space secrets · generate CFO portrait · (done in scaffold: lazy lookup, warmup sandbox prebuild, accordion open, torch≥2.5, generate_audio/init_audio).
- Deploy + smoke-test
run_in_sandbox()and the voxcpm image standalone. - Five consecutive end-to-end turns from the deployed Space + measured latencies/cost (the gate).
- ≥30 win-line takes cached · codegen reliability hardened.
- Nemotron ASR pivot-gate (stop-loss) · optional real
gr.ServerUI for Off-Brand. - Demo video (after dry runs) → README measured numbers → social → submit.
12. Integrity rules
Claims follow code — no "only entry that…" claims about a moving field; cost/latency are measured,
never fabricated. Pre-submit grep: TODO | YOUR_HF_USER | NotImplementedError | <!--.