WitnessBox / PRD.md
Farseen0's picture
Deploy WitnessBox
c519923 verified
|
Raw
History Blame Contribute Delete
7.86 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

⚖️ WitnessBox — PRD

Cross-examine a hostile AI witness. A courtroom interrogation game where the witness reacts to how you deliver, the AI is the irreplaceable mechanic, and a Modal Sandbox executing model-written code is the game's referee.

Track: 🍄 Thousand Token Wood · Primary prize: Best Use of Modal (1st-caliber, Axis A: Sandbox-runs-model-generated-code) · Status: built, compiles clean (see existing hf-hackathon/witnessbox/).

1. Vision & why it wins

Interrogate Marcus Reid, CFO of Halcyon Dynamics. He's evasive and reads your delivery stance (vocal confidence) — sound confident and he clams up; sound hesitant and he gets cocky and overshares. Catch him in 3 contradictions and his voice cracks as he breaks.

Three independent win mechanisms, three judge pools:

  1. Best Use of Modal (#1 target): the core mechanic IS Modal's documented flagship pattern — an LLM writes code, a Sandbox safely executes it. Modal's own GRPO example: the "Best Use of Modal prize showcased the use of sandboxes for securely evaluating model-generated code." No rival in the field centers on this; most use Modal as plain inference hosting.
  2. OpenBMB Best MiniCPM Build (Wood): MiniCPM-o is the character, VoxCPM2's style-tags are the game state — "model is the product," which beats "model is a component."
  3. Wood track podium (4 paid slots): delight + load-bearing AI + originality + polish; a voiced, interactive game with a win condition and an audiovisual climax stands out vs watch-only demos.

2. Target prizes

Primary: Best Use of Modal (1st). Secondary (awards stack): OpenBMB-Wood · Wood podium · Community Choice (Wood) · Nemotron Hardware (ASR) · Best Agent · Best Demo · Off-Brand (only if a real gr.Server custom UI is built — not earned by CSS alone).

3. Users & core experience

Player = anyone who wants the fantasy of breaking a witness on the stand. Turn-based push-to-talk:

player records a question (mic)
  → Nemotron ASR transcribes  +  librosa reads DELIVERY STANCE (perceived confidence; NOT lie detection)
  → stance steers the witness system prompt (Hesitant → he overshares a thread toward an uncaught lie)
  → ONE MiniCPM-o call returns {in-character reply, contradiction-check Python}
  → modal.Sandbox executes the MODEL-WRITTEN code; its JSON verdict DECIDES the catch
    (keyword matching is only a silent fallback; on Sandbox error, the model self-corrects its code)
  → VoxCPM2 voices the reply; style escalates with pressure
catch #3 → win; the witness's voice cracks (pre-generated best take)

4. Functional requirements

  • 3 planted lies injected into the system prompt (timeline, authorization, relationship), each with a concrete contradiction cue the player must surface. Detection fires against THESE, not on emergent model inconsistency (reliable > magical).
  • Delivery stance from a parallel librosa pass (pause-rate + speaking-rate dominant per the prosody literature; pitch minor). Framed as perceived delivery, never "lie detector."
  • Stance is load-bearing: Hesitant delivery makes the witness leak a cue toward one uncaught lie.
  • Win at 3 catches, ≤ ~12 turns; the climactic break line is pre-generated and cached.
  • The model-written code + Sandbox verdict are shown live in an open panel (the Modal evidence).

5. Technical architecture (all ≤32B; ≈12B combined)

Component Model / lib Notes (verified)
Witness brain openbmb/MiniCPM-o-4_5 (9.4B) AutoModel, trust_remote_code; chat(msgs=, use_tts_template=False, enable_thinking=False, generate_audio=False); init_vision/audio/tts=False (text-only).
Witness voice openbmb/VoxCPM2 (2B) from_pretrained(load_denoiser=False); Voice-Design CFO once → Controllable-Clone per line generate(text="(style)...", reference_wav_path=ref); 48kHz; torch≥2.5.0.
Player ASR nvidia/nemotron-speech-streaming-en-0.6b (or -3.5-asr-streaming-) whisper-small local fallback.
Delivery stance librosa parallel waveform pass; pause/rate → tier.
Contradiction engine MiniCPM-o generates networkx code → modal.Sandbox the verdict authority.

6. Best Use of Modal — five load-bearing primitives (the #1-prize section)

The core mechanic is Modal's flagship Sandbox pattern (docs/examples/agent, safe_code_execution).

  1. ⭐ Sandbox executes model-written code — the game's referee (network-blocked; its JSON decides catches).
  2. 🔧 Agentic self-correction — on Sandbox error, the error feeds back to MiniCPM-o, which repairs its own code and reruns (max 2) — Modal's devlooper generate→execute→fix loop.
  3. GPU inference via @app.cls, scale-to-zero — MiniCPM-o (A100) + VoxCPM2 (A10G) + Nemotron ASR (A10G), idle → $0.
  4. Parallel .map() — pre-generates the scripted voice beats (incl. the voice-crack) at load.
  5. Memory snapshot + Volume — snapshot cuts cold start (measured); a Volume persists the designed CFO voice clip + model cache. Measured cost: quote real container-seconds → "$0.0X / match" (read from the Modal dashboard). Map this verbatim into the README's "Best Use of Modal" section (REQ-06 requires noting Modal).

7. UX / UI requirements

Courtroom aesthetic (parchment, serif). CFO portrait. "Delivery Stance" bar (labeled not a lie detector). X/3 contradiction counter. Autoplay witness audio. Contradiction Engine accordion defaults OPEN (the #1-prize evidence must be on camera). Latency (~20–35s warm) masked diegetically ("the witness considers…"). For Off-Brand, a real gr.Server custom courtroom UI would be required.

8. Demo video (the judged artifact)

60–90s, controlled, ~20 dry runs first: stance steers witness → ask hesitantly, he overshares → catch #1 → the Sandbox panel shows model-written code + verdict → catch #3 → voice cracks → cost readout. Show the Sandbox executing the model's code as the dramatic beat.

9. Success metrics

Five consecutive clean end-to-end turns from the deployed Space · win-at-3 reliable · Sandbox verdict authoritative (codegen broken <~30% of turns, self-correction covers the rest) · voice-crack lands · measured Modal cost + snapshot seconds captured.

10. Risks & mitigations

  • End-to-end turn never run (highest risk) → deploy + prove 5 turns before anything downstream.
  • Modal secrets unset → Space boots (lookup is lazy/try-excepted) but the Sandbox is dead; set MODAL_TOKEN_ID/MODAL_TOKEN_SECRET as Space secrets.
  • Codegen unreliable → self-correction loop + a networkx skeleton in the prompt; never show repeated score=0.00.
  • Voice-crack variance → pre-generate ≥30 takes of the win line, cache the best.
  • Nemotron ASR install friction → bounded attempt, else pivot to parakeet or whisper fallback (never blocks the critical path).

11. Build plan (by dependency — no calendar)

  1. Set Space secrets · generate CFO portrait · (done in scaffold: lazy lookup, warmup sandbox prebuild, accordion open, torch≥2.5, generate_audio/init_audio).
  2. Deploy + smoke-test run_in_sandbox() and the voxcpm image standalone.
  3. Five consecutive end-to-end turns from the deployed Space + measured latencies/cost (the gate).
  4. ≥30 win-line takes cached · codegen reliability hardened.
  5. Nemotron ASR pivot-gate (stop-loss) · optional real gr.Server UI for Off-Brand.
  6. Demo video (after dry runs) → README measured numbers → social → submit.

12. Integrity rules

Claims follow code — no "only entry that…" claims about a moving field; cost/latency are measured, never fabricated. Pre-submit grep: TODO | YOUR_HF_USER | NotImplementedError | <!--.