Spaces:
Sleeping
Sleeping
File size: 5,588 Bytes
f1685da c519923 f1685da beb5c4a f1685da c519923 3c2608c c519923 4e5e057 c519923 4e5e057 f1685da c519923 4e5e057 c519923 4e5e057 c519923 4e5e057 c519923 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | ---
title: WitnessBox
emoji: ⚖️
colorFrom: yellow
colorTo: red
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: mit
tags:
- track:wood
- sponsor:modal
- sponsor:openbmb
- achievement:offbrand
- build-small-hackathon
- gradio
- minicpm
- voxcpm
- modal
- voice
- game
---
# ⚖️ WitnessBox — cross-examine a hostile AI witness with your *voice*
> Interrogate **Marcus Reid, CFO of Halcyon Dynamics**. He reads *how you deliver*
> — sound confident and he clams up; sound hesitant and he gets cocky and
> overshares. Surface **three contradictions** and his voice **cracks** as he breaks.
>
> **Track:** 🍄 An Adventure in Thousand Token Wood · **Targeting:** Best Use of Modal + Best MiniCPM Build
---
## Why it's different
Every other "interrogate a witness" build in this jam is text-and-logic. WitnessBox
is the only one where **your vocal delivery is the input**: a `librosa` pass reads
your *perceived* confidence (pauses + pace) and steers the witness in real time,
and the witness answers back in a **voice that escalates** from composed to
cracking. The moat is the audio loop, not the puzzle.
> **The delivery meter is *perceived delivery*, never a lie detector.** It reads
> how you sound (pauses, pace, pitch steadiness) — not whether anything is true.
## How a turn works
```
you speak ─┬─► Whisper ASR ───────────────► your question
└─► librosa stance ─► CONFIDENT / NEUTRAL / HESITANT (steers the witness)
your question ─► deterministic Contradiction Engine ─► catch? (reproducible verdict)
persona + stance + tier + leak ─► MiniCPM4.1-8B ─► witness's line
state ─► VoxCPM2 (voice style = game state) ─► audio (cached voice-crack on the win)
```
Hesitant delivery makes Reid leak a thread toward an uncaught lie. Confident
delivery shuts him down. Catch all three (timeline · authorization · relationship)
and he breaks; whiff too many and the bench excuses him — you lose.
## Models — all <32B, ~11B combined
| Role | Model | Size |
|---|---|---|
| Witness brain | `openbmb/MiniCPM4.1-8B` | 8.2B |
| Witness voice | `openbmb/VoxCPM2` (style tag = game state) | 2.3B |
| Player ASR | `openai/whisper-small` (deployed) — `nvidia/nemotron-…-0.6b` is a one-image-swap upgrade (NeMo-only) | 0.24B |
| Delivery stance | `librosa` (no model) | — |
## ⚙️ Best Use of Modal
Modal is the **runtime** for all three GPU models and the beat pre-generator —
used as a *platform*, not just a host (the prize counts "inference… all"):
1. **GPU inference behind `@app.cls`, scale-to-zero.** Three models on three
right-sized GPUs (A100 + 2×A10G); idle → `$0` via `scaledown_window`.
2. **Opt-in keep-warm.** `min_containers` defaults to `0` — genuinely `$0`
between examinations — and flips to `1` (`WITNESSBOX_KEEP_WARM=1`) for a live
demo so turns don't eat a cold start. Scale-to-zero is the default; warmth is
a deliberate, costed choice, not an always-on bill.
3. **Parallel `.map()`** pre-generates every scripted beat at deploy time, fanning
the **32 voice-crack takes across containers at once** and keeping the best.
4. **Volume** persists the designed CFO reference voice + model cache + chosen beats.
5. **Right-sized GPUs** — an A100 only for the 8B witness brain; the 2B voice and
the ASR ride cheaper A10Gs.
**Measured (warm, this deploy).** A live dynamic turn is `MiniCPM4.1-8B` **→ 5.3s**
for the witness's reply, then `VoxCPM2` **→ 8.6s** for ~4.5s of 48 kHz speech
(RTF ≈ 1.9) — the line lands as **text first**, the voice follows. The five
**scripted beats** (intro · opening · the voice-crack · win · lose) are pre-rendered
by the parallel `.map()` pass and served straight from the Volume, so every
*dramatic* moment plays **instantly** off the per-turn path. Idle containers →
`$0` via `scaledown_window`. (Container-seconds / $-per-match read live from the
Modal dashboard, not fabricated.)
## 🧠 Best MiniCPM Build
The witness *is* a MiniCPM model. `openbmb/MiniCPM4.1-8B` runs the entire persona —
it reads the delivery stance, decides what Reid admits or hides, and leaks a thread
toward an uncaught lie when you sound unsure — and `openbmb/VoxCPM2` gives him the
voice that cracks on the break. The 8B brain is the **core of the experience, not a
bolt-on**: every line Reid speaks is MiniCPM under a stance- and tier-conditioned
system prompt, so the drama lives or dies on how well a small model holds a character
under pressure.
## Run it
**Offline (no GPU, no Modal — boots anywhere):**
```bash
pip install -r requirements.txt
python app.py # WITNESSBOX_BACKEND defaults to "mock"; type your questions
```
The full game loop — stance, the catch engine, state, win/lose, audio autoplay —
runs locally against a rule-based mock witness, so the end-to-end flow is provable
without a single GPU.
**Live (real models):**
```bash
modal deploy modal_app.py # serves MiniCPM4.1-8B, VoxCPM2, Whisper ASR
modal run modal_app.py # pre-generate the scripted beats (.map)
WITNESSBOX_BACKEND=modal python app.py
```
On a Space, set `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET` as secrets. Lookups are
lazy and fall back to mock if Modal is unreachable, so the Space always boots.
## Integrity
Detection fires against three **planted** lies with concrete cues — reliable, not
"magical." The model never grades itself. Cost/latency numbers are measured. No
"only entry that…" claims about a moving field.
|