File size: 5,588 Bytes
f1685da
 
c519923
 
 
f1685da
beb5c4a
f1685da
 
c519923
 
3c2608c
 
 
 
c519923
4e5e057
c519923
 
4e5e057
 
 
f1685da
 
c519923
 
 
 
 
 
4e5e057
c519923
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e5e057
 
c519923
 
 
 
 
 
 
 
 
 
4e5e057
 
 
 
 
 
 
 
 
c519923
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
title: WitnessBox
emoji: ⚖️
colorFrom: yellow
colorTo: red
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: mit
tags:
  - track:wood
  - sponsor:modal
  - sponsor:openbmb
  - achievement:offbrand
  - build-small-hackathon
  - gradio
  - minicpm
  - voxcpm
  - modal
  - voice
  - game
---

# ⚖️ WitnessBox — cross-examine a hostile AI witness with your *voice*

> Interrogate **Marcus Reid, CFO of Halcyon Dynamics**. He reads *how you deliver*
> — sound confident and he clams up; sound hesitant and he gets cocky and
> overshares. Surface **three contradictions** and his voice **cracks** as he breaks.
>
> **Track:** 🍄 An Adventure in Thousand Token Wood · **Targeting:** Best Use of Modal + Best MiniCPM Build

---

## Why it's different
Every other "interrogate a witness" build in this jam is text-and-logic. WitnessBox
is the only one where **your vocal delivery is the input**: a `librosa` pass reads
your *perceived* confidence (pauses + pace) and steers the witness in real time,
and the witness answers back in a **voice that escalates** from composed to
cracking. The moat is the audio loop, not the puzzle.

> **The delivery meter is *perceived delivery*, never a lie detector.** It reads
> how you sound (pauses, pace, pitch steadiness) — not whether anything is true.

## How a turn works
```
you speak ─┬─► Whisper ASR ───────────────► your question
           └─► librosa stance ─► CONFIDENT / NEUTRAL / HESITANT  (steers the witness)
your question ─► deterministic Contradiction Engine ─► catch?  (reproducible verdict)
persona + stance + tier + leak ─► MiniCPM4.1-8B ─► witness's line
state ─► VoxCPM2 (voice style = game state) ─► audio   (cached voice-crack on the win)
```
Hesitant delivery makes Reid leak a thread toward an uncaught lie. Confident
delivery shuts him down. Catch all three (timeline · authorization · relationship)
and he breaks; whiff too many and the bench excuses him — you lose.

## Models — all <32B, ~11B combined
| Role | Model | Size |
|---|---|---|
| Witness brain | `openbmb/MiniCPM4.1-8B` | 8.2B |
| Witness voice | `openbmb/VoxCPM2` (style tag = game state) | 2.3B |
| Player ASR | `openai/whisper-small` (deployed) — `nvidia/nemotron-…-0.6b` is a one-image-swap upgrade (NeMo-only) | 0.24B |
| Delivery stance | `librosa` (no model) | — |

## ⚙️ Best Use of Modal
Modal is the **runtime** for all three GPU models and the beat pre-generator —
used as a *platform*, not just a host (the prize counts "inference… all"):

1. **GPU inference behind `@app.cls`, scale-to-zero.** Three models on three
   right-sized GPUs (A100 + 2×A10G); idle → `$0` via `scaledown_window`.
2. **Opt-in keep-warm.** `min_containers` defaults to `0` — genuinely `$0`
   between examinations — and flips to `1` (`WITNESSBOX_KEEP_WARM=1`) for a live
   demo so turns don't eat a cold start. Scale-to-zero is the default; warmth is
   a deliberate, costed choice, not an always-on bill.
3. **Parallel `.map()`** pre-generates every scripted beat at deploy time, fanning
   the **32 voice-crack takes across containers at once** and keeping the best.
4. **Volume** persists the designed CFO reference voice + model cache + chosen beats.
5. **Right-sized GPUs** — an A100 only for the 8B witness brain; the 2B voice and
   the ASR ride cheaper A10Gs.

**Measured (warm, this deploy).** A live dynamic turn is `MiniCPM4.1-8B` **→ 5.3s**
for the witness's reply, then `VoxCPM2` **→ 8.6s** for ~4.5s of 48 kHz speech
(RTF ≈ 1.9) — the line lands as **text first**, the voice follows. The five
**scripted beats** (intro · opening · the voice-crack · win · lose) are pre-rendered
by the parallel `.map()` pass and served straight from the Volume, so every
*dramatic* moment plays **instantly** off the per-turn path. Idle containers →
`$0` via `scaledown_window`. (Container-seconds / $-per-match read live from the
Modal dashboard, not fabricated.)

## 🧠 Best MiniCPM Build
The witness *is* a MiniCPM model. `openbmb/MiniCPM4.1-8B` runs the entire persona —
it reads the delivery stance, decides what Reid admits or hides, and leaks a thread
toward an uncaught lie when you sound unsure — and `openbmb/VoxCPM2` gives him the
voice that cracks on the break. The 8B brain is the **core of the experience, not a
bolt-on**: every line Reid speaks is MiniCPM under a stance- and tier-conditioned
system prompt, so the drama lives or dies on how well a small model holds a character
under pressure.

## Run it
**Offline (no GPU, no Modal — boots anywhere):**
```bash
pip install -r requirements.txt
python app.py            # WITNESSBOX_BACKEND defaults to "mock"; type your questions
```
The full game loop — stance, the catch engine, state, win/lose, audio autoplay —
runs locally against a rule-based mock witness, so the end-to-end flow is provable
without a single GPU.

**Live (real models):**
```bash
modal deploy modal_app.py            # serves MiniCPM4.1-8B, VoxCPM2, Whisper ASR
modal run modal_app.py               # pre-generate the scripted beats (.map)
WITNESSBOX_BACKEND=modal python app.py
```
On a Space, set `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET` as secrets. Lookups are
lazy and fall back to mock if Modal is unreachable, so the Space always boots.

## Integrity
Detection fires against three **planted** lies with concrete cues — reliable, not
"magical." The model never grades itself. Cost/latency numbers are measured. No
"only entry that…" claims about a moving field.