Spaces:
Paused
Paused
File size: 51,612 Bytes
f2df60e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 | # audio.md — DriftCall Audio Pipeline (Kokoro-82M TTS + faster-whisper-small ASR)
**Owner:** Person C (Training & Data), secondary: Person A (integration glue)
**Implements:** DESIGN.md §9 (Audio Pipeline, 9.1–9.4), §3.3 (Deployed Env Topology), §3.4 (Demo Topology)
**Status:** DRAFT — pending ≥ 2 fresh critic rounds
---
## 1. Purpose
`driftcall/audio/` houses the two model wrappers that convert between text and speech at the **env boundary**: `tts_kokoro.py` (text → 16 kHz mono WAV bytes) and `asr_whisper.py` (WAV/PCM bytes → transcript + detected language + confidence). They exist so the deployed env and demo Space can honestly claim "voice-first" while the training loop stays text-in/text-out for throughput.
This module is the **single place** where audio-model state lives. Both engines are heavy (~82M and ~244M params respectively) and slow to initialize on CPU, so each exposes a module-level singleton constructed lazily on first call and reused across all sessions in the process. The FastAPI env (`app.py`) calls the factory once at startup; the Gradio demo (`demo/app_gradio.py`) does the same. The training loop (`training/train_grpo.py`) **never imports these modules** — not even the factory — because `import kokoro` / `import faster_whisper` pulls in torchaudio and a 50 MB tokenizer per process, and we do not want that weight in the GRPO rollout worker.
The guiding constraints from DESIGN.md §9:
1. **CPU-only.** Both models must run on the free-tier HF Space (basic CPU). No `cuda` fall-through, no `torch.compile`, no GPU-dependent kernels. Kokoro-82M is 3–11× real-time on CPU; faster-whisper-small (int8) is ~1× real-time. Both fit in <1.2 GB RAM each.
2. **Deterministic where possible.** TTS takes a `seed: int = 0` argument forwarded to torch's generator so synthesized clips are byte-reproducible given the same (text, voice, seed) triple. ASR uses `beam_size=1` (greedy) for reproducibility; with `vad_filter=True`, outputs are stable across runs on the same input.
3. **Latency budgets.** TTS < 500 ms for a 1-sentence utterance. ASR ≈ 1× real-time (a 4-second clip decodes in ≈ 4 seconds on CPU basic). Env `/step` endpoint budgets 2 seconds total per turn — the audio path must not dominate.
4. **Indic support.** Hindi, Tamil, Kannada, English, and Hinglish (code-mixed). Voice-pack selection per language is defined in §4.3; ASR language hint is passed per-episode from `GoalSpec.language`.
The module is **not** called on every training rollout — DESIGN.md §9.4 is emphatic about this, and §3 ("Behavior spec") documents the runtime split.
### 1.1 Whisper size trade-off + migration path
`faster-whisper-small` (~244M params, ~120 MB int8) was chosen to hit the ~1× real-time decode budget on free-tier CPU Space. We explicitly acknowledge this comes at a cost: `small` has **measurably degraded Word Error Rate on Hindi / Tamil / Kannada** compared to `large-v3` — published faster-whisper benchmarks show roughly a 5–10 percentage point WER gap on Indic audio depending on noise and code-mix. `large-v3` is not a free-tier option: ~3 GB weights on disk, >3 GB resident RAM during decode, and ≥ 3× real-time on CPU basic — it would bust both memory (16 GB tier shared across app, sim-caller, TTS, observation builder) and the §3 latency budget.
**Migration path (explicit, not aspirational):**
1. If Batch C2 baseline R1 on Hindi episodes is **< 0.4**, bump to `Systran/faster-whisper-medium` (~700 MB int8). This is a one-line `model_id=` change; all behaviour in this doc still holds. Move the **env Space only** (not demo) to HF CPU Pro (+$5/mo, fits in the ≤ $30/mo deployment budget per DESIGN.md §13).
2. If `medium` is still insufficient on Hindi/Tamil/Kannada (R1 < 0.4 after Stage-1 training), escalate to `large-v3` **on the demo Space only** (ZeroGPU), keeping the env Space on `small`/`medium` on CPU. This means the demo plays the more impressive transcript while the env used for reward grading stays on the deterministic CPU config — an acceptable asymmetry because demo ASR is never used for reward attribution (see §6.3 — rewards do not re-transcribe).
The chosen default for hackathon ship is `small` + int8 on CPU. Any escalation above requires orchestrator approval and a DESIGN.md §9.2 update.
---
## 2. Interface
Every declaration below is the *exact* target signature. `env.py` / `app.py` / `demo/app_gradio.py` depend on these signatures; no addition or rename is allowed without a DESIGN.md update first.
### 2.1 `driftcall/audio/tts_kokoro.py`
```python
from __future__ import annotations
from dataclasses import dataclass
from typing import Literal
LanguageCode = Literal["hi", "ta", "kn", "en", "hinglish"]
VoicePack = Literal[
"hi_female_1",
"hi_male_1",
"ta_female_1",
"kn_male_1",
"en_indian_female_1",
]
@dataclass(frozen=True)
class VoicePackMapping:
"""Per-language default + allowed voice packs for Kokoro.
DESIGN.md §9.1 lists the five packs. The mapping is frozen at module
load and never mutated.
"""
language: LanguageCode
default: VoicePack
allowed: tuple[VoicePack, ...]
# Module-level constant. Frozen at import time; see §4.3 for the authoritative
# per-row rationale. The literal below IS the full contents — five entries, one
# per LanguageCode. No runtime mutation.
VOICE_PACKS: dict[LanguageCode, VoicePackMapping] = {
"hi": VoicePackMapping(
language="hi",
default="hi_female_1",
allowed=("hi_female_1", "hi_male_1"),
),
"ta": VoicePackMapping(
language="ta",
default="ta_female_1",
allowed=("ta_female_1",),
),
"kn": VoicePackMapping(
language="kn",
default="kn_male_1",
allowed=("kn_male_1",),
),
"en": VoicePackMapping(
language="en",
default="en_indian_female_1",
allowed=("en_indian_female_1",),
),
"hinglish": VoicePackMapping(
language="hinglish",
default="en_indian_female_1",
allowed=("en_indian_female_1", "hi_female_1"),
),
}
class TTSEngine:
"""Kokoro-82M wrapper. One instance per process.
Constructed via `get_tts_engine()`; do NOT instantiate directly in
consumer code — the singleton guarantees the model is loaded once.
"""
def __init__(
self,
*,
model_id: str = "hexgrad/Kokoro-82M",
trace_sink: "Callable[[AudioTrace], None] | None" = None,
) -> None: ...
def synthesize(
self,
text: str,
language_code: LanguageCode,
voice_pack: VoicePack | None = None,
*,
seed: int = 0,
sample_rate_hz: int = 16000,
) -> bytes:
"""Return 16-bit PCM mono WAV bytes.
- `voice_pack=None` → use `VOICE_PACKS[language_code].default`.
- `voice_pack` outside `VOICE_PACKS[language_code].allowed` → `UnsupportedVoicePackError`.
- Deterministic given (text, voice_pack, seed, sample_rate_hz).
- Cached in LRU (see §3.4).
- Returns the full WAV (RIFF header + PCM), ready to write to disk
or send as `Response(content=..., media_type="audio/wav")`.
"""
def synthesize_to_gradio(
self,
text: str,
language_hint: LanguageCode,
voice_pack: VoicePack | None = None,
*,
seed: int = 0,
) -> tuple[int, "np.ndarray"]:
"""Gradio-friendly sibling of `synthesize`.
Returns `(sample_rate, float32 np.ndarray)` with shape `(n_samples,)`
(mono). This matches Gradio's `gr.Audio(type="numpy")` expected output.
Internally calls the same Kokoro path as `synthesize()`, skipping the
WAV encoding step and returning the float32 tensor-as-numpy directly.
The LRU cache from §3.4 is NOT shared — Gradio-path outputs are
cached separately under a key that includes a `fmt="numpy"` discriminator,
so byte-cache and numpy-cache never collide.
- `voice_pack=None` → use `VOICE_PACKS[language_hint].default`.
- Sample rate is fixed at 16000 to match the `synthesize()` contract.
- Deterministic given (text, voice_pack, seed).
"""
def warmup(self) -> None:
"""Run one synthesize() with a canonical string to force model load.
Called by `app.py` startup hook so the first real request is fast.
"""
def get_tts_engine() -> TTSEngine:
"""Return the process-wide TTSEngine singleton (lazy-constructed)."""
```
**Which caller uses which helper (binding contract):**
| Caller | Helper | Return type | Framing |
|---|---|---|---|
| FastAPI `/synthesize` endpoint in `app.py` | `TTSEngine.synthesize` | `bytes` (RIFF WAV) | `Response(content=wav_bytes, media_type="audio/wav")` |
| FastAPI `/step` audio field in `app.py` | `TTSEngine.synthesize` | `bytes` | Embedded as base64 inside the JSON step response. |
| Gradio demo in `demo/app_gradio.py` | `TTSEngine.synthesize_to_gradio` | `tuple[int, np.ndarray]` | Direct return to `gr.Audio(type="numpy")` output component. |
| Tests | Either, per-case | — | WAV-bytes tests use `synthesize`; spectral / numpy-domain tests use `synthesize_to_gradio`. |
Rationale for two helpers rather than `synthesize` + a numpy-wrapper: re-decoding WAV bytes back into float32 numpy inside the Gradio path wastes ~3 ms and doubles the memory briefly (encoded bytes + re-decoded tensor). Keeping a numpy-native return avoids that round-trip for the demo-critical path.
### 2.2 `driftcall/audio/asr_whisper.py`
```python
from __future__ import annotations
from dataclasses import dataclass
from typing import Literal
LanguageCode = Literal["hi", "ta", "kn", "en", "hinglish"]
@dataclass(frozen=True)
class TranscriptResult:
"""ASR output surfaced to the env observation builder.
- `text` is NFC-normalized Unicode; empty string on silence.
- `language_detected` is the Whisper-reported language code; may disagree
with the hint (e.g., hint="hi", detected="en" for code-mixed utterances).
- `confidence` is the mean token log-prob mapped to [0.0, 1.0] via
exp-normalize (see §3.5). 1.0 = perfect, 0.0 = pathological.
- `duration_s` is the decoded clip length in seconds (float, rounded to 3dp).
"""
text: str
language_detected: LanguageCode | Literal["unknown"]
confidence: float
duration_s: float
class ASREngine:
"""faster-whisper-small (int8) wrapper. One instance per process.
Constructed via `get_asr_engine()`.
"""
def __init__(
self,
*,
model_id: str = "Systran/faster-whisper-small",
compute_type: Literal["int8", "int8_float16"] = "int8",
trace_sink: "Callable[[AudioTrace], None] | None" = None,
) -> None: ...
def transcribe(
self,
audio_bytes: bytes,
language_hint: LanguageCode | None,
*,
beam_size: int = 1,
vad_filter: bool = True,
max_duration_s: float = 30.0,
) -> TranscriptResult:
"""Decode a WAV/PCM clip into a TranscriptResult.
- `audio_bytes` must be a RIFF WAV with mono 16-bit PCM at 16 kHz OR
raw float32 PCM at 16 kHz (detected by magic bytes). Other formats
→ `AudioDecodeError`.
- `language_hint="hinglish"` is translated to `language="hi"` at the
Whisper call site (Whisper has no Hinglish code); detected language
may come back as "hi" or "en".
- `language_hint=None` → autodetect (slower on first pass).
- Truncates to `max_duration_s` silently and sets
`result.duration_s = max_duration_s` (see edge case §7.3).
- Returns a populated `TranscriptResult`; never raises on a merely
low-confidence decode — that is a policy decision for the caller.
"""
def warmup(self) -> None:
"""Run one transcribe() on 0.5s of silence to load weights + VAD."""
def get_asr_engine() -> ASREngine:
"""Return the process-wide ASREngine singleton (lazy-constructed)."""
```
### 2.2a `driftcall/audio/trace.py` (shared between TTS + ASR)
```python
from __future__ import annotations
from dataclasses import dataclass
from typing import Callable, Literal
@dataclass(frozen=True)
class AudioTrace:
"""Per-call diagnostic record for synthesize() and transcribe().
Emitted via the `trace_sink` callback passed to each engine's __init__.
Consumed by the `/audio/trace` FastAPI endpoint and the demo UI live overlay.
Never mutated after construction (frozen).
"""
op: Literal["synthesize", "transcribe"]
input_hash: str # blake2b hex digest of text (for TTS) or audio bytes (for ASR)
language: str # requested language code or "unknown"
duration_s: float # clip duration in seconds (output for TTS, input for ASR)
latency_ms: int # wall-clock call latency
confidence: float | None # ASR: TranscriptResult.confidence; TTS: None
cache_hit: bool # TTS: LRU hit? ASR: always False
degraded: bool # True on voice-pack fallback (TTS) OR coerced-empty (ASR)
ts_ist: str # ISO-8601 timestamp in Asia/Kolkata tz
TraceSink = Callable[[AudioTrace], None]
```
### 2.3 Custom exceptions
Defined in `driftcall/audio/errors.py` (tiny module, shared):
```python
class AudioError(Exception): ...
class ModelLoadError(AudioError): ...
class UnsupportedLanguageError(AudioError): ...
class UnsupportedVoicePackError(AudioError): ...
class AudioDecodeError(AudioError): ...
class AudioTooLongError(AudioError): ... # only raised if caller passes strict=True
class TTSOutOfMemoryError(AudioError): ...
```
`env.py` catches `AudioError` at the boundary and either degrades (see §5) or 500s the HTTP response.
### 2.4 `__all__`
```python
# tts_kokoro.py
__all__ = [
"LanguageCode",
"VoicePack",
"VoicePackMapping",
"VOICE_PACKS",
"TTSEngine",
"get_tts_engine",
]
# asr_whisper.py
__all__ = [
"LanguageCode",
"TranscriptResult",
"ASREngine",
"get_asr_engine",
]
# trace.py
__all__ = [
"AudioTrace",
"TraceSink",
]
```
---
## 3. Behavior spec
### 3.1 Training-vs-deploy split (DESIGN.md §9.4 — load-bearing)
| Runtime | Imports `driftcall.audio`? | TTS in loop? | ASR in loop? | Why |
|---|---|---|---|---|
| **Training** (`training/train_grpo.py`, local V100) | **No.** Explicit negative contract. | No | No | Speed. Pre-authored text transcripts go straight into `DriftCallObservation.last_transcript`. `last_confidence=1.0` (treated as perfect ASR). ~10× faster rollouts. |
| **Deployed env** (HF Space CPU basic, `app.py`) | Yes, via `get_tts_engine()` + `get_asr_engine()` at startup | Yes (on `SPEAK` actions) | Yes (on every inbound `/step` that carries audio bytes) | DESIGN.md §9.4: "env is genuinely voice-driven for realism". Sim-caller in §3.1 synthesizes user utterances; ASR at env boundary transcribes before embedding into observation. |
| **Demo Space** (Gradio, ZeroGPU / A10G) | Yes | Yes | Yes + live mic input | Judge interaction. |
`env.py` toggles between modes via a single flag: `DriftCallEnv(audio_boundary_enabled: bool = False)`. Default `False` means the training path; `True` is set only inside `app.py` (FastAPI) and `demo/app_gradio.py`. The flag is checked once in `__init__`; it does not change per-step. **Tests:** `tests/test_env.py` must verify that `DriftCallEnv(audio_boundary_enabled=False)` does not import `driftcall.audio.*` at all (use `sys.modules` assertion before/after reset).
### 3.2 Model load lifecycle
- **Lazy singleton.** `get_tts_engine()` and `get_asr_engine()` wrap a `_tts: TTSEngine | None = None` / `_asr: ASREngine | None = None` module-global. First call constructs and caches; subsequent calls return the cache. Thread-safe via `threading.Lock` (not asyncio — FastAPI workers are thread-per-request under the default gunicorn/uvicorn sync path, and even on async workers the lock is uncontended after warmup).
- **Download.** Kokoro-82M and faster-whisper-small are pulled from HF Hub on first load. The Dockerfile for the env Space (`deploy_env_space.md`) pre-pulls both into `/root/.cache/huggingface/` at image-build time so cold start on Space does not re-download (multi-gigabyte pull would exceed the free-tier timeout).
- **Warmup.** `app.py` lifespan hook calls `get_tts_engine().warmup()` and `get_asr_engine().warmup()` serially before the server binds its port. This burns ~8 seconds but ensures the first user request does not face a 5+ second first-inference penalty. The demo Space does the same in its Gradio `demo.load` event.
- **Unload.** Never. The engines live for the process lifetime. Sessions come and go; the models stay hot. This is safe because both are stateless between calls (no session-private buffers).
### 3.3 Determinism
- **TTS.** Kokoro exposes a `torch.Generator` seed. `synthesize(..., seed=N)` forwards `torch.manual_seed(N)` inside a `torch.random.fork_rng()` context so the global RNG is unaffected (critical — do not pollute the trainer's RNG). Given identical `(text, voice_pack, seed, sample_rate_hz)`, byte-for-byte output. Floating-point non-determinism across CPU architectures is theoretical but not observed on x86_64 AVX2, which is the only target (Docker image pins to `python:3.11-slim` on amd64).
- **ASR.** `beam_size=1` disables beam search (greedy decoding, deterministic given weights + input). `vad_filter=True` uses a deterministic silero-VAD pass that is stable across runs. `temperature=0.0` is the faster-whisper default — we do not override.
- **Training implication.** Neither engine is called in training, so RNG safety there is moot, but `fork_rng()` is kept for hygiene in case future eval scripts run TTS after a seeded rollout.
### 3.4 LRU caching for TTS
- **Key.** `(text_hash, voice_pack, seed, sample_rate_hz)` where `text_hash = blake2b(text.encode("utf-8"), digest_size=16).hexdigest()`. Using the hash (not the raw string) bounds key size and keeps the LRU memory footprint predictable. **Key-extension rationale:** `seed` and `sample_rate_hz` are in the key because `synthesize` accepts them as arguments and they change output bytes; omitting them would cause silent cache-hit corruption when a caller changes either parameter. This is why the key is richer than the DESIGN.md-level sketch `(text, voice_pack)`.
- **Value.** The WAV bytes (typically 30–80 KB for a 1-sentence Hindi utterance at 16 kHz; up to ~180 KB for a 4–6 s Hindi utterance).
- **Capacity.** 256 entries (`functools.lru_cache(maxsize=256)` is NOT used because it doesn't handle the hash-first indirection cleanly — we use `cachetools.LRUCache(maxsize=256, getsizeof=len)` with an explicit lock and an optional byte-budget cap of **64 MB** via `cachetools.LRUCache`'s `getsizeof` + `maxsize` byte-limit mode). Implementation note: cachetools treats `maxsize` as either entry-count or total `getsizeof` sum depending on constructor form; we use the byte-sum form so worst-case memory is bounded by the byte cap, not the entry count.
- **Memory envelope (worst-case vs typical).**
- Typical: 256 × ~60 KB ≈ **15 MB** (old number — still correct for average 1-sentence utterances).
- Worst-case: 256 × ~180 KB = **46 MB** (4–6 s Hindi utterances at 16 kHz 16-bit post-resample).
- Upper-bounded by the byte cap at **64 MB**. Above 64 MB, oldest entries evict by LRU order regardless of the 256 entry count.
- Pre-resample (24 kHz, Kokoro native) bytes would be ~69 MB worst-case if we cached pre-resample; we do NOT — the resample in §4.4 happens inside `synthesize()` before WAV encoding, so the cache stores 16 kHz bytes only. This is why the cache key includes `sample_rate_hz`: if a future caller ever requests 24 kHz output, it will cache under a separate key rather than colliding with 16 kHz entries.
- **Cache scope.** **Process-wide singleton, GLOBAL — not per-session.** All concurrent sessions (up to 10 per DESIGN.md §3.3) share ONE cache. This is intentional: the TTS output for `(text, voice_pack, seed, sample_rate_hz)` is deterministic and carries no session-private data, so sharing is safe and maximises hit rate (sim-caller re-synthesizing the same `seed_utterance` across sessions benefits from the shared cache).
- **Invalidation.** None — (text, voice, seed, sample_rate_hz) tuples deterministically produce the same bytes, so cache entries are eternal. Model change invalidates everything by process restart.
- **Why cache.** Demo Space replays the same goal utterance across multiple toggle switches (base ⇄ trained LoRA), and the env's sim-caller re-synthesizes the same `seed_utterance` each time the user re-runs an episode. Hit rate is >90% in the demo setting, turning a 300 ms synth into a 1 ms memcpy.
- **No ASR caching.** ASR inputs are already-variable WAV bytes; repeat rate is low, and keying on audio-byte hashes is O(audio length). Not worth it.
### 3.5 Confidence mapping (ASR)
faster-whisper exposes per-segment `avg_logprob` (mean log-probability over tokens, in the range roughly `[-1.5, 0.0]`). We map it to a [0, 1] confidence via:
```python
def _logprob_to_confidence(avg_logprob: float) -> float:
# avg_logprob ∈ [-1.5, 0.0] approx. Clamp then exp-normalize.
clamped = max(-1.5, min(0.0, avg_logprob))
return round(math.exp(clamped), 3)
```
This matches the DESIGN.md §4.1 `DriftCallObservation.last_confidence` semantics (`0.0 ≤ c ≤ 1.0`, 1.0 in training). When the clip has multiple segments, we take the duration-weighted mean of per-segment confidences.
**Empty-text-with-nonzero-confidence branching.** faster-whisper can decode to `text=""` while still reporting `avg_logprob > -1.5` (i.e., `confidence > 0`) on short non-silent clips where the acoustic model produces only whitespace / punctuation tokens that get stripped in post-processing. This is distinct from the VAD-silent case (§7.4) where VAD drops every segment before decode. Branching logic inside `transcribe()`:
```
if text == "":
if vad_dropped_all_segments:
# §7.4 silent-audio path
return TranscriptResult(text="", language_detected="unknown",
confidence=0.0, duration_s=clip_duration)
else:
# Decoded to empty but audio was not VAD-silent.
# Coerce confidence to 0.0 (we cannot trust a confident empty decode)
# and flag as low-confidence decode so callers can treat it like the
# silent path without losing the language hint that whisper provided.
return TranscriptResult(
text="",
language_detected=<whisper-reported language, mapped>,
confidence=0.0,
duration_s=clip_duration,
# degraded=True via trace sink (§3.8); no exception raised
)
```
The env treats `text == ""` as "no intelligible speech" regardless of which branch produced it. This matches DESIGN.md §4.1's implicit contract: `last_transcript=""` means the agent should `CLARIFY` rather than assume intent.
### 3.6 Language detection & Hinglish handling
- `language_hint="hinglish"` is translated to Whisper's `language="hi"` at the call site. Whisper has no Hinglish token, but Hindi decoding on code-mixed audio produces readable transliteration + English words in Latin script roughly 85% of the time. Noise is expected and documented as Risk 3 in DESIGN.md §14.
- `TranscriptResult.language_detected` reports what Whisper says, not the hint. If hint is `"hinglish"` and Whisper reports `"hi"`, we downgrade to `"hinglish"` only when the decoded text contains ≥ 2 ASCII-letter words intermixed with Devanagari (heuristic; documented in tests).
- If Whisper returns a language code not in our 5-value Literal (e.g., `"ur"` for Urdu, `"mr"` for Marathi), `language_detected="unknown"` is surfaced; `env.py` logs a warning and falls back to `language_hint` for R4 reward attribution.
### 3.7 Concurrency
- Both engines are CPU-bound Python calls into C extensions (Kokoro via torch, faster-whisper via CTranslate2). They **release the GIL** during inference, so threaded FastAPI workers can process N concurrent transcribes at a small RAM cost. Max concurrency is governed by the env-space session cap (10 concurrent sessions per DESIGN.md §3.3). RAM usage: 10 concurrent transcribes × ~150 MB peak = 1.5 GB — fits the free CPU tier's 16 GB with margin.
- No per-session model state means two sessions can share an engine instance without lock contention beyond what CTranslate2 internally serializes.
### 3.8 Diagnostic tracing hook
Both engines accept an optional `trace_sink: Callable[[AudioTrace], None] | None = None` kwarg in `__init__`. When provided, **every call** to `synthesize()`, `synthesize_to_gradio()`, or `transcribe()` emits exactly one `AudioTrace` record (schema in §2.2a) to the sink **after** the core work completes but **before** the return statement. Emissions are wrapped in `try/except Exception: pass` so a broken sink never crashes the audio path — telemetry must never break production.
**Default.** `trace_sink=None` means no emission, zero overhead.
**Wiring in `app.py`.** The FastAPI startup hook constructs a module-global ring buffer of the most recent **100** traces (`collections.deque(maxlen=100)`) and passes its `.append` method as the sink to both engines at `get_tts_engine()` / `get_asr_engine()` construction:
```
_trace_buffer: deque[AudioTrace] = deque(maxlen=100)
tts = get_tts_engine(trace_sink=_trace_buffer.append)
asr = get_asr_engine(trace_sink=_trace_buffer.append)
```
(Note: `get_tts_engine` / `get_asr_engine` are updated to accept and forward `trace_sink` through to the first-call `__init__`; subsequent calls after the singleton is constructed ignore the kwarg — warn in logs if a different sink is passed after construction.)
**Endpoint.** `GET /audio/trace` returns `{"traces": [AudioTrace.asdict(), ...]}` with the most recent 100 records, newest-first. No auth (demo-only; the env Space is behind judge tokens anyway per DESIGN.md §3.3). This endpoint is defined in `app.py`, not here.
**Demo UI.** `demo/app_gradio.py` polls `/audio/trace` every 2 s and overlays a sparkline of `latency_ms` per op and a counter of `degraded=True` events. This is how judges see the trace health live.
**Privacy.** `input_hash` is a blake2b digest — raw text and raw audio bytes never leave the process via the trace. This is a hard invariant.
---
## 4. Data structures
### 4.1 `TranscriptResult`
| Field | Type | Semantic | Constraint | Writer |
|---|---|---|---|---|
| `text` | `str` | Decoded transcript, NFC-normalized Unicode | Non-None; may be empty on silence; no trailing whitespace | `ASREngine.transcribe` |
| `language_detected` | `LanguageCode \| "unknown"` | Whisper-reported language, mapped to our 5 codes or `"unknown"` | One of `{"hi","ta","kn","en","hinglish","unknown"}` | `ASREngine.transcribe` |
| `confidence` | `float` | Duration-weighted exp-normalized mean log-prob | `0.0 ≤ c ≤ 1.0`; `0.0` whenever `text == ""` (both VAD-silent per §7.4 AND decoded-empty-despite-audio per §3.5 — the latter coerces any nonzero whisper-reported confidence to `0.0`); `1.0` only by convention in training when ASR is bypassed entirely (see §3.1) | `ASREngine.transcribe` |
| `duration_s` | `float` | Clip length in seconds | `0.0 ≤ d ≤ max_duration_s`; rounded to 3dp | `ASREngine.transcribe` |
Frozen dataclass; immutable by project convention (CLAUDE.md §4.2).
### 4.2 `VoicePackMapping`
Frozen dataclass. Five instances live in the `VOICE_PACKS` module-level dict — one per `LanguageCode`. Never re-assigned after module load.
### 4.2a `AudioTrace`
Frozen dataclass, defined in `driftcall/audio/trace.py` (schema in §2.2a, emission semantics in §3.8). Fields: `op`, `input_hash`, `language`, `duration_s`, `latency_ms`, `confidence`, `cache_hit`, `degraded`, `ts_ist`. All fields are immutable; `AudioTrace` instances are produced at the tail of each synth/transcribe call and fed to the configured `trace_sink`. Consumed by `app.py`'s `/audio/trace` endpoint and the demo UI live overlay. Never serialized to disk by this module (app-level concern).
### 4.3 Voice pack table (DESIGN.md §9.1)
| `language` | `default` | `allowed` | Notes |
|---|---|---|---|
| `"hi"` | `"hi_female_1"` | `("hi_female_1", "hi_male_1")` | Kokoro Hindi voices. Female default matches most task-brief personas. |
| `"ta"` | `"ta_female_1"` | `("ta_female_1",)` | Only one Tamil pack available at Kokoro-82M size. |
| `"kn"` | `"kn_male_1"` | `("kn_male_1",)` | Only one Kannada pack. |
| `"en"` | `"en_indian_female_1"` | `("en_indian_female_1",)` | Indian-accented English per DESIGN.md §9.1. |
| `"hinglish"` | `"en_indian_female_1"` | `("en_indian_female_1", "hi_female_1")` | Hinglish utterances transliterate English lexis into Devanagari poorly, and Hindi-voice delivery of Latin script is poorer still. `en_indian_female_1` delivers code-mixed ASCII text most naturally; `hi_female_1` is retained as an A/B fallback for utterances that are ≥ 80% Devanagari. **Choice documented here per task brief.** |
Total: **5 language codes mapped, 5 distinct voice packs used across the table.**
#### 4.3.1 Shipped voice packs at pinned version
At the `kokoro>=0.3,<0.4` pin (DESIGN.md §9.1, this doc §6.1), Kokoro-82M's **actually-shipped** voice packs at the time of pinning are: `hi_female_1`, `hi_male_1`, `en_indian_female_1`, and a best-effort set of Indic packs (`ta_female_1`, `kn_male_1`) whose bundling with the HF-distributed weights is **not guaranteed** across minor releases. The Kokoro project ships voice packs as separate `.pt` files inside the model repo; some Indic packs have been reshuffled between `0.3.x` minor versions. This module must behave sanely when an Indic pack is missing from the installed bundle.
**Missing-voice-pack fallback chain (evaluated at `synthesize()` call time, not at warmup, so fallbacks can be per-call telemetry rather than fatal startup errors):**
| Requested pack | If missing from bundle, fall back to | Emitted metadata |
|---|---|---|
| `ta_female_1` | `hi_female_1` | `degraded=True`, `fallback_from="ta_female_1"` in the audio trace (§3.8) |
| `kn_male_1` | `hi_female_1` | `degraded=True`, `fallback_from="kn_male_1"` |
| `hi_male_1` | `hi_female_1` | `degraded=True`, `fallback_from="hi_male_1"` |
| `hi_female_1` | `en_indian_female_1` (last resort for Hindi text) | `degraded=True`, `fallback_from="hi_female_1"` |
| `en_indian_female_1` | — (catastrophic if also missing; see below) | — |
**Warmup policy.** `TTSEngine.warmup()` probes each pack in `VOICE_PACKS` values by attempting a 1-word synthesis. Missing Indic packs (`ta_female_1`, `kn_male_1`, `hi_male_1`) are logged at `WARN` and the fallback chain is activated for subsequent calls — **warmup does not abort the Space**. The ONE condition that DOES abort the Space at warmup is: **both `en_indian_female_1` AND `hi_female_1` missing** — this is catastrophic because there is no voice at all for Hindi or English, which are the ≥ 95% traffic languages. In that case `ModelLoadError("no usable voice pack for hi or en")` is raised and the Space fails to bind its port.
**Downstream visibility.** Whenever a fallback is used, the `degraded=True` flag travels with the response. For TTS, this lives in the `AudioTrace` (§3.8) attached to the ring buffer; for ASR, there is an analogous mechanism in §3.5's empty-string edge case. `env.py` surfaces `degraded=True` into `DriftCallObservation` via a future `last_audio_degraded: bool` field if the rewards/models doc adds it; until then the flag is telemetry-only and does not influence reward.
### 4.4 Audio byte format (WAV contract)
- **TTS output:** RIFF WAV, mono, 16-bit PCM, 16 kHz. Produced via `torchaudio.save(..., format="wav", bits_per_sample=16, sample_rate=16000)` into an in-memory `io.BytesIO`, then `.getvalue()`. Header + data.
- **Resampling call site (canonical).** Kokoro-82M synthesizes at **24 kHz** natively. Resampling to the env's 16 kHz target happens **inside `TTSEngine.synthesize`, BEFORE WAV encoding**, via:
```python
import torchaudio.functional as F
pcm_16k = F.resample(pcm_24k, orig_freq=24000, new_freq=16000, lowpass_filter_width=64)
# ...then torchaudio.save(buf, pcm_16k, sample_rate=16000, bits_per_sample=16, format="wav")
```
`torchaudio.save(..., sample_rate=16000)` is called **after** the resample — it is an encoder, not a resampler. The `sample_rate` kwarg on `save` only writes the RIFF header value; it does not change the tensor's sample rate. Consequence: LRU-cached bytes are always 16 kHz (see §3.4). The `sample_rate_hz` synth-argument is validated at the top of `synthesize()` — only `16000` is supported in the v1 contract; any other value raises `UnsupportedLanguageError`-style error (future work: allow 24 kHz path per Open Question historically 9.4, now resolved — see §9).
- **ASR input resampling policy.** ASR does **NOT** auto-resample. If `transcribe()` receives audio whose header sample-rate is not 16 kHz (detected via `soundfile.info` before full read, or via the RIFF `nSamplesPerSec` field at bytes 24–27), it raises `AudioDecodeError("input must be 16 kHz mono; caller must pre-resample")`. Rationale: silently resampling at the ASR boundary hides caller bugs and costs 20–40 ms per call; since TTS already produces 16 kHz and the Gradio mic component is configured to deliver 16 kHz, any non-16kHz input indicates a mis-wired caller that must be fixed, not papered over.
- **ASR input:** same format required. The `transcribe` method sniffs magic bytes (`RIFF....WAVE` at offset 0–11) and dispatches to `soundfile.read(BytesIO(audio_bytes))`; raw float32 PCM at 16 kHz is accepted as a second path for the demo mic input pipeline (which delivers float32 by default from Gradio's `type="numpy"` component, configured with `sample_rate=16000` in §6.2 wiring). Other formats (mp3, ogg, flac) are **rejected** with `AudioDecodeError` — we do not ship ffmpeg in the CPU Space image.
---
## 5. Error modes
| Situation | Exception | Handled by |
|---|---|---|
| Kokoro-82M weights cannot be pulled from HF Hub (network / rate-limit / disk full) at `get_tts_engine()` first-call | `ModelLoadError` wrapping original `huggingface_hub` / `OSError` | `app.py` startup hook fails fast → HF Space log shows error; server does not bind. Retried on next container boot. |
| faster-whisper-small weights cannot be pulled at `get_asr_engine()` first-call | `ModelLoadError` | Same as above. |
| `synthesize(..., language_code=X)` with `X` not in `VOICE_PACKS` keys | `UnsupportedLanguageError` | `env.py` catches → logs, falls back to `en_indian_female_1` at `en`, and sets R4 penalty flag for language mismatch (enforced by rewards, not here). |
| `synthesize(..., voice_pack=X)` where X not in `VOICE_PACKS[language_code].allowed` | `UnsupportedVoicePackError` | Caller error — 400 at HTTP boundary. |
| `transcribe()` receives bytes with no valid WAV header and no float32-PCM magic | `AudioDecodeError` | `env.py` returns an `UNKNOWN_AUDIO` status in observation; `last_transcript=""`, `last_confidence=0.0`. |
| `transcribe()` low-confidence decode (`confidence < 0.3`) | **Not** an exception. Returned normally. | Caller (`env.py`) sets `DriftCallObservation.last_confidence` honestly; downstream the agent may `CLARIFY` to re-prompt. R4 does not penalize low ASR confidence — it is a natural observation feature. |
| `transcribe()` returns `text=""` with whisper-reported `confidence > 0` (decoded-empty-despite-audio, not VAD-silent — see §3.5) | **Not** an exception. `confidence` is coerced to `0.0`, `degraded=True` in trace, result returned with whisper-reported `language_detected`. | Env treats identically to the silent case: "no intelligible speech"; agent should `CLARIFY`. |
| Audio duration > `max_duration_s` (default 30 s) | Truncated silently. NOT raised. Unless caller passes `strict=True` (not in default signature) — then `AudioTooLongError`. | Documented in §7.3. `env.py` always uses the default (silent truncation). |
| TTS OOM mid-synthesis on a pathologically long string (> 4 KB of text) | `TTSOutOfMemoryError` wrapping the originating `MemoryError` or `RuntimeError` (CPU-only deployment per §1, §3.1, §6.1 — CUDA OOM cannot occur; torch on CPU raises `RuntimeError` or Python's built-in `MemoryError` on large tensor allocation failure) | `env.py` catches → agent's `SPEAK` is dropped with a warning; the turn still counts. R4 penalty for format non-compliance does not apply (env-side failure, not agent fault). |
| Indic voice pack (`ta_female_1`, `kn_male_1`, `hi_male_1`) missing from Kokoro bundle | **No exception** — fallback chain per §4.3.1 activated. `degraded=True` attached to trace. | Warmup logs WARN. Startup continues. |
| BOTH `en_indian_female_1` AND `hi_female_1` missing from Kokoro bundle (catastrophic — no Hindi/English voice) | `ModelLoadError("no usable voice pack for hi or en")` | `app.py` warmup catches and aborts startup — the Space will not bind its port. Operator must re-pull weights or downgrade `kokoro` pin. |
| `voice_pack` argument not in `VOICE_PACKS[language_code].allowed` | `UnsupportedVoicePackError` (caller bug, distinct from bundle missing) | Caller error — 400 at HTTP boundary. |
| `language_hint=None` with silent/empty audio | Returns `TranscriptResult(text="", language_detected="unknown", confidence=0.0, duration_s=<duration>)`. No exception. | Normal flow. |
| Concurrent `warmup()` calls from two threads | Second call is a no-op (singleton guard); first blocks until ready. | Tested. |
**Partial-result policy:** ASR never returns a partial `TranscriptResult`. Either the decode completes (even if `text=""`) or an `AudioError` subclass propagates. No `None` fields.
---
## 6. Dependencies
### 6.1 Upstream (what this module imports)
| Dependency | Version pin | License | Why |
|---|---|---|---|
| `kokoro` (Kokoro-82M official SDK wrapping the HF model `hexgrad/Kokoro-82M`) | `>=0.3, <0.4` | Apache 2.0 | TTS synthesis. Pure-CPU path. |
| `faster-whisper` | `>=1.0, <2.0` | MIT | ASR via CTranslate2 int8 runtime. |
| `ctranslate2` | (transitive of faster-whisper) | MIT | CTranslate2 runtime, CPU-only wheel. |
| `torchaudio` | `>=2.1, <3.0` | BSD-3 | WAV encoding from raw Kokoro PCM tensors. Pulled in by Kokoro anyway. |
| `soundfile` | `>=0.12` | BSD-3 | WAV decoding for ASR input; works without ffmpeg. |
| `cachetools` | `>=5.3` | MIT | `LRUCache` for TTS bytes. |
| Python stdlib | — | — | `math`, `io`, `hashlib`, `threading`, `dataclasses`, `enum`, `typing`. |
Not depended on: `ffmpeg-python`, `librosa`, `pydub`, `gradio` (demo-only), `fastapi` (app-only).
### 6.2 Downstream (who imports `driftcall/audio/`)
| Consumer | Imports |
|---|---|
| `app.py` (FastAPI env entrypoint) | `get_tts_engine`, `get_asr_engine`, `TranscriptResult`, all exceptions. Uses `TTSEngine.synthesize` (WAV bytes path) for HTTP responses via `Response(content=..., media_type="audio/wav")` or base64-embedded inside `/step`. Called at startup hook + on every `/step` that carries audio. |
| `demo/app_gradio.py` | `get_tts_engine`, `get_asr_engine`, `TranscriptResult`, all exceptions. Uses `TTSEngine.synthesize_to_gradio` (`tuple[int, np.ndarray]`) as the direct return value for `gr.Audio(type="numpy")` output components. Mic component feeds `transcribe()` via `audio_bytes` obtained from Gradio's float32 PCM at 16 kHz (configured on the component, not converted at the audio-module layer). Never calls `synthesize` (bytes) — that is only for FastAPI. |
| `driftcall/env.py` | **Only when `audio_boundary_enabled=True`.** Calls `get_tts_engine()` / `get_asr_engine()` lazily inside `_maybe_synthesize()` / `_maybe_transcribe()` helpers. Never imports at module top. |
| `tests/test_audio.py` | All public symbols for unit tests. |
| `tests/test_e2e.py` | `TranscriptResult` for constructing deploy-mode integration fixtures. |
### 6.3 Explicit non-consumers (load-bearing)
- `training/train_grpo.py` — **MUST NOT** import any symbol from `driftcall.audio`. Enforced by a linter rule in `pyproject.toml`:
```toml
[tool.ruff.lint.flake8-tidy-imports.banned-api]
"driftcall.audio".msg = "Training loop is text-only (DESIGN.md §9.4). Do not import audio in training/."
```
The rule is scoped to `training/**/*.py` via a `per-file-ignores` override pattern.
- `training/eval_baseline.py`, `training/eval_final.py` — same rule. Eval runs on text transcripts; if live-audio eval is needed later, it becomes a separate `eval_audio.py` script.
- `rewards.py` — does not import audio. Rewards read `DriftCallObservation.last_confidence` (a float) and `last_lang` (a string) which the env boundary has already set. Rewards do not re-transcribe.
### 6.4 Model assets
Licenses below cover **model weights** (distinct from the Python package licenses in §6.1).
| Model repo | Params / size | License (weights) | Notes |
|---|---|---|---|
| `hexgrad/Kokoro-82M` | 82M params, ~330 MB fp32 (~160 MB int8, unused) | Apache 2.0 | Kokoro fp32 is fast enough on CPU; int8 path not exercised. |
| `Systran/faster-whisper-small` | ~244M params, ~470 MB fp32 / ~120 MB int8 | Apache 2.0 | We use int8 on CPU. See §1.1 for WER trade-off vs `medium` / `large-v3` and the migration path. |
Total cache-on-disk footprint: ~450 MB. Dockerfile pre-pulls both into `/root/.cache/huggingface/`; image size budget per DESIGN.md Risk 10: < 2 GB total. Audio weights take ~25% of that. If §1.1's migration is triggered and we swap to `Systran/faster-whisper-medium` (~700 MB int8), total weights rise to ~1 GB and the image size budget still holds.
---
## 7. Edge cases
Eight cases that the test plan (`docs/tests/audio_tests.md`) must cover. Each case is the minimum test that would catch regressions.
### 7.1 Hinglish code-mix Whisper noise
`transcribe(wav_of("Bhai Friday ko Bangalore jaana hai"), language_hint="hinglish")` — Whisper-Hindi decoding on code-mixed audio returns mixed Devanagari+Latin output. Test asserts: (a) `text` is non-empty, (b) `confidence` is finite in [0, 1], (c) `language_detected` is one of `{"hi", "hinglish"}`. Text-equality is NOT asserted (Risk 3, semantic match downstream). If this test becomes flaky on a new faster-whisper release, we pin the version tighter — do not loosen the assertions.
### 7.2 Kannada voice pack quality
`synthesize("Namaskara, saha haridu", language_code="kn")` — Kokoro's Kannada pack is known to produce occasional glitches on loanwords. Test asserts: (a) returns non-empty WAV bytes, (b) the WAV parses with `soundfile.read` and has `>= 1.5 s` duration for this phrase, (c) duration is within 30% of expected (2.0 s). Audio-quality assertions beyond this are out of scope — DESIGN.md Risk 8 accepts "pre-generate demo audio with careful voice-pack selection" as mitigation.
### 7.3 Long utterance truncation
`transcribe()` receives a 45-second WAV when `max_duration_s=30.0`. Default path: silent truncation; `result.duration_s == 30.0`; `text` contains only the first 30 s of content. Test: feed a synthesized 45-s clip of counted numbers 1–45, assert the decoded text does NOT contain "40" or "45". No exception raised.
### 7.4 Silent audio
`transcribe(wav_of_silence(duration_s=3.0), language_hint="hi")` — VAD filter drops all segments. `TranscriptResult(text="", language_detected="unknown", confidence=0.0, duration_s=3.0)` is returned. Explicitly NO exception. `env.py` interprets `text==""` as "user did not speak" and the agent observation reflects that.
### 7.5 Wrong-language hint
`transcribe(wav_of("The flight leaves at six"), language_hint="ta")` — Whisper is forced into Tamil decoding on English audio. Result typically garbled. Test asserts: (a) no exception, (b) `language_detected` may disagree with hint, (c) `confidence` is likely low (< 0.5 expected, not strictly asserted to avoid flakes). `env.py` logs a WARN but does not retry with autodetect — retry is the agent's job (via `CLARIFY`).
### 7.6 Concurrent sessions sharing engine
Spawn 5 threads each calling `transcribe()` on distinct 2-second clips simultaneously. Assert: (a) all 5 return `TranscriptResult`, (b) wall-clock is less than 5× sequential (thanks to GIL release in CTranslate2), (c) no exceptions. Same test for TTS, but parallelism benefit is smaller (torch on CPU serializes heavily).
### 7.7 TTS LRU hit
Call `synthesize(text="नमस्ते", language_code="hi", seed=0)` twice back-to-back. First call p50 ≈ 250 ms, second call p50 < 5 ms (LRU hit). Assert second call returns byte-identical WAV and is ≥ 10× faster. This guards against accidental cache-key drift.
### 7.8 TTS seed determinism
`synthesize(text="कल मिलते हैं", language_code="hi", seed=7)` called from two separate fresh processes (subprocess fixture) produces byte-identical WAV. Guards against RNG leak from outer training/eval code. Uses `fork_rng` internally; test validates by calling `random.random()` before and after to confirm global RNG is undisturbed.
### 7.9 Training-loop import firewall
Import `training.train_grpo` in a subprocess. After import, assert `"driftcall.audio.tts_kokoro" not in sys.modules` and `"driftcall.audio.asr_whisper" not in sys.modules`. This guards DESIGN.md §9.4 at the structural level. The ruff banned-api rule should fire in CI; this test belts-and-braces it.
### 7.10 Model-load failure at startup
Monkeypatch `kokoro.KPipeline` to raise `OSError("no network")`. Call `get_tts_engine()`. Assert `ModelLoadError` is raised with the original `OSError` in `__cause__`. Second call re-attempts load (singleton state did NOT cache the failure) — this is intentional so a transient HF Hub outage does not permanently break the process. Test both on an ASR mock too.
---
## 8. Examples
### 8.1 Hindi TTS round-trip (deployed env sim-caller path)
```python
from __future__ import annotations
from driftcall.audio.tts_kokoro import get_tts_engine
tts = get_tts_engine()
wav_bytes = tts.synthesize(
text="नमस्ते, कल दिल्ली की फ्लाइट बुक करनी है, सात हज़ार के अंदर।",
language_code="hi",
voice_pack="hi_female_1",
seed=0,
)
# Assertions typical of the test and the sim-caller:
assert isinstance(wav_bytes, bytes)
assert wav_bytes[:4] == b"RIFF"
assert wav_bytes[8:12] == b"WAVE"
# Write to disk for debugging:
# pathlib.Path("goal_hi.wav").write_bytes(wav_bytes)
# File size for a ~4 s clip at 16 kHz 16-bit mono ≈ 128 KB.
assert 60_000 < len(wav_bytes) < 180_000
# Duration can be extracted cheaply via soundfile:
import io, soundfile
info = soundfile.info(io.BytesIO(wav_bytes))
assert 3.0 < info.duration < 6.0
assert info.samplerate == 16_000
assert info.channels == 1
```
### 8.2 Hinglish ASR (env boundary transcribing user audio)
```python
from __future__ import annotations
from pathlib import Path
from driftcall.audio.asr_whisper import get_asr_engine, TranscriptResult
asr = get_asr_engine()
wav_bytes = Path("user_hinglish_bangalore.wav").read_bytes()
result: TranscriptResult = asr.transcribe(
audio_bytes=wav_bytes,
language_hint="hinglish",
beam_size=1,
vad_filter=True,
)
# Expected shape:
assert isinstance(result, TranscriptResult)
assert "bangalore" in result.text.lower() or "बैंगलोर" in result.text
assert result.language_detected in {"hi", "hinglish"}
assert 0.0 <= result.confidence <= 1.0
assert result.duration_s > 0.0
# Embedding into env observation (happens in env.py, not here):
# obs = replace(obs,
# last_transcript=result.text,
# last_lang=result.language_detected if result.language_detected != "unknown" else goal.language,
# last_confidence=result.confidence,
# )
```
### 8.3 Kannada round-trip TTS → ASR (demo Space self-test)
```python
from __future__ import annotations
from driftcall.audio.tts_kokoro import get_tts_engine
from driftcall.audio.asr_whisper import get_asr_engine
tts = get_tts_engine()
asr = get_asr_engine()
original_text = "Kempegowda airport ge taxi beku"
wav = tts.synthesize(text=original_text, language_code="kn", seed=42)
result = asr.transcribe(audio_bytes=wav, language_hint="kn")
# Round-trip fidelity (semantic, not exact — Kannada ASR has noise):
assert result.text != ""
assert result.language_detected in {"kn", "unknown"}
# Soft assertion: at least one keyword survives the round-trip.
assert any(tok in result.text.lower() for tok in ("kempegowda", "airport", "taxi"))
# Confidence floor for demo playback:
assert result.confidence > 0.3, f"Kannada round-trip confidence too low: {result.confidence}"
```
Additional integration-level flow (not a unit test, for orientation):
```
┌─ sim-caller ─┐ TTS ┌─ env boundary ─┐ ASR ┌─ env core ─┐
│ goal text │──────────▶│ 16 kHz WAV bytes │──────────▶│ observation │
│ (GoalSpec │ │ (bytes over HTTP │ │ (text,lang, │
│ .seed_utt) │ │ in /step body) │ │ conf) │
└──────────────┘ └──────────────────┘ └─────────────┘
```
---
## 9. Open questions
1. **VAD filter confidence on Hinglish code-mix.** `vad_filter=True` uses silero-VAD trained primarily on European languages. Early smoke tests suggest it sometimes clips Hindi laterals ("ल", "न"). If this materially hurts R1 on Hinglish episodes during Phase C baseline eval, we may flip to `vad_filter=False` at the cost of ~10% slower decoding. Escalate to orchestrator after baseline runs in Batch C2.
2. **Kokoro voice pack A/B for Hinglish.** §4.3 documents `en_indian_female_1` as default and `hi_female_1` as fallback. We have no empirical data yet on which produces better judge perception in the demo. Decision deferred to demo rehearsal in Batch C5 — Person D to record both variants and pick by ear.
3. **Should the env return raw WAV bytes to the agent, or just the transcript?** Current design: transcript only (via `DriftCallObservation.last_transcript`). An argument for also returning WAV: the agent could self-re-transcribe with a different model. Counter: we want to lock the ASR-as-oracle contract for reward reproducibility. **Recommendation:** keep transcript-only. If overturned in review, `DriftCallObservation` gets a new optional `last_audio_b64: str | None` field and this doc + `models.md` both update.
4. ~~**Sample rate upgrade path.** 16 kHz is the minimum for Whisper-small; 24 kHz would sound better for TTS playback in the demo. Kokoro natively produces 24 kHz; we currently resample down. If Space CPU budget permits, we may expose 24 kHz for TTS output while ASR continues at 16 kHz — this costs 50% more bandwidth over HTTP. Deferred; do not implement until demo-polish sprint.~~ **RESOLVED (see §4.4).** v1 contract pins TTS output to 16 kHz and resamples inside `synthesize()` before WAV encoding via `torchaudio.functional.resample(tensor, orig_freq=24000, new_freq=16000, lowpass_filter_width=64)`. ASR never auto-resamples; non-16 kHz input raises `AudioDecodeError`. 24 kHz playback path is out of scope for hackathon ship and will not be added without a DESIGN.md §9 update.
|