Spaces:

saumilyajj
/

driftcall

Sleeping

App Files Files Community

driftcall / docs /modules /audio.md

saumilyajj

Upload folder using huggingface_hub

f2df60e verified 28 days ago

preview code

raw

history blame contribute delete

51.6 kB

audio.md — DriftCall Audio Pipeline (Kokoro-82M TTS + faster-whisper-small ASR)

Owner: Person C (Training & Data), secondary: Person A (integration glue) Implements: DESIGN.md §9 (Audio Pipeline, 9.1–9.4), §3.3 (Deployed Env Topology), §3.4 (Demo Topology) Status: DRAFT — pending ≥ 2 fresh critic rounds

1. Purpose

driftcall/audio/ houses the two model wrappers that convert between text and speech at the env boundary: tts_kokoro.py (text → 16 kHz mono WAV bytes) and asr_whisper.py (WAV/PCM bytes → transcript + detected language + confidence). They exist so the deployed env and demo Space can honestly claim "voice-first" while the training loop stays text-in/text-out for throughput.

This module is the single place where audio-model state lives. Both engines are heavy (~82M and ~244M params respectively) and slow to initialize on CPU, so each exposes a module-level singleton constructed lazily on first call and reused across all sessions in the process. The FastAPI env (app.py) calls the factory once at startup; the Gradio demo (demo/app_gradio.py) does the same. The training loop (training/train_grpo.py) never imports these modules — not even the factory — because import kokoro / import faster_whisper pulls in torchaudio and a 50 MB tokenizer per process, and we do not want that weight in the GRPO rollout worker.

The guiding constraints from DESIGN.md §9:

CPU-only. Both models must run on the free-tier HF Space (basic CPU). No cuda fall-through, no torch.compile, no GPU-dependent kernels. Kokoro-82M is 3–11× real-time on CPU; faster-whisper-small (int8) is ~1× real-time. Both fit in <1.2 GB RAM each.
Deterministic where possible. TTS takes a seed: int = 0 argument forwarded to torch's generator so synthesized clips are byte-reproducible given the same (text, voice, seed) triple. ASR uses beam_size=1 (greedy) for reproducibility; with vad_filter=True, outputs are stable across runs on the same input.
Latency budgets. TTS < 500 ms for a 1-sentence utterance. ASR ≈ 1× real-time (a 4-second clip decodes in ≈ 4 seconds on CPU basic). Env /step endpoint budgets 2 seconds total per turn — the audio path must not dominate.
Indic support. Hindi, Tamil, Kannada, English, and Hinglish (code-mixed). Voice-pack selection per language is defined in §4.3; ASR language hint is passed per-episode from GoalSpec.language.

The module is not called on every training rollout — DESIGN.md §9.4 is emphatic about this, and §3 ("Behavior spec") documents the runtime split.

1.1 Whisper size trade-off + migration path

faster-whisper-small (~244M params, ~120 MB int8) was chosen to hit the ~1× real-time decode budget on free-tier CPU Space. We explicitly acknowledge this comes at a cost: small has measurably degraded Word Error Rate on Hindi / Tamil / Kannada compared to large-v3 — published faster-whisper benchmarks show roughly a 5–10 percentage point WER gap on Indic audio depending on noise and code-mix. large-v3 is not a free-tier option: ~3 GB weights on disk, >3 GB resident RAM during decode, and ≥ 3× real-time on CPU basic — it would bust both memory (16 GB tier shared across app, sim-caller, TTS, observation builder) and the §3 latency budget.

Migration path (explicit, not aspirational):

If Batch C2 baseline R1 on Hindi episodes is < 0.4, bump to Systran/faster-whisper-medium (~700 MB int8). This is a one-line model_id= change; all behaviour in this doc still holds. Move the env Space only (not demo) to HF CPU Pro (+$5/mo, fits in the ≤ $30/mo deployment budget per DESIGN.md §13).
If medium is still insufficient on Hindi/Tamil/Kannada (R1 < 0.4 after Stage-1 training), escalate to large-v3 on the demo Space only (ZeroGPU), keeping the env Space on small/medium on CPU. This means the demo plays the more impressive transcript while the env used for reward grading stays on the deterministic CPU config — an acceptable asymmetry because demo ASR is never used for reward attribution (see §6.3 — rewards do not re-transcribe).

The chosen default for hackathon ship is small + int8 on CPU. Any escalation above requires orchestrator approval and a DESIGN.md §9.2 update.

2. Interface

Every declaration below is the exact target signature. env.py / app.py / demo/app_gradio.py depend on these signatures; no addition or rename is allowed without a DESIGN.md update first.

2.1 `driftcall/audio/tts_kokoro.py`

from __future__ import annotations

from dataclasses import dataclass
from typing import Literal

LanguageCode = Literal["hi", "ta", "kn", "en", "hinglish"]
VoicePack = Literal[
    "hi_female_1",
    "hi_male_1",
    "ta_female_1",
    "kn_male_1",
    "en_indian_female_1",
]


@dataclass(frozen=True)
class VoicePackMapping:
    """Per-language default + allowed voice packs for Kokoro.

    DESIGN.md §9.1 lists the five packs. The mapping is frozen at module
    load and never mutated.
    """

    language: LanguageCode
    default: VoicePack
    allowed: tuple[VoicePack, ...]


# Module-level constant. Frozen at import time; see §4.3 for the authoritative
# per-row rationale. The literal below IS the full contents — five entries, one
# per LanguageCode. No runtime mutation.
VOICE_PACKS: dict[LanguageCode, VoicePackMapping] = {
    "hi": VoicePackMapping(
        language="hi",
        default="hi_female_1",
        allowed=("hi_female_1", "hi_male_1"),
    ),
    "ta": VoicePackMapping(
        language="ta",
        default="ta_female_1",
        allowed=("ta_female_1",),
    ),
    "kn": VoicePackMapping(
        language="kn",
        default="kn_male_1",
        allowed=("kn_male_1",),
    ),
    "en": VoicePackMapping(
        language="en",
        default="en_indian_female_1",
        allowed=("en_indian_female_1",),
    ),
    "hinglish": VoicePackMapping(
        language="hinglish",
        default="en_indian_female_1",
        allowed=("en_indian_female_1", "hi_female_1"),
    ),
}


class TTSEngine:
    """Kokoro-82M wrapper. One instance per process.

    Constructed via `get_tts_engine()`; do NOT instantiate directly in
    consumer code — the singleton guarantees the model is loaded once.
    """

    def __init__(
        self,
        *,
        model_id: str = "hexgrad/Kokoro-82M",
        trace_sink: "Callable[[AudioTrace], None] | None" = None,
    ) -> None: ...

    def synthesize(
        self,
        text: str,
        language_code: LanguageCode,
        voice_pack: VoicePack | None = None,
        *,
        seed: int = 0,
        sample_rate_hz: int = 16000,
    ) -> bytes:
        """Return 16-bit PCM mono WAV bytes.

        - `voice_pack=None` → use `VOICE_PACKS[language_code].default`.
        - `voice_pack` outside `VOICE_PACKS[language_code].allowed` → `UnsupportedVoicePackError`.
        - Deterministic given (text, voice_pack, seed, sample_rate_hz).
        - Cached in LRU (see §3.4).
        - Returns the full WAV (RIFF header + PCM), ready to write to disk
          or send as `Response(content=..., media_type="audio/wav")`.
        """

    def synthesize_to_gradio(
        self,
        text: str,
        language_hint: LanguageCode,
        voice_pack: VoicePack | None = None,
        *,
        seed: int = 0,
    ) -> tuple[int, "np.ndarray"]:
        """Gradio-friendly sibling of `synthesize`.

        Returns `(sample_rate, float32 np.ndarray)` with shape `(n_samples,)`
        (mono). This matches Gradio's `gr.Audio(type="numpy")` expected output.
        Internally calls the same Kokoro path as `synthesize()`, skipping the
        WAV encoding step and returning the float32 tensor-as-numpy directly.
        The LRU cache from §3.4 is NOT shared — Gradio-path outputs are
        cached separately under a key that includes a `fmt="numpy"` discriminator,
        so byte-cache and numpy-cache never collide.

        - `voice_pack=None` → use `VOICE_PACKS[language_hint].default`.
        - Sample rate is fixed at 16000 to match the `synthesize()` contract.
        - Deterministic given (text, voice_pack, seed).
        """

    def warmup(self) -> None:
        """Run one synthesize() with a canonical string to force model load.
        Called by `app.py` startup hook so the first real request is fast.
        """


def get_tts_engine() -> TTSEngine:
    """Return the process-wide TTSEngine singleton (lazy-constructed)."""

Which caller uses which helper (binding contract):

Caller	Helper	Return type	Framing
FastAPI `/synthesize` endpoint in `app.py`	`TTSEngine.synthesize`	`bytes` (RIFF WAV)	`Response(content=wav_bytes, media_type="audio/wav")`
FastAPI `/step` audio field in `app.py`	`TTSEngine.synthesize`	`bytes`	Embedded as base64 inside the JSON step response.
Gradio demo in `demo/app_gradio.py`	`TTSEngine.synthesize_to_gradio`	`tuple[int, np.ndarray]`	Direct return to `gr.Audio(type="numpy")` output component.
Tests	Either, per-case	—	WAV-bytes tests use `synthesize`; spectral / numpy-domain tests use `synthesize_to_gradio`.

Rationale for two helpers rather than synthesize + a numpy-wrapper: re-decoding WAV bytes back into float32 numpy inside the Gradio path wastes ~3 ms and doubles the memory briefly (encoded bytes + re-decoded tensor). Keeping a numpy-native return avoids that round-trip for the demo-critical path.

2.2 `driftcall/audio/asr_whisper.py`

from __future__ import annotations

from dataclasses import dataclass
from typing import Literal

LanguageCode = Literal["hi", "ta", "kn", "en", "hinglish"]


@dataclass(frozen=True)
class TranscriptResult:
    """ASR output surfaced to the env observation builder.

    - `text` is NFC-normalized Unicode; empty string on silence.
    - `language_detected` is the Whisper-reported language code; may disagree
      with the hint (e.g., hint="hi", detected="en" for code-mixed utterances).
    - `confidence` is the mean token log-prob mapped to [0.0, 1.0] via
      exp-normalize (see §3.5). 1.0 = perfect, 0.0 = pathological.
    - `duration_s` is the decoded clip length in seconds (float, rounded to 3dp).
    """

    text: str
    language_detected: LanguageCode | Literal["unknown"]
    confidence: float
    duration_s: float


class ASREngine:
    """faster-whisper-small (int8) wrapper. One instance per process.

    Constructed via `get_asr_engine()`.
    """

    def __init__(
        self,
        *,
        model_id: str = "Systran/faster-whisper-small",
        compute_type: Literal["int8", "int8_float16"] = "int8",
        trace_sink: "Callable[[AudioTrace], None] | None" = None,
    ) -> None: ...

    def transcribe(
        self,
        audio_bytes: bytes,
        language_hint: LanguageCode | None,
        *,
        beam_size: int = 1,
        vad_filter: bool = True,
        max_duration_s: float = 30.0,
    ) -> TranscriptResult:
        """Decode a WAV/PCM clip into a TranscriptResult.

        - `audio_bytes` must be a RIFF WAV with mono 16-bit PCM at 16 kHz OR
          raw float32 PCM at 16 kHz (detected by magic bytes). Other formats
          → `AudioDecodeError`.
        - `language_hint="hinglish"` is translated to `language="hi"` at the
          Whisper call site (Whisper has no Hinglish code); detected language
          may come back as "hi" or "en".
        - `language_hint=None` → autodetect (slower on first pass).
        - Truncates to `max_duration_s` silently and sets
          `result.duration_s = max_duration_s` (see edge case §7.3).
        - Returns a populated `TranscriptResult`; never raises on a merely
          low-confidence decode — that is a policy decision for the caller.
        """

    def warmup(self) -> None:
        """Run one transcribe() on 0.5s of silence to load weights + VAD."""


def get_asr_engine() -> ASREngine:
    """Return the process-wide ASREngine singleton (lazy-constructed)."""

2.2a `driftcall/audio/trace.py` (shared between TTS + ASR)

from __future__ import annotations

from dataclasses import dataclass
from typing import Callable, Literal


@dataclass(frozen=True)
class AudioTrace:
    """Per-call diagnostic record for synthesize() and transcribe().

    Emitted via the `trace_sink` callback passed to each engine's __init__.
    Consumed by the `/audio/trace` FastAPI endpoint and the demo UI live overlay.
    Never mutated after construction (frozen).
    """

    op: Literal["synthesize", "transcribe"]
    input_hash: str       # blake2b hex digest of text (for TTS) or audio bytes (for ASR)
    language: str         # requested language code or "unknown"
    duration_s: float     # clip duration in seconds (output for TTS, input for ASR)
    latency_ms: int       # wall-clock call latency
    confidence: float | None   # ASR: TranscriptResult.confidence; TTS: None
    cache_hit: bool       # TTS: LRU hit? ASR: always False
    degraded: bool        # True on voice-pack fallback (TTS) OR coerced-empty (ASR)
    ts_ist: str           # ISO-8601 timestamp in Asia/Kolkata tz


TraceSink = Callable[[AudioTrace], None]

2.3 Custom exceptions

Defined in driftcall/audio/errors.py (tiny module, shared):

class AudioError(Exception): ...
class ModelLoadError(AudioError): ...
class UnsupportedLanguageError(AudioError): ...
class UnsupportedVoicePackError(AudioError): ...
class AudioDecodeError(AudioError): ...
class AudioTooLongError(AudioError): ...   # only raised if caller passes strict=True
class TTSOutOfMemoryError(AudioError): ...

env.py catches AudioError at the boundary and either degrades (see §5) or 500s the HTTP response.

2.4 `all`

# tts_kokoro.py
__all__ = [
    "LanguageCode",
    "VoicePack",
    "VoicePackMapping",
    "VOICE_PACKS",
    "TTSEngine",
    "get_tts_engine",
]

# asr_whisper.py
__all__ = [
    "LanguageCode",
    "TranscriptResult",
    "ASREngine",
    "get_asr_engine",
]

# trace.py
__all__ = [
    "AudioTrace",
    "TraceSink",
]

3. Behavior spec

3.1 Training-vs-deploy split (DESIGN.md §9.4 — load-bearing)

Runtime	Imports `driftcall.audio`?	TTS in loop?	ASR in loop?	Why
Training (`training/train_grpo.py`, local V100)	No. Explicit negative contract.	No	No	Speed. Pre-authored text transcripts go straight into `DriftCallObservation.last_transcript`. `last_confidence=1.0` (treated as perfect ASR). ~10× faster rollouts.
Deployed env (HF Space CPU basic, `app.py`)	Yes, via `get_tts_engine()` + `get_asr_engine()` at startup	Yes (on `SPEAK` actions)	Yes (on every inbound `/step` that carries audio bytes)	DESIGN.md §9.4: "env is genuinely voice-driven for realism". Sim-caller in §3.1 synthesizes user utterances; ASR at env boundary transcribes before embedding into observation.
Demo Space (Gradio, ZeroGPU / A10G)	Yes	Yes	Yes + live mic input	Judge interaction.

env.py toggles between modes via a single flag: DriftCallEnv(audio_boundary_enabled: bool = False). Default False means the training path; True is set only inside app.py (FastAPI) and demo/app_gradio.py. The flag is checked once in __init__; it does not change per-step. Tests: tests/test_env.py must verify that DriftCallEnv(audio_boundary_enabled=False) does not import driftcall.audio.* at all (use sys.modules assertion before/after reset).

3.2 Model load lifecycle

Lazy singleton. get_tts_engine() and get_asr_engine() wrap a _tts: TTSEngine | None = None / _asr: ASREngine | None = None module-global. First call constructs and caches; subsequent calls return the cache. Thread-safe via threading.Lock (not asyncio — FastAPI workers are thread-per-request under the default gunicorn/uvicorn sync path, and even on async workers the lock is uncontended after warmup).
Download. Kokoro-82M and faster-whisper-small are pulled from HF Hub on first load. The Dockerfile for the env Space (deploy_env_space.md) pre-pulls both into /root/.cache/huggingface/ at image-build time so cold start on Space does not re-download (multi-gigabyte pull would exceed the free-tier timeout).
Warmup. app.py lifespan hook calls get_tts_engine().warmup() and get_asr_engine().warmup() serially before the server binds its port. This burns ~8 seconds but ensures the first user request does not face a 5+ second first-inference penalty. The demo Space does the same in its Gradio demo.load event.
Unload. Never. The engines live for the process lifetime. Sessions come and go; the models stay hot. This is safe because both are stateless between calls (no session-private buffers).

3.3 Determinism

TTS. Kokoro exposes a torch.Generator seed. synthesize(..., seed=N) forwards torch.manual_seed(N) inside a torch.random.fork_rng() context so the global RNG is unaffected (critical — do not pollute the trainer's RNG). Given identical (text, voice_pack, seed, sample_rate_hz), byte-for-byte output. Floating-point non-determinism across CPU architectures is theoretical but not observed on x86_64 AVX2, which is the only target (Docker image pins to python:3.11-slim on amd64).
ASR. beam_size=1 disables beam search (greedy decoding, deterministic given weights + input). vad_filter=True uses a deterministic silero-VAD pass that is stable across runs. temperature=0.0 is the faster-whisper default — we do not override.
Training implication. Neither engine is called in training, so RNG safety there is moot, but fork_rng() is kept for hygiene in case future eval scripts run TTS after a seeded rollout.

3.4 LRU caching for TTS

Key. (text_hash, voice_pack, seed, sample_rate_hz) where text_hash = blake2b(text.encode("utf-8"), digest_size=16).hexdigest(). Using the hash (not the raw string) bounds key size and keeps the LRU memory footprint predictable. Key-extension rationale: seed and sample_rate_hz are in the key because synthesize accepts them as arguments and they change output bytes; omitting them would cause silent cache-hit corruption when a caller changes either parameter. This is why the key is richer than the DESIGN.md-level sketch (text, voice_pack).
Value. The WAV bytes (typically 30–80 KB for a 1-sentence Hindi utterance at 16 kHz; up to ~180 KB for a 4–6 s Hindi utterance).
Capacity. 256 entries (functools.lru_cache(maxsize=256) is NOT used because it doesn't handle the hash-first indirection cleanly — we use cachetools.LRUCache(maxsize=256, getsizeof=len) with an explicit lock and an optional byte-budget cap of 64 MB via cachetools.LRUCache's getsizeof + maxsize byte-limit mode). Implementation note: cachetools treats maxsize as either entry-count or total getsizeof sum depending on constructor form; we use the byte-sum form so worst-case memory is bounded by the byte cap, not the entry count.
Memory envelope (worst-case vs typical).
- Typical: 256 × ~60 KB ≈ 15 MB (old number — still correct for average 1-sentence utterances).
- Worst-case: 256 × ~180 KB = 46 MB (4–6 s Hindi utterances at 16 kHz 16-bit post-resample).
- Upper-bounded by the byte cap at 64 MB. Above 64 MB, oldest entries evict by LRU order regardless of the 256 entry count.
- Pre-resample (24 kHz, Kokoro native) bytes would be ~69 MB worst-case if we cached pre-resample; we do NOT — the resample in §4.4 happens inside synthesize() before WAV encoding, so the cache stores 16 kHz bytes only. This is why the cache key includes sample_rate_hz: if a future caller ever requests 24 kHz output, it will cache under a separate key rather than colliding with 16 kHz entries.
Cache scope. Process-wide singleton, GLOBAL — not per-session. All concurrent sessions (up to 10 per DESIGN.md §3.3) share ONE cache. This is intentional: the TTS output for (text, voice_pack, seed, sample_rate_hz) is deterministic and carries no session-private data, so sharing is safe and maximises hit rate (sim-caller re-synthesizing the same seed_utterance across sessions benefits from the shared cache).
Invalidation. None — (text, voice, seed, sample_rate_hz) tuples deterministically produce the same bytes, so cache entries are eternal. Model change invalidates everything by process restart.
Why cache. Demo Space replays the same goal utterance across multiple toggle switches (base ⇄ trained LoRA), and the env's sim-caller re-synthesizes the same seed_utterance each time the user re-runs an episode. Hit rate is >90% in the demo setting, turning a 300 ms synth into a 1 ms memcpy.
No ASR caching. ASR inputs are already-variable WAV bytes; repeat rate is low, and keying on audio-byte hashes is O(audio length). Not worth it.

3.5 Confidence mapping (ASR)

faster-whisper exposes per-segment avg_logprob (mean log-probability over tokens, in the range roughly [-1.5, 0.0]). We map it to a [0, 1] confidence via:

def _logprob_to_confidence(avg_logprob: float) -> float:
    # avg_logprob ∈ [-1.5, 0.0] approx. Clamp then exp-normalize.
    clamped = max(-1.5, min(0.0, avg_logprob))
    return round(math.exp(clamped), 3)

This matches the DESIGN.md §4.1 DriftCallObservation.last_confidence semantics (0.0 ≤ c ≤ 1.0, 1.0 in training). When the clip has multiple segments, we take the duration-weighted mean of per-segment confidences.

Empty-text-with-nonzero-confidence branching. faster-whisper can decode to text="" while still reporting avg_logprob > -1.5 (i.e., confidence > 0) on short non-silent clips where the acoustic model produces only whitespace / punctuation tokens that get stripped in post-processing. This is distinct from the VAD-silent case (§7.4) where VAD drops every segment before decode. Branching logic inside transcribe():

if text == "":
    if vad_dropped_all_segments:
        # §7.4 silent-audio path
        return TranscriptResult(text="", language_detected="unknown",
                                confidence=0.0, duration_s=clip_duration)
    else:
        # Decoded to empty but audio was not VAD-silent.
        # Coerce confidence to 0.0 (we cannot trust a confident empty decode)
        # and flag as low-confidence decode so callers can treat it like the
        # silent path without losing the language hint that whisper provided.
        return TranscriptResult(
            text="",
            language_detected=<whisper-reported language, mapped>,
            confidence=0.0,
            duration_s=clip_duration,
            # degraded=True via trace sink (§3.8); no exception raised
        )

The env treats text == "" as "no intelligible speech" regardless of which branch produced it. This matches DESIGN.md §4.1's implicit contract: last_transcript="" means the agent should CLARIFY rather than assume intent.

3.6 Language detection & Hinglish handling

language_hint="hinglish" is translated to Whisper's language="hi" at the call site. Whisper has no Hinglish token, but Hindi decoding on code-mixed audio produces readable transliteration + English words in Latin script roughly 85% of the time. Noise is expected and documented as Risk 3 in DESIGN.md §14.
TranscriptResult.language_detected reports what Whisper says, not the hint. If hint is "hinglish" and Whisper reports "hi", we downgrade to "hinglish" only when the decoded text contains ≥ 2 ASCII-letter words intermixed with Devanagari (heuristic; documented in tests).
If Whisper returns a language code not in our 5-value Literal (e.g., "ur" for Urdu, "mr" for Marathi), language_detected="unknown" is surfaced; env.py logs a warning and falls back to language_hint for R4 reward attribution.

3.7 Concurrency

Both engines are CPU-bound Python calls into C extensions (Kokoro via torch, faster-whisper via CTranslate2). They release the GIL during inference, so threaded FastAPI workers can process N concurrent transcribes at a small RAM cost. Max concurrency is governed by the env-space session cap (10 concurrent sessions per DESIGN.md §3.3). RAM usage: 10 concurrent transcribes × ~150 MB peak = 1.5 GB — fits the free CPU tier's 16 GB with margin.
No per-session model state means two sessions can share an engine instance without lock contention beyond what CTranslate2 internally serializes.

3.8 Diagnostic tracing hook

Both engines accept an optional trace_sink: Callable[[AudioTrace], None] | None = None kwarg in __init__. When provided, every call to synthesize(), synthesize_to_gradio(), or transcribe() emits exactly one AudioTrace record (schema in §2.2a) to the sink after the core work completes but before the return statement. Emissions are wrapped in try/except Exception: pass so a broken sink never crashes the audio path — telemetry must never break production.

Default. trace_sink=None means no emission, zero overhead.

Wiring in app.py. The FastAPI startup hook constructs a module-global ring buffer of the most recent 100 traces (collections.deque(maxlen=100)) and passes its .append method as the sink to both engines at get_tts_engine() / get_asr_engine() construction:

_trace_buffer: deque[AudioTrace] = deque(maxlen=100)
tts = get_tts_engine(trace_sink=_trace_buffer.append)
asr = get_asr_engine(trace_sink=_trace_buffer.append)

(Note: get_tts_engine / get_asr_engine are updated to accept and forward trace_sink through to the first-call __init__; subsequent calls after the singleton is constructed ignore the kwarg — warn in logs if a different sink is passed after construction.)

Endpoint. GET /audio/trace returns {"traces": [AudioTrace.asdict(), ...]} with the most recent 100 records, newest-first. No auth (demo-only; the env Space is behind judge tokens anyway per DESIGN.md §3.3). This endpoint is defined in app.py, not here.

Demo UI. demo/app_gradio.py polls /audio/trace every 2 s and overlays a sparkline of latency_ms per op and a counter of degraded=True events. This is how judges see the trace health live.

Privacy. input_hash is a blake2b digest — raw text and raw audio bytes never leave the process via the trace. This is a hard invariant.

4. Data structures

4.1 `TranscriptResult`

Field	Type	Semantic	Constraint	Writer
`text`	`str`	Decoded transcript, NFC-normalized Unicode	Non-None; may be empty on silence; no trailing whitespace	`ASREngine.transcribe`
`language_detected`	`LanguageCode \| "unknown"`	Whisper-reported language, mapped to our 5 codes or `"unknown"`	One of `{"hi","ta","kn","en","hinglish","unknown"}`	`ASREngine.transcribe`
`confidence`	`float`	Duration-weighted exp-normalized mean log-prob	`0.0 ≤ c ≤ 1.0`; `0.0` whenever `text == ""` (both VAD-silent per §7.4 AND decoded-empty-despite-audio per §3.5 — the latter coerces any nonzero whisper-reported confidence to `0.0`); `1.0` only by convention in training when ASR is bypassed entirely (see §3.1)	`ASREngine.transcribe`
`duration_s`	`float`	Clip length in seconds	`0.0 ≤ d ≤ max_duration_s`; rounded to 3dp	`ASREngine.transcribe`

Frozen dataclass; immutable by project convention (CLAUDE.md §4.2).

4.2 `VoicePackMapping`

Frozen dataclass. Five instances live in the VOICE_PACKS module-level dict — one per LanguageCode. Never re-assigned after module load.

4.2a `AudioTrace`

Frozen dataclass, defined in driftcall/audio/trace.py (schema in §2.2a, emission semantics in §3.8). Fields: op, input_hash, language, duration_s, latency_ms, confidence, cache_hit, degraded, ts_ist. All fields are immutable; AudioTrace instances are produced at the tail of each synth/transcribe call and fed to the configured trace_sink. Consumed by app.py's /audio/trace endpoint and the demo UI live overlay. Never serialized to disk by this module (app-level concern).

4.3 Voice pack table (DESIGN.md §9.1)

`language`	`default`	`allowed`	Notes
`"hi"`	`"hi_female_1"`	`("hi_female_1", "hi_male_1")`	Kokoro Hindi voices. Female default matches most task-brief personas.
`"ta"`	`"ta_female_1"`	`("ta_female_1",)`	Only one Tamil pack available at Kokoro-82M size.
`"kn"`	`"kn_male_1"`	`("kn_male_1",)`	Only one Kannada pack.
`"en"`	`"en_indian_female_1"`	`("en_indian_female_1",)`	Indian-accented English per DESIGN.md §9.1.
`"hinglish"`	`"en_indian_female_1"`	`("en_indian_female_1", "hi_female_1")`	Hinglish utterances transliterate English lexis into Devanagari poorly, and Hindi-voice delivery of Latin script is poorer still. `en_indian_female_1` delivers code-mixed ASCII text most naturally; `hi_female_1` is retained as an A/B fallback for utterances that are ≥ 80% Devanagari. Choice documented here per task brief.

Total: 5 language codes mapped, 5 distinct voice packs used across the table.

4.3.1 Shipped voice packs at pinned version

At the kokoro>=0.3,<0.4 pin (DESIGN.md §9.1, this doc §6.1), Kokoro-82M's actually-shipped voice packs at the time of pinning are: hi_female_1, hi_male_1, en_indian_female_1, and a best-effort set of Indic packs (ta_female_1, kn_male_1) whose bundling with the HF-distributed weights is not guaranteed across minor releases. The Kokoro project ships voice packs as separate .pt files inside the model repo; some Indic packs have been reshuffled between 0.3.x minor versions. This module must behave sanely when an Indic pack is missing from the installed bundle.

Missing-voice-pack fallback chain (evaluated at synthesize() call time, not at warmup, so fallbacks can be per-call telemetry rather than fatal startup errors):

Requested pack	If missing from bundle, fall back to	Emitted metadata
`ta_female_1`	`hi_female_1`	`degraded=True`, `fallback_from="ta_female_1"` in the audio trace (§3.8)
`kn_male_1`	`hi_female_1`	`degraded=True`, `fallback_from="kn_male_1"`
`hi_male_1`	`hi_female_1`	`degraded=True`, `fallback_from="hi_male_1"`
`hi_female_1`	`en_indian_female_1` (last resort for Hindi text)	`degraded=True`, `fallback_from="hi_female_1"`
`en_indian_female_1`	— (catastrophic if also missing; see below)	—

Warmup policy. TTSEngine.warmup() probes each pack in VOICE_PACKS values by attempting a 1-word synthesis. Missing Indic packs (ta_female_1, kn_male_1, hi_male_1) are logged at WARN and the fallback chain is activated for subsequent calls — warmup does not abort the Space. The ONE condition that DOES abort the Space at warmup is: both en_indian_female_1 AND hi_female_1 missing — this is catastrophic because there is no voice at all for Hindi or English, which are the ≥ 95% traffic languages. In that case ModelLoadError("no usable voice pack for hi or en") is raised and the Space fails to bind its port.

Downstream visibility. Whenever a fallback is used, the degraded=True flag travels with the response. For TTS, this lives in the AudioTrace (§3.8) attached to the ring buffer; for ASR, there is an analogous mechanism in §3.5's empty-string edge case. env.py surfaces degraded=True into DriftCallObservation via a future last_audio_degraded: bool field if the rewards/models doc adds it; until then the flag is telemetry-only and does not influence reward.

4.4 Audio byte format (WAV contract)

TTS output: RIFF WAV, mono, 16-bit PCM, 16 kHz. Produced via torchaudio.save(..., format="wav", bits_per_sample=16, sample_rate=16000) into an in-memory io.BytesIO, then .getvalue(). Header + data.
Resampling call site (canonical). Kokoro-82M synthesizes at 24 kHz natively. Resampling to the env's 16 kHz target happens inside TTSEngine.synthesize, BEFORE WAV encoding, via:
```
import torchaudio.functional as F
pcm_16k = F.resample(pcm_24k, orig_freq=24000, new_freq=16000, lowpass_filter_width=64)
# ...then torchaudio.save(buf, pcm_16k, sample_rate=16000, bits_per_sample=16, format="wav")
```
torchaudio.save(..., sample_rate=16000) is called after the resample — it is an encoder, not a resampler. The sample_rate kwarg on save only writes the RIFF header value; it does not change the tensor's sample rate. Consequence: LRU-cached bytes are always 16 kHz (see §3.4). The sample_rate_hz synth-argument is validated at the top of synthesize() — only 16000 is supported in the v1 contract; any other value raises UnsupportedLanguageError-style error (future work: allow 24 kHz path per Open Question historically 9.4, now resolved — see §9).
ASR input resampling policy. ASR does NOT auto-resample. If transcribe() receives audio whose header sample-rate is not 16 kHz (detected via soundfile.info before full read, or via the RIFF nSamplesPerSec field at bytes 24–27), it raises AudioDecodeError("input must be 16 kHz mono; caller must pre-resample"). Rationale: silently resampling at the ASR boundary hides caller bugs and costs 20–40 ms per call; since TTS already produces 16 kHz and the Gradio mic component is configured to deliver 16 kHz, any non-16kHz input indicates a mis-wired caller that must be fixed, not papered over.
ASR input: same format required. The transcribe method sniffs magic bytes (RIFF....WAVE at offset 0–11) and dispatches to soundfile.read(BytesIO(audio_bytes)); raw float32 PCM at 16 kHz is accepted as a second path for the demo mic input pipeline (which delivers float32 by default from Gradio's type="numpy" component, configured with sample_rate=16000 in §6.2 wiring). Other formats (mp3, ogg, flac) are rejected with AudioDecodeError — we do not ship ffmpeg in the CPU Space image.

5. Error modes

Situation	Exception	Handled by
Kokoro-82M weights cannot be pulled from HF Hub (network / rate-limit / disk full) at `get_tts_engine()` first-call	`ModelLoadError` wrapping original `huggingface_hub` / `OSError`	`app.py` startup hook fails fast → HF Space log shows error; server does not bind. Retried on next container boot.
faster-whisper-small weights cannot be pulled at `get_asr_engine()` first-call	`ModelLoadError`	Same as above.
`synthesize(..., language_code=X)` with `X` not in `VOICE_PACKS` keys	`UnsupportedLanguageError`	`env.py` catches → logs, falls back to `en_indian_female_1` at `en`, and sets R4 penalty flag for language mismatch (enforced by rewards, not here).
`synthesize(..., voice_pack=X)` where X not in `VOICE_PACKS[language_code].allowed`	`UnsupportedVoicePackError`	Caller error — 400 at HTTP boundary.
`transcribe()` receives bytes with no valid WAV header and no float32-PCM magic	`AudioDecodeError`	`env.py` returns an `UNKNOWN_AUDIO` status in observation; `last_transcript=""`, `last_confidence=0.0`.
`transcribe()` low-confidence decode (`confidence < 0.3`)	Not an exception. Returned normally.	Caller (`env.py`) sets `DriftCallObservation.last_confidence` honestly; downstream the agent may `CLARIFY` to re-prompt. R4 does not penalize low ASR confidence — it is a natural observation feature.
`transcribe()` returns `text=""` with whisper-reported `confidence > 0` (decoded-empty-despite-audio, not VAD-silent — see §3.5)	Not an exception. `confidence` is coerced to `0.0`, `degraded=True` in trace, result returned with whisper-reported `language_detected`.	Env treats identically to the silent case: "no intelligible speech"; agent should `CLARIFY`.
Audio duration > `max_duration_s` (default 30 s)	Truncated silently. NOT raised. Unless caller passes `strict=True` (not in default signature) — then `AudioTooLongError`.	Documented in §7.3. `env.py` always uses the default (silent truncation).
TTS OOM mid-synthesis on a pathologically long string (> 4 KB of text)	`TTSOutOfMemoryError` wrapping the originating `MemoryError` or `RuntimeError` (CPU-only deployment per §1, §3.1, §6.1 — CUDA OOM cannot occur; torch on CPU raises `RuntimeError` or Python's built-in `MemoryError` on large tensor allocation failure)	`env.py` catches → agent's `SPEAK` is dropped with a warning; the turn still counts. R4 penalty for format non-compliance does not apply (env-side failure, not agent fault).
Indic voice pack (`ta_female_1`, `kn_male_1`, `hi_male_1`) missing from Kokoro bundle	No exception — fallback chain per §4.3.1 activated. `degraded=True` attached to trace.	Warmup logs WARN. Startup continues.
BOTH `en_indian_female_1` AND `hi_female_1` missing from Kokoro bundle (catastrophic — no Hindi/English voice)	`ModelLoadError("no usable voice pack for hi or en")`	`app.py` warmup catches and aborts startup — the Space will not bind its port. Operator must re-pull weights or downgrade `kokoro` pin.
`voice_pack` argument not in `VOICE_PACKS[language_code].allowed`	`UnsupportedVoicePackError` (caller bug, distinct from bundle missing)	Caller error — 400 at HTTP boundary.
`language_hint=None` with silent/empty audio	Returns `TranscriptResult(text="", language_detected="unknown", confidence=0.0, duration_s=<duration>)`. No exception.	Normal flow.
Concurrent `warmup()` calls from two threads	Second call is a no-op (singleton guard); first blocks until ready.	Tested.

Partial-result policy: ASR never returns a partial TranscriptResult. Either the decode completes (even if text="") or an AudioError subclass propagates. No None fields.

6. Dependencies

6.1 Upstream (what this module imports)

Dependency	Version pin	License	Why
`kokoro` (Kokoro-82M official SDK wrapping the HF model `hexgrad/Kokoro-82M`)	`>=0.3, <0.4`	Apache 2.0	TTS synthesis. Pure-CPU path.
`faster-whisper`	`>=1.0, <2.0`	MIT	ASR via CTranslate2 int8 runtime.
`ctranslate2`	(transitive of faster-whisper)	MIT	CTranslate2 runtime, CPU-only wheel.
`torchaudio`	`>=2.1, <3.0`	BSD-3	WAV encoding from raw Kokoro PCM tensors. Pulled in by Kokoro anyway.
`soundfile`	`>=0.12`	BSD-3	WAV decoding for ASR input; works without ffmpeg.
`cachetools`	`>=5.3`	MIT	`LRUCache` for TTS bytes.
Python stdlib	—	—	`math`, `io`, `hashlib`, `threading`, `dataclasses`, `enum`, `typing`.

Not depended on: ffmpeg-python, librosa, pydub, gradio (demo-only), fastapi (app-only).

6.2 Downstream (who imports `driftcall/audio/`)

Consumer	Imports
`app.py` (FastAPI env entrypoint)	`get_tts_engine`, `get_asr_engine`, `TranscriptResult`, all exceptions. Uses `TTSEngine.synthesize` (WAV bytes path) for HTTP responses via `Response(content=..., media_type="audio/wav")` or base64-embedded inside `/step`. Called at startup hook + on every `/step` that carries audio.
`demo/app_gradio.py`	`get_tts_engine`, `get_asr_engine`, `TranscriptResult`, all exceptions. Uses `TTSEngine.synthesize_to_gradio` (`tuple[int, np.ndarray]`) as the direct return value for `gr.Audio(type="numpy")` output components. Mic component feeds `transcribe()` via `audio_bytes` obtained from Gradio's float32 PCM at 16 kHz (configured on the component, not converted at the audio-module layer). Never calls `synthesize` (bytes) — that is only for FastAPI.
`driftcall/env.py`	Only when `audio_boundary_enabled=True`. Calls `get_tts_engine()` / `get_asr_engine()` lazily inside `_maybe_synthesize()` / `_maybe_transcribe()` helpers. Never imports at module top.
`tests/test_audio.py`	All public symbols for unit tests.
`tests/test_e2e.py`	`TranscriptResult` for constructing deploy-mode integration fixtures.

6.3 Explicit non-consumers (load-bearing)

training/train_grpo.py — MUST NOT import any symbol from driftcall.audio. Enforced by a linter rule in pyproject.toml:
```
[tool.ruff.lint.flake8-tidy-imports.banned-api]
"driftcall.audio".msg = "Training loop is text-only (DESIGN.md §9.4). Do not import audio in training/."
```
The rule is scoped to training/**/*.py via a per-file-ignores override pattern.
training/eval_baseline.py, training/eval_final.py — same rule. Eval runs on text transcripts; if live-audio eval is needed later, it becomes a separate eval_audio.py script.
rewards.py — does not import audio. Rewards read DriftCallObservation.last_confidence (a float) and last_lang (a string) which the env boundary has already set. Rewards do not re-transcribe.

6.4 Model assets

Licenses below cover model weights (distinct from the Python package licenses in §6.1).

Model repo	Params / size	License (weights)	Notes
`hexgrad/Kokoro-82M`	82M params, ~~330 MB fp32 (~~160 MB int8, unused)	Apache 2.0	Kokoro fp32 is fast enough on CPU; int8 path not exercised.
`Systran/faster-whisper-small`	~244M params, ~470 MB fp32 / ~120 MB int8	Apache 2.0	We use int8 on CPU. See §1.1 for WER trade-off vs `medium` / `large-v3` and the migration path.

Total cache-on-disk footprint: 450 MB. Dockerfile pre-pulls both into /root/.cache/huggingface/; image size budget per DESIGN.md Risk 10: < 2 GB total. Audio weights take ~25% of that. If §1.1's migration is triggered and we swap to Systran/faster-whisper-medium (700 MB int8), total weights rise to ~1 GB and the image size budget still holds.

7. Edge cases

Eight cases that the test plan (docs/tests/audio_tests.md) must cover. Each case is the minimum test that would catch regressions.

7.1 Hinglish code-mix Whisper noise

transcribe(wav_of("Bhai Friday ko Bangalore jaana hai"), language_hint="hinglish") — Whisper-Hindi decoding on code-mixed audio returns mixed Devanagari+Latin output. Test asserts: (a) text is non-empty, (b) confidence is finite in [0, 1], (c) language_detected is one of {"hi", "hinglish"}. Text-equality is NOT asserted (Risk 3, semantic match downstream). If this test becomes flaky on a new faster-whisper release, we pin the version tighter — do not loosen the assertions.

7.2 Kannada voice pack quality

synthesize("Namaskara, saha haridu", language_code="kn") — Kokoro's Kannada pack is known to produce occasional glitches on loanwords. Test asserts: (a) returns non-empty WAV bytes, (b) the WAV parses with soundfile.read and has >= 1.5 s duration for this phrase, (c) duration is within 30% of expected (2.0 s). Audio-quality assertions beyond this are out of scope — DESIGN.md Risk 8 accepts "pre-generate demo audio with careful voice-pack selection" as mitigation.

7.3 Long utterance truncation

transcribe() receives a 45-second WAV when max_duration_s=30.0. Default path: silent truncation; result.duration_s == 30.0; text contains only the first 30 s of content. Test: feed a synthesized 45-s clip of counted numbers 1–45, assert the decoded text does NOT contain "40" or "45". No exception raised.

7.4 Silent audio

transcribe(wav_of_silence(duration_s=3.0), language_hint="hi") — VAD filter drops all segments. TranscriptResult(text="", language_detected="unknown", confidence=0.0, duration_s=3.0) is returned. Explicitly NO exception. env.py interprets text=="" as "user did not speak" and the agent observation reflects that.

7.5 Wrong-language hint

transcribe(wav_of("The flight leaves at six"), language_hint="ta") — Whisper is forced into Tamil decoding on English audio. Result typically garbled. Test asserts: (a) no exception, (b) language_detected may disagree with hint, (c) confidence is likely low (< 0.5 expected, not strictly asserted to avoid flakes). env.py logs a WARN but does not retry with autodetect — retry is the agent's job (via CLARIFY).

7.6 Concurrent sessions sharing engine

Spawn 5 threads each calling transcribe() on distinct 2-second clips simultaneously. Assert: (a) all 5 return TranscriptResult, (b) wall-clock is less than 5× sequential (thanks to GIL release in CTranslate2), (c) no exceptions. Same test for TTS, but parallelism benefit is smaller (torch on CPU serializes heavily).

7.7 TTS LRU hit

Call synthesize(text="नमस्ते", language_code="hi", seed=0) twice back-to-back. First call p50 ≈ 250 ms, second call p50 < 5 ms (LRU hit). Assert second call returns byte-identical WAV and is ≥ 10× faster. This guards against accidental cache-key drift.

7.8 TTS seed determinism

synthesize(text="कल मिलते हैं", language_code="hi", seed=7) called from two separate fresh processes (subprocess fixture) produces byte-identical WAV. Guards against RNG leak from outer training/eval code. Uses fork_rng internally; test validates by calling random.random() before and after to confirm global RNG is undisturbed.

7.9 Training-loop import firewall

Import training.train_grpo in a subprocess. After import, assert "driftcall.audio.tts_kokoro" not in sys.modules and "driftcall.audio.asr_whisper" not in sys.modules. This guards DESIGN.md §9.4 at the structural level. The ruff banned-api rule should fire in CI; this test belts-and-braces it.

7.10 Model-load failure at startup

Monkeypatch kokoro.KPipeline to raise OSError("no network"). Call get_tts_engine(). Assert ModelLoadError is raised with the original OSError in __cause__. Second call re-attempts load (singleton state did NOT cache the failure) — this is intentional so a transient HF Hub outage does not permanently break the process. Test both on an ASR mock too.

8. Examples

8.1 Hindi TTS round-trip (deployed env sim-caller path)

from __future__ import annotations

from driftcall.audio.tts_kokoro import get_tts_engine

tts = get_tts_engine()
wav_bytes = tts.synthesize(
    text="नमस्ते, कल दिल्ली की फ्लाइट बुक करनी है, सात हज़ार के अंदर।",
    language_code="hi",
    voice_pack="hi_female_1",
    seed=0,
)

# Assertions typical of the test and the sim-caller:
assert isinstance(wav_bytes, bytes)
assert wav_bytes[:4] == b"RIFF"
assert wav_bytes[8:12] == b"WAVE"
# Write to disk for debugging:
# pathlib.Path("goal_hi.wav").write_bytes(wav_bytes)

# File size for a ~4 s clip at 16 kHz 16-bit mono ≈ 128 KB.
assert 60_000 < len(wav_bytes) < 180_000

# Duration can be extracted cheaply via soundfile:
import io, soundfile
info = soundfile.info(io.BytesIO(wav_bytes))
assert 3.0 < info.duration < 6.0
assert info.samplerate == 16_000
assert info.channels == 1

8.2 Hinglish ASR (env boundary transcribing user audio)

from __future__ import annotations

from pathlib import Path

from driftcall.audio.asr_whisper import get_asr_engine, TranscriptResult

asr = get_asr_engine()
wav_bytes = Path("user_hinglish_bangalore.wav").read_bytes()

result: TranscriptResult = asr.transcribe(
    audio_bytes=wav_bytes,
    language_hint="hinglish",
    beam_size=1,
    vad_filter=True,
)

# Expected shape:
assert isinstance(result, TranscriptResult)
assert "bangalore" in result.text.lower() or "बैंगलोर" in result.text
assert result.language_detected in {"hi", "hinglish"}
assert 0.0 <= result.confidence <= 1.0
assert result.duration_s > 0.0

# Embedding into env observation (happens in env.py, not here):
# obs = replace(obs,
#     last_transcript=result.text,
#     last_lang=result.language_detected if result.language_detected != "unknown" else goal.language,
#     last_confidence=result.confidence,
# )

8.3 Kannada round-trip TTS → ASR (demo Space self-test)

from __future__ import annotations

from driftcall.audio.tts_kokoro import get_tts_engine
from driftcall.audio.asr_whisper import get_asr_engine

tts = get_tts_engine()
asr = get_asr_engine()

original_text = "Kempegowda airport ge taxi beku"
wav = tts.synthesize(text=original_text, language_code="kn", seed=42)
result = asr.transcribe(audio_bytes=wav, language_hint="kn")

# Round-trip fidelity (semantic, not exact — Kannada ASR has noise):
assert result.text != ""
assert result.language_detected in {"kn", "unknown"}
# Soft assertion: at least one keyword survives the round-trip.
assert any(tok in result.text.lower() for tok in ("kempegowda", "airport", "taxi"))
# Confidence floor for demo playback:
assert result.confidence > 0.3, f"Kannada round-trip confidence too low: {result.confidence}"

Additional integration-level flow (not a unit test, for orientation):

┌─ sim-caller ─┐   TTS      ┌─ env boundary ─┐   ASR      ┌─ env core ─┐
│ goal text    │──────────▶│ 16 kHz WAV bytes │──────────▶│ observation │
│ (GoalSpec    │           │ (bytes over HTTP │           │ (text,lang, │
│  .seed_utt)  │           │  in /step body)  │           │  conf)      │
└──────────────┘            └──────────────────┘           └─────────────┘

9. Open questions

VAD filter confidence on Hinglish code-mix. vad_filter=True uses silero-VAD trained primarily on European languages. Early smoke tests suggest it sometimes clips Hindi laterals ("ल", "न"). If this materially hurts R1 on Hinglish episodes during Phase C baseline eval, we may flip to vad_filter=False at the cost of ~10% slower decoding. Escalate to orchestrator after baseline runs in Batch C2.
Kokoro voice pack A/B for Hinglish. §4.3 documents en_indian_female_1 as default and hi_female_1 as fallback. We have no empirical data yet on which produces better judge perception in the demo. Decision deferred to demo rehearsal in Batch C5 — Person D to record both variants and pick by ear.
Should the env return raw WAV bytes to the agent, or just the transcript? Current design: transcript only (via DriftCallObservation.last_transcript). An argument for also returning WAV: the agent could self-re-transcribe with a different model. Counter: we want to lock the ASR-as-oracle contract for reward reproducibility. Recommendation: keep transcript-only. If overturned in review, DriftCallObservation gets a new optional last_audio_b64: str | None field and this doc + models.md both update.
Sample rate upgrade path. 16 kHz is the minimum for Whisper-small; 24 kHz would sound better for TTS playback in the demo. Kokoro natively produces 24 kHz; we currently resample down. If Space CPU budget permits, we may expose 24 kHz for TTS output while ASR continues at 16 kHz — this costs 50% more bandwidth over HTTP. Deferred; do not implement until demo-polish sprint. RESOLVED (see §4.4). v1 contract pins TTS output to 16 kHz and resamples inside synthesize() before WAV encoding via torchaudio.functional.resample(tensor, orig_freq=24000, new_freq=16000, lowpass_filter_width=64). ASR never auto-resamples; non-16 kHz input raises AudioDecodeError. 24 kHz playback path is out of scope for hackathon ship and will not be added without a DESIGN.md §9 update.