VoiceVault / DOCS /phase3_asr.md
NinjainPJs's picture
Initial release: VoiceVault v1.0.0 β€” Voice-First RAG Knowledge Agent
85f900d

Phase 3 β€” ASR & Voice Input

Status: βœ… Complete | Tests: 45 passed, 2 skipped (VAD β€” soundfile not installed) | Files: 2 modules


What Was Built

Phase 3 adds the voice input pipeline: raw audio β†’ Whisper transcription β†’ cleaned, classified query.

Two modules were implemented:

Module Responsibility
voicevault/asr/query_preprocessor.py Cleans Whisper transcripts; classifies intent
voicevault/asr/whisper_transcriber.py GPU/CPU Whisper transcription with VAD

QueryPreprocessor

File: voicevault/asr/query_preprocessor.py

Pipeline (in order)

raw_query
  β†’ _normalize()         # lowercase + collapse whitespace
  β†’ _remove_fillers()    # strip spoken filler words
  β†’ _repair_punctuation()# clean leading/trailing commas, dots
  β†’ _detect_language()   # langdetect ISO 639-1
  β†’ _classify_intent()   # factual | summary | compare
β†’ PreprocessedQuery

Filler Word Removal

Multi-word fillers are removed first (longer patterns take priority) to prevent partial matches:

_FILLER_WORDS = {
    "um", "uh", "er", "ah", "eh", "like", "you know",
    "i mean", "basically", "literally", "actually", "right",
    "so", "well", "okay", "ok",
}

"you know" is matched before "you" or "know" individually. All fillers use \b word boundaries (or (?<!\w)..(?!\w) for multi-word) to prevent stripping substrings of legitimate words.

Intent Classification

Priority: compare > summary > factual (most specific wins):

Type Trigger patterns Example
compare \bcompar(e|ing)\b, \bdifferen(ce|t)\b, \bversus\b, \bvs\.?\b "compare BERT and GPT"
summary ^summar(ise|ize), describe, explain, overview "summarize the paper"
factual ^what is, ^who, ^when, ^where, ^how many "what is accuracy?"

Unknown queries default to factual β€” the safest retrieval strategy.

Language Detection

Uses langdetect (offline, lightweight). Falls back to "en" when:

  • text has < 3 words (too short to detect reliably)
  • langdetect raises any exception
  • langdetect is not installed

PreprocessedQuery Dataclass

@dataclass
class PreprocessedQuery:
    raw_query: str        # Original, unmodified input
    processed_query: str  # Cleaned for retrieval
    query_type: str       # factual | summary | compare
    language: str         # ISO 639-1

WhisperTranscriber

File: voicevault/asr/whisper_transcriber.py

Model Selection Strategy

Has GPU?
  YES β†’ openai/whisper-large-v3     (3GB VRAM, best quality, 99 languages)
        If load fails β†’
  NO  β†’ distil-whisper/distil-large-v3  (CPU-friendly, 6Γ— faster, within 1% WER)

Both model IDs are configurable via cfg.whisper_model and cfg.distil_whisper_model.

Transcription Flow

transcribe(audio_path: Path) β†’ TranscriptResult:
    1. audio_path.exists()  β€” raise WhisperTranscriberError if missing
    2. _vad_check(audio_path)  β€” reject silent/too-short audio
    3. _get_pipeline()  β€” lazy load (GPU β†’ CPU cascade)
    4. pipeline(str(audio_path), return_timestamps=False,
                generate_kwargs={"language": "english"})
    5. preprocessor.process(raw_transcript)  β€” clean + classify
    6. return TranscriptResult

Voice Activity Detection (VAD)

Prevents Whisper from hallucinating text on silent input β€” a known failure mode.

_vad_check(audio_path):
    if soundfile available:
        duration_s = len(data) / sample_rate
        if duration_s < 1.0:  raise WhisperTranscriberError("Audio too short")
        rms = sqrt(mean(data**2))
        if rms < 0.001:  raise WhisperTranscriberError("No speech detected")
    else:
        if file_size < 1000 bytes:  raise WhisperTranscriberError("File appears empty")

Non-critical VAD failures (unexpected exceptions) are silently logged β€” transcription proceeds. This ensures resilience on unusual audio formats.

Lazy Pipeline Loading

The HuggingFace pipeline is loaded only on the first transcribe() call. This:

  • Avoids slow startup on CPU-only machines
  • Prevents import-time failures when torch/transformers are unavailable
  • Allows is_ready() to accurately report whether the model is loaded
def is_ready(self) -> bool:
    return self._pipeline is not None

TranscriptResult Schema

class TranscriptResult(BaseModel):
    transcript: str       # Cleaned, normalized text (ready for retrieval)
    raw_transcript: str   # Original Whisper output (for debugging/audit)
    language: str         # ISO 639-1
    confidence: float     # Always 1.0 (Whisper has no per-utterance score)
    model_used: str       # "openai/whisper-large-v3" or distil variant
    latency_ms: int       # Wall-clock transcription time
    query_type: str       # factual | summary | compare

Security Decisions

No Audio Storage

Audio files are processed in-memory and never persisted beyond the request lifecycle. Only the SHA-256 of the query text is stored in SQLite (for PII protection).

VAD as DoS Protection

Rejecting silent audio prevents Whisper from consuming GPU compute on empty input β€” important when running on shared HuggingFace Spaces hardware.

Fixed Language Parameter

generate_kwargs={"language": "english"} is passed to Whisper. This improves WER on English technical vocabulary and prevents the model from guessing the language from noise.


Test Coverage

File: tests/test_phase3.py | 47 collected, 45 passed, 2 skipped

Class Tests Coverage
TestQueryPreprocessorNormalization 11 Lowercase, whitespace, filler removal (um/uh/like/so/you know), edge cases
TestQueryPreprocessorIntentClassification 14 All three types, priority ordering, default fallback
TestQueryPreprocessorLanguageDetection 3 English detection, short query fallback, return type
TestWhisperTranscriberVAD 3 Valid audio passes, silent audio raises, missing file raises
TestWhisperTranscriberMocked 10 Return type, transcript cleaning, query type classification, model_used, latency, raw_transcript preservation, is_ready(), pipeline error handling
TestTranscriptResultModel 3 Default confidence=1.0, default query_type="factual", default language="en"

Skipped tests: test_valid_audio_passes_vad and test_silent_audio_raises β€” these require soundfile which is not installed in the test environment. They pass in production (HuggingFace Spaces has soundfile).

Key Mock Pattern

Tests inject a pre-built MagicMock pipeline directly into transcriber._pipeline, bypassing the 3GB model download entirely:

transcriber = WhisperTranscriber(force_cpu=True)
transcriber._pipeline = MagicMock(return_value={"text": "what is machine learning"})
transcriber._model_used = "mock-whisper"

Integration with Phase 4

WhisperTranscriber.transcribe() returns a TranscriptResult where:

  • transcript β†’ passed as query to HybridRetriever and AnswerChain
  • query_type β†’ drives retrieval strategy (factual/summary/compare) and LLM token budget
  • language β†’ stored in QuerySession for analytics

The preprocessed query flows through the entire pipeline without further transformation.