Spaces:

NinjainPJs
/

VoiceVault

Running

App Files Files Community

VoiceVault / DOCS /phase3_asr.md

NinjainPJs

Initial release: VoiceVault v1.0.0 — Voice-First RAG Knowledge Agent

85f900d 3 months ago

preview code

raw

history blame contribute delete

7.56 kB

Phase 3 — ASR & Voice Input

Status: ✅ Complete | Tests: 45 passed, 2 skipped (VAD — soundfile not installed) | Files: 2 modules

What Was Built

Phase 3 adds the voice input pipeline: raw audio → Whisper transcription → cleaned, classified query.

Two modules were implemented:

Module	Responsibility
`voicevault/asr/query_preprocessor.py`	Cleans Whisper transcripts; classifies intent
`voicevault/asr/whisper_transcriber.py`	GPU/CPU Whisper transcription with VAD

QueryPreprocessor

File: voicevault/asr/query_preprocessor.py

Pipeline (in order)

raw_query
  → _normalize()         # lowercase + collapse whitespace
  → _remove_fillers()    # strip spoken filler words
  → _repair_punctuation()# clean leading/trailing commas, dots
  → _detect_language()   # langdetect ISO 639-1
  → _classify_intent()   # factual | summary | compare
→ PreprocessedQuery

Filler Word Removal

Multi-word fillers are removed first (longer patterns take priority) to prevent partial matches:

_FILLER_WORDS = {
    "um", "uh", "er", "ah", "eh", "like", "you know",
    "i mean", "basically", "literally", "actually", "right",
    "so", "well", "okay", "ok",
}

"you know" is matched before "you" or "know" individually. All fillers use \b word boundaries (or (?<!\w)..(?!\w) for multi-word) to prevent stripping substrings of legitimate words.

Intent Classification

Priority: compare > summary > factual (most specific wins):

Type	Trigger patterns	Example
`compare`	`\bcompar(e\|ing)\b`, `\bdifferen(ce\|t)\b`, `\bversus\b`, `\bvs\.?\b`	"compare BERT and GPT"
`summary`	`^summar(ise\|ize)`, `describe`, `explain`, `overview`	"summarize the paper"
`factual`	`^what is`, `^who`, `^when`, `^where`, `^how many`	"what is accuracy?"

Unknown queries default to factual — the safest retrieval strategy.

Language Detection

Uses langdetect (offline, lightweight). Falls back to "en" when:

text has < 3 words (too short to detect reliably)
langdetect raises any exception
langdetect is not installed

PreprocessedQuery Dataclass

@dataclass
class PreprocessedQuery:
    raw_query: str        # Original, unmodified input
    processed_query: str  # Cleaned for retrieval
    query_type: str       # factual | summary | compare
    language: str         # ISO 639-1

WhisperTranscriber

File: voicevault/asr/whisper_transcriber.py

Model Selection Strategy

Has GPU?
  YES → openai/whisper-large-v3     (3GB VRAM, best quality, 99 languages)
        If load fails →
  NO  → distil-whisper/distil-large-v3  (CPU-friendly, 6× faster, within 1% WER)

Both model IDs are configurable via cfg.whisper_model and cfg.distil_whisper_model.

Transcription Flow

transcribe(audio_path: Path) → TranscriptResult:
    1. audio_path.exists()  — raise WhisperTranscriberError if missing
    2. _vad_check(audio_path)  — reject silent/too-short audio
    3. _get_pipeline()  — lazy load (GPU → CPU cascade)
    4. pipeline(str(audio_path), return_timestamps=False,
                generate_kwargs={"language": "english"})
    5. preprocessor.process(raw_transcript)  — clean + classify
    6. return TranscriptResult

Voice Activity Detection (VAD)

Prevents Whisper from hallucinating text on silent input — a known failure mode.

_vad_check(audio_path):
    if soundfile available:
        duration_s = len(data) / sample_rate
        if duration_s < 1.0:  raise WhisperTranscriberError("Audio too short")
        rms = sqrt(mean(data**2))
        if rms < 0.001:  raise WhisperTranscriberError("No speech detected")
    else:
        if file_size < 1000 bytes:  raise WhisperTranscriberError("File appears empty")

Non-critical VAD failures (unexpected exceptions) are silently logged — transcription proceeds. This ensures resilience on unusual audio formats.

Lazy Pipeline Loading

The HuggingFace pipeline is loaded only on the first transcribe() call. This:

Avoids slow startup on CPU-only machines
Prevents import-time failures when torch/transformers are unavailable
Allows is_ready() to accurately report whether the model is loaded

def is_ready(self) -> bool:
    return self._pipeline is not None

TranscriptResult Schema

class TranscriptResult(BaseModel):
    transcript: str       # Cleaned, normalized text (ready for retrieval)
    raw_transcript: str   # Original Whisper output (for debugging/audit)
    language: str         # ISO 639-1
    confidence: float     # Always 1.0 (Whisper has no per-utterance score)
    model_used: str       # "openai/whisper-large-v3" or distil variant
    latency_ms: int       # Wall-clock transcription time
    query_type: str       # factual | summary | compare

Security Decisions

No Audio Storage

Audio files are processed in-memory and never persisted beyond the request lifecycle. Only the SHA-256 of the query text is stored in SQLite (for PII protection).

VAD as DoS Protection

Rejecting silent audio prevents Whisper from consuming GPU compute on empty input — important when running on shared HuggingFace Spaces hardware.

Fixed Language Parameter

generate_kwargs={"language": "english"} is passed to Whisper. This improves WER on English technical vocabulary and prevents the model from guessing the language from noise.

Test Coverage

File: tests/test_phase3.py | 47 collected, 45 passed, 2 skipped

Class	Tests	Coverage
`TestQueryPreprocessorNormalization`	11	Lowercase, whitespace, filler removal (um/uh/like/so/you know), edge cases
`TestQueryPreprocessorIntentClassification`	14	All three types, priority ordering, default fallback
`TestQueryPreprocessorLanguageDetection`	3	English detection, short query fallback, return type
`TestWhisperTranscriberVAD`	3	Valid audio passes, silent audio raises, missing file raises
`TestWhisperTranscriberMocked`	10	Return type, transcript cleaning, query type classification, model_used, latency, raw_transcript preservation, is_ready(), pipeline error handling
`TestTranscriptResultModel`	3	Default confidence=1.0, default query_type="factual", default language="en"

Skipped tests: test_valid_audio_passes_vad and test_silent_audio_raises — these require soundfile which is not installed in the test environment. They pass in production (HuggingFace Spaces has soundfile).

Key Mock Pattern

Tests inject a pre-built MagicMock pipeline directly into transcriber._pipeline, bypassing the 3GB model download entirely:

transcriber = WhisperTranscriber(force_cpu=True)
transcriber._pipeline = MagicMock(return_value={"text": "what is machine learning"})
transcriber._model_used = "mock-whisper"

Integration with Phase 4

WhisperTranscriber.transcribe() returns a TranscriptResult where:

transcript → passed as query to HybridRetriever and AnswerChain
query_type → drives retrieval strategy (factual/summary/compare) and LLM token budget
language → stored in QuerySession for analytics

The preprocessed query flows through the entire pipeline without further transformation.