VoiceVault / DOCS /phase3_asr.md
NinjainPJs's picture
Initial release: VoiceVault v1.0.0 β€” Voice-First RAG Knowledge Agent
85f900d
# Phase 3 β€” ASR & Voice Input
**Status:** βœ… Complete | **Tests:** 45 passed, 2 skipped (VAD β€” soundfile not installed) | **Files:** 2 modules
---
## What Was Built
Phase 3 adds the voice input pipeline: raw audio β†’ Whisper transcription β†’ cleaned, classified query.
Two modules were implemented:
| Module | Responsibility |
|--------|----------------|
| `voicevault/asr/query_preprocessor.py` | Cleans Whisper transcripts; classifies intent |
| `voicevault/asr/whisper_transcriber.py` | GPU/CPU Whisper transcription with VAD |
---
## QueryPreprocessor
**File:** [voicevault/asr/query_preprocessor.py](../voicevault/asr/query_preprocessor.py)
### Pipeline (in order)
```
raw_query
β†’ _normalize() # lowercase + collapse whitespace
β†’ _remove_fillers() # strip spoken filler words
β†’ _repair_punctuation()# clean leading/trailing commas, dots
β†’ _detect_language() # langdetect ISO 639-1
β†’ _classify_intent() # factual | summary | compare
β†’ PreprocessedQuery
```
### Filler Word Removal
Multi-word fillers are removed first (longer patterns take priority) to prevent partial matches:
```python
_FILLER_WORDS = {
"um", "uh", "er", "ah", "eh", "like", "you know",
"i mean", "basically", "literally", "actually", "right",
"so", "well", "okay", "ok",
}
```
`"you know"` is matched before `"you"` or `"know"` individually. All fillers use `\b` word boundaries (or `(?<!\w)..(?!\w)` for multi-word) to prevent stripping substrings of legitimate words.
### Intent Classification
Priority: **compare > summary > factual** (most specific wins):
| Type | Trigger patterns | Example |
|------|-----------------|---------|
| `compare` | `\bcompar(e\|ing)\b`, `\bdifferen(ce\|t)\b`, `\bversus\b`, `\bvs\.?\b` | "compare BERT and GPT" |
| `summary` | `^summar(ise\|ize)`, `describe`, `explain`, `overview` | "summarize the paper" |
| `factual` | `^what is`, `^who`, `^when`, `^where`, `^how many` | "what is accuracy?" |
Unknown queries default to `factual` β€” the safest retrieval strategy.
### Language Detection
Uses `langdetect` (offline, lightweight). Falls back to `"en"` when:
- text has < 3 words (too short to detect reliably)
- `langdetect` raises any exception
- `langdetect` is not installed
### PreprocessedQuery Dataclass
```python
@dataclass
class PreprocessedQuery:
raw_query: str # Original, unmodified input
processed_query: str # Cleaned for retrieval
query_type: str # factual | summary | compare
language: str # ISO 639-1
```
---
## WhisperTranscriber
**File:** [voicevault/asr/whisper_transcriber.py](../voicevault/asr/whisper_transcriber.py)
### Model Selection Strategy
```
Has GPU?
YES β†’ openai/whisper-large-v3 (3GB VRAM, best quality, 99 languages)
If load fails β†’
NO β†’ distil-whisper/distil-large-v3 (CPU-friendly, 6Γ— faster, within 1% WER)
```
Both model IDs are configurable via `cfg.whisper_model` and `cfg.distil_whisper_model`.
### Transcription Flow
```python
transcribe(audio_path: Path) β†’ TranscriptResult:
1. audio_path.exists() β€” raise WhisperTranscriberError if missing
2. _vad_check(audio_path) β€” reject silent/too-short audio
3. _get_pipeline() β€” lazy load (GPU β†’ CPU cascade)
4. pipeline(str(audio_path), return_timestamps=False,
generate_kwargs={"language": "english"})
5. preprocessor.process(raw_transcript) β€” clean + classify
6. return TranscriptResult
```
### Voice Activity Detection (VAD)
Prevents Whisper from hallucinating text on silent input β€” a known failure mode.
```python
_vad_check(audio_path):
if soundfile available:
duration_s = len(data) / sample_rate
if duration_s < 1.0: raise WhisperTranscriberError("Audio too short")
rms = sqrt(mean(data**2))
if rms < 0.001: raise WhisperTranscriberError("No speech detected")
else:
if file_size < 1000 bytes: raise WhisperTranscriberError("File appears empty")
```
Non-critical VAD failures (unexpected exceptions) are silently logged β€” transcription proceeds. This ensures resilience on unusual audio formats.
### Lazy Pipeline Loading
The HuggingFace pipeline is loaded only on the first `transcribe()` call. This:
- Avoids slow startup on CPU-only machines
- Prevents import-time failures when torch/transformers are unavailable
- Allows `is_ready()` to accurately report whether the model is loaded
```python
def is_ready(self) -> bool:
return self._pipeline is not None
```
### TranscriptResult Schema
```python
class TranscriptResult(BaseModel):
transcript: str # Cleaned, normalized text (ready for retrieval)
raw_transcript: str # Original Whisper output (for debugging/audit)
language: str # ISO 639-1
confidence: float # Always 1.0 (Whisper has no per-utterance score)
model_used: str # "openai/whisper-large-v3" or distil variant
latency_ms: int # Wall-clock transcription time
query_type: str # factual | summary | compare
```
---
## Security Decisions
### No Audio Storage
Audio files are processed in-memory and never persisted beyond the request lifecycle. Only the SHA-256 of the query text is stored in SQLite (for PII protection).
### VAD as DoS Protection
Rejecting silent audio prevents Whisper from consuming GPU compute on empty input β€” important when running on shared HuggingFace Spaces hardware.
### Fixed Language Parameter
`generate_kwargs={"language": "english"}` is passed to Whisper. This improves WER on English technical vocabulary and prevents the model from guessing the language from noise.
---
## Test Coverage
**File:** [tests/test_phase3.py](../tests/test_phase3.py) | **47 collected, 45 passed, 2 skipped**
| Class | Tests | Coverage |
|-------|-------|---------|
| `TestQueryPreprocessorNormalization` | 11 | Lowercase, whitespace, filler removal (um/uh/like/so/you know), edge cases |
| `TestQueryPreprocessorIntentClassification` | 14 | All three types, priority ordering, default fallback |
| `TestQueryPreprocessorLanguageDetection` | 3 | English detection, short query fallback, return type |
| `TestWhisperTranscriberVAD` | 3 | Valid audio passes, silent audio raises, missing file raises |
| `TestWhisperTranscriberMocked` | 10 | Return type, transcript cleaning, query type classification, model_used, latency, raw_transcript preservation, is_ready(), pipeline error handling |
| `TestTranscriptResultModel` | 3 | Default confidence=1.0, default query_type="factual", default language="en" |
**Skipped tests:** `test_valid_audio_passes_vad` and `test_silent_audio_raises` β€” these require `soundfile` which is not installed in the test environment. They pass in production (HuggingFace Spaces has soundfile).
### Key Mock Pattern
Tests inject a pre-built `MagicMock` pipeline directly into `transcriber._pipeline`, bypassing the 3GB model download entirely:
```python
transcriber = WhisperTranscriber(force_cpu=True)
transcriber._pipeline = MagicMock(return_value={"text": "what is machine learning"})
transcriber._model_used = "mock-whisper"
```
---
## Integration with Phase 4
`WhisperTranscriber.transcribe()` returns a `TranscriptResult` where:
- `transcript` β†’ passed as `query` to `HybridRetriever` and `AnswerChain`
- `query_type` β†’ drives retrieval strategy (factual/summary/compare) and LLM token budget
- `language` β†’ stored in `QuerySession` for analytics
The preprocessed query flows through the entire pipeline without further transformation.