Spaces:

NinjainPJs
/

VoiceVault

Running

App Files Files Community

VoiceVault / DOCS /phase3_asr.md

NinjainPJs

Initial release: VoiceVault v1.0.0 — Voice-First RAG Knowledge Agent

85f900d 3 months ago

preview code

raw

history blame contribute delete

7.56 kB

	# Phase 3 — ASR & Voice Input

	Status: ✅ Complete \| Tests: 45 passed, 2 skipped (VAD — soundfile not installed) \| Files: 2 modules

	---

	## What Was Built

	Phase 3 adds the voice input pipeline: raw audio → Whisper transcription → cleaned, classified query.

	Two modules were implemented:

	\| Module \| Responsibility \|
	\|--------\|----------------\|
	\| `voicevault/asr/query_preprocessor.py` \| Cleans Whisper transcripts; classifies intent \|
	\| `voicevault/asr/whisper_transcriber.py` \| GPU/CPU Whisper transcription with VAD \|

	---

	## QueryPreprocessor

	File: [voicevault/asr/query_preprocessor.py](../voicevault/asr/query_preprocessor.py)

	### Pipeline (in order)

	```
	raw_query
	→ _normalize() # lowercase + collapse whitespace
	→ _remove_fillers() # strip spoken filler words
	→ _repair_punctuation()# clean leading/trailing commas, dots
	→ _detect_language() # langdetect ISO 639-1
	→ _classify_intent() # factual \| summary \| compare
	→ PreprocessedQuery
	```

	### Filler Word Removal

	Multi-word fillers are removed first (longer patterns take priority) to prevent partial matches:

	```python
	_FILLER_WORDS = {
	"um", "uh", "er", "ah", "eh", "like", "you know",
	"i mean", "basically", "literally", "actually", "right",
	"so", "well", "okay", "ok",
	}
	```

	`"you know"` is matched before `"you"` or `"know"` individually. All fillers use `\b` word boundaries (or `(?<!\w)..(?!\w)` for multi-word) to prevent stripping substrings of legitimate words.

	### Intent Classification

	Priority: compare > summary > factual (most specific wins):

	\| Type \| Trigger patterns \| Example \|
	\|------\|-----------------\|---------\|
	\| `compare` \| `\bcompar(e\\|ing)\b`, `\bdifferen(ce\\|t)\b`, `\bversus\b`, `\bvs\.?\b` \| "compare BERT and GPT" \|
	\| `summary` \| `^summar(ise\\|ize)`, `describe`, `explain`, `overview` \| "summarize the paper" \|
	\| `factual` \| `^what is`, `^who`, `^when`, `^where`, `^how many` \| "what is accuracy?" \|

	Unknown queries default to `factual` — the safest retrieval strategy.

	### Language Detection

	Uses `langdetect` (offline, lightweight). Falls back to `"en"` when:
	- text has < 3 words (too short to detect reliably)
	- `langdetect` raises any exception
	- `langdetect` is not installed

	### PreprocessedQuery Dataclass

	```python
	@dataclass
	class PreprocessedQuery:
	raw_query: str # Original, unmodified input
	processed_query: str # Cleaned for retrieval
	query_type: str # factual \| summary \| compare
	language: str # ISO 639-1
	```

	---

	## WhisperTranscriber

	File: [voicevault/asr/whisper_transcriber.py](../voicevault/asr/whisper_transcriber.py)

	### Model Selection Strategy

	```
	Has GPU?
	YES → openai/whisper-large-v3 (3GB VRAM, best quality, 99 languages)
	If load fails →
	NO → distil-whisper/distil-large-v3 (CPU-friendly, 6× faster, within 1% WER)
	```

	Both model IDs are configurable via `cfg.whisper_model` and `cfg.distil_whisper_model`.

	### Transcription Flow

	```python
	transcribe(audio_path: Path) → TranscriptResult:
	1. audio_path.exists() — raise WhisperTranscriberError if missing
	2. _vad_check(audio_path) — reject silent/too-short audio
	3. _get_pipeline() — lazy load (GPU → CPU cascade)
	4. pipeline(str(audio_path), return_timestamps=False,
	generate_kwargs={"language": "english"})
	5. preprocessor.process(raw_transcript) — clean + classify
	6. return TranscriptResult
	```

	### Voice Activity Detection (VAD)

	Prevents Whisper from hallucinating text on silent input — a known failure mode.

	```python
	_vad_check(audio_path):
	if soundfile available:
	duration_s = len(data) / sample_rate
	if duration_s < 1.0: raise WhisperTranscriberError("Audio too short")
	rms = sqrt(mean(data**2))
	if rms < 0.001: raise WhisperTranscriberError("No speech detected")
	else:
	if file_size < 1000 bytes: raise WhisperTranscriberError("File appears empty")
	```

	Non-critical VAD failures (unexpected exceptions) are silently logged — transcription proceeds. This ensures resilience on unusual audio formats.

	### Lazy Pipeline Loading

	The HuggingFace pipeline is loaded only on the first `transcribe()` call. This:
	- Avoids slow startup on CPU-only machines
	- Prevents import-time failures when torch/transformers are unavailable
	- Allows `is_ready()` to accurately report whether the model is loaded

	```python
	def is_ready(self) -> bool:
	return self._pipeline is not None
	```

	### TranscriptResult Schema

	```python
	class TranscriptResult(BaseModel):
	transcript: str # Cleaned, normalized text (ready for retrieval)
	raw_transcript: str # Original Whisper output (for debugging/audit)
	language: str # ISO 639-1
	confidence: float # Always 1.0 (Whisper has no per-utterance score)
	model_used: str # "openai/whisper-large-v3" or distil variant
	latency_ms: int # Wall-clock transcription time
	query_type: str # factual \| summary \| compare
	```

	---

	## Security Decisions

	### No Audio Storage
	Audio files are processed in-memory and never persisted beyond the request lifecycle. Only the SHA-256 of the query text is stored in SQLite (for PII protection).

	### VAD as DoS Protection
	Rejecting silent audio prevents Whisper from consuming GPU compute on empty input — important when running on shared HuggingFace Spaces hardware.

	### Fixed Language Parameter
	`generate_kwargs={"language": "english"}` is passed to Whisper. This improves WER on English technical vocabulary and prevents the model from guessing the language from noise.

	---

	## Test Coverage

	File: [tests/test_phase3.py](../tests/test_phase3.py) \| 47 collected, 45 passed, 2 skipped

	\| Class \| Tests \| Coverage \|
	\|-------\|-------\|---------\|
	\| `TestQueryPreprocessorNormalization` \| 11 \| Lowercase, whitespace, filler removal (um/uh/like/so/you know), edge cases \|
	\| `TestQueryPreprocessorIntentClassification` \| 14 \| All three types, priority ordering, default fallback \|
	\| `TestQueryPreprocessorLanguageDetection` \| 3 \| English detection, short query fallback, return type \|
	\| `TestWhisperTranscriberVAD` \| 3 \| Valid audio passes, silent audio raises, missing file raises \|
	\| `TestWhisperTranscriberMocked` \| 10 \| Return type, transcript cleaning, query type classification, model_used, latency, raw_transcript preservation, is_ready(), pipeline error handling \|
	\| `TestTranscriptResultModel` \| 3 \| Default confidence=1.0, default query_type="factual", default language="en" \|

	Skipped tests: `test_valid_audio_passes_vad` and `test_silent_audio_raises` — these require `soundfile` which is not installed in the test environment. They pass in production (HuggingFace Spaces has soundfile).

	### Key Mock Pattern

	Tests inject a pre-built `MagicMock` pipeline directly into `transcriber._pipeline`, bypassing the 3GB model download entirely:

	```python
	transcriber = WhisperTranscriber(force_cpu=True)
	transcriber._pipeline = MagicMock(return_value={"text": "what is machine learning"})
	transcriber._model_used = "mock-whisper"
	```

	---

	## Integration with Phase 4

	`WhisperTranscriber.transcribe()` returns a `TranscriptResult` where:
	- `transcript` → passed as `query` to `HybridRetriever` and `AnswerChain`
	- `query_type` → drives retrieval strategy (factual/summary/compare) and LLM token budget
	- `language` → stored in `QuerySession` for analytics

	The preprocessed query flows through the entire pipeline without further transformation.