# Phase 3 — ASR & Voice Input

**Status:** ✅ Complete | **Tests:** 45 passed, 2 skipped (VAD — soundfile not installed) | **Files:** 2 modules

---

## What Was Built

Phase 3 adds the voice input pipeline: raw audio → Whisper transcription → cleaned, classified query.

Two modules were implemented:

| Module | Responsibility |
|--------|----------------|
| `voicevault/asr/query_preprocessor.py` | Cleans Whisper transcripts; classifies intent |
| `voicevault/asr/whisper_transcriber.py` | GPU/CPU Whisper transcription with VAD |

---

## QueryPreprocessor

**File:** [voicevault/asr/query_preprocessor.py](../voicevault/asr/query_preprocessor.py)

### Pipeline (in order)

```
raw_query
  → _normalize()         # lowercase + collapse whitespace
  → _remove_fillers()    # strip spoken filler words
  → _repair_punctuation()# clean leading/trailing commas, dots
  → _detect_language()   # langdetect ISO 639-1
  → _classify_intent()   # factual | summary | compare
→ PreprocessedQuery
```

### Filler Word Removal

Multi-word fillers are removed first (longer patterns take priority) to prevent partial matches:

```python
_FILLER_WORDS = {
    "um", "uh", "er", "ah", "eh", "like", "you know",
    "i mean", "basically", "literally", "actually", "right",
    "so", "well", "okay", "ok",
}
```

`"you know"` is matched before `"you"` or `"know"` individually. All fillers use `\b` word boundaries (or `(?<!\w)..(?!\w)` for multi-word) to prevent stripping substrings of legitimate words.

### Intent Classification

Priority: **compare > summary > factual** (most specific wins):

| Type | Trigger patterns | Example |
|------|-----------------|---------|
| `compare` | `\bcompar(e\|ing)\b`, `\bdifferen(ce\|t)\b`, `\bversus\b`, `\bvs\.?\b` | "compare BERT and GPT" |
| `summary` | `^summar(ise\|ize)`, `describe`, `explain`, `overview` | "summarize the paper" |
| `factual` | `^what is`, `^who`, `^when`, `^where`, `^how many` | "what is accuracy?" |

Unknown queries default to `factual` — the safest retrieval strategy.

### Language Detection

Uses `langdetect` (offline, lightweight). Falls back to `"en"` when:
- text has < 3 words (too short to detect reliably)
- `langdetect` raises any exception
- `langdetect` is not installed

### PreprocessedQuery Dataclass

```python
@dataclass
class PreprocessedQuery:
    raw_query: str        # Original, unmodified input
    processed_query: str  # Cleaned for retrieval
    query_type: str       # factual | summary | compare
    language: str         # ISO 639-1
```

---

## WhisperTranscriber

**File:** [voicevault/asr/whisper_transcriber.py](../voicevault/asr/whisper_transcriber.py)

### Model Selection Strategy

```
Has GPU?
  YES → openai/whisper-large-v3     (3GB VRAM, best quality, 99 languages)
        If load fails →
  NO  → distil-whisper/distil-large-v3  (CPU-friendly, 6× faster, within 1% WER)
```

Both model IDs are configurable via `cfg.whisper_model` and `cfg.distil_whisper_model`.

### Transcription Flow

```python
transcribe(audio_path: Path) → TranscriptResult:
    1. audio_path.exists()  — raise WhisperTranscriberError if missing
    2. _vad_check(audio_path)  — reject silent/too-short audio
    3. _get_pipeline()  — lazy load (GPU → CPU cascade)
    4. pipeline(str(audio_path), return_timestamps=False,
                generate_kwargs={"language": "english"})
    5. preprocessor.process(raw_transcript)  — clean + classify
    6. return TranscriptResult
```

### Voice Activity Detection (VAD)

Prevents Whisper from hallucinating text on silent input — a known failure mode.

```python
_vad_check(audio_path):
    if soundfile available:
        duration_s = len(data) / sample_rate
        if duration_s < 1.0:  raise WhisperTranscriberError("Audio too short")
        rms = sqrt(mean(data**2))
        if rms < 0.001:  raise WhisperTranscriberError("No speech detected")
    else:
        if file_size < 1000 bytes:  raise WhisperTranscriberError("File appears empty")
```

Non-critical VAD failures (unexpected exceptions) are silently logged — transcription proceeds. This ensures resilience on unusual audio formats.

### Lazy Pipeline Loading

The HuggingFace pipeline is loaded only on the first `transcribe()` call. This:
- Avoids slow startup on CPU-only machines
- Prevents import-time failures when torch/transformers are unavailable
- Allows `is_ready()` to accurately report whether the model is loaded

```python
def is_ready(self) -> bool:
    return self._pipeline is not None
```

### TranscriptResult Schema

```python
class TranscriptResult(BaseModel):
    transcript: str       # Cleaned, normalized text (ready for retrieval)
    raw_transcript: str   # Original Whisper output (for debugging/audit)
    language: str         # ISO 639-1
    confidence: float     # Always 1.0 (Whisper has no per-utterance score)
    model_used: str       # "openai/whisper-large-v3" or distil variant
    latency_ms: int       # Wall-clock transcription time
    query_type: str       # factual | summary | compare
```

---

## Security Decisions

### No Audio Storage
Audio files are processed in-memory and never persisted beyond the request lifecycle. Only the SHA-256 of the query text is stored in SQLite (for PII protection).

### VAD as DoS Protection
Rejecting silent audio prevents Whisper from consuming GPU compute on empty input — important when running on shared HuggingFace Spaces hardware.

### Fixed Language Parameter
`generate_kwargs={"language": "english"}` is passed to Whisper. This improves WER on English technical vocabulary and prevents the model from guessing the language from noise.

---

## Test Coverage

**File:** [tests/test_phase3.py](../tests/test_phase3.py) | **47 collected, 45 passed, 2 skipped**

| Class | Tests | Coverage |
|-------|-------|---------|
| `TestQueryPreprocessorNormalization` | 11 | Lowercase, whitespace, filler removal (um/uh/like/so/you know), edge cases |
| `TestQueryPreprocessorIntentClassification` | 14 | All three types, priority ordering, default fallback |
| `TestQueryPreprocessorLanguageDetection` | 3 | English detection, short query fallback, return type |
| `TestWhisperTranscriberVAD` | 3 | Valid audio passes, silent audio raises, missing file raises |
| `TestWhisperTranscriberMocked` | 10 | Return type, transcript cleaning, query type classification, model_used, latency, raw_transcript preservation, is_ready(), pipeline error handling |
| `TestTranscriptResultModel` | 3 | Default confidence=1.0, default query_type="factual", default language="en" |

**Skipped tests:** `test_valid_audio_passes_vad` and `test_silent_audio_raises` — these require `soundfile` which is not installed in the test environment. They pass in production (HuggingFace Spaces has soundfile).

### Key Mock Pattern

Tests inject a pre-built `MagicMock` pipeline directly into `transcriber._pipeline`, bypassing the 3GB model download entirely:

```python
transcriber = WhisperTranscriber(force_cpu=True)
transcriber._pipeline = MagicMock(return_value={"text": "what is machine learning"})
transcriber._model_used = "mock-whisper"
```

---

## Integration with Phase 4

`WhisperTranscriber.transcribe()` returns a `TranscriptResult` where:
- `transcript` → passed as `query` to `HybridRetriever` and `AnswerChain`
- `query_type` → drives retrieval strategy (factual/summary/compare) and LLM token budget
- `language` → stored in `QuerySession` for analytics

The preprocessed query flows through the entire pipeline without further transformation.