# Phase 4 — LLM Generation Chain

**Status:** ✅ Complete | **Tests:** 72/72 passed | **Files:** 3 modules

---

## What Was Built

Phase 4 implements the generation layer: context + query → grounded, cited answer.

| Module | Responsibility |
|--------|----------------|
| `voicevault/generation/answer_chain.py` | LangChain LCEL chain, Groq → Gemini fallback |
| `voicevault/generation/citation_injector.py` | Parse + resolve `[Source: ...]` markers |
| `voicevault/generation/faithfulness_guard.py` | Refusal detection + confidence scoring |

---

## FaithfulnessGuard

**File:** [voicevault/generation/faithfulness_guard.py](../voicevault/generation/faithfulness_guard.py)

Two-layer hallucination prevention:

### Layer 1 — System Prompt Instruction

The LLM is instructed to use a fixed refusal phrase when the answer is not in context:

```
REFUSAL_PHRASE = "I could not find this in your documents."
```

This phrase is embedded in the system prompt via `build_system_prompt()`. The instruction is unambiguous: use *exactly this phrase and nothing else*.

### Layer 2 — Post-Generation Check

After the LLM responds, `is_refusal()` verifies the instruction was followed:

```python
_REFUSAL_PATTERN = re.compile(
    re.escape(REFUSAL_PHRASE.lower().rstrip(".")),
    re.IGNORECASE,
)

def is_refusal(self, answer: str) -> bool:
    if not answer:
        return True
    return bool(_REFUSAL_PATTERN.search(answer))
```

The pattern strips the trailing period before matching — covers both `"...documents."` and `"...documents"` forms. Case-insensitive for robustness.

### Confidence Scoring

Based on the top retrieval score across all retrieved chunks:

```
top_score > 0.5  → "high"   (strong retrieval signal)
top_score > 0.2  → "medium" (moderate signal — answer may miss nuance)
top_score ≤ 0.2  → "low"    (weak signal — treat answer with caution)
```

Uses `rerank_score` if > 0 (cross-encoder score), falls back to `rrf_score` (RRF). Empty results → `"low"`.

### System Prompt

`build_system_prompt()` combines two concern areas:

```
CITATION RULES:
  - Cite every factual claim with [Source: filename, p.N] inline.
  - Use exact source names and page numbers from context headers.
  - Do not cite general knowledge.

FAITHFULNESS RULES:
  - If answer not in context: respond with REFUSAL_PHRASE only.
  - Keep factual answers under 150 words.
  - Keep summary answers under 300 words.
```

Word limits prevent verbose answers that dilute citations and increase hallucination risk.

---

## CitationInjector

**File:** [voicevault/generation/citation_injector.py](../voicevault/generation/citation_injector.py)

Post-processes LLM output to resolve inline markers into structured `Citation` objects.

### Marker Format

The LLM is instructed to use: `[Source: filename, p.N]`

The regex also handles abbreviated forms the LLM might produce:
```python
_CITATION_PATTERN = re.compile(
    r"\[(?:Source:\s*)?([^,\]]+?)(?:,\s*p\.?\s*(\d+))?\]",
    re.IGNORECASE,
)
```
Matches: `[Source: report.pdf, p.3]`, `[report.pdf, p.3]`, `[Source: report]`.

### Resolution Strategy (4-level cascade)

For each parsed marker `(raw_name, page_num)`:

| Priority | Strategy | Condition |
|----------|----------|-----------|
| 1 | Exact filename + page | `source_file.lower() == raw_name.lower() and page_number == page_num` |
| 2 | Substring filename + page | `raw_name in source_file.lower() and page_number == page_num` |
| 3 | Page number only | `page_number == page_num` |
| 4 | Filename substring (no page) | `raw_name in source_file.lower()` |
| 5 (last resort) | First citation in map | Always |

This cascade handles real-world LLM output variability — models sometimes abbreviate filenames or omit page numbers.

### Deduplication

A `seen_keys: set[tuple[str, int]]` tracks `(source_file, page_number)` pairs. The same source/page cited multiple times resolves to one `Citation` in the output list.

### Output Contract

```python
inject(answer, citation_map) → (answer_text, resolved_citations)
```

The answer text is **preserved with markers** — they are not stripped. The UI displays both the inline `[Source: ...]` text and the structured citation panel below the answer.

---

## AnswerChain

**File:** [voicevault/generation/answer_chain.py](../voicevault/generation/answer_chain.py)

### LLM Selection

```
GROQ_API_KEY set?
  YES → ChatGroq(model=llama-3.1-70b-versatile)
        If invoke() raises →
  NO  → ChatGoogleGenerativeAI(model=gemini-1.5-flash)
        If invoke() raises →
        Return REFUSAL_PHRASE (no crash)
```

Both LLMs are constructed fresh per call (not cached) — `max_tokens` varies by `query_type` and LangChain model instances are lightweight.

### Message Layout

```
[SystemMessage]  ← FaithfulnessGuard.build_system_prompt()
[HumanMessage]   ← history turn 1 (oldest within window)
[AIMessage]      ← history turn 1 response
...              ← up to cfg.conversation_window pairs
[HumanMessage]   ← "Context:\n{context}\n\nQuestion: {query}"
```

History is capped at `cfg.conversation_window` (default 5) to keep prompt size predictable.

### Token Budget by Query Type

```python
factual  → cfg.max_answer_tokens        (default 500 tokens)
summary  → cfg.max_answer_tokens × 2    (default 1000 tokens)
compare  → cfg.max_answer_tokens        (default 500 tokens)
```

Summaries need more room for comprehensive coverage.

### generate() Flow

```python
generate(query, context, citation_map, history, query_type) → GenerationResult:
    1. _build_messages()          → LangChain message list
    2. _invoke_with_fallback()    → raw_answer, model_used, tokens_used
    3. CitationInjector.inject()  → clean_answer, citations
    4. FaithfulnessGuard.is_refusal() → is_refusal flag
    5. _confidence_from_citations()   → "high" | "medium" | "low"
    6. return GenerationResult
```

### stream_generate() Flow

```python
stream_generate(...) → Generator[str, None, None]:
    1. _build_messages()
    2. _build_groq() or _build_gemini()  ← first available
    3. for chunk in llm.stream(messages): yield chunk.content
    4. On error: yield error message (never raises)
```

Streaming is used by the Gradio UI to show tokens as they arrive. Citation injection and faithfulness check are not applied to streamed chunks — call `generate()` once streaming completes for the structured result.

### GenerationResult

```python
@dataclass
class GenerationResult:
    answer: str           # Final answer with inline [Source: ...] markers
    citations: list[Citation]  # Resolved, deduplicated citations
    confidence_level: str # "high" | "medium" | "low"
    is_refusal: bool      # True if LLM correctly refused
    model_used: str       # Model ID ("llama-3.1-70b-versatile" / "gemini-1.5-flash" / "none")
    tokens_used: int      # Total tokens (input + output); 0 if unavailable
    latency_ms: int       # Wall-clock LLM call time in ms
```

### Token Extraction

```python
def _extract_tokens(response) -> int:
    try:
        return int(response.usage_metadata.get("total_tokens", 0))
    except (AttributeError, TypeError):
        return 0
```

`usage_metadata` is the LangChain standard; both Groq and Gemini backends populate it. Returns 0 gracefully when unavailable (e.g., during streaming or with older SDK versions).

---

## Security Decisions

### No Prompt Injection Through Context

Context is injected as a plain string inside the user message, not in the system prompt. The system prompt is hardcoded and never receives user-controlled input. This limits prompt injection attack surface — a malicious document cannot override the faithfulness or citation instructions.

### Refusal as Default

When no LLM is configured (both keys absent) or both calls fail, the chain returns `REFUSAL_PHRASE` with `model_used="none"`. The application continues running — it never crashes due to missing API keys.

### No PII in LLM Calls

The query text passed to the LLM is the preprocessed version from `QueryPreprocessor` — fillers stripped, lowercased. The raw Whisper transcript is never sent to the LLM.

---

## Test Coverage

**File:** [tests/test_phase4.py](../tests/test_phase4.py) | **72/72 passed**

| Class | Tests | What's verified |
|-------|-------|----------------|
| `TestCitationInjectorBasic` | 8 | Empty input, no markers, exact match, multiple markers, dedup, text preservation, order, empty map |
| `TestCitationInjectorMatchingStrategies` | 5 | All 4 strategies + last resort |
| `TestFaithfulnessGuardRefusal` | 7 | Exact phrase, case insensitive, embedded, normal answer, empty, partial, no-period |
| `TestFaithfulnessGuardConfidence` | 10 | Empty results, high/medium/low thresholds, max across results, rrf fallback, all 4 boundary conditions |
| `TestFaithfulnessGuardSystemPrompt` | 6 | Refusal phrase present, citation rules, faithfulness rules, length |
| `TestAnswerChainMessageBuilding` | 7 | SystemMessage first, HumanMessage last, context in body, history pairs, window cap, no-history length |
| `TestAnswerChainMaxTokens` | 3 | factual/summary/compare budgets |
| `TestAnswerChainTokenExtraction` | 4 | Valid metadata, None metadata, missing attribute, type error |
| `TestAnswerChainConfidenceFromCitations` | 5 | Empty, high/medium/low thresholds, max across citations |
| `TestAnswerChainGenerateMocked` | 7 | Returns correct type, answer content, latency, tokens, refusal detection, non-refusal, citation resolution |
| `TestAnswerChainFallback` | 4 | Gemini fallback on Groq failure, refusal when both fail, refusal when no keys, Groq preferred |
| `TestAnswerChainStreaming` | 4 | Yields chunks, skips empty chunks, refusal when no LLM, error on exception |
| `TestGenerationResult` | 2 | Instantiation, mutable citations list |

### Mocking Strategy

No real API keys are needed. Tests patch `_build_groq` and `_build_gemini` at the instance level to return `MagicMock` LLMs with controlled responses:

```python
mock_llm = MagicMock()
mock_llm.invoke.return_value = mock_response  # or .side_effect = RuntimeError(...)
with patch.object(chain, "_build_groq", return_value=mock_llm):
    result = chain.generate(...)
```

---

## Integration Points

### Called by (Phase 5 orchestrator)

```python
# In the query handler:
results = retriever.search(query, kb_names)
context, citation_map = builder.build(results)
generation = chain.generate(
    query=transcript.transcript,
    context=context,
    citation_map=citation_map,
    history=session.history,
    query_type=transcript.query_type,
)
# generation.answer → display + TTS
# generation.citations → citation panel
# generation.is_refusal → skip TTS if True
# generation.tokens_used → store in QuerySession
```

### Dependencies

| Dep | Purpose |
|-----|---------|
| `langchain-core` | `HumanMessage`, `AIMessage`, `SystemMessage` |
| `langchain-groq` | `ChatGroq` client |
| `langchain-google-genai` | `ChatGoogleGenerativeAI` client |
| `FaithfulnessGuard` | System prompt + refusal detection |
| `CitationInjector` | Marker parsing + resolution |