# Phase 4 — LLM Generation Chain **Status:** ✅ Complete | **Tests:** 72/72 passed | **Files:** 3 modules --- ## What Was Built Phase 4 implements the generation layer: context + query → grounded, cited answer. | Module | Responsibility | |--------|----------------| | `voicevault/generation/answer_chain.py` | LangChain LCEL chain, Groq → Gemini fallback | | `voicevault/generation/citation_injector.py` | Parse + resolve `[Source: ...]` markers | | `voicevault/generation/faithfulness_guard.py` | Refusal detection + confidence scoring | --- ## FaithfulnessGuard **File:** [voicevault/generation/faithfulness_guard.py](../voicevault/generation/faithfulness_guard.py) Two-layer hallucination prevention: ### Layer 1 — System Prompt Instruction The LLM is instructed to use a fixed refusal phrase when the answer is not in context: ``` REFUSAL_PHRASE = "I could not find this in your documents." ``` This phrase is embedded in the system prompt via `build_system_prompt()`. The instruction is unambiguous: use *exactly this phrase and nothing else*. ### Layer 2 — Post-Generation Check After the LLM responds, `is_refusal()` verifies the instruction was followed: ```python _REFUSAL_PATTERN = re.compile( re.escape(REFUSAL_PHRASE.lower().rstrip(".")), re.IGNORECASE, ) def is_refusal(self, answer: str) -> bool: if not answer: return True return bool(_REFUSAL_PATTERN.search(answer)) ``` The pattern strips the trailing period before matching — covers both `"...documents."` and `"...documents"` forms. Case-insensitive for robustness. ### Confidence Scoring Based on the top retrieval score across all retrieved chunks: ``` top_score > 0.5 → "high" (strong retrieval signal) top_score > 0.2 → "medium" (moderate signal — answer may miss nuance) top_score ≤ 0.2 → "low" (weak signal — treat answer with caution) ``` Uses `rerank_score` if > 0 (cross-encoder score), falls back to `rrf_score` (RRF). Empty results → `"low"`. ### System Prompt `build_system_prompt()` combines two concern areas: ``` CITATION RULES: - Cite every factual claim with [Source: filename, p.N] inline. - Use exact source names and page numbers from context headers. - Do not cite general knowledge. FAITHFULNESS RULES: - If answer not in context: respond with REFUSAL_PHRASE only. - Keep factual answers under 150 words. - Keep summary answers under 300 words. ``` Word limits prevent verbose answers that dilute citations and increase hallucination risk. --- ## CitationInjector **File:** [voicevault/generation/citation_injector.py](../voicevault/generation/citation_injector.py) Post-processes LLM output to resolve inline markers into structured `Citation` objects. ### Marker Format The LLM is instructed to use: `[Source: filename, p.N]` The regex also handles abbreviated forms the LLM might produce: ```python _CITATION_PATTERN = re.compile( r"\[(?:Source:\s*)?([^,\]]+?)(?:,\s*p\.?\s*(\d+))?\]", re.IGNORECASE, ) ``` Matches: `[Source: report.pdf, p.3]`, `[report.pdf, p.3]`, `[Source: report]`. ### Resolution Strategy (4-level cascade) For each parsed marker `(raw_name, page_num)`: | Priority | Strategy | Condition | |----------|----------|-----------| | 1 | Exact filename + page | `source_file.lower() == raw_name.lower() and page_number == page_num` | | 2 | Substring filename + page | `raw_name in source_file.lower() and page_number == page_num` | | 3 | Page number only | `page_number == page_num` | | 4 | Filename substring (no page) | `raw_name in source_file.lower()` | | 5 (last resort) | First citation in map | Always | This cascade handles real-world LLM output variability — models sometimes abbreviate filenames or omit page numbers. ### Deduplication A `seen_keys: set[tuple[str, int]]` tracks `(source_file, page_number)` pairs. The same source/page cited multiple times resolves to one `Citation` in the output list. ### Output Contract ```python inject(answer, citation_map) → (answer_text, resolved_citations) ``` The answer text is **preserved with markers** — they are not stripped. The UI displays both the inline `[Source: ...]` text and the structured citation panel below the answer. --- ## AnswerChain **File:** [voicevault/generation/answer_chain.py](../voicevault/generation/answer_chain.py) ### LLM Selection ``` GROQ_API_KEY set? YES → ChatGroq(model=llama-3.1-70b-versatile) If invoke() raises → NO → ChatGoogleGenerativeAI(model=gemini-1.5-flash) If invoke() raises → Return REFUSAL_PHRASE (no crash) ``` Both LLMs are constructed fresh per call (not cached) — `max_tokens` varies by `query_type` and LangChain model instances are lightweight. ### Message Layout ``` [SystemMessage] ← FaithfulnessGuard.build_system_prompt() [HumanMessage] ← history turn 1 (oldest within window) [AIMessage] ← history turn 1 response ... ← up to cfg.conversation_window pairs [HumanMessage] ← "Context:\n{context}\n\nQuestion: {query}" ``` History is capped at `cfg.conversation_window` (default 5) to keep prompt size predictable. ### Token Budget by Query Type ```python factual → cfg.max_answer_tokens (default 500 tokens) summary → cfg.max_answer_tokens × 2 (default 1000 tokens) compare → cfg.max_answer_tokens (default 500 tokens) ``` Summaries need more room for comprehensive coverage. ### generate() Flow ```python generate(query, context, citation_map, history, query_type) → GenerationResult: 1. _build_messages() → LangChain message list 2. _invoke_with_fallback() → raw_answer, model_used, tokens_used 3. CitationInjector.inject() → clean_answer, citations 4. FaithfulnessGuard.is_refusal() → is_refusal flag 5. _confidence_from_citations() → "high" | "medium" | "low" 6. return GenerationResult ``` ### stream_generate() Flow ```python stream_generate(...) → Generator[str, None, None]: 1. _build_messages() 2. _build_groq() or _build_gemini() ← first available 3. for chunk in llm.stream(messages): yield chunk.content 4. On error: yield error message (never raises) ``` Streaming is used by the Gradio UI to show tokens as they arrive. Citation injection and faithfulness check are not applied to streamed chunks — call `generate()` once streaming completes for the structured result. ### GenerationResult ```python @dataclass class GenerationResult: answer: str # Final answer with inline [Source: ...] markers citations: list[Citation] # Resolved, deduplicated citations confidence_level: str # "high" | "medium" | "low" is_refusal: bool # True if LLM correctly refused model_used: str # Model ID ("llama-3.1-70b-versatile" / "gemini-1.5-flash" / "none") tokens_used: int # Total tokens (input + output); 0 if unavailable latency_ms: int # Wall-clock LLM call time in ms ``` ### Token Extraction ```python def _extract_tokens(response) -> int: try: return int(response.usage_metadata.get("total_tokens", 0)) except (AttributeError, TypeError): return 0 ``` `usage_metadata` is the LangChain standard; both Groq and Gemini backends populate it. Returns 0 gracefully when unavailable (e.g., during streaming or with older SDK versions). --- ## Security Decisions ### No Prompt Injection Through Context Context is injected as a plain string inside the user message, not in the system prompt. The system prompt is hardcoded and never receives user-controlled input. This limits prompt injection attack surface — a malicious document cannot override the faithfulness or citation instructions. ### Refusal as Default When no LLM is configured (both keys absent) or both calls fail, the chain returns `REFUSAL_PHRASE` with `model_used="none"`. The application continues running — it never crashes due to missing API keys. ### No PII in LLM Calls The query text passed to the LLM is the preprocessed version from `QueryPreprocessor` — fillers stripped, lowercased. The raw Whisper transcript is never sent to the LLM. --- ## Test Coverage **File:** [tests/test_phase4.py](../tests/test_phase4.py) | **72/72 passed** | Class | Tests | What's verified | |-------|-------|----------------| | `TestCitationInjectorBasic` | 8 | Empty input, no markers, exact match, multiple markers, dedup, text preservation, order, empty map | | `TestCitationInjectorMatchingStrategies` | 5 | All 4 strategies + last resort | | `TestFaithfulnessGuardRefusal` | 7 | Exact phrase, case insensitive, embedded, normal answer, empty, partial, no-period | | `TestFaithfulnessGuardConfidence` | 10 | Empty results, high/medium/low thresholds, max across results, rrf fallback, all 4 boundary conditions | | `TestFaithfulnessGuardSystemPrompt` | 6 | Refusal phrase present, citation rules, faithfulness rules, length | | `TestAnswerChainMessageBuilding` | 7 | SystemMessage first, HumanMessage last, context in body, history pairs, window cap, no-history length | | `TestAnswerChainMaxTokens` | 3 | factual/summary/compare budgets | | `TestAnswerChainTokenExtraction` | 4 | Valid metadata, None metadata, missing attribute, type error | | `TestAnswerChainConfidenceFromCitations` | 5 | Empty, high/medium/low thresholds, max across citations | | `TestAnswerChainGenerateMocked` | 7 | Returns correct type, answer content, latency, tokens, refusal detection, non-refusal, citation resolution | | `TestAnswerChainFallback` | 4 | Gemini fallback on Groq failure, refusal when both fail, refusal when no keys, Groq preferred | | `TestAnswerChainStreaming` | 4 | Yields chunks, skips empty chunks, refusal when no LLM, error on exception | | `TestGenerationResult` | 2 | Instantiation, mutable citations list | ### Mocking Strategy No real API keys are needed. Tests patch `_build_groq` and `_build_gemini` at the instance level to return `MagicMock` LLMs with controlled responses: ```python mock_llm = MagicMock() mock_llm.invoke.return_value = mock_response # or .side_effect = RuntimeError(...) with patch.object(chain, "_build_groq", return_value=mock_llm): result = chain.generate(...) ``` --- ## Integration Points ### Called by (Phase 5 orchestrator) ```python # In the query handler: results = retriever.search(query, kb_names) context, citation_map = builder.build(results) generation = chain.generate( query=transcript.transcript, context=context, citation_map=citation_map, history=session.history, query_type=transcript.query_type, ) # generation.answer → display + TTS # generation.citations → citation panel # generation.is_refusal → skip TTS if True # generation.tokens_used → store in QuerySession ``` ### Dependencies | Dep | Purpose | |-----|---------| | `langchain-core` | `HumanMessage`, `AIMessage`, `SystemMessage` | | `langchain-groq` | `ChatGroq` client | | `langchain-google-genai` | `ChatGoogleGenerativeAI` client | | `FaithfulnessGuard` | System prompt + refusal detection | | `CitationInjector` | Marker parsing + resolution |