Spaces:
Running
Running
| # Phase 4 β LLM Generation Chain | |
| **Status:** β Complete | **Tests:** 72/72 passed | **Files:** 3 modules | |
| --- | |
| ## What Was Built | |
| Phase 4 implements the generation layer: context + query β grounded, cited answer. | |
| | Module | Responsibility | | |
| |--------|----------------| | |
| | `voicevault/generation/answer_chain.py` | LangChain LCEL chain, Groq β Gemini fallback | | |
| | `voicevault/generation/citation_injector.py` | Parse + resolve `[Source: ...]` markers | | |
| | `voicevault/generation/faithfulness_guard.py` | Refusal detection + confidence scoring | | |
| --- | |
| ## FaithfulnessGuard | |
| **File:** [voicevault/generation/faithfulness_guard.py](../voicevault/generation/faithfulness_guard.py) | |
| Two-layer hallucination prevention: | |
| ### Layer 1 β System Prompt Instruction | |
| The LLM is instructed to use a fixed refusal phrase when the answer is not in context: | |
| ``` | |
| REFUSAL_PHRASE = "I could not find this in your documents." | |
| ``` | |
| This phrase is embedded in the system prompt via `build_system_prompt()`. The instruction is unambiguous: use *exactly this phrase and nothing else*. | |
| ### Layer 2 β Post-Generation Check | |
| After the LLM responds, `is_refusal()` verifies the instruction was followed: | |
| ```python | |
| _REFUSAL_PATTERN = re.compile( | |
| re.escape(REFUSAL_PHRASE.lower().rstrip(".")), | |
| re.IGNORECASE, | |
| ) | |
| def is_refusal(self, answer: str) -> bool: | |
| if not answer: | |
| return True | |
| return bool(_REFUSAL_PATTERN.search(answer)) | |
| ``` | |
| The pattern strips the trailing period before matching β covers both `"...documents."` and `"...documents"` forms. Case-insensitive for robustness. | |
| ### Confidence Scoring | |
| Based on the top retrieval score across all retrieved chunks: | |
| ``` | |
| top_score > 0.5 β "high" (strong retrieval signal) | |
| top_score > 0.2 β "medium" (moderate signal β answer may miss nuance) | |
| top_score β€ 0.2 β "low" (weak signal β treat answer with caution) | |
| ``` | |
| Uses `rerank_score` if > 0 (cross-encoder score), falls back to `rrf_score` (RRF). Empty results β `"low"`. | |
| ### System Prompt | |
| `build_system_prompt()` combines two concern areas: | |
| ``` | |
| CITATION RULES: | |
| - Cite every factual claim with [Source: filename, p.N] inline. | |
| - Use exact source names and page numbers from context headers. | |
| - Do not cite general knowledge. | |
| FAITHFULNESS RULES: | |
| - If answer not in context: respond with REFUSAL_PHRASE only. | |
| - Keep factual answers under 150 words. | |
| - Keep summary answers under 300 words. | |
| ``` | |
| Word limits prevent verbose answers that dilute citations and increase hallucination risk. | |
| --- | |
| ## CitationInjector | |
| **File:** [voicevault/generation/citation_injector.py](../voicevault/generation/citation_injector.py) | |
| Post-processes LLM output to resolve inline markers into structured `Citation` objects. | |
| ### Marker Format | |
| The LLM is instructed to use: `[Source: filename, p.N]` | |
| The regex also handles abbreviated forms the LLM might produce: | |
| ```python | |
| _CITATION_PATTERN = re.compile( | |
| r"\[(?:Source:\s*)?([^,\]]+?)(?:,\s*p\.?\s*(\d+))?\]", | |
| re.IGNORECASE, | |
| ) | |
| ``` | |
| Matches: `[Source: report.pdf, p.3]`, `[report.pdf, p.3]`, `[Source: report]`. | |
| ### Resolution Strategy (4-level cascade) | |
| For each parsed marker `(raw_name, page_num)`: | |
| | Priority | Strategy | Condition | | |
| |----------|----------|-----------| | |
| | 1 | Exact filename + page | `source_file.lower() == raw_name.lower() and page_number == page_num` | | |
| | 2 | Substring filename + page | `raw_name in source_file.lower() and page_number == page_num` | | |
| | 3 | Page number only | `page_number == page_num` | | |
| | 4 | Filename substring (no page) | `raw_name in source_file.lower()` | | |
| | 5 (last resort) | First citation in map | Always | | |
| This cascade handles real-world LLM output variability β models sometimes abbreviate filenames or omit page numbers. | |
| ### Deduplication | |
| A `seen_keys: set[tuple[str, int]]` tracks `(source_file, page_number)` pairs. The same source/page cited multiple times resolves to one `Citation` in the output list. | |
| ### Output Contract | |
| ```python | |
| inject(answer, citation_map) β (answer_text, resolved_citations) | |
| ``` | |
| The answer text is **preserved with markers** β they are not stripped. The UI displays both the inline `[Source: ...]` text and the structured citation panel below the answer. | |
| --- | |
| ## AnswerChain | |
| **File:** [voicevault/generation/answer_chain.py](../voicevault/generation/answer_chain.py) | |
| ### LLM Selection | |
| ``` | |
| GROQ_API_KEY set? | |
| YES β ChatGroq(model=llama-3.1-70b-versatile) | |
| If invoke() raises β | |
| NO β ChatGoogleGenerativeAI(model=gemini-1.5-flash) | |
| If invoke() raises β | |
| Return REFUSAL_PHRASE (no crash) | |
| ``` | |
| Both LLMs are constructed fresh per call (not cached) β `max_tokens` varies by `query_type` and LangChain model instances are lightweight. | |
| ### Message Layout | |
| ``` | |
| [SystemMessage] β FaithfulnessGuard.build_system_prompt() | |
| [HumanMessage] β history turn 1 (oldest within window) | |
| [AIMessage] β history turn 1 response | |
| ... β up to cfg.conversation_window pairs | |
| [HumanMessage] β "Context:\n{context}\n\nQuestion: {query}" | |
| ``` | |
| History is capped at `cfg.conversation_window` (default 5) to keep prompt size predictable. | |
| ### Token Budget by Query Type | |
| ```python | |
| factual β cfg.max_answer_tokens (default 500 tokens) | |
| summary β cfg.max_answer_tokens Γ 2 (default 1000 tokens) | |
| compare β cfg.max_answer_tokens (default 500 tokens) | |
| ``` | |
| Summaries need more room for comprehensive coverage. | |
| ### generate() Flow | |
| ```python | |
| generate(query, context, citation_map, history, query_type) β GenerationResult: | |
| 1. _build_messages() β LangChain message list | |
| 2. _invoke_with_fallback() β raw_answer, model_used, tokens_used | |
| 3. CitationInjector.inject() β clean_answer, citations | |
| 4. FaithfulnessGuard.is_refusal() β is_refusal flag | |
| 5. _confidence_from_citations() β "high" | "medium" | "low" | |
| 6. return GenerationResult | |
| ``` | |
| ### stream_generate() Flow | |
| ```python | |
| stream_generate(...) β Generator[str, None, None]: | |
| 1. _build_messages() | |
| 2. _build_groq() or _build_gemini() β first available | |
| 3. for chunk in llm.stream(messages): yield chunk.content | |
| 4. On error: yield error message (never raises) | |
| ``` | |
| Streaming is used by the Gradio UI to show tokens as they arrive. Citation injection and faithfulness check are not applied to streamed chunks β call `generate()` once streaming completes for the structured result. | |
| ### GenerationResult | |
| ```python | |
| @dataclass | |
| class GenerationResult: | |
| answer: str # Final answer with inline [Source: ...] markers | |
| citations: list[Citation] # Resolved, deduplicated citations | |
| confidence_level: str # "high" | "medium" | "low" | |
| is_refusal: bool # True if LLM correctly refused | |
| model_used: str # Model ID ("llama-3.1-70b-versatile" / "gemini-1.5-flash" / "none") | |
| tokens_used: int # Total tokens (input + output); 0 if unavailable | |
| latency_ms: int # Wall-clock LLM call time in ms | |
| ``` | |
| ### Token Extraction | |
| ```python | |
| def _extract_tokens(response) -> int: | |
| try: | |
| return int(response.usage_metadata.get("total_tokens", 0)) | |
| except (AttributeError, TypeError): | |
| return 0 | |
| ``` | |
| `usage_metadata` is the LangChain standard; both Groq and Gemini backends populate it. Returns 0 gracefully when unavailable (e.g., during streaming or with older SDK versions). | |
| --- | |
| ## Security Decisions | |
| ### No Prompt Injection Through Context | |
| Context is injected as a plain string inside the user message, not in the system prompt. The system prompt is hardcoded and never receives user-controlled input. This limits prompt injection attack surface β a malicious document cannot override the faithfulness or citation instructions. | |
| ### Refusal as Default | |
| When no LLM is configured (both keys absent) or both calls fail, the chain returns `REFUSAL_PHRASE` with `model_used="none"`. The application continues running β it never crashes due to missing API keys. | |
| ### No PII in LLM Calls | |
| The query text passed to the LLM is the preprocessed version from `QueryPreprocessor` β fillers stripped, lowercased. The raw Whisper transcript is never sent to the LLM. | |
| --- | |
| ## Test Coverage | |
| **File:** [tests/test_phase4.py](../tests/test_phase4.py) | **72/72 passed** | |
| | Class | Tests | What's verified | | |
| |-------|-------|----------------| | |
| | `TestCitationInjectorBasic` | 8 | Empty input, no markers, exact match, multiple markers, dedup, text preservation, order, empty map | | |
| | `TestCitationInjectorMatchingStrategies` | 5 | All 4 strategies + last resort | | |
| | `TestFaithfulnessGuardRefusal` | 7 | Exact phrase, case insensitive, embedded, normal answer, empty, partial, no-period | | |
| | `TestFaithfulnessGuardConfidence` | 10 | Empty results, high/medium/low thresholds, max across results, rrf fallback, all 4 boundary conditions | | |
| | `TestFaithfulnessGuardSystemPrompt` | 6 | Refusal phrase present, citation rules, faithfulness rules, length | | |
| | `TestAnswerChainMessageBuilding` | 7 | SystemMessage first, HumanMessage last, context in body, history pairs, window cap, no-history length | | |
| | `TestAnswerChainMaxTokens` | 3 | factual/summary/compare budgets | | |
| | `TestAnswerChainTokenExtraction` | 4 | Valid metadata, None metadata, missing attribute, type error | | |
| | `TestAnswerChainConfidenceFromCitations` | 5 | Empty, high/medium/low thresholds, max across citations | | |
| | `TestAnswerChainGenerateMocked` | 7 | Returns correct type, answer content, latency, tokens, refusal detection, non-refusal, citation resolution | | |
| | `TestAnswerChainFallback` | 4 | Gemini fallback on Groq failure, refusal when both fail, refusal when no keys, Groq preferred | | |
| | `TestAnswerChainStreaming` | 4 | Yields chunks, skips empty chunks, refusal when no LLM, error on exception | | |
| | `TestGenerationResult` | 2 | Instantiation, mutable citations list | | |
| ### Mocking Strategy | |
| No real API keys are needed. Tests patch `_build_groq` and `_build_gemini` at the instance level to return `MagicMock` LLMs with controlled responses: | |
| ```python | |
| mock_llm = MagicMock() | |
| mock_llm.invoke.return_value = mock_response # or .side_effect = RuntimeError(...) | |
| with patch.object(chain, "_build_groq", return_value=mock_llm): | |
| result = chain.generate(...) | |
| ``` | |
| --- | |
| ## Integration Points | |
| ### Called by (Phase 5 orchestrator) | |
| ```python | |
| # In the query handler: | |
| results = retriever.search(query, kb_names) | |
| context, citation_map = builder.build(results) | |
| generation = chain.generate( | |
| query=transcript.transcript, | |
| context=context, | |
| citation_map=citation_map, | |
| history=session.history, | |
| query_type=transcript.query_type, | |
| ) | |
| # generation.answer β display + TTS | |
| # generation.citations β citation panel | |
| # generation.is_refusal β skip TTS if True | |
| # generation.tokens_used β store in QuerySession | |
| ``` | |
| ### Dependencies | |
| | Dep | Purpose | | |
| |-----|---------| | |
| | `langchain-core` | `HumanMessage`, `AIMessage`, `SystemMessage` | | |
| | `langchain-groq` | `ChatGroq` client | | |
| | `langchain-google-genai` | `ChatGoogleGenerativeAI` client | | |
| | `FaithfulnessGuard` | System prompt + refusal detection | | |
| | `CitationInjector` | Marker parsing + resolution | | |