VoiceVault / DOCS /phase4_generation.md
NinjainPJs's picture
Initial release: VoiceVault v1.0.0 β€” Voice-First RAG Knowledge Agent
85f900d
# Phase 4 β€” LLM Generation Chain
**Status:** βœ… Complete | **Tests:** 72/72 passed | **Files:** 3 modules
---
## What Was Built
Phase 4 implements the generation layer: context + query β†’ grounded, cited answer.
| Module | Responsibility |
|--------|----------------|
| `voicevault/generation/answer_chain.py` | LangChain LCEL chain, Groq β†’ Gemini fallback |
| `voicevault/generation/citation_injector.py` | Parse + resolve `[Source: ...]` markers |
| `voicevault/generation/faithfulness_guard.py` | Refusal detection + confidence scoring |
---
## FaithfulnessGuard
**File:** [voicevault/generation/faithfulness_guard.py](../voicevault/generation/faithfulness_guard.py)
Two-layer hallucination prevention:
### Layer 1 β€” System Prompt Instruction
The LLM is instructed to use a fixed refusal phrase when the answer is not in context:
```
REFUSAL_PHRASE = "I could not find this in your documents."
```
This phrase is embedded in the system prompt via `build_system_prompt()`. The instruction is unambiguous: use *exactly this phrase and nothing else*.
### Layer 2 β€” Post-Generation Check
After the LLM responds, `is_refusal()` verifies the instruction was followed:
```python
_REFUSAL_PATTERN = re.compile(
re.escape(REFUSAL_PHRASE.lower().rstrip(".")),
re.IGNORECASE,
)
def is_refusal(self, answer: str) -> bool:
if not answer:
return True
return bool(_REFUSAL_PATTERN.search(answer))
```
The pattern strips the trailing period before matching β€” covers both `"...documents."` and `"...documents"` forms. Case-insensitive for robustness.
### Confidence Scoring
Based on the top retrieval score across all retrieved chunks:
```
top_score > 0.5 β†’ "high" (strong retrieval signal)
top_score > 0.2 β†’ "medium" (moderate signal β€” answer may miss nuance)
top_score ≀ 0.2 β†’ "low" (weak signal β€” treat answer with caution)
```
Uses `rerank_score` if > 0 (cross-encoder score), falls back to `rrf_score` (RRF). Empty results β†’ `"low"`.
### System Prompt
`build_system_prompt()` combines two concern areas:
```
CITATION RULES:
- Cite every factual claim with [Source: filename, p.N] inline.
- Use exact source names and page numbers from context headers.
- Do not cite general knowledge.
FAITHFULNESS RULES:
- If answer not in context: respond with REFUSAL_PHRASE only.
- Keep factual answers under 150 words.
- Keep summary answers under 300 words.
```
Word limits prevent verbose answers that dilute citations and increase hallucination risk.
---
## CitationInjector
**File:** [voicevault/generation/citation_injector.py](../voicevault/generation/citation_injector.py)
Post-processes LLM output to resolve inline markers into structured `Citation` objects.
### Marker Format
The LLM is instructed to use: `[Source: filename, p.N]`
The regex also handles abbreviated forms the LLM might produce:
```python
_CITATION_PATTERN = re.compile(
r"\[(?:Source:\s*)?([^,\]]+?)(?:,\s*p\.?\s*(\d+))?\]",
re.IGNORECASE,
)
```
Matches: `[Source: report.pdf, p.3]`, `[report.pdf, p.3]`, `[Source: report]`.
### Resolution Strategy (4-level cascade)
For each parsed marker `(raw_name, page_num)`:
| Priority | Strategy | Condition |
|----------|----------|-----------|
| 1 | Exact filename + page | `source_file.lower() == raw_name.lower() and page_number == page_num` |
| 2 | Substring filename + page | `raw_name in source_file.lower() and page_number == page_num` |
| 3 | Page number only | `page_number == page_num` |
| 4 | Filename substring (no page) | `raw_name in source_file.lower()` |
| 5 (last resort) | First citation in map | Always |
This cascade handles real-world LLM output variability β€” models sometimes abbreviate filenames or omit page numbers.
### Deduplication
A `seen_keys: set[tuple[str, int]]` tracks `(source_file, page_number)` pairs. The same source/page cited multiple times resolves to one `Citation` in the output list.
### Output Contract
```python
inject(answer, citation_map) β†’ (answer_text, resolved_citations)
```
The answer text is **preserved with markers** β€” they are not stripped. The UI displays both the inline `[Source: ...]` text and the structured citation panel below the answer.
---
## AnswerChain
**File:** [voicevault/generation/answer_chain.py](../voicevault/generation/answer_chain.py)
### LLM Selection
```
GROQ_API_KEY set?
YES β†’ ChatGroq(model=llama-3.1-70b-versatile)
If invoke() raises β†’
NO β†’ ChatGoogleGenerativeAI(model=gemini-1.5-flash)
If invoke() raises β†’
Return REFUSAL_PHRASE (no crash)
```
Both LLMs are constructed fresh per call (not cached) β€” `max_tokens` varies by `query_type` and LangChain model instances are lightweight.
### Message Layout
```
[SystemMessage] ← FaithfulnessGuard.build_system_prompt()
[HumanMessage] ← history turn 1 (oldest within window)
[AIMessage] ← history turn 1 response
... ← up to cfg.conversation_window pairs
[HumanMessage] ← "Context:\n{context}\n\nQuestion: {query}"
```
History is capped at `cfg.conversation_window` (default 5) to keep prompt size predictable.
### Token Budget by Query Type
```python
factual β†’ cfg.max_answer_tokens (default 500 tokens)
summary β†’ cfg.max_answer_tokens Γ— 2 (default 1000 tokens)
compare β†’ cfg.max_answer_tokens (default 500 tokens)
```
Summaries need more room for comprehensive coverage.
### generate() Flow
```python
generate(query, context, citation_map, history, query_type) β†’ GenerationResult:
1. _build_messages() β†’ LangChain message list
2. _invoke_with_fallback() β†’ raw_answer, model_used, tokens_used
3. CitationInjector.inject() β†’ clean_answer, citations
4. FaithfulnessGuard.is_refusal() β†’ is_refusal flag
5. _confidence_from_citations() β†’ "high" | "medium" | "low"
6. return GenerationResult
```
### stream_generate() Flow
```python
stream_generate(...) β†’ Generator[str, None, None]:
1. _build_messages()
2. _build_groq() or _build_gemini() ← first available
3. for chunk in llm.stream(messages): yield chunk.content
4. On error: yield error message (never raises)
```
Streaming is used by the Gradio UI to show tokens as they arrive. Citation injection and faithfulness check are not applied to streamed chunks β€” call `generate()` once streaming completes for the structured result.
### GenerationResult
```python
@dataclass
class GenerationResult:
answer: str # Final answer with inline [Source: ...] markers
citations: list[Citation] # Resolved, deduplicated citations
confidence_level: str # "high" | "medium" | "low"
is_refusal: bool # True if LLM correctly refused
model_used: str # Model ID ("llama-3.1-70b-versatile" / "gemini-1.5-flash" / "none")
tokens_used: int # Total tokens (input + output); 0 if unavailable
latency_ms: int # Wall-clock LLM call time in ms
```
### Token Extraction
```python
def _extract_tokens(response) -> int:
try:
return int(response.usage_metadata.get("total_tokens", 0))
except (AttributeError, TypeError):
return 0
```
`usage_metadata` is the LangChain standard; both Groq and Gemini backends populate it. Returns 0 gracefully when unavailable (e.g., during streaming or with older SDK versions).
---
## Security Decisions
### No Prompt Injection Through Context
Context is injected as a plain string inside the user message, not in the system prompt. The system prompt is hardcoded and never receives user-controlled input. This limits prompt injection attack surface β€” a malicious document cannot override the faithfulness or citation instructions.
### Refusal as Default
When no LLM is configured (both keys absent) or both calls fail, the chain returns `REFUSAL_PHRASE` with `model_used="none"`. The application continues running β€” it never crashes due to missing API keys.
### No PII in LLM Calls
The query text passed to the LLM is the preprocessed version from `QueryPreprocessor` β€” fillers stripped, lowercased. The raw Whisper transcript is never sent to the LLM.
---
## Test Coverage
**File:** [tests/test_phase4.py](../tests/test_phase4.py) | **72/72 passed**
| Class | Tests | What's verified |
|-------|-------|----------------|
| `TestCitationInjectorBasic` | 8 | Empty input, no markers, exact match, multiple markers, dedup, text preservation, order, empty map |
| `TestCitationInjectorMatchingStrategies` | 5 | All 4 strategies + last resort |
| `TestFaithfulnessGuardRefusal` | 7 | Exact phrase, case insensitive, embedded, normal answer, empty, partial, no-period |
| `TestFaithfulnessGuardConfidence` | 10 | Empty results, high/medium/low thresholds, max across results, rrf fallback, all 4 boundary conditions |
| `TestFaithfulnessGuardSystemPrompt` | 6 | Refusal phrase present, citation rules, faithfulness rules, length |
| `TestAnswerChainMessageBuilding` | 7 | SystemMessage first, HumanMessage last, context in body, history pairs, window cap, no-history length |
| `TestAnswerChainMaxTokens` | 3 | factual/summary/compare budgets |
| `TestAnswerChainTokenExtraction` | 4 | Valid metadata, None metadata, missing attribute, type error |
| `TestAnswerChainConfidenceFromCitations` | 5 | Empty, high/medium/low thresholds, max across citations |
| `TestAnswerChainGenerateMocked` | 7 | Returns correct type, answer content, latency, tokens, refusal detection, non-refusal, citation resolution |
| `TestAnswerChainFallback` | 4 | Gemini fallback on Groq failure, refusal when both fail, refusal when no keys, Groq preferred |
| `TestAnswerChainStreaming` | 4 | Yields chunks, skips empty chunks, refusal when no LLM, error on exception |
| `TestGenerationResult` | 2 | Instantiation, mutable citations list |
### Mocking Strategy
No real API keys are needed. Tests patch `_build_groq` and `_build_gemini` at the instance level to return `MagicMock` LLMs with controlled responses:
```python
mock_llm = MagicMock()
mock_llm.invoke.return_value = mock_response # or .side_effect = RuntimeError(...)
with patch.object(chain, "_build_groq", return_value=mock_llm):
result = chain.generate(...)
```
---
## Integration Points
### Called by (Phase 5 orchestrator)
```python
# In the query handler:
results = retriever.search(query, kb_names)
context, citation_map = builder.build(results)
generation = chain.generate(
query=transcript.transcript,
context=context,
citation_map=citation_map,
history=session.history,
query_type=transcript.query_type,
)
# generation.answer β†’ display + TTS
# generation.citations β†’ citation panel
# generation.is_refusal β†’ skip TTS if True
# generation.tokens_used β†’ store in QuerySession
```
### Dependencies
| Dep | Purpose |
|-----|---------|
| `langchain-core` | `HumanMessage`, `AIMessage`, `SystemMessage` |
| `langchain-groq` | `ChatGroq` client |
| `langchain-google-genai` | `ChatGoogleGenerativeAI` client |
| `FaithfulnessGuard` | System prompt + refusal detection |
| `CitationInjector` | Marker parsing + resolution |