Spaces:

NinjainPJs
/

VoiceVault

Running

App Files Files Community

VoiceVault / DOCS /phase4_generation.md

NinjainPJs

Initial release: VoiceVault v1.0.0 — Voice-First RAG Knowledge Agent

85f900d 3 months ago

preview code

raw

history blame contribute delete

11.1 kB

	# Phase 4 — LLM Generation Chain

	Status: ✅ Complete \| Tests: 72/72 passed \| Files: 3 modules

	---

	## What Was Built

	Phase 4 implements the generation layer: context + query → grounded, cited answer.

	\| Module \| Responsibility \|
	\|--------\|----------------\|
	\| `voicevault/generation/answer_chain.py` \| LangChain LCEL chain, Groq → Gemini fallback \|
	\| `voicevault/generation/citation_injector.py` \| Parse + resolve `[Source: ...]` markers \|
	\| `voicevault/generation/faithfulness_guard.py` \| Refusal detection + confidence scoring \|

	---

	## FaithfulnessGuard

	File: [voicevault/generation/faithfulness_guard.py](../voicevault/generation/faithfulness_guard.py)

	Two-layer hallucination prevention:

	### Layer 1 — System Prompt Instruction

	The LLM is instructed to use a fixed refusal phrase when the answer is not in context:

	```
	REFUSAL_PHRASE = "I could not find this in your documents."
	```

	This phrase is embedded in the system prompt via `build_system_prompt()`. The instruction is unambiguous: use exactly this phrase and nothing else.

	### Layer 2 — Post-Generation Check

	After the LLM responds, `is_refusal()` verifies the instruction was followed:

	```python
	_REFUSAL_PATTERN = re.compile(
	re.escape(REFUSAL_PHRASE.lower().rstrip(".")),
	re.IGNORECASE,
	)

	def is_refusal(self, answer: str) -> bool:
	if not answer:
	return True
	return bool(_REFUSAL_PATTERN.search(answer))
	```

	The pattern strips the trailing period before matching — covers both `"...documents."` and `"...documents"` forms. Case-insensitive for robustness.

	### Confidence Scoring

	Based on the top retrieval score across all retrieved chunks:

	```
	top_score > 0.5 → "high" (strong retrieval signal)
	top_score > 0.2 → "medium" (moderate signal — answer may miss nuance)
	top_score ≤ 0.2 → "low" (weak signal — treat answer with caution)
	```

	Uses `rerank_score` if > 0 (cross-encoder score), falls back to `rrf_score` (RRF). Empty results → `"low"`.

	### System Prompt

	`build_system_prompt()` combines two concern areas:

	```
	CITATION RULES:
	- Cite every factual claim with [Source: filename, p.N] inline.
	- Use exact source names and page numbers from context headers.
	- Do not cite general knowledge.

	FAITHFULNESS RULES:
	- If answer not in context: respond with REFUSAL_PHRASE only.
	- Keep factual answers under 150 words.
	- Keep summary answers under 300 words.
	```

	Word limits prevent verbose answers that dilute citations and increase hallucination risk.

	---

	## CitationInjector

	File: [voicevault/generation/citation_injector.py](../voicevault/generation/citation_injector.py)

	Post-processes LLM output to resolve inline markers into structured `Citation` objects.

	### Marker Format

	The LLM is instructed to use: `[Source: filename, p.N]`

	The regex also handles abbreviated forms the LLM might produce:
	```python
	_CITATION_PATTERN = re.compile(
	r"\[(?:Source:\s)?([^,\]]+?)(?:,\sp\.?\s*(\d+))?\]",
	re.IGNORECASE,
	)
	```
	Matches: `[Source: report.pdf, p.3]`, `[report.pdf, p.3]`, `[Source: report]`.

	### Resolution Strategy (4-level cascade)

	For each parsed marker `(raw_name, page_num)`:

	\| Priority \| Strategy \| Condition \|
	\|----------\|----------\|-----------\|
	\| 1 \| Exact filename + page \| `source_file.lower() == raw_name.lower() and page_number == page_num` \|
	\| 2 \| Substring filename + page \| `raw_name in source_file.lower() and page_number == page_num` \|
	\| 3 \| Page number only \| `page_number == page_num` \|
	\| 4 \| Filename substring (no page) \| `raw_name in source_file.lower()` \|
	\| 5 (last resort) \| First citation in map \| Always \|

	This cascade handles real-world LLM output variability — models sometimes abbreviate filenames or omit page numbers.

	### Deduplication

	A `seen_keys: set[tuple[str, int]]` tracks `(source_file, page_number)` pairs. The same source/page cited multiple times resolves to one `Citation` in the output list.

	### Output Contract

	```python
	inject(answer, citation_map) → (answer_text, resolved_citations)
	```

	The answer text is preserved with markers — they are not stripped. The UI displays both the inline `[Source: ...]` text and the structured citation panel below the answer.

	---

	## AnswerChain

	File: [voicevault/generation/answer_chain.py](../voicevault/generation/answer_chain.py)

	### LLM Selection

	```
	GROQ_API_KEY set?
	YES → ChatGroq(model=llama-3.1-70b-versatile)
	If invoke() raises →
	NO → ChatGoogleGenerativeAI(model=gemini-1.5-flash)
	If invoke() raises →
	Return REFUSAL_PHRASE (no crash)
	```

	Both LLMs are constructed fresh per call (not cached) — `max_tokens` varies by `query_type` and LangChain model instances are lightweight.

	### Message Layout

	```
	[SystemMessage] ← FaithfulnessGuard.build_system_prompt()
	[HumanMessage] ← history turn 1 (oldest within window)
	[AIMessage] ← history turn 1 response
	... ← up to cfg.conversation_window pairs
	[HumanMessage] ← "Context:\n{context}\n\nQuestion: {query}"
	```

	History is capped at `cfg.conversation_window` (default 5) to keep prompt size predictable.

	### Token Budget by Query Type

	```python
	factual → cfg.max_answer_tokens (default 500 tokens)
	summary → cfg.max_answer_tokens × 2 (default 1000 tokens)
	compare → cfg.max_answer_tokens (default 500 tokens)
	```

	Summaries need more room for comprehensive coverage.

	### generate() Flow

	```python
	generate(query, context, citation_map, history, query_type) → GenerationResult:
	1. _build_messages() → LangChain message list
	2. _invoke_with_fallback() → raw_answer, model_used, tokens_used
	3. CitationInjector.inject() → clean_answer, citations
	4. FaithfulnessGuard.is_refusal() → is_refusal flag
	5. _confidence_from_citations() → "high" \| "medium" \| "low"
	6. return GenerationResult
	```

	### stream_generate() Flow

	```python
	stream_generate(...) → Generator[str, None, None]:
	1. _build_messages()
	2. _build_groq() or _build_gemini() ← first available
	3. for chunk in llm.stream(messages): yield chunk.content
	4. On error: yield error message (never raises)
	```

	Streaming is used by the Gradio UI to show tokens as they arrive. Citation injection and faithfulness check are not applied to streamed chunks — call `generate()` once streaming completes for the structured result.

	### GenerationResult

	```python
	@dataclass
	class GenerationResult:
	answer: str # Final answer with inline [Source: ...] markers
	citations: list[Citation] # Resolved, deduplicated citations
	confidence_level: str # "high" \| "medium" \| "low"
	is_refusal: bool # True if LLM correctly refused
	model_used: str # Model ID ("llama-3.1-70b-versatile" / "gemini-1.5-flash" / "none")
	tokens_used: int # Total tokens (input + output); 0 if unavailable
	latency_ms: int # Wall-clock LLM call time in ms
	```

	### Token Extraction

	```python
	def _extract_tokens(response) -> int:
	try:
	return int(response.usage_metadata.get("total_tokens", 0))
	except (AttributeError, TypeError):
	return 0
	```

	`usage_metadata` is the LangChain standard; both Groq and Gemini backends populate it. Returns 0 gracefully when unavailable (e.g., during streaming or with older SDK versions).

	---

	## Security Decisions

	### No Prompt Injection Through Context

	Context is injected as a plain string inside the user message, not in the system prompt. The system prompt is hardcoded and never receives user-controlled input. This limits prompt injection attack surface — a malicious document cannot override the faithfulness or citation instructions.

	### Refusal as Default

	When no LLM is configured (both keys absent) or both calls fail, the chain returns `REFUSAL_PHRASE` with `model_used="none"`. The application continues running — it never crashes due to missing API keys.

	### No PII in LLM Calls

	The query text passed to the LLM is the preprocessed version from `QueryPreprocessor` — fillers stripped, lowercased. The raw Whisper transcript is never sent to the LLM.

	---

	## Test Coverage

	File: [tests/test_phase4.py](../tests/test_phase4.py) \| 72/72 passed

	\| Class \| Tests \| What's verified \|
	\|-------\|-------\|----------------\|
	\| `TestCitationInjectorBasic` \| 8 \| Empty input, no markers, exact match, multiple markers, dedup, text preservation, order, empty map \|
	\| `TestCitationInjectorMatchingStrategies` \| 5 \| All 4 strategies + last resort \|
	\| `TestFaithfulnessGuardRefusal` \| 7 \| Exact phrase, case insensitive, embedded, normal answer, empty, partial, no-period \|
	\| `TestFaithfulnessGuardConfidence` \| 10 \| Empty results, high/medium/low thresholds, max across results, rrf fallback, all 4 boundary conditions \|
	\| `TestFaithfulnessGuardSystemPrompt` \| 6 \| Refusal phrase present, citation rules, faithfulness rules, length \|
	\| `TestAnswerChainMessageBuilding` \| 7 \| SystemMessage first, HumanMessage last, context in body, history pairs, window cap, no-history length \|
	\| `TestAnswerChainMaxTokens` \| 3 \| factual/summary/compare budgets \|
	\| `TestAnswerChainTokenExtraction` \| 4 \| Valid metadata, None metadata, missing attribute, type error \|
	\| `TestAnswerChainConfidenceFromCitations` \| 5 \| Empty, high/medium/low thresholds, max across citations \|
	\| `TestAnswerChainGenerateMocked` \| 7 \| Returns correct type, answer content, latency, tokens, refusal detection, non-refusal, citation resolution \|
	\| `TestAnswerChainFallback` \| 4 \| Gemini fallback on Groq failure, refusal when both fail, refusal when no keys, Groq preferred \|
	\| `TestAnswerChainStreaming` \| 4 \| Yields chunks, skips empty chunks, refusal when no LLM, error on exception \|
	\| `TestGenerationResult` \| 2 \| Instantiation, mutable citations list \|

	### Mocking Strategy

	No real API keys are needed. Tests patch `_build_groq` and `_build_gemini` at the instance level to return `MagicMock` LLMs with controlled responses:

	```python
	mock_llm = MagicMock()
	mock_llm.invoke.return_value = mock_response # or .side_effect = RuntimeError(...)
	with patch.object(chain, "_build_groq", return_value=mock_llm):
	result = chain.generate(...)
	```

	---

	## Integration Points

	### Called by (Phase 5 orchestrator)

	```python
	# In the query handler:
	results = retriever.search(query, kb_names)
	context, citation_map = builder.build(results)
	generation = chain.generate(
	query=transcript.transcript,
	context=context,
	citation_map=citation_map,
	history=session.history,
	query_type=transcript.query_type,
	)
	# generation.answer → display + TTS
	# generation.citations → citation panel
	# generation.is_refusal → skip TTS if True
	# generation.tokens_used → store in QuerySession
	```

	### Dependencies

	\| Dep \| Purpose \|
	\|-----\|---------\|
	\| `langchain-core` \| `HumanMessage`, `AIMessage`, `SystemMessage` \|
	\| `langchain-groq` \| `ChatGroq` client \|
	\| `langchain-google-genai` \| `ChatGoogleGenerativeAI` client \|
	\| `FaithfulnessGuard` \| System prompt + refusal detection \|
	\| `CitationInjector` \| Marker parsing + resolution \|