Spaces:
Running
Phase 4 β LLM Generation Chain
Status: β Complete | Tests: 72/72 passed | Files: 3 modules
What Was Built
Phase 4 implements the generation layer: context + query β grounded, cited answer.
| Module | Responsibility |
|---|---|
voicevault/generation/answer_chain.py |
LangChain LCEL chain, Groq β Gemini fallback |
voicevault/generation/citation_injector.py |
Parse + resolve [Source: ...] markers |
voicevault/generation/faithfulness_guard.py |
Refusal detection + confidence scoring |
FaithfulnessGuard
File: voicevault/generation/faithfulness_guard.py
Two-layer hallucination prevention:
Layer 1 β System Prompt Instruction
The LLM is instructed to use a fixed refusal phrase when the answer is not in context:
REFUSAL_PHRASE = "I could not find this in your documents."
This phrase is embedded in the system prompt via build_system_prompt(). The instruction is unambiguous: use exactly this phrase and nothing else.
Layer 2 β Post-Generation Check
After the LLM responds, is_refusal() verifies the instruction was followed:
_REFUSAL_PATTERN = re.compile(
re.escape(REFUSAL_PHRASE.lower().rstrip(".")),
re.IGNORECASE,
)
def is_refusal(self, answer: str) -> bool:
if not answer:
return True
return bool(_REFUSAL_PATTERN.search(answer))
The pattern strips the trailing period before matching β covers both "...documents." and "...documents" forms. Case-insensitive for robustness.
Confidence Scoring
Based on the top retrieval score across all retrieved chunks:
top_score > 0.5 β "high" (strong retrieval signal)
top_score > 0.2 β "medium" (moderate signal β answer may miss nuance)
top_score β€ 0.2 β "low" (weak signal β treat answer with caution)
Uses rerank_score if > 0 (cross-encoder score), falls back to rrf_score (RRF). Empty results β "low".
System Prompt
build_system_prompt() combines two concern areas:
CITATION RULES:
- Cite every factual claim with [Source: filename, p.N] inline.
- Use exact source names and page numbers from context headers.
- Do not cite general knowledge.
FAITHFULNESS RULES:
- If answer not in context: respond with REFUSAL_PHRASE only.
- Keep factual answers under 150 words.
- Keep summary answers under 300 words.
Word limits prevent verbose answers that dilute citations and increase hallucination risk.
CitationInjector
File: voicevault/generation/citation_injector.py
Post-processes LLM output to resolve inline markers into structured Citation objects.
Marker Format
The LLM is instructed to use: [Source: filename, p.N]
The regex also handles abbreviated forms the LLM might produce:
_CITATION_PATTERN = re.compile(
r"\[(?:Source:\s*)?([^,\]]+?)(?:,\s*p\.?\s*(\d+))?\]",
re.IGNORECASE,
)
Matches: [Source: report.pdf, p.3], [report.pdf, p.3], [Source: report].
Resolution Strategy (4-level cascade)
For each parsed marker (raw_name, page_num):
| Priority | Strategy | Condition |
|---|---|---|
| 1 | Exact filename + page | source_file.lower() == raw_name.lower() and page_number == page_num |
| 2 | Substring filename + page | raw_name in source_file.lower() and page_number == page_num |
| 3 | Page number only | page_number == page_num |
| 4 | Filename substring (no page) | raw_name in source_file.lower() |
| 5 (last resort) | First citation in map | Always |
This cascade handles real-world LLM output variability β models sometimes abbreviate filenames or omit page numbers.
Deduplication
A seen_keys: set[tuple[str, int]] tracks (source_file, page_number) pairs. The same source/page cited multiple times resolves to one Citation in the output list.
Output Contract
inject(answer, citation_map) β (answer_text, resolved_citations)
The answer text is preserved with markers β they are not stripped. The UI displays both the inline [Source: ...] text and the structured citation panel below the answer.
AnswerChain
File: voicevault/generation/answer_chain.py
LLM Selection
GROQ_API_KEY set?
YES β ChatGroq(model=llama-3.1-70b-versatile)
If invoke() raises β
NO β ChatGoogleGenerativeAI(model=gemini-1.5-flash)
If invoke() raises β
Return REFUSAL_PHRASE (no crash)
Both LLMs are constructed fresh per call (not cached) β max_tokens varies by query_type and LangChain model instances are lightweight.
Message Layout
[SystemMessage] β FaithfulnessGuard.build_system_prompt()
[HumanMessage] β history turn 1 (oldest within window)
[AIMessage] β history turn 1 response
... β up to cfg.conversation_window pairs
[HumanMessage] β "Context:\n{context}\n\nQuestion: {query}"
History is capped at cfg.conversation_window (default 5) to keep prompt size predictable.
Token Budget by Query Type
factual β cfg.max_answer_tokens (default 500 tokens)
summary β cfg.max_answer_tokens Γ 2 (default 1000 tokens)
compare β cfg.max_answer_tokens (default 500 tokens)
Summaries need more room for comprehensive coverage.
generate() Flow
generate(query, context, citation_map, history, query_type) β GenerationResult:
1. _build_messages() β LangChain message list
2. _invoke_with_fallback() β raw_answer, model_used, tokens_used
3. CitationInjector.inject() β clean_answer, citations
4. FaithfulnessGuard.is_refusal() β is_refusal flag
5. _confidence_from_citations() β "high" | "medium" | "low"
6. return GenerationResult
stream_generate() Flow
stream_generate(...) β Generator[str, None, None]:
1. _build_messages()
2. _build_groq() or _build_gemini() β first available
3. for chunk in llm.stream(messages): yield chunk.content
4. On error: yield error message (never raises)
Streaming is used by the Gradio UI to show tokens as they arrive. Citation injection and faithfulness check are not applied to streamed chunks β call generate() once streaming completes for the structured result.
GenerationResult
@dataclass
class GenerationResult:
answer: str # Final answer with inline [Source: ...] markers
citations: list[Citation] # Resolved, deduplicated citations
confidence_level: str # "high" | "medium" | "low"
is_refusal: bool # True if LLM correctly refused
model_used: str # Model ID ("llama-3.1-70b-versatile" / "gemini-1.5-flash" / "none")
tokens_used: int # Total tokens (input + output); 0 if unavailable
latency_ms: int # Wall-clock LLM call time in ms
Token Extraction
def _extract_tokens(response) -> int:
try:
return int(response.usage_metadata.get("total_tokens", 0))
except (AttributeError, TypeError):
return 0
usage_metadata is the LangChain standard; both Groq and Gemini backends populate it. Returns 0 gracefully when unavailable (e.g., during streaming or with older SDK versions).
Security Decisions
No Prompt Injection Through Context
Context is injected as a plain string inside the user message, not in the system prompt. The system prompt is hardcoded and never receives user-controlled input. This limits prompt injection attack surface β a malicious document cannot override the faithfulness or citation instructions.
Refusal as Default
When no LLM is configured (both keys absent) or both calls fail, the chain returns REFUSAL_PHRASE with model_used="none". The application continues running β it never crashes due to missing API keys.
No PII in LLM Calls
The query text passed to the LLM is the preprocessed version from QueryPreprocessor β fillers stripped, lowercased. The raw Whisper transcript is never sent to the LLM.
Test Coverage
File: tests/test_phase4.py | 72/72 passed
| Class | Tests | What's verified |
|---|---|---|
TestCitationInjectorBasic |
8 | Empty input, no markers, exact match, multiple markers, dedup, text preservation, order, empty map |
TestCitationInjectorMatchingStrategies |
5 | All 4 strategies + last resort |
TestFaithfulnessGuardRefusal |
7 | Exact phrase, case insensitive, embedded, normal answer, empty, partial, no-period |
TestFaithfulnessGuardConfidence |
10 | Empty results, high/medium/low thresholds, max across results, rrf fallback, all 4 boundary conditions |
TestFaithfulnessGuardSystemPrompt |
6 | Refusal phrase present, citation rules, faithfulness rules, length |
TestAnswerChainMessageBuilding |
7 | SystemMessage first, HumanMessage last, context in body, history pairs, window cap, no-history length |
TestAnswerChainMaxTokens |
3 | factual/summary/compare budgets |
TestAnswerChainTokenExtraction |
4 | Valid metadata, None metadata, missing attribute, type error |
TestAnswerChainConfidenceFromCitations |
5 | Empty, high/medium/low thresholds, max across citations |
TestAnswerChainGenerateMocked |
7 | Returns correct type, answer content, latency, tokens, refusal detection, non-refusal, citation resolution |
TestAnswerChainFallback |
4 | Gemini fallback on Groq failure, refusal when both fail, refusal when no keys, Groq preferred |
TestAnswerChainStreaming |
4 | Yields chunks, skips empty chunks, refusal when no LLM, error on exception |
TestGenerationResult |
2 | Instantiation, mutable citations list |
Mocking Strategy
No real API keys are needed. Tests patch _build_groq and _build_gemini at the instance level to return MagicMock LLMs with controlled responses:
mock_llm = MagicMock()
mock_llm.invoke.return_value = mock_response # or .side_effect = RuntimeError(...)
with patch.object(chain, "_build_groq", return_value=mock_llm):
result = chain.generate(...)
Integration Points
Called by (Phase 5 orchestrator)
# In the query handler:
results = retriever.search(query, kb_names)
context, citation_map = builder.build(results)
generation = chain.generate(
query=transcript.transcript,
context=context,
citation_map=citation_map,
history=session.history,
query_type=transcript.query_type,
)
# generation.answer β display + TTS
# generation.citations β citation panel
# generation.is_refusal β skip TTS if True
# generation.tokens_used β store in QuerySession
Dependencies
| Dep | Purpose |
|---|---|
langchain-core |
HumanMessage, AIMessage, SystemMessage |
langchain-groq |
ChatGroq client |
langchain-google-genai |
ChatGoogleGenerativeAI client |
FaithfulnessGuard |
System prompt + refusal detection |
CitationInjector |
Marker parsing + resolution |