VoiceVault / DOCS /phase4_generation.md
NinjainPJs's picture
Initial release: VoiceVault v1.0.0 β€” Voice-First RAG Knowledge Agent
85f900d

Phase 4 β€” LLM Generation Chain

Status: βœ… Complete | Tests: 72/72 passed | Files: 3 modules


What Was Built

Phase 4 implements the generation layer: context + query β†’ grounded, cited answer.

Module Responsibility
voicevault/generation/answer_chain.py LangChain LCEL chain, Groq β†’ Gemini fallback
voicevault/generation/citation_injector.py Parse + resolve [Source: ...] markers
voicevault/generation/faithfulness_guard.py Refusal detection + confidence scoring

FaithfulnessGuard

File: voicevault/generation/faithfulness_guard.py

Two-layer hallucination prevention:

Layer 1 β€” System Prompt Instruction

The LLM is instructed to use a fixed refusal phrase when the answer is not in context:

REFUSAL_PHRASE = "I could not find this in your documents."

This phrase is embedded in the system prompt via build_system_prompt(). The instruction is unambiguous: use exactly this phrase and nothing else.

Layer 2 β€” Post-Generation Check

After the LLM responds, is_refusal() verifies the instruction was followed:

_REFUSAL_PATTERN = re.compile(
    re.escape(REFUSAL_PHRASE.lower().rstrip(".")),
    re.IGNORECASE,
)

def is_refusal(self, answer: str) -> bool:
    if not answer:
        return True
    return bool(_REFUSAL_PATTERN.search(answer))

The pattern strips the trailing period before matching β€” covers both "...documents." and "...documents" forms. Case-insensitive for robustness.

Confidence Scoring

Based on the top retrieval score across all retrieved chunks:

top_score > 0.5  β†’ "high"   (strong retrieval signal)
top_score > 0.2  β†’ "medium" (moderate signal β€” answer may miss nuance)
top_score ≀ 0.2  β†’ "low"    (weak signal β€” treat answer with caution)

Uses rerank_score if > 0 (cross-encoder score), falls back to rrf_score (RRF). Empty results β†’ "low".

System Prompt

build_system_prompt() combines two concern areas:

CITATION RULES:
  - Cite every factual claim with [Source: filename, p.N] inline.
  - Use exact source names and page numbers from context headers.
  - Do not cite general knowledge.

FAITHFULNESS RULES:
  - If answer not in context: respond with REFUSAL_PHRASE only.
  - Keep factual answers under 150 words.
  - Keep summary answers under 300 words.

Word limits prevent verbose answers that dilute citations and increase hallucination risk.


CitationInjector

File: voicevault/generation/citation_injector.py

Post-processes LLM output to resolve inline markers into structured Citation objects.

Marker Format

The LLM is instructed to use: [Source: filename, p.N]

The regex also handles abbreviated forms the LLM might produce:

_CITATION_PATTERN = re.compile(
    r"\[(?:Source:\s*)?([^,\]]+?)(?:,\s*p\.?\s*(\d+))?\]",
    re.IGNORECASE,
)

Matches: [Source: report.pdf, p.3], [report.pdf, p.3], [Source: report].

Resolution Strategy (4-level cascade)

For each parsed marker (raw_name, page_num):

Priority Strategy Condition
1 Exact filename + page source_file.lower() == raw_name.lower() and page_number == page_num
2 Substring filename + page raw_name in source_file.lower() and page_number == page_num
3 Page number only page_number == page_num
4 Filename substring (no page) raw_name in source_file.lower()
5 (last resort) First citation in map Always

This cascade handles real-world LLM output variability β€” models sometimes abbreviate filenames or omit page numbers.

Deduplication

A seen_keys: set[tuple[str, int]] tracks (source_file, page_number) pairs. The same source/page cited multiple times resolves to one Citation in the output list.

Output Contract

inject(answer, citation_map) β†’ (answer_text, resolved_citations)

The answer text is preserved with markers β€” they are not stripped. The UI displays both the inline [Source: ...] text and the structured citation panel below the answer.


AnswerChain

File: voicevault/generation/answer_chain.py

LLM Selection

GROQ_API_KEY set?
  YES β†’ ChatGroq(model=llama-3.1-70b-versatile)
        If invoke() raises β†’
  NO  β†’ ChatGoogleGenerativeAI(model=gemini-1.5-flash)
        If invoke() raises β†’
        Return REFUSAL_PHRASE (no crash)

Both LLMs are constructed fresh per call (not cached) β€” max_tokens varies by query_type and LangChain model instances are lightweight.

Message Layout

[SystemMessage]  ← FaithfulnessGuard.build_system_prompt()
[HumanMessage]   ← history turn 1 (oldest within window)
[AIMessage]      ← history turn 1 response
...              ← up to cfg.conversation_window pairs
[HumanMessage]   ← "Context:\n{context}\n\nQuestion: {query}"

History is capped at cfg.conversation_window (default 5) to keep prompt size predictable.

Token Budget by Query Type

factual  β†’ cfg.max_answer_tokens        (default 500 tokens)
summary  β†’ cfg.max_answer_tokens Γ— 2    (default 1000 tokens)
compare  β†’ cfg.max_answer_tokens        (default 500 tokens)

Summaries need more room for comprehensive coverage.

generate() Flow

generate(query, context, citation_map, history, query_type) β†’ GenerationResult:
    1. _build_messages()          β†’ LangChain message list
    2. _invoke_with_fallback()    β†’ raw_answer, model_used, tokens_used
    3. CitationInjector.inject()  β†’ clean_answer, citations
    4. FaithfulnessGuard.is_refusal() β†’ is_refusal flag
    5. _confidence_from_citations()   β†’ "high" | "medium" | "low"
    6. return GenerationResult

stream_generate() Flow

stream_generate(...) β†’ Generator[str, None, None]:
    1. _build_messages()
    2. _build_groq() or _build_gemini()  ← first available
    3. for chunk in llm.stream(messages): yield chunk.content
    4. On error: yield error message (never raises)

Streaming is used by the Gradio UI to show tokens as they arrive. Citation injection and faithfulness check are not applied to streamed chunks β€” call generate() once streaming completes for the structured result.

GenerationResult

@dataclass
class GenerationResult:
    answer: str           # Final answer with inline [Source: ...] markers
    citations: list[Citation]  # Resolved, deduplicated citations
    confidence_level: str # "high" | "medium" | "low"
    is_refusal: bool      # True if LLM correctly refused
    model_used: str       # Model ID ("llama-3.1-70b-versatile" / "gemini-1.5-flash" / "none")
    tokens_used: int      # Total tokens (input + output); 0 if unavailable
    latency_ms: int       # Wall-clock LLM call time in ms

Token Extraction

def _extract_tokens(response) -> int:
    try:
        return int(response.usage_metadata.get("total_tokens", 0))
    except (AttributeError, TypeError):
        return 0

usage_metadata is the LangChain standard; both Groq and Gemini backends populate it. Returns 0 gracefully when unavailable (e.g., during streaming or with older SDK versions).


Security Decisions

No Prompt Injection Through Context

Context is injected as a plain string inside the user message, not in the system prompt. The system prompt is hardcoded and never receives user-controlled input. This limits prompt injection attack surface β€” a malicious document cannot override the faithfulness or citation instructions.

Refusal as Default

When no LLM is configured (both keys absent) or both calls fail, the chain returns REFUSAL_PHRASE with model_used="none". The application continues running β€” it never crashes due to missing API keys.

No PII in LLM Calls

The query text passed to the LLM is the preprocessed version from QueryPreprocessor β€” fillers stripped, lowercased. The raw Whisper transcript is never sent to the LLM.


Test Coverage

File: tests/test_phase4.py | 72/72 passed

Class Tests What's verified
TestCitationInjectorBasic 8 Empty input, no markers, exact match, multiple markers, dedup, text preservation, order, empty map
TestCitationInjectorMatchingStrategies 5 All 4 strategies + last resort
TestFaithfulnessGuardRefusal 7 Exact phrase, case insensitive, embedded, normal answer, empty, partial, no-period
TestFaithfulnessGuardConfidence 10 Empty results, high/medium/low thresholds, max across results, rrf fallback, all 4 boundary conditions
TestFaithfulnessGuardSystemPrompt 6 Refusal phrase present, citation rules, faithfulness rules, length
TestAnswerChainMessageBuilding 7 SystemMessage first, HumanMessage last, context in body, history pairs, window cap, no-history length
TestAnswerChainMaxTokens 3 factual/summary/compare budgets
TestAnswerChainTokenExtraction 4 Valid metadata, None metadata, missing attribute, type error
TestAnswerChainConfidenceFromCitations 5 Empty, high/medium/low thresholds, max across citations
TestAnswerChainGenerateMocked 7 Returns correct type, answer content, latency, tokens, refusal detection, non-refusal, citation resolution
TestAnswerChainFallback 4 Gemini fallback on Groq failure, refusal when both fail, refusal when no keys, Groq preferred
TestAnswerChainStreaming 4 Yields chunks, skips empty chunks, refusal when no LLM, error on exception
TestGenerationResult 2 Instantiation, mutable citations list

Mocking Strategy

No real API keys are needed. Tests patch _build_groq and _build_gemini at the instance level to return MagicMock LLMs with controlled responses:

mock_llm = MagicMock()
mock_llm.invoke.return_value = mock_response  # or .side_effect = RuntimeError(...)
with patch.object(chain, "_build_groq", return_value=mock_llm):
    result = chain.generate(...)

Integration Points

Called by (Phase 5 orchestrator)

# In the query handler:
results = retriever.search(query, kb_names)
context, citation_map = builder.build(results)
generation = chain.generate(
    query=transcript.transcript,
    context=context,
    citation_map=citation_map,
    history=session.history,
    query_type=transcript.query_type,
)
# generation.answer β†’ display + TTS
# generation.citations β†’ citation panel
# generation.is_refusal β†’ skip TTS if True
# generation.tokens_used β†’ store in QuerySession

Dependencies

Dep Purpose
langchain-core HumanMessage, AIMessage, SystemMessage
langchain-groq ChatGroq client
langchain-google-genai ChatGoogleGenerativeAI client
FaithfulnessGuard System prompt + refusal detection
CitationInjector Marker parsing + resolution