Spaces:

NinjainPJs
/

VoiceVault

Running

App Files Files Community

VoiceVault / DOCS /phase4_generation.md

NinjainPJs

Initial release: VoiceVault v1.0.0 — Voice-First RAG Knowledge Agent

85f900d 3 months ago

preview code

raw

history blame contribute delete

11.1 kB

Phase 4 — LLM Generation Chain

Status: ✅ Complete | Tests: 72/72 passed | Files: 3 modules

What Was Built

Phase 4 implements the generation layer: context + query → grounded, cited answer.

Module	Responsibility
`voicevault/generation/answer_chain.py`	LangChain LCEL chain, Groq → Gemini fallback
`voicevault/generation/citation_injector.py`	Parse + resolve `[Source: ...]` markers
`voicevault/generation/faithfulness_guard.py`	Refusal detection + confidence scoring

FaithfulnessGuard

File: voicevault/generation/faithfulness_guard.py

Two-layer hallucination prevention:

Layer 1 — System Prompt Instruction

The LLM is instructed to use a fixed refusal phrase when the answer is not in context:

REFUSAL_PHRASE = "I could not find this in your documents."

This phrase is embedded in the system prompt via build_system_prompt(). The instruction is unambiguous: use exactly this phrase and nothing else.

Layer 2 — Post-Generation Check

After the LLM responds, is_refusal() verifies the instruction was followed:

_REFUSAL_PATTERN = re.compile(
    re.escape(REFUSAL_PHRASE.lower().rstrip(".")),
    re.IGNORECASE,
)

def is_refusal(self, answer: str) -> bool:
    if not answer:
        return True
    return bool(_REFUSAL_PATTERN.search(answer))

The pattern strips the trailing period before matching — covers both "...documents." and "...documents" forms. Case-insensitive for robustness.

Confidence Scoring

Based on the top retrieval score across all retrieved chunks:

top_score > 0.5  → "high"   (strong retrieval signal)
top_score > 0.2  → "medium" (moderate signal — answer may miss nuance)
top_score ≤ 0.2  → "low"    (weak signal — treat answer with caution)

Uses rerank_score if > 0 (cross-encoder score), falls back to rrf_score (RRF). Empty results → "low".

System Prompt

build_system_prompt() combines two concern areas:

CITATION RULES:
  - Cite every factual claim with [Source: filename, p.N] inline.
  - Use exact source names and page numbers from context headers.
  - Do not cite general knowledge.

FAITHFULNESS RULES:
  - If answer not in context: respond with REFUSAL_PHRASE only.
  - Keep factual answers under 150 words.
  - Keep summary answers under 300 words.

Word limits prevent verbose answers that dilute citations and increase hallucination risk.

CitationInjector

File: voicevault/generation/citation_injector.py

Post-processes LLM output to resolve inline markers into structured Citation objects.

Marker Format

The LLM is instructed to use: [Source: filename, p.N]

The regex also handles abbreviated forms the LLM might produce:

_CITATION_PATTERN = re.compile(
    r"\[(?:Source:\s*)?([^,\]]+?)(?:,\s*p\.?\s*(\d+))?\]",
    re.IGNORECASE,
)

Matches: [Source: report.pdf, p.3], [report.pdf, p.3], [Source: report].

Resolution Strategy (4-level cascade)

For each parsed marker (raw_name, page_num):

Priority	Strategy	Condition
1	Exact filename + page	`source_file.lower() == raw_name.lower() and page_number == page_num`
2	Substring filename + page	`raw_name in source_file.lower() and page_number == page_num`
3	Page number only	`page_number == page_num`
4	Filename substring (no page)	`raw_name in source_file.lower()`
5 (last resort)	First citation in map	Always

This cascade handles real-world LLM output variability — models sometimes abbreviate filenames or omit page numbers.

Deduplication

A seen_keys: set[tuple[str, int]] tracks (source_file, page_number) pairs. The same source/page cited multiple times resolves to one Citation in the output list.

Output Contract

inject(answer, citation_map) → (answer_text, resolved_citations)

The answer text is preserved with markers — they are not stripped. The UI displays both the inline [Source: ...] text and the structured citation panel below the answer.

AnswerChain

File: voicevault/generation/answer_chain.py

LLM Selection

GROQ_API_KEY set?
  YES → ChatGroq(model=llama-3.1-70b-versatile)
        If invoke() raises →
  NO  → ChatGoogleGenerativeAI(model=gemini-1.5-flash)
        If invoke() raises →
        Return REFUSAL_PHRASE (no crash)

Both LLMs are constructed fresh per call (not cached) — max_tokens varies by query_type and LangChain model instances are lightweight.

Message Layout

[SystemMessage]  ← FaithfulnessGuard.build_system_prompt()
[HumanMessage]   ← history turn 1 (oldest within window)
[AIMessage]      ← history turn 1 response
...              ← up to cfg.conversation_window pairs
[HumanMessage]   ← "Context:\n{context}\n\nQuestion: {query}"

History is capped at cfg.conversation_window (default 5) to keep prompt size predictable.

Token Budget by Query Type

factual  → cfg.max_answer_tokens        (default 500 tokens)
summary  → cfg.max_answer_tokens × 2    (default 1000 tokens)
compare  → cfg.max_answer_tokens        (default 500 tokens)

Summaries need more room for comprehensive coverage.

generate() Flow

generate(query, context, citation_map, history, query_type) → GenerationResult:
    1. _build_messages()          → LangChain message list
    2. _invoke_with_fallback()    → raw_answer, model_used, tokens_used
    3. CitationInjector.inject()  → clean_answer, citations
    4. FaithfulnessGuard.is_refusal() → is_refusal flag
    5. _confidence_from_citations()   → "high" | "medium" | "low"
    6. return GenerationResult

stream_generate() Flow

stream_generate(...) → Generator[str, None, None]:
    1. _build_messages()
    2. _build_groq() or _build_gemini()  ← first available
    3. for chunk in llm.stream(messages): yield chunk.content
    4. On error: yield error message (never raises)

Streaming is used by the Gradio UI to show tokens as they arrive. Citation injection and faithfulness check are not applied to streamed chunks — call generate() once streaming completes for the structured result.

GenerationResult

@dataclass
class GenerationResult:
    answer: str           # Final answer with inline [Source: ...] markers
    citations: list[Citation]  # Resolved, deduplicated citations
    confidence_level: str # "high" | "medium" | "low"
    is_refusal: bool      # True if LLM correctly refused
    model_used: str       # Model ID ("llama-3.1-70b-versatile" / "gemini-1.5-flash" / "none")
    tokens_used: int      # Total tokens (input + output); 0 if unavailable
    latency_ms: int       # Wall-clock LLM call time in ms

Token Extraction

def _extract_tokens(response) -> int:
    try:
        return int(response.usage_metadata.get("total_tokens", 0))
    except (AttributeError, TypeError):
        return 0

usage_metadata is the LangChain standard; both Groq and Gemini backends populate it. Returns 0 gracefully when unavailable (e.g., during streaming or with older SDK versions).

Security Decisions

No Prompt Injection Through Context

Context is injected as a plain string inside the user message, not in the system prompt. The system prompt is hardcoded and never receives user-controlled input. This limits prompt injection attack surface — a malicious document cannot override the faithfulness or citation instructions.

Refusal as Default

When no LLM is configured (both keys absent) or both calls fail, the chain returns REFUSAL_PHRASE with model_used="none". The application continues running — it never crashes due to missing API keys.

No PII in LLM Calls

The query text passed to the LLM is the preprocessed version from QueryPreprocessor — fillers stripped, lowercased. The raw Whisper transcript is never sent to the LLM.

Test Coverage

File: tests/test_phase4.py | 72/72 passed

Class	Tests	What's verified
`TestCitationInjectorBasic`	8	Empty input, no markers, exact match, multiple markers, dedup, text preservation, order, empty map
`TestCitationInjectorMatchingStrategies`	5	All 4 strategies + last resort
`TestFaithfulnessGuardRefusal`	7	Exact phrase, case insensitive, embedded, normal answer, empty, partial, no-period
`TestFaithfulnessGuardConfidence`	10	Empty results, high/medium/low thresholds, max across results, rrf fallback, all 4 boundary conditions
`TestFaithfulnessGuardSystemPrompt`	6	Refusal phrase present, citation rules, faithfulness rules, length
`TestAnswerChainMessageBuilding`	7	SystemMessage first, HumanMessage last, context in body, history pairs, window cap, no-history length
`TestAnswerChainMaxTokens`	3	factual/summary/compare budgets
`TestAnswerChainTokenExtraction`	4	Valid metadata, None metadata, missing attribute, type error
`TestAnswerChainConfidenceFromCitations`	5	Empty, high/medium/low thresholds, max across citations
`TestAnswerChainGenerateMocked`	7	Returns correct type, answer content, latency, tokens, refusal detection, non-refusal, citation resolution
`TestAnswerChainFallback`	4	Gemini fallback on Groq failure, refusal when both fail, refusal when no keys, Groq preferred
`TestAnswerChainStreaming`	4	Yields chunks, skips empty chunks, refusal when no LLM, error on exception
`TestGenerationResult`	2	Instantiation, mutable citations list

Mocking Strategy

No real API keys are needed. Tests patch _build_groq and _build_gemini at the instance level to return MagicMock LLMs with controlled responses:

mock_llm = MagicMock()
mock_llm.invoke.return_value = mock_response  # or .side_effect = RuntimeError(...)
with patch.object(chain, "_build_groq", return_value=mock_llm):
    result = chain.generate(...)

Integration Points

Called by (Phase 5 orchestrator)

# In the query handler:
results = retriever.search(query, kb_names)
context, citation_map = builder.build(results)
generation = chain.generate(
    query=transcript.transcript,
    context=context,
    citation_map=citation_map,
    history=session.history,
    query_type=transcript.query_type,
)
# generation.answer → display + TTS
# generation.citations → citation panel
# generation.is_refusal → skip TTS if True
# generation.tokens_used → store in QuerySession

Dependencies

Dep	Purpose
`langchain-core`	`HumanMessage`, `AIMessage`, `SystemMessage`
`langchain-groq`	`ChatGroq` client
`langchain-google-genai`	`ChatGoogleGenerativeAI` client
`FaithfulnessGuard`	System prompt + refusal detection
`CitationInjector`	Marker parsing + resolution