File size: 11,109 Bytes
85f900d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
# Phase 4 β€” LLM Generation Chain

**Status:** βœ… Complete | **Tests:** 72/72 passed | **Files:** 3 modules

---

## What Was Built

Phase 4 implements the generation layer: context + query β†’ grounded, cited answer.

| Module | Responsibility |
|--------|----------------|
| `voicevault/generation/answer_chain.py` | LangChain LCEL chain, Groq β†’ Gemini fallback |
| `voicevault/generation/citation_injector.py` | Parse + resolve `[Source: ...]` markers |
| `voicevault/generation/faithfulness_guard.py` | Refusal detection + confidence scoring |

---

## FaithfulnessGuard

**File:** [voicevault/generation/faithfulness_guard.py](../voicevault/generation/faithfulness_guard.py)

Two-layer hallucination prevention:

### Layer 1 β€” System Prompt Instruction

The LLM is instructed to use a fixed refusal phrase when the answer is not in context:

```
REFUSAL_PHRASE = "I could not find this in your documents."
```

This phrase is embedded in the system prompt via `build_system_prompt()`. The instruction is unambiguous: use *exactly this phrase and nothing else*.

### Layer 2 β€” Post-Generation Check

After the LLM responds, `is_refusal()` verifies the instruction was followed:

```python
_REFUSAL_PATTERN = re.compile(
    re.escape(REFUSAL_PHRASE.lower().rstrip(".")),
    re.IGNORECASE,
)

def is_refusal(self, answer: str) -> bool:
    if not answer:
        return True
    return bool(_REFUSAL_PATTERN.search(answer))
```

The pattern strips the trailing period before matching β€” covers both `"...documents."` and `"...documents"` forms. Case-insensitive for robustness.

### Confidence Scoring

Based on the top retrieval score across all retrieved chunks:

```
top_score > 0.5  β†’ "high"   (strong retrieval signal)
top_score > 0.2  β†’ "medium" (moderate signal β€” answer may miss nuance)
top_score ≀ 0.2  β†’ "low"    (weak signal β€” treat answer with caution)
```

Uses `rerank_score` if > 0 (cross-encoder score), falls back to `rrf_score` (RRF). Empty results β†’ `"low"`.

### System Prompt

`build_system_prompt()` combines two concern areas:

```
CITATION RULES:
  - Cite every factual claim with [Source: filename, p.N] inline.
  - Use exact source names and page numbers from context headers.
  - Do not cite general knowledge.

FAITHFULNESS RULES:
  - If answer not in context: respond with REFUSAL_PHRASE only.
  - Keep factual answers under 150 words.
  - Keep summary answers under 300 words.
```

Word limits prevent verbose answers that dilute citations and increase hallucination risk.

---

## CitationInjector

**File:** [voicevault/generation/citation_injector.py](../voicevault/generation/citation_injector.py)

Post-processes LLM output to resolve inline markers into structured `Citation` objects.

### Marker Format

The LLM is instructed to use: `[Source: filename, p.N]`

The regex also handles abbreviated forms the LLM might produce:
```python
_CITATION_PATTERN = re.compile(
    r"\[(?:Source:\s*)?([^,\]]+?)(?:,\s*p\.?\s*(\d+))?\]",
    re.IGNORECASE,
)
```
Matches: `[Source: report.pdf, p.3]`, `[report.pdf, p.3]`, `[Source: report]`.

### Resolution Strategy (4-level cascade)

For each parsed marker `(raw_name, page_num)`:

| Priority | Strategy | Condition |
|----------|----------|-----------|
| 1 | Exact filename + page | `source_file.lower() == raw_name.lower() and page_number == page_num` |
| 2 | Substring filename + page | `raw_name in source_file.lower() and page_number == page_num` |
| 3 | Page number only | `page_number == page_num` |
| 4 | Filename substring (no page) | `raw_name in source_file.lower()` |
| 5 (last resort) | First citation in map | Always |

This cascade handles real-world LLM output variability β€” models sometimes abbreviate filenames or omit page numbers.

### Deduplication

A `seen_keys: set[tuple[str, int]]` tracks `(source_file, page_number)` pairs. The same source/page cited multiple times resolves to one `Citation` in the output list.

### Output Contract

```python
inject(answer, citation_map) β†’ (answer_text, resolved_citations)
```

The answer text is **preserved with markers** β€” they are not stripped. The UI displays both the inline `[Source: ...]` text and the structured citation panel below the answer.

---

## AnswerChain

**File:** [voicevault/generation/answer_chain.py](../voicevault/generation/answer_chain.py)

### LLM Selection

```
GROQ_API_KEY set?
  YES β†’ ChatGroq(model=llama-3.1-70b-versatile)
        If invoke() raises β†’
  NO  β†’ ChatGoogleGenerativeAI(model=gemini-1.5-flash)
        If invoke() raises β†’
        Return REFUSAL_PHRASE (no crash)
```

Both LLMs are constructed fresh per call (not cached) β€” `max_tokens` varies by `query_type` and LangChain model instances are lightweight.

### Message Layout

```
[SystemMessage]  ← FaithfulnessGuard.build_system_prompt()
[HumanMessage]   ← history turn 1 (oldest within window)
[AIMessage]      ← history turn 1 response
...              ← up to cfg.conversation_window pairs
[HumanMessage]   ← "Context:\n{context}\n\nQuestion: {query}"
```

History is capped at `cfg.conversation_window` (default 5) to keep prompt size predictable.

### Token Budget by Query Type

```python
factual  β†’ cfg.max_answer_tokens        (default 500 tokens)
summary  β†’ cfg.max_answer_tokens Γ— 2    (default 1000 tokens)
compare  β†’ cfg.max_answer_tokens        (default 500 tokens)
```

Summaries need more room for comprehensive coverage.

### generate() Flow

```python
generate(query, context, citation_map, history, query_type) β†’ GenerationResult:
    1. _build_messages()          β†’ LangChain message list
    2. _invoke_with_fallback()    β†’ raw_answer, model_used, tokens_used
    3. CitationInjector.inject()  β†’ clean_answer, citations
    4. FaithfulnessGuard.is_refusal() β†’ is_refusal flag
    5. _confidence_from_citations()   β†’ "high" | "medium" | "low"
    6. return GenerationResult
```

### stream_generate() Flow

```python
stream_generate(...) β†’ Generator[str, None, None]:
    1. _build_messages()
    2. _build_groq() or _build_gemini()  ← first available
    3. for chunk in llm.stream(messages): yield chunk.content
    4. On error: yield error message (never raises)
```

Streaming is used by the Gradio UI to show tokens as they arrive. Citation injection and faithfulness check are not applied to streamed chunks β€” call `generate()` once streaming completes for the structured result.

### GenerationResult

```python
@dataclass
class GenerationResult:
    answer: str           # Final answer with inline [Source: ...] markers
    citations: list[Citation]  # Resolved, deduplicated citations
    confidence_level: str # "high" | "medium" | "low"
    is_refusal: bool      # True if LLM correctly refused
    model_used: str       # Model ID ("llama-3.1-70b-versatile" / "gemini-1.5-flash" / "none")
    tokens_used: int      # Total tokens (input + output); 0 if unavailable
    latency_ms: int       # Wall-clock LLM call time in ms
```

### Token Extraction

```python
def _extract_tokens(response) -> int:
    try:
        return int(response.usage_metadata.get("total_tokens", 0))
    except (AttributeError, TypeError):
        return 0
```

`usage_metadata` is the LangChain standard; both Groq and Gemini backends populate it. Returns 0 gracefully when unavailable (e.g., during streaming or with older SDK versions).

---

## Security Decisions

### No Prompt Injection Through Context

Context is injected as a plain string inside the user message, not in the system prompt. The system prompt is hardcoded and never receives user-controlled input. This limits prompt injection attack surface β€” a malicious document cannot override the faithfulness or citation instructions.

### Refusal as Default

When no LLM is configured (both keys absent) or both calls fail, the chain returns `REFUSAL_PHRASE` with `model_used="none"`. The application continues running β€” it never crashes due to missing API keys.

### No PII in LLM Calls

The query text passed to the LLM is the preprocessed version from `QueryPreprocessor` β€” fillers stripped, lowercased. The raw Whisper transcript is never sent to the LLM.

---

## Test Coverage

**File:** [tests/test_phase4.py](../tests/test_phase4.py) | **72/72 passed**

| Class | Tests | What's verified |
|-------|-------|----------------|
| `TestCitationInjectorBasic` | 8 | Empty input, no markers, exact match, multiple markers, dedup, text preservation, order, empty map |
| `TestCitationInjectorMatchingStrategies` | 5 | All 4 strategies + last resort |
| `TestFaithfulnessGuardRefusal` | 7 | Exact phrase, case insensitive, embedded, normal answer, empty, partial, no-period |
| `TestFaithfulnessGuardConfidence` | 10 | Empty results, high/medium/low thresholds, max across results, rrf fallback, all 4 boundary conditions |
| `TestFaithfulnessGuardSystemPrompt` | 6 | Refusal phrase present, citation rules, faithfulness rules, length |
| `TestAnswerChainMessageBuilding` | 7 | SystemMessage first, HumanMessage last, context in body, history pairs, window cap, no-history length |
| `TestAnswerChainMaxTokens` | 3 | factual/summary/compare budgets |
| `TestAnswerChainTokenExtraction` | 4 | Valid metadata, None metadata, missing attribute, type error |
| `TestAnswerChainConfidenceFromCitations` | 5 | Empty, high/medium/low thresholds, max across citations |
| `TestAnswerChainGenerateMocked` | 7 | Returns correct type, answer content, latency, tokens, refusal detection, non-refusal, citation resolution |
| `TestAnswerChainFallback` | 4 | Gemini fallback on Groq failure, refusal when both fail, refusal when no keys, Groq preferred |
| `TestAnswerChainStreaming` | 4 | Yields chunks, skips empty chunks, refusal when no LLM, error on exception |
| `TestGenerationResult` | 2 | Instantiation, mutable citations list |

### Mocking Strategy

No real API keys are needed. Tests patch `_build_groq` and `_build_gemini` at the instance level to return `MagicMock` LLMs with controlled responses:

```python
mock_llm = MagicMock()
mock_llm.invoke.return_value = mock_response  # or .side_effect = RuntimeError(...)
with patch.object(chain, "_build_groq", return_value=mock_llm):
    result = chain.generate(...)
```

---

## Integration Points

### Called by (Phase 5 orchestrator)

```python
# In the query handler:
results = retriever.search(query, kb_names)
context, citation_map = builder.build(results)
generation = chain.generate(
    query=transcript.transcript,
    context=context,
    citation_map=citation_map,
    history=session.history,
    query_type=transcript.query_type,
)
# generation.answer β†’ display + TTS
# generation.citations β†’ citation panel
# generation.is_refusal β†’ skip TTS if True
# generation.tokens_used β†’ store in QuerySession
```

### Dependencies

| Dep | Purpose |
|-----|---------|
| `langchain-core` | `HumanMessage`, `AIMessage`, `SystemMessage` |
| `langchain-groq` | `ChatGroq` client |
| `langchain-google-genai` | `ChatGoogleGenerativeAI` client |
| `FaithfulnessGuard` | System prompt + refusal detection |
| `CitationInjector` | Marker parsing + resolution |