File size: 2,707 Bytes
9c9ce67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# RAG Retrieval Strategy Comparison

## Implemented strategies

1. **Similarity (cosine)**  
   Standard top-k retrieval by embedding cosine similarity. Chroma returns the k nearest chunks to the query embedding. No diversity adjustment.

2. **MMR (Max Marginal Relevance)**  
   Balances relevance and diversity: after selecting the top chunk, we iteratively choose the next chunk that maximizes  
   `λ * relevance(chunk, query) - (1 - λ) * max_similarity(chunk, already_selected)`.  
   Here we use a simplified MMR that approximates inter-chunk similarity from metadata (same source vs different source) and uses Chroma’s distance for relevance.  
   `MMR_LAMBDA` (default 0.7) controls the trade-off: 1.0 ≈ pure similarity, 0 ≈ maximum diversity.

## What changed in retrieved docs

- **Similarity**: Tends to return several very similar chunks (e.g. same section or same source repeated), which can make the answer repetitive and over-anchored to one part of the corpus.
- **MMR**: Favors spreading across different sources/sections, so the model sees more varied context and is less likely to over-cite one document. You may see more distinct source names in citations.

## Speed differences

- **Retrieval time**: Similarity is a single Chroma query. MMR does one Chroma query (with a larger n_results) then a small Python loop to select k chunks; the extra work is negligible (milliseconds). So retrieval time is effectively the same.
- **Generation time**: Unchanged by strategy; it depends on context length and model. MMR can slightly reduce redundancy in the context, which might marginally affect length and thus generation time.

## Timing in the app

After each assistant reply, the UI shows:

- **Retrieval: X.XXs** – time from query to having the list of chunks.
- **Generation: X.XXs** – time for the LLM to produce the answer.

These values are logged (e.g. in `/data/logs/app.log`) for comparison across runs.

## Which performed best and why

- **Similarity** is best when the user’s question is very focused and the answer is expected to come from one or two passages. It’s also the fastest in principle (one query, no post-processing).
- **MMR** is best when the question is broad (“summarize all sources”, “compare X and Y”) or when you want to avoid over-citing one document. It improves perceived quality when the corpus has multiple relevant documents that should all contribute.

Recommendation: use **similarity** for short, factual questions; use **MMR** for synthesis and multi-document comparison. The timing approach (logging retrieval + generation and displaying them in the UI) lets you compare strategies on your own data and choose accordingly.