| # RAG Retrieval Strategy Comparison |
|
|
| ## Implemented strategies |
|
|
| 1. **Similarity (cosine)** |
| Standard top-k retrieval by embedding cosine similarity. Chroma returns the k nearest chunks to the query embedding. No diversity adjustment. |
|
|
| 2. **MMR (Max Marginal Relevance)** |
| Balances relevance and diversity: after selecting the top chunk, we iteratively choose the next chunk that maximizes |
| `λ * relevance(chunk, query) - (1 - λ) * max_similarity(chunk, already_selected)`. |
| Here we use a simplified MMR that approximates inter-chunk similarity from metadata (same source vs different source) and uses Chroma’s distance for relevance. |
| `MMR_LAMBDA` (default 0.7) controls the trade-off: 1.0 ≈ pure similarity, 0 ≈ maximum diversity. |
|
|
| ## What changed in retrieved docs |
|
|
| - **Similarity**: Tends to return several very similar chunks (e.g. same section or same source repeated), which can make the answer repetitive and over-anchored to one part of the corpus. |
| - **MMR**: Favors spreading across different sources/sections, so the model sees more varied context and is less likely to over-cite one document. You may see more distinct source names in citations. |
|
|
| ## Speed differences |
|
|
| - **Retrieval time**: Similarity is a single Chroma query. MMR does one Chroma query (with a larger n_results) then a small Python loop to select k chunks; the extra work is negligible (milliseconds). So retrieval time is effectively the same. |
| - **Generation time**: Unchanged by strategy; it depends on context length and model. MMR can slightly reduce redundancy in the context, which might marginally affect length and thus generation time. |
| |
| ## Timing in the app |
| |
| After each assistant reply, the UI shows: |
| |
| - **Retrieval: X.XXs** – time from query to having the list of chunks. |
| - **Generation: X.XXs** – time for the LLM to produce the answer. |
| |
| These values are logged (e.g. in `/data/logs/app.log`) for comparison across runs. |
| |
| ## Which performed best and why |
| |
| - **Similarity** is best when the user’s question is very focused and the answer is expected to come from one or two passages. It’s also the fastest in principle (one query, no post-processing). |
| - **MMR** is best when the question is broad (“summarize all sources”, “compare X and Y”) or when you want to avoid over-citing one document. It improves perceived quality when the corpus has multiple relevant documents that should all contribute. |
| |
| Recommendation: use **similarity** for short, factual questions; use **MMR** for synthesis and multi-document comparison. The timing approach (logging retrieval + generation and displaying them in the UI) lets you compare strategies on your own data and choose accordingly. |
| |