Clone_Lm / docs /rag_comparison.md
skumar54's picture
NotebookLM clone: Gradio app, backend, Gemini artifacts
9c9ce67
# RAG Retrieval Strategy Comparison
## Implemented strategies
1. **Similarity (cosine)**
Standard top-k retrieval by embedding cosine similarity. Chroma returns the k nearest chunks to the query embedding. No diversity adjustment.
2. **MMR (Max Marginal Relevance)**
Balances relevance and diversity: after selecting the top chunk, we iteratively choose the next chunk that maximizes
`λ * relevance(chunk, query) - (1 - λ) * max_similarity(chunk, already_selected)`.
Here we use a simplified MMR that approximates inter-chunk similarity from metadata (same source vs different source) and uses Chroma’s distance for relevance.
`MMR_LAMBDA` (default 0.7) controls the trade-off: 1.0 ≈ pure similarity, 0 ≈ maximum diversity.
## What changed in retrieved docs
- **Similarity**: Tends to return several very similar chunks (e.g. same section or same source repeated), which can make the answer repetitive and over-anchored to one part of the corpus.
- **MMR**: Favors spreading across different sources/sections, so the model sees more varied context and is less likely to over-cite one document. You may see more distinct source names in citations.
## Speed differences
- **Retrieval time**: Similarity is a single Chroma query. MMR does one Chroma query (with a larger n_results) then a small Python loop to select k chunks; the extra work is negligible (milliseconds). So retrieval time is effectively the same.
- **Generation time**: Unchanged by strategy; it depends on context length and model. MMR can slightly reduce redundancy in the context, which might marginally affect length and thus generation time.
## Timing in the app
After each assistant reply, the UI shows:
- **Retrieval: X.XXs** – time from query to having the list of chunks.
- **Generation: X.XXs** – time for the LLM to produce the answer.
These values are logged (e.g. in `/data/logs/app.log`) for comparison across runs.
## Which performed best and why
- **Similarity** is best when the user’s question is very focused and the answer is expected to come from one or two passages. It’s also the fastest in principle (one query, no post-processing).
- **MMR** is best when the question is broad (“summarize all sources”, “compare X and Y”) or when you want to avoid over-citing one document. It improves perceived quality when the corpus has multiple relevant documents that should all contribute.
Recommendation: use **similarity** for short, factual questions; use **MMR** for synthesis and multi-document comparison. The timing approach (logging retrieval + generation and displaying them in the UI) lets you compare strategies on your own data and choose accordingly.