A newer version of the Gradio SDK is available: 6.14.0
RAG Retrieval Strategy Comparison
Implemented strategies
Similarity (cosine)
Standard top-k retrieval by embedding cosine similarity. Chroma returns the k nearest chunks to the query embedding. No diversity adjustment.MMR (Max Marginal Relevance)
Balances relevance and diversity: after selecting the top chunk, we iteratively choose the next chunk that maximizesλ * relevance(chunk, query) - (1 - λ) * max_similarity(chunk, already_selected).
Here we use a simplified MMR that approximates inter-chunk similarity from metadata (same source vs different source) and uses Chroma’s distance for relevance.MMR_LAMBDA(default 0.7) controls the trade-off: 1.0 ≈ pure similarity, 0 ≈ maximum diversity.
What changed in retrieved docs
- Similarity: Tends to return several very similar chunks (e.g. same section or same source repeated), which can make the answer repetitive and over-anchored to one part of the corpus.
- MMR: Favors spreading across different sources/sections, so the model sees more varied context and is less likely to over-cite one document. You may see more distinct source names in citations.
Speed differences
- Retrieval time: Similarity is a single Chroma query. MMR does one Chroma query (with a larger n_results) then a small Python loop to select k chunks; the extra work is negligible (milliseconds). So retrieval time is effectively the same.
- Generation time: Unchanged by strategy; it depends on context length and model. MMR can slightly reduce redundancy in the context, which might marginally affect length and thus generation time.
Timing in the app
After each assistant reply, the UI shows:
- Retrieval: X.XXs – time from query to having the list of chunks.
- Generation: X.XXs – time for the LLM to produce the answer.
These values are logged (e.g. in /data/logs/app.log) for comparison across runs.
Which performed best and why
- Similarity is best when the user’s question is very focused and the answer is expected to come from one or two passages. It’s also the fastest in principle (one query, no post-processing).
- MMR is best when the question is broad (“summarize all sources”, “compare X and Y”) or when you want to avoid over-citing one document. It improves perceived quality when the corpus has multiple relevant documents that should all contribute.
Recommendation: use similarity for short, factual questions; use MMR for synthesis and multi-document comparison. The timing approach (logging retrieval + generation and displaying them in the UI) lets you compare strategies on your own data and choose accordingly.