# RAG Retrieval Strategy Comparison ## Implemented strategies 1. **Similarity (cosine)** Standard top-k retrieval by embedding cosine similarity. Chroma returns the k nearest chunks to the query embedding. No diversity adjustment. 2. **MMR (Max Marginal Relevance)** Balances relevance and diversity: after selecting the top chunk, we iteratively choose the next chunk that maximizes `λ * relevance(chunk, query) - (1 - λ) * max_similarity(chunk, already_selected)`. Here we use a simplified MMR that approximates inter-chunk similarity from metadata (same source vs different source) and uses Chroma’s distance for relevance. `MMR_LAMBDA` (default 0.7) controls the trade-off: 1.0 ≈ pure similarity, 0 ≈ maximum diversity. ## What changed in retrieved docs - **Similarity**: Tends to return several very similar chunks (e.g. same section or same source repeated), which can make the answer repetitive and over-anchored to one part of the corpus. - **MMR**: Favors spreading across different sources/sections, so the model sees more varied context and is less likely to over-cite one document. You may see more distinct source names in citations. ## Speed differences - **Retrieval time**: Similarity is a single Chroma query. MMR does one Chroma query (with a larger n_results) then a small Python loop to select k chunks; the extra work is negligible (milliseconds). So retrieval time is effectively the same. - **Generation time**: Unchanged by strategy; it depends on context length and model. MMR can slightly reduce redundancy in the context, which might marginally affect length and thus generation time. ## Timing in the app After each assistant reply, the UI shows: - **Retrieval: X.XXs** – time from query to having the list of chunks. - **Generation: X.XXs** – time for the LLM to produce the answer. These values are logged (e.g. in `/data/logs/app.log`) for comparison across runs. ## Which performed best and why - **Similarity** is best when the user’s question is very focused and the answer is expected to come from one or two passages. It’s also the fastest in principle (one query, no post-processing). - **MMR** is best when the question is broad (“summarize all sources”, “compare X and Y”) or when you want to avoid over-citing one document. It improves perceived quality when the corpus has multiple relevant documents that should all contribute. Recommendation: use **similarity** for short, factual questions; use **MMR** for synthesis and multi-document comparison. The timing approach (logging retrieval + generation and displaying them in the UI) lets you compare strategies on your own data and choose accordingly.