# Multi-Document Research Assistant β€” RAG with Retrieval Quality Evaluation > Upload multiple PDFs. Ask questions. Get cited answers grounded in your documents β€” with automatic retrieval quality scoring. [![Live Demo](https://img.shields.io/badge/Live%20Demo-neurotic--evidence--defendant.ngrok--free.dev-38BDF8?style=flat-square)](https://neurotic-evidence-defendant.ngrok-free.dev/) [![Python](https://img.shields.io/badge/Python-3.10+-blue?style=flat-square)](https://python.org) [![LangChain](https://img.shields.io/badge/LangChain-0.2-green?style=flat-square)](https://langchain.com) [![ChromaDB](https://img.shields.io/badge/ChromaDB-0.5-orange?style=flat-square)](https://trychroma.com) --- ## 🎯 Problem Most LLMs hallucinate when asked about private or domain-specific documents. They generate confident answers from training data β€” not from your actual content. Standard RAG systems fix this but introduce a new problem: **you can't tell when retrieval fails.** Wrong chunks get retrieved. The LLM answers confidently from bad context. No warning. This project solves both problems: 1. Grounds all answers in your uploaded documents 2. Scores retrieval quality on every query so you know when to trust the answer --- ## πŸ—οΈ Architecture ``` PDF Documents β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ PyPDFLoader β”‚ Extract text page by page β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ RecursiveText β”‚ Split into 512-token chunks β”‚ Splitter β”‚ with 50-token overlap β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ all-MiniLM-L6 β”‚ Embed each chunk β†’ 384-dim vector β”‚ Embeddings β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ ChromaDB β”‚ Store vectors + text + metadata β”‚ (Vector Store) β”‚ (source filename + page number) β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ User Query β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Similarity β”‚ Cosine search β†’ Top-k chunks β”‚ Search β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Gemini Flash β”‚ Generate answer from chunks only β”‚ (LLM) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Custom Evaluatorβ”‚ Score grounding, relevance, β”‚ β”‚ completeness β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό Answer + Sources + Quality Scores ``` --- ## πŸ“Š Evaluation Metrics Instead of using RAGAS (which requires paid OpenAI API), this project implements a custom lightweight evaluator β€” free, fast, and interpretable. | Metric | What It Measures | How | |---|---|---| | **Grounding Score** | Is the answer based on retrieved chunks? | Word overlap: answer ∩ context / answer | | **Retrieval Relevance** | Did we retrieve the right chunks? | Cosine similarity: query vector vs chunk vectors | | **Answer Completeness** | Did the LLM use the retrieved context? | Word overlap: answer ∩ context / context | ### Sample Results (Attention Is All You Need paper) | Query | Grounding | Relevance | Completeness | |---|---|---|---| | "What is the attention mechanism?" | 0.97 | 0.61 | 0.16 | | "Who are the authors?" | 0.95 | 0.58 | 0.12 | | "What is the conclusion?" | 0.91 | 0.55 | 0.14 | **Key insight:** High grounding (0.97) but low completeness (0.16) means the LLM extracted a precise answer from a small portion of retrieved context β€” expected behavior for factual queries. --- ## πŸš€ Quick Start ### Prerequisites - Python 3.10+ - Google AI Studio API key (free at [aistudio.google.com](https://aistudio.google.com)) ### Installation ```bash git clone https://github.com/aneebnaqvi15/rag-research-assistant cd rag-research-assistant python -m venv venv venv\Scripts\activate # Windows # source venv/bin/activate # Mac/Linux pip install -r requirements.txt ``` ### Configuration Create a `.env` file: ``` GOOGLE_API_KEY=your_key_here ``` ### Run ```bash streamlit run app.py ``` Open `http://localhost:8501` --- ## πŸ› οΈ Tech Stack | Component | Tool | Why | |---|---|---| | PDF Loading | `pypdf` + LangChain | Page-level metadata for citations | | Chunking | `RecursiveCharacterTextSplitter` | Respects sentence boundaries | | Embeddings | `all-MiniLM-L6-v2` | Fast, free, 384-dim, runs on CPU | | Vector Store | `ChromaDB` | Local, zero-config, stores metadata | | LLM | `Gemini 2.5 Flash` | Free tier, strong reasoning | | Orchestration | `LangChain` | Connects all RAG components | | UI | `Streamlit` | Rapid prototyping, real-time logs | | Evaluation | Custom (NumPy + sklearn) | Free, interpretable, no OpenAI dependency | --- ## ✨ Features - **Multi-document support** β€” Upload and query across multiple PDFs simultaneously - **Source citations** β€” Every answer shows exact filename and page number - **Real-time processing logs** β€” Watch the pipeline run: load β†’ chunk β†’ embed β†’ index - **Retrieval quality scores** β€” Three metrics scored on every query - **Bring your own API key** β€” Toggle in sidebar to use your own Gemini key - **Custom model selection** β€” Enter any Gemini model string from AI Studio - **Adjustable retrieval** β€” Control chunk size and top-k via sidebar sliders --- ## πŸ“ Project Structure ``` rag-research-assistant/ β”œβ”€β”€ app.py # Streamlit UI + full RAG pipeline β”œβ”€β”€ requirements.txt # Pinned dependencies β”œβ”€β”€ .env # API keys (not committed) β”œβ”€β”€ .gitignore └── README.md ``` --- ## πŸ’‘ Key Learnings **1. Dependency hell is real.** ChromaDB, NumPy 2.0, and OpenTelemetry have a war inside Colab's environment. The fix: use `EphemeralClient()` instead of file-based persistence, and never pin NumPy manually. **2. RAGAS is not free.** RAGAS v0.4+ requires OpenAI API for their `InstructorLLM`. Building a custom evaluator is not a compromise β€” it's better engineering. You understand every metric you ship. **3. Chunk size is a hyperparameter.** 512 tokens with 50-token overlap is a starting point, not a truth. Smaller chunks = sharper vectors but less context. Tune based on your document type. **4. Retrieval can fail silently.** A grounding score of 1.0 does not mean a correct answer. It means every word in the answer exists somewhere in retrieved chunks. You need multiple metrics to trust a RAG system. **5. LLM quality matters more than RAG architecture.** `flan-t5-base` produced gibberish from perfect retrieval. `Gemini 2.5 Flash` produced accurate answers from the same chunks. The retrieval pipeline is only as useful as the model that reads it. --- ## πŸ”— Related Projects This project is **Part 2** of a connected AI/ML portfolio: | Project | Focus | Link | |---|---|---| | **Banking77 Intent Classifier** | NLP fine-tuning with DistilBERT + LoRA | [GitHub](https://github.com/aneebnaqvi15/banking77-intent-classifier) | | **RAG Research Assistant** | Retrieval-Augmented Generation + Evaluation | This repo | Together these demonstrate: fine-tuning (shaping model behavior) vs RAG (shaping model knowledge) β€” two complementary approaches to applied NLP. --- ## πŸ“„ References - [Attention Is All You Need](https://arxiv.org/abs/1706.03762) β€” Vaswani et al. (2017) - [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) - [LangChain Documentation](https://docs.langchain.com) - [ChromaDB Documentation](https://docs.trychroma.com) - [Sentence Transformers β€” all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) --- ## πŸ‘€ Author **Syed Muhammad Aneeb** CS Graduate Β· Full-Stack Developer Β· AI Engineer [![GitHub](https://img.shields.io/badge/GitHub-aneebnaqvi15-black?style=flat-square)](https://github.com/aneebnaqvi15) --- *Built with zero GPU budget on Google Colab free tier. Constraints breed creativity.*