| # Multi-Document Research Assistant β RAG with Retrieval Quality Evaluation | |
| > Upload multiple PDFs. Ask questions. Get cited answers grounded in your documents β with automatic retrieval quality scoring. | |
| [](https://neurotic-evidence-defendant.ngrok-free.dev/) | |
| [](https://python.org) | |
| [](https://langchain.com) | |
| [](https://trychroma.com) | |
| --- | |
| ## π― Problem | |
| Most LLMs hallucinate when asked about private or domain-specific documents. They generate confident answers from training data β not from your actual content. | |
| Standard RAG systems fix this but introduce a new problem: **you can't tell when retrieval fails.** Wrong chunks get retrieved. The LLM answers confidently from bad context. No warning. | |
| This project solves both problems: | |
| 1. Grounds all answers in your uploaded documents | |
| 2. Scores retrieval quality on every query so you know when to trust the answer | |
| --- | |
| ## ποΈ Architecture | |
| ``` | |
| PDF Documents | |
| β | |
| βΌ | |
| βββββββββββββββββββ | |
| β PyPDFLoader β Extract text page by page | |
| ββββββββββ¬βββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββ | |
| β RecursiveText β Split into 512-token chunks | |
| β Splitter β with 50-token overlap | |
| ββββββββββ¬βββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββ | |
| β all-MiniLM-L6 β Embed each chunk β 384-dim vector | |
| β Embeddings β | |
| ββββββββββ¬βββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββ | |
| β ChromaDB β Store vectors + text + metadata | |
| β (Vector Store) β (source filename + page number) | |
| ββββββββββ¬βββββββββ | |
| β | |
| User Query | |
| β | |
| βΌ | |
| βββββββββββββββββββ | |
| β Similarity β Cosine search β Top-k chunks | |
| β Search β | |
| ββββββββββ¬βββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββ | |
| β Gemini Flash β Generate answer from chunks only | |
| β (LLM) β | |
| ββββββββββ¬βββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββ | |
| β Custom Evaluatorβ Score grounding, relevance, | |
| β β completeness | |
| βββββββββββββββββββ | |
| β | |
| βΌ | |
| Answer + Sources + Quality Scores | |
| ``` | |
| --- | |
| ## π Evaluation Metrics | |
| Instead of using RAGAS (which requires paid OpenAI API), this project implements a custom lightweight evaluator β free, fast, and interpretable. | |
| | Metric | What It Measures | How | | |
| |---|---|---| | |
| | **Grounding Score** | Is the answer based on retrieved chunks? | Word overlap: answer β© context / answer | | |
| | **Retrieval Relevance** | Did we retrieve the right chunks? | Cosine similarity: query vector vs chunk vectors | | |
| | **Answer Completeness** | Did the LLM use the retrieved context? | Word overlap: answer β© context / context | | |
| ### Sample Results (Attention Is All You Need paper) | |
| | Query | Grounding | Relevance | Completeness | | |
| |---|---|---|---| | |
| | "What is the attention mechanism?" | 0.97 | 0.61 | 0.16 | | |
| | "Who are the authors?" | 0.95 | 0.58 | 0.12 | | |
| | "What is the conclusion?" | 0.91 | 0.55 | 0.14 | | |
| **Key insight:** High grounding (0.97) but low completeness (0.16) means the LLM extracted a precise answer from a small portion of retrieved context β expected behavior for factual queries. | |
| --- | |
| ## π Quick Start | |
| ### Prerequisites | |
| - Python 3.10+ | |
| - Google AI Studio API key (free at [aistudio.google.com](https://aistudio.google.com)) | |
| ### Installation | |
| ```bash | |
| git clone https://github.com/aneebnaqvi15/rag-research-assistant | |
| cd rag-research-assistant | |
| python -m venv venv | |
| venv\Scripts\activate # Windows | |
| # source venv/bin/activate # Mac/Linux | |
| pip install -r requirements.txt | |
| ``` | |
| ### Configuration | |
| Create a `.env` file: | |
| ``` | |
| GOOGLE_API_KEY=your_key_here | |
| ``` | |
| ### Run | |
| ```bash | |
| streamlit run app.py | |
| ``` | |
| Open `http://localhost:8501` | |
| --- | |
| ## π οΈ Tech Stack | |
| | Component | Tool | Why | | |
| |---|---|---| | |
| | PDF Loading | `pypdf` + LangChain | Page-level metadata for citations | | |
| | Chunking | `RecursiveCharacterTextSplitter` | Respects sentence boundaries | | |
| | Embeddings | `all-MiniLM-L6-v2` | Fast, free, 384-dim, runs on CPU | | |
| | Vector Store | `ChromaDB` | Local, zero-config, stores metadata | | |
| | LLM | `Gemini 2.5 Flash` | Free tier, strong reasoning | | |
| | Orchestration | `LangChain` | Connects all RAG components | | |
| | UI | `Streamlit` | Rapid prototyping, real-time logs | | |
| | Evaluation | Custom (NumPy + sklearn) | Free, interpretable, no OpenAI dependency | | |
| --- | |
| ## β¨ Features | |
| - **Multi-document support** β Upload and query across multiple PDFs simultaneously | |
| - **Source citations** β Every answer shows exact filename and page number | |
| - **Real-time processing logs** β Watch the pipeline run: load β chunk β embed β index | |
| - **Retrieval quality scores** β Three metrics scored on every query | |
| - **Bring your own API key** β Toggle in sidebar to use your own Gemini key | |
| - **Custom model selection** β Enter any Gemini model string from AI Studio | |
| - **Adjustable retrieval** β Control chunk size and top-k via sidebar sliders | |
| --- | |
| ## π Project Structure | |
| ``` | |
| rag-research-assistant/ | |
| βββ app.py # Streamlit UI + full RAG pipeline | |
| βββ requirements.txt # Pinned dependencies | |
| βββ .env # API keys (not committed) | |
| βββ .gitignore | |
| βββ README.md | |
| ``` | |
| --- | |
| ## π‘ Key Learnings | |
| **1. Dependency hell is real.** | |
| ChromaDB, NumPy 2.0, and OpenTelemetry have a war inside Colab's environment. The fix: use `EphemeralClient()` instead of file-based persistence, and never pin NumPy manually. | |
| **2. RAGAS is not free.** | |
| RAGAS v0.4+ requires OpenAI API for their `InstructorLLM`. Building a custom evaluator is not a compromise β it's better engineering. You understand every metric you ship. | |
| **3. Chunk size is a hyperparameter.** | |
| 512 tokens with 50-token overlap is a starting point, not a truth. Smaller chunks = sharper vectors but less context. Tune based on your document type. | |
| **4. Retrieval can fail silently.** | |
| A grounding score of 1.0 does not mean a correct answer. It means every word in the answer exists somewhere in retrieved chunks. You need multiple metrics to trust a RAG system. | |
| **5. LLM quality matters more than RAG architecture.** | |
| `flan-t5-base` produced gibberish from perfect retrieval. `Gemini 2.5 Flash` produced accurate answers from the same chunks. The retrieval pipeline is only as useful as the model that reads it. | |
| --- | |
| ## π Related Projects | |
| This project is **Part 2** of a connected AI/ML portfolio: | |
| | Project | Focus | Link | | |
| |---|---|---| | |
| | **Banking77 Intent Classifier** | NLP fine-tuning with DistilBERT + LoRA | [GitHub](https://github.com/aneebnaqvi15/banking77-intent-classifier) | | |
| | **RAG Research Assistant** | Retrieval-Augmented Generation + Evaluation | This repo | | |
| Together these demonstrate: fine-tuning (shaping model behavior) vs RAG (shaping model knowledge) β two complementary approaches to applied NLP. | |
| --- | |
| ## π References | |
| - [Attention Is All You Need](https://arxiv.org/abs/1706.03762) β Vaswani et al. (2017) | |
| - [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) | |
| - [LangChain Documentation](https://docs.langchain.com) | |
| - [ChromaDB Documentation](https://docs.trychroma.com) | |
| - [Sentence Transformers β all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | |
| --- | |
| ## π€ Author | |
| **Syed Muhammad Aneeb** | |
| CS Graduate Β· Full-Stack Developer Β· AI Engineer | |
| [](https://github.com/aneebnaqvi15) | |
| --- | |
| *Built with zero GPU budget on Google Colab free tier. Constraints breed creativity.* |