aneeb15's picture
Initial commit: clean history without sensitive data
2505371
# Multi-Document Research Assistant β€” RAG with Retrieval Quality Evaluation
> Upload multiple PDFs. Ask questions. Get cited answers grounded in your documents β€” with automatic retrieval quality scoring.
[![Live Demo](https://img.shields.io/badge/Live%20Demo-neurotic--evidence--defendant.ngrok--free.dev-38BDF8?style=flat-square)](https://neurotic-evidence-defendant.ngrok-free.dev/)
[![Python](https://img.shields.io/badge/Python-3.10+-blue?style=flat-square)](https://python.org)
[![LangChain](https://img.shields.io/badge/LangChain-0.2-green?style=flat-square)](https://langchain.com)
[![ChromaDB](https://img.shields.io/badge/ChromaDB-0.5-orange?style=flat-square)](https://trychroma.com)
---
## 🎯 Problem
Most LLMs hallucinate when asked about private or domain-specific documents. They generate confident answers from training data β€” not from your actual content.
Standard RAG systems fix this but introduce a new problem: **you can't tell when retrieval fails.** Wrong chunks get retrieved. The LLM answers confidently from bad context. No warning.
This project solves both problems:
1. Grounds all answers in your uploaded documents
2. Scores retrieval quality on every query so you know when to trust the answer
---
## πŸ—οΈ Architecture
```
PDF Documents
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PyPDFLoader β”‚ Extract text page by page
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RecursiveText β”‚ Split into 512-token chunks
β”‚ Splitter β”‚ with 50-token overlap
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ all-MiniLM-L6 β”‚ Embed each chunk β†’ 384-dim vector
β”‚ Embeddings β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ChromaDB β”‚ Store vectors + text + metadata
β”‚ (Vector Store) β”‚ (source filename + page number)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
User Query
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Similarity β”‚ Cosine search β†’ Top-k chunks
β”‚ Search β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Gemini Flash β”‚ Generate answer from chunks only
β”‚ (LLM) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Custom Evaluatorβ”‚ Score grounding, relevance,
β”‚ β”‚ completeness
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
Answer + Sources + Quality Scores
```
---
## πŸ“Š Evaluation Metrics
Instead of using RAGAS (which requires paid OpenAI API), this project implements a custom lightweight evaluator β€” free, fast, and interpretable.
| Metric | What It Measures | How |
|---|---|---|
| **Grounding Score** | Is the answer based on retrieved chunks? | Word overlap: answer ∩ context / answer |
| **Retrieval Relevance** | Did we retrieve the right chunks? | Cosine similarity: query vector vs chunk vectors |
| **Answer Completeness** | Did the LLM use the retrieved context? | Word overlap: answer ∩ context / context |
### Sample Results (Attention Is All You Need paper)
| Query | Grounding | Relevance | Completeness |
|---|---|---|---|
| "What is the attention mechanism?" | 0.97 | 0.61 | 0.16 |
| "Who are the authors?" | 0.95 | 0.58 | 0.12 |
| "What is the conclusion?" | 0.91 | 0.55 | 0.14 |
**Key insight:** High grounding (0.97) but low completeness (0.16) means the LLM extracted a precise answer from a small portion of retrieved context β€” expected behavior for factual queries.
---
## πŸš€ Quick Start
### Prerequisites
- Python 3.10+
- Google AI Studio API key (free at [aistudio.google.com](https://aistudio.google.com))
### Installation
```bash
git clone https://github.com/aneebnaqvi15/rag-research-assistant
cd rag-research-assistant
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # Mac/Linux
pip install -r requirements.txt
```
### Configuration
Create a `.env` file:
```
GOOGLE_API_KEY=your_key_here
```
### Run
```bash
streamlit run app.py
```
Open `http://localhost:8501`
---
## πŸ› οΈ Tech Stack
| Component | Tool | Why |
|---|---|---|
| PDF Loading | `pypdf` + LangChain | Page-level metadata for citations |
| Chunking | `RecursiveCharacterTextSplitter` | Respects sentence boundaries |
| Embeddings | `all-MiniLM-L6-v2` | Fast, free, 384-dim, runs on CPU |
| Vector Store | `ChromaDB` | Local, zero-config, stores metadata |
| LLM | `Gemini 2.5 Flash` | Free tier, strong reasoning |
| Orchestration | `LangChain` | Connects all RAG components |
| UI | `Streamlit` | Rapid prototyping, real-time logs |
| Evaluation | Custom (NumPy + sklearn) | Free, interpretable, no OpenAI dependency |
---
## ✨ Features
- **Multi-document support** β€” Upload and query across multiple PDFs simultaneously
- **Source citations** β€” Every answer shows exact filename and page number
- **Real-time processing logs** β€” Watch the pipeline run: load β†’ chunk β†’ embed β†’ index
- **Retrieval quality scores** β€” Three metrics scored on every query
- **Bring your own API key** β€” Toggle in sidebar to use your own Gemini key
- **Custom model selection** β€” Enter any Gemini model string from AI Studio
- **Adjustable retrieval** β€” Control chunk size and top-k via sidebar sliders
---
## πŸ“ Project Structure
```
rag-research-assistant/
β”œβ”€β”€ app.py # Streamlit UI + full RAG pipeline
β”œβ”€β”€ requirements.txt # Pinned dependencies
β”œβ”€β”€ .env # API keys (not committed)
β”œβ”€β”€ .gitignore
└── README.md
```
---
## πŸ’‘ Key Learnings
**1. Dependency hell is real.**
ChromaDB, NumPy 2.0, and OpenTelemetry have a war inside Colab's environment. The fix: use `EphemeralClient()` instead of file-based persistence, and never pin NumPy manually.
**2. RAGAS is not free.**
RAGAS v0.4+ requires OpenAI API for their `InstructorLLM`. Building a custom evaluator is not a compromise β€” it's better engineering. You understand every metric you ship.
**3. Chunk size is a hyperparameter.**
512 tokens with 50-token overlap is a starting point, not a truth. Smaller chunks = sharper vectors but less context. Tune based on your document type.
**4. Retrieval can fail silently.**
A grounding score of 1.0 does not mean a correct answer. It means every word in the answer exists somewhere in retrieved chunks. You need multiple metrics to trust a RAG system.
**5. LLM quality matters more than RAG architecture.**
`flan-t5-base` produced gibberish from perfect retrieval. `Gemini 2.5 Flash` produced accurate answers from the same chunks. The retrieval pipeline is only as useful as the model that reads it.
---
## πŸ”— Related Projects
This project is **Part 2** of a connected AI/ML portfolio:
| Project | Focus | Link |
|---|---|---|
| **Banking77 Intent Classifier** | NLP fine-tuning with DistilBERT + LoRA | [GitHub](https://github.com/aneebnaqvi15/banking77-intent-classifier) |
| **RAG Research Assistant** | Retrieval-Augmented Generation + Evaluation | This repo |
Together these demonstrate: fine-tuning (shaping model behavior) vs RAG (shaping model knowledge) β€” two complementary approaches to applied NLP.
---
## πŸ“„ References
- [Attention Is All You Need](https://arxiv.org/abs/1706.03762) β€” Vaswani et al. (2017)
- [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909)
- [LangChain Documentation](https://docs.langchain.com)
- [ChromaDB Documentation](https://docs.trychroma.com)
- [Sentence Transformers β€” all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
---
## πŸ‘€ Author
**Syed Muhammad Aneeb**
CS Graduate Β· Full-Stack Developer Β· AI Engineer
[![GitHub](https://img.shields.io/badge/GitHub-aneebnaqvi15-black?style=flat-square)](https://github.com/aneebnaqvi15)
---
*Built with zero GPU budget on Google Colab free tier. Constraints breed creativity.*