# Multi-Document Research Assistant — RAG with Retrieval Quality Evaluation

> Upload multiple PDFs. Ask questions. Get cited answers grounded in your documents — with automatic retrieval quality scoring.

[![Live Demo](https://img.shields.io/badge/Live%20Demo-neurotic--evidence--defendant.ngrok--free.dev-38BDF8?style=flat-square)](https://neurotic-evidence-defendant.ngrok-free.dev/)
[![Python](https://img.shields.io/badge/Python-3.10+-blue?style=flat-square)](https://python.org)
[![LangChain](https://img.shields.io/badge/LangChain-0.2-green?style=flat-square)](https://langchain.com)
[![ChromaDB](https://img.shields.io/badge/ChromaDB-0.5-orange?style=flat-square)](https://trychroma.com)

---

## 🎯 Problem

Most LLMs hallucinate when asked about private or domain-specific documents. They generate confident answers from training data — not from your actual content.

Standard RAG systems fix this but introduce a new problem: **you can't tell when retrieval fails.** Wrong chunks get retrieved. The LLM answers confidently from bad context. No warning.

This project solves both problems:
1. Grounds all answers in your uploaded documents
2. Scores retrieval quality on every query so you know when to trust the answer

---

## 🏗️ Architecture

```
PDF Documents
      │
      ▼
┌─────────────────┐
│  PyPDFLoader    │  Extract text page by page
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ RecursiveText   │  Split into 512-token chunks
│   Splitter      │  with 50-token overlap
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ all-MiniLM-L6   │  Embed each chunk → 384-dim vector
│   Embeddings    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    ChromaDB     │  Store vectors + text + metadata
│  (Vector Store) │  (source filename + page number)
└────────┬────────┘
         │
    User Query
         │
         ▼
┌─────────────────┐
│ Similarity      │  Cosine search → Top-k chunks
│   Search        │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Gemini Flash   │  Generate answer from chunks only
│     (LLM)       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Custom Evaluator│  Score grounding, relevance,
│                 │  completeness
└─────────────────┘
         │
         ▼
   Answer + Sources + Quality Scores
```

---

## 📊 Evaluation Metrics

Instead of using RAGAS (which requires paid OpenAI API), this project implements a custom lightweight evaluator — free, fast, and interpretable.

| Metric | What It Measures | How |
|---|---|---|
| **Grounding Score** | Is the answer based on retrieved chunks? | Word overlap: answer ∩ context / answer |
| **Retrieval Relevance** | Did we retrieve the right chunks? | Cosine similarity: query vector vs chunk vectors |
| **Answer Completeness** | Did the LLM use the retrieved context? | Word overlap: answer ∩ context / context |

### Sample Results (Attention Is All You Need paper)

| Query | Grounding | Relevance | Completeness |
|---|---|---|---|
| "What is the attention mechanism?" | 0.97 | 0.61 | 0.16 |
| "Who are the authors?" | 0.95 | 0.58 | 0.12 |
| "What is the conclusion?" | 0.91 | 0.55 | 0.14 |

**Key insight:** High grounding (0.97) but low completeness (0.16) means the LLM extracted a precise answer from a small portion of retrieved context — expected behavior for factual queries.

---

## 🚀 Quick Start

### Prerequisites
- Python 3.10+
- Google AI Studio API key (free at [aistudio.google.com](https://aistudio.google.com))

### Installation

```bash
git clone https://github.com/aneebnaqvi15/rag-research-assistant
cd rag-research-assistant

python -m venv venv
venv\Scripts\activate  # Windows
# source venv/bin/activate  # Mac/Linux

pip install -r requirements.txt
```

### Configuration

Create a `.env` file:
```
GOOGLE_API_KEY=your_key_here
```

### Run

```bash
streamlit run app.py
```

Open `http://localhost:8501`

---

## 🛠️ Tech Stack

| Component | Tool | Why |
|---|---|---|
| PDF Loading | `pypdf` + LangChain | Page-level metadata for citations |
| Chunking | `RecursiveCharacterTextSplitter` | Respects sentence boundaries |
| Embeddings | `all-MiniLM-L6-v2` | Fast, free, 384-dim, runs on CPU |
| Vector Store | `ChromaDB` | Local, zero-config, stores metadata |
| LLM | `Gemini 2.5 Flash` | Free tier, strong reasoning |
| Orchestration | `LangChain` | Connects all RAG components |
| UI | `Streamlit` | Rapid prototyping, real-time logs |
| Evaluation | Custom (NumPy + sklearn) | Free, interpretable, no OpenAI dependency |

---

## ✨ Features

- **Multi-document support** — Upload and query across multiple PDFs simultaneously
- **Source citations** — Every answer shows exact filename and page number
- **Real-time processing logs** — Watch the pipeline run: load → chunk → embed → index
- **Retrieval quality scores** — Three metrics scored on every query
- **Bring your own API key** — Toggle in sidebar to use your own Gemini key
- **Custom model selection** — Enter any Gemini model string from AI Studio
- **Adjustable retrieval** — Control chunk size and top-k via sidebar sliders

---

## 📁 Project Structure

```
rag-research-assistant/
├── app.py                  # Streamlit UI + full RAG pipeline
├── requirements.txt        # Pinned dependencies
├── .env                    # API keys (not committed)
├── .gitignore
└── README.md
```

---

## 💡 Key Learnings

**1. Dependency hell is real.**
ChromaDB, NumPy 2.0, and OpenTelemetry have a war inside Colab's environment. The fix: use `EphemeralClient()` instead of file-based persistence, and never pin NumPy manually.

**2. RAGAS is not free.**
RAGAS v0.4+ requires OpenAI API for their `InstructorLLM`. Building a custom evaluator is not a compromise — it's better engineering. You understand every metric you ship.

**3. Chunk size is a hyperparameter.**
512 tokens with 50-token overlap is a starting point, not a truth. Smaller chunks = sharper vectors but less context. Tune based on your document type.

**4. Retrieval can fail silently.**
A grounding score of 1.0 does not mean a correct answer. It means every word in the answer exists somewhere in retrieved chunks. You need multiple metrics to trust a RAG system.

**5. LLM quality matters more than RAG architecture.**
`flan-t5-base` produced gibberish from perfect retrieval. `Gemini 2.5 Flash` produced accurate answers from the same chunks. The retrieval pipeline is only as useful as the model that reads it.

---

## 🔗 Related Projects

This project is **Part 2** of a connected AI/ML portfolio:

| Project | Focus | Link |
|---|---|---|
| **Banking77 Intent Classifier** | NLP fine-tuning with DistilBERT + LoRA | [GitHub](https://github.com/aneebnaqvi15/banking77-intent-classifier) |
| **RAG Research Assistant** | Retrieval-Augmented Generation + Evaluation | This repo |

Together these demonstrate: fine-tuning (shaping model behavior) vs RAG (shaping model knowledge) — two complementary approaches to applied NLP.

---

## 📄 References

- [Attention Is All You Need](https://arxiv.org/abs/1706.03762) — Vaswani et al. (2017)
- [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909)
- [LangChain Documentation](https://docs.langchain.com)
- [ChromaDB Documentation](https://docs.trychroma.com)
- [Sentence Transformers — all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

---

## 👤 Author

**Syed Muhammad Aneeb**
CS Graduate · Full-Stack Developer · AI Engineer

[![GitHub](https://img.shields.io/badge/GitHub-aneebnaqvi15-black?style=flat-square)](https://github.com/aneebnaqvi15)

---

*Built with zero GPU budget on Google Colab free tier. Constraints breed creativity.*