Initial commit: clean history without sensitive data

2505371 about 1 month ago

8.65 kB

	# Multi-Document Research Assistant — RAG with Retrieval Quality Evaluation

	> Upload multiple PDFs. Ask questions. Get cited answers grounded in your documents — with automatic retrieval quality scoring.

	[![Live Demo](https://img.shields.io/badge/Live%20Demo-neurotic--evidence--defendant.ngrok--free.dev-38BDF8?style=flat-square)](https://neurotic-evidence-defendant.ngrok-free.dev/)
	[![Python](https://img.shields.io/badge/Python-3.10+-blue?style=flat-square)](https://python.org)
	[![LangChain](https://img.shields.io/badge/LangChain-0.2-green?style=flat-square)](https://langchain.com)
	[![ChromaDB](https://img.shields.io/badge/ChromaDB-0.5-orange?style=flat-square)](https://trychroma.com)

	---

	## 🎯 Problem

	Most LLMs hallucinate when asked about private or domain-specific documents. They generate confident answers from training data — not from your actual content.

	Standard RAG systems fix this but introduce a new problem: you can't tell when retrieval fails. Wrong chunks get retrieved. The LLM answers confidently from bad context. No warning.

	This project solves both problems:
	1. Grounds all answers in your uploaded documents
	2. Scores retrieval quality on every query so you know when to trust the answer

	---

	## 🏗️ Architecture

	```
	PDF Documents
	│
	▼
	┌─────────────────┐
	│ PyPDFLoader │ Extract text page by page
	└────────┬────────┘
	│
	▼
	┌─────────────────┐
	│ RecursiveText │ Split into 512-token chunks
	│ Splitter │ with 50-token overlap
	└────────┬────────┘
	│
	▼
	┌─────────────────┐
	│ all-MiniLM-L6 │ Embed each chunk → 384-dim vector
	│ Embeddings │
	└────────┬────────┘
	│
	▼
	┌─────────────────┐
	│ ChromaDB │ Store vectors + text + metadata
	│ (Vector Store) │ (source filename + page number)
	└────────┬────────┘
	│
	User Query
	│
	▼
	┌─────────────────┐
	│ Similarity │ Cosine search → Top-k chunks
	│ Search │
	└────────┬────────┘
	│
	▼
	┌─────────────────┐
	│ Gemini Flash │ Generate answer from chunks only
	│ (LLM) │
	└────────┬────────┘
	│
	▼
	┌─────────────────┐
	│ Custom Evaluator│ Score grounding, relevance,
	│ │ completeness
	└─────────────────┘
	│
	▼
	Answer + Sources + Quality Scores
	```

	---

	## 📊 Evaluation Metrics

	Instead of using RAGAS (which requires paid OpenAI API), this project implements a custom lightweight evaluator — free, fast, and interpretable.

	\| Metric \| What It Measures \| How \|
	\|---\|---\|---\|
	\| Grounding Score \| Is the answer based on retrieved chunks? \| Word overlap: answer ∩ context / answer \|
	\| Retrieval Relevance \| Did we retrieve the right chunks? \| Cosine similarity: query vector vs chunk vectors \|
	\| Answer Completeness \| Did the LLM use the retrieved context? \| Word overlap: answer ∩ context / context \|

	### Sample Results (Attention Is All You Need paper)

	\| Query \| Grounding \| Relevance \| Completeness \|
	\|---\|---\|---\|---\|
	\| "What is the attention mechanism?" \| 0.97 \| 0.61 \| 0.16 \|
	\| "Who are the authors?" \| 0.95 \| 0.58 \| 0.12 \|
	\| "What is the conclusion?" \| 0.91 \| 0.55 \| 0.14 \|

	Key insight: High grounding (0.97) but low completeness (0.16) means the LLM extracted a precise answer from a small portion of retrieved context — expected behavior for factual queries.

	---

	## 🚀 Quick Start

	### Prerequisites
	- Python 3.10+
	- Google AI Studio API key (free at [aistudio.google.com](https://aistudio.google.com))

	### Installation

	```bash
	git clone https://github.com/aneebnaqvi15/rag-research-assistant
	cd rag-research-assistant

	python -m venv venv
	venv\Scripts\activate # Windows
	# source venv/bin/activate # Mac/Linux

	pip install -r requirements.txt
	```

	### Configuration

	Create a `.env` file:
	```
	GOOGLE_API_KEY=your_key_here
	```

	### Run

	```bash
	streamlit run app.py
	```

	Open `http://localhost:8501`

	---

	## 🛠️ Tech Stack

	\| Component \| Tool \| Why \|
	\|---\|---\|---\|
	\| PDF Loading \| `pypdf` + LangChain \| Page-level metadata for citations \|
	\| Chunking \| `RecursiveCharacterTextSplitter` \| Respects sentence boundaries \|
	\| Embeddings \| `all-MiniLM-L6-v2` \| Fast, free, 384-dim, runs on CPU \|
	\| Vector Store \| `ChromaDB` \| Local, zero-config, stores metadata \|
	\| LLM \| `Gemini 2.5 Flash` \| Free tier, strong reasoning \|
	\| Orchestration \| `LangChain` \| Connects all RAG components \|
	\| UI \| `Streamlit` \| Rapid prototyping, real-time logs \|
	\| Evaluation \| Custom (NumPy + sklearn) \| Free, interpretable, no OpenAI dependency \|

	---

	## ✨ Features

	- Multi-document support — Upload and query across multiple PDFs simultaneously
	- Source citations — Every answer shows exact filename and page number
	- Real-time processing logs — Watch the pipeline run: load → chunk → embed → index
	- Retrieval quality scores — Three metrics scored on every query
	- Bring your own API key — Toggle in sidebar to use your own Gemini key
	- Custom model selection — Enter any Gemini model string from AI Studio
	- Adjustable retrieval — Control chunk size and top-k via sidebar sliders

	---

	## 📁 Project Structure

	```
	rag-research-assistant/
	├── app.py # Streamlit UI + full RAG pipeline
	├── requirements.txt # Pinned dependencies
	├── .env # API keys (not committed)
	├── .gitignore
	└── README.md
	```

	---

	## 💡 Key Learnings

	1. Dependency hell is real.
	ChromaDB, NumPy 2.0, and OpenTelemetry have a war inside Colab's environment. The fix: use `EphemeralClient()` instead of file-based persistence, and never pin NumPy manually.

	2. RAGAS is not free.
	RAGAS v0.4+ requires OpenAI API for their `InstructorLLM`. Building a custom evaluator is not a compromise — it's better engineering. You understand every metric you ship.

	3. Chunk size is a hyperparameter.
	512 tokens with 50-token overlap is a starting point, not a truth. Smaller chunks = sharper vectors but less context. Tune based on your document type.

	4. Retrieval can fail silently.
	A grounding score of 1.0 does not mean a correct answer. It means every word in the answer exists somewhere in retrieved chunks. You need multiple metrics to trust a RAG system.

	5. LLM quality matters more than RAG architecture.
	`flan-t5-base` produced gibberish from perfect retrieval. `Gemini 2.5 Flash` produced accurate answers from the same chunks. The retrieval pipeline is only as useful as the model that reads it.

	---

	## 🔗 Related Projects

	This project is Part 2 of a connected AI/ML portfolio:

	\| Project \| Focus \| Link \|
	\|---\|---\|---\|
	\| Banking77 Intent Classifier \| NLP fine-tuning with DistilBERT + LoRA \| [GitHub](https://github.com/aneebnaqvi15/banking77-intent-classifier) \|
	\| RAG Research Assistant \| Retrieval-Augmented Generation + Evaluation \| This repo \|

	Together these demonstrate: fine-tuning (shaping model behavior) vs RAG (shaping model knowledge) — two complementary approaches to applied NLP.

	---

	## 📄 References

	- [Attention Is All You Need](https://arxiv.org/abs/1706.03762) — Vaswani et al. (2017)
	- [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909)
	- [LangChain Documentation](https://docs.langchain.com)
	- [ChromaDB Documentation](https://docs.trychroma.com)
	- [Sentence Transformers — all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

	---

	## 👤 Author

	Syed Muhammad Aneeb
	CS Graduate · Full-Stack Developer · AI Engineer

	[![GitHub](https://img.shields.io/badge/GitHub-aneebnaqvi15-black?style=flat-square)](https://github.com/aneebnaqvi15)

	---

	Built with zero GPU budget on Google Colab free tier. Constraints breed creativity.