Multi-Document Research Assistant — RAG with Retrieval Quality Evaluation

Upload multiple PDFs. Ask questions. Get cited answers grounded in your documents — with automatic retrieval quality scoring.

🎯 Problem

Most LLMs hallucinate when asked about private or domain-specific documents. They generate confident answers from training data — not from your actual content.

Standard RAG systems fix this but introduce a new problem: you can't tell when retrieval fails. Wrong chunks get retrieved. The LLM answers confidently from bad context. No warning.

This project solves both problems:

Grounds all answers in your uploaded documents
Scores retrieval quality on every query so you know when to trust the answer

🏗️ Architecture

PDF Documents
      │
      ▼
┌─────────────────┐
│  PyPDFLoader    │  Extract text page by page
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ RecursiveText   │  Split into 512-token chunks
│   Splitter      │  with 50-token overlap
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ all-MiniLM-L6   │  Embed each chunk → 384-dim vector
│   Embeddings    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    ChromaDB     │  Store vectors + text + metadata
│  (Vector Store) │  (source filename + page number)
└────────┬────────┘
         │
    User Query
         │
         ▼
┌─────────────────┐
│ Similarity      │  Cosine search → Top-k chunks
│   Search        │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Gemini Flash   │  Generate answer from chunks only
│     (LLM)       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Custom Evaluator│  Score grounding, relevance,
│                 │  completeness
└─────────────────┘
         │
         ▼
   Answer + Sources + Quality Scores

📊 Evaluation Metrics

Instead of using RAGAS (which requires paid OpenAI API), this project implements a custom lightweight evaluator — free, fast, and interpretable.

Metric	What It Measures	How
Grounding Score	Is the answer based on retrieved chunks?	Word overlap: answer ∩ context / answer
Retrieval Relevance	Did we retrieve the right chunks?	Cosine similarity: query vector vs chunk vectors
Answer Completeness	Did the LLM use the retrieved context?	Word overlap: answer ∩ context / context

Sample Results (Attention Is All You Need paper)

Query	Grounding	Relevance	Completeness
"What is the attention mechanism?"	0.97	0.61	0.16
"Who are the authors?"	0.95	0.58	0.12
"What is the conclusion?"	0.91	0.55	0.14

Key insight: High grounding (0.97) but low completeness (0.16) means the LLM extracted a precise answer from a small portion of retrieved context — expected behavior for factual queries.

🚀 Quick Start

Prerequisites

Python 3.10+
Google AI Studio API key (free at aistudio.google.com)

Installation

git clone https://github.com/aneebnaqvi15/rag-research-assistant
cd rag-research-assistant

python -m venv venv
venv\Scripts\activate  # Windows
# source venv/bin/activate  # Mac/Linux

pip install -r requirements.txt

Configuration

Create a .env file:

GOOGLE_API_KEY=your_key_here

Run

streamlit run app.py

Open http://localhost:8501

🛠️ Tech Stack

Component	Tool	Why
PDF Loading	`pypdf` + LangChain	Page-level metadata for citations
Chunking	`RecursiveCharacterTextSplitter`	Respects sentence boundaries
Embeddings	`all-MiniLM-L6-v2`	Fast, free, 384-dim, runs on CPU
Vector Store	`ChromaDB`	Local, zero-config, stores metadata
LLM	`Gemini 2.5 Flash`	Free tier, strong reasoning
Orchestration	`LangChain`	Connects all RAG components
UI	`Streamlit`	Rapid prototyping, real-time logs
Evaluation	Custom (NumPy + sklearn)	Free, interpretable, no OpenAI dependency

✨ Features

Multi-document support — Upload and query across multiple PDFs simultaneously
Source citations — Every answer shows exact filename and page number
Real-time processing logs — Watch the pipeline run: load → chunk → embed → index
Retrieval quality scores — Three metrics scored on every query
Bring your own API key — Toggle in sidebar to use your own Gemini key
Custom model selection — Enter any Gemini model string from AI Studio
Adjustable retrieval — Control chunk size and top-k via sidebar sliders

📁 Project Structure

rag-research-assistant/
├── app.py                  # Streamlit UI + full RAG pipeline
├── requirements.txt        # Pinned dependencies
├── .env                    # API keys (not committed)
├── .gitignore
└── README.md

💡 Key Learnings

1. Dependency hell is real. ChromaDB, NumPy 2.0, and OpenTelemetry have a war inside Colab's environment. The fix: use EphemeralClient() instead of file-based persistence, and never pin NumPy manually.

2. RAGAS is not free. RAGAS v0.4+ requires OpenAI API for their InstructorLLM. Building a custom evaluator is not a compromise — it's better engineering. You understand every metric you ship.

3. Chunk size is a hyperparameter. 512 tokens with 50-token overlap is a starting point, not a truth. Smaller chunks = sharper vectors but less context. Tune based on your document type.

4. Retrieval can fail silently. A grounding score of 1.0 does not mean a correct answer. It means every word in the answer exists somewhere in retrieved chunks. You need multiple metrics to trust a RAG system.

5. LLM quality matters more than RAG architecture. flan-t5-base produced gibberish from perfect retrieval. Gemini 2.5 Flash produced accurate answers from the same chunks. The retrieval pipeline is only as useful as the model that reads it.

🔗 Related Projects

This project is Part 2 of a connected AI/ML portfolio:

Project	Focus	Link
Banking77 Intent Classifier	NLP fine-tuning with DistilBERT + LoRA	GitHub
RAG Research Assistant	Retrieval-Augmented Generation + Evaluation	This repo

Together these demonstrate: fine-tuning (shaping model behavior) vs RAG (shaping model knowledge) — two complementary approaches to applied NLP.

📄 References

Attention Is All You Need — Vaswani et al. (2017)
REALM: Retrieval-Augmented Language Model Pre-Training
LangChain Documentation
ChromaDB Documentation
Sentence Transformers — all-MiniLM-L6-v2

👤 Author

Syed Muhammad Aneeb CS Graduate · Full-Stack Developer · AI Engineer

Built with zero GPU budget on Google Colab free tier. Constraints breed creativity.