Multi-Document Research Assistant β RAG with Retrieval Quality Evaluation
Upload multiple PDFs. Ask questions. Get cited answers grounded in your documents β with automatic retrieval quality scoring.
π― Problem
Most LLMs hallucinate when asked about private or domain-specific documents. They generate confident answers from training data β not from your actual content.
Standard RAG systems fix this but introduce a new problem: you can't tell when retrieval fails. Wrong chunks get retrieved. The LLM answers confidently from bad context. No warning.
This project solves both problems:
- Grounds all answers in your uploaded documents
- Scores retrieval quality on every query so you know when to trust the answer
ποΈ Architecture
PDF Documents
β
βΌ
βββββββββββββββββββ
β PyPDFLoader β Extract text page by page
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β RecursiveText β Split into 512-token chunks
β Splitter β with 50-token overlap
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β all-MiniLM-L6 β Embed each chunk β 384-dim vector
β Embeddings β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β ChromaDB β Store vectors + text + metadata
β (Vector Store) β (source filename + page number)
ββββββββββ¬βββββββββ
β
User Query
β
βΌ
βββββββββββββββββββ
β Similarity β Cosine search β Top-k chunks
β Search β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Gemini Flash β Generate answer from chunks only
β (LLM) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Custom Evaluatorβ Score grounding, relevance,
β β completeness
βββββββββββββββββββ
β
βΌ
Answer + Sources + Quality Scores
π Evaluation Metrics
Instead of using RAGAS (which requires paid OpenAI API), this project implements a custom lightweight evaluator β free, fast, and interpretable.
| Metric | What It Measures | How |
|---|---|---|
| Grounding Score | Is the answer based on retrieved chunks? | Word overlap: answer β© context / answer |
| Retrieval Relevance | Did we retrieve the right chunks? | Cosine similarity: query vector vs chunk vectors |
| Answer Completeness | Did the LLM use the retrieved context? | Word overlap: answer β© context / context |
Sample Results (Attention Is All You Need paper)
| Query | Grounding | Relevance | Completeness |
|---|---|---|---|
| "What is the attention mechanism?" | 0.97 | 0.61 | 0.16 |
| "Who are the authors?" | 0.95 | 0.58 | 0.12 |
| "What is the conclusion?" | 0.91 | 0.55 | 0.14 |
Key insight: High grounding (0.97) but low completeness (0.16) means the LLM extracted a precise answer from a small portion of retrieved context β expected behavior for factual queries.
π Quick Start
Prerequisites
- Python 3.10+
- Google AI Studio API key (free at aistudio.google.com)
Installation
git clone https://github.com/aneebnaqvi15/rag-research-assistant
cd rag-research-assistant
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # Mac/Linux
pip install -r requirements.txt
Configuration
Create a .env file:
GOOGLE_API_KEY=your_key_here
Run
streamlit run app.py
Open http://localhost:8501
π οΈ Tech Stack
| Component | Tool | Why |
|---|---|---|
| PDF Loading | pypdf + LangChain |
Page-level metadata for citations |
| Chunking | RecursiveCharacterTextSplitter |
Respects sentence boundaries |
| Embeddings | all-MiniLM-L6-v2 |
Fast, free, 384-dim, runs on CPU |
| Vector Store | ChromaDB |
Local, zero-config, stores metadata |
| LLM | Gemini 2.5 Flash |
Free tier, strong reasoning |
| Orchestration | LangChain |
Connects all RAG components |
| UI | Streamlit |
Rapid prototyping, real-time logs |
| Evaluation | Custom (NumPy + sklearn) | Free, interpretable, no OpenAI dependency |
β¨ Features
- Multi-document support β Upload and query across multiple PDFs simultaneously
- Source citations β Every answer shows exact filename and page number
- Real-time processing logs β Watch the pipeline run: load β chunk β embed β index
- Retrieval quality scores β Three metrics scored on every query
- Bring your own API key β Toggle in sidebar to use your own Gemini key
- Custom model selection β Enter any Gemini model string from AI Studio
- Adjustable retrieval β Control chunk size and top-k via sidebar sliders
π Project Structure
rag-research-assistant/
βββ app.py # Streamlit UI + full RAG pipeline
βββ requirements.txt # Pinned dependencies
βββ .env # API keys (not committed)
βββ .gitignore
βββ README.md
π‘ Key Learnings
1. Dependency hell is real.
ChromaDB, NumPy 2.0, and OpenTelemetry have a war inside Colab's environment. The fix: use EphemeralClient() instead of file-based persistence, and never pin NumPy manually.
2. RAGAS is not free.
RAGAS v0.4+ requires OpenAI API for their InstructorLLM. Building a custom evaluator is not a compromise β it's better engineering. You understand every metric you ship.
3. Chunk size is a hyperparameter. 512 tokens with 50-token overlap is a starting point, not a truth. Smaller chunks = sharper vectors but less context. Tune based on your document type.
4. Retrieval can fail silently. A grounding score of 1.0 does not mean a correct answer. It means every word in the answer exists somewhere in retrieved chunks. You need multiple metrics to trust a RAG system.
5. LLM quality matters more than RAG architecture.
flan-t5-base produced gibberish from perfect retrieval. Gemini 2.5 Flash produced accurate answers from the same chunks. The retrieval pipeline is only as useful as the model that reads it.
π Related Projects
This project is Part 2 of a connected AI/ML portfolio:
| Project | Focus | Link |
|---|---|---|
| Banking77 Intent Classifier | NLP fine-tuning with DistilBERT + LoRA | GitHub |
| RAG Research Assistant | Retrieval-Augmented Generation + Evaluation | This repo |
Together these demonstrate: fine-tuning (shaping model behavior) vs RAG (shaping model knowledge) β two complementary approaches to applied NLP.
π References
- Attention Is All You Need β Vaswani et al. (2017)
- REALM: Retrieval-Augmented Language Model Pre-Training
- LangChain Documentation
- ChromaDB Documentation
- Sentence Transformers β all-MiniLM-L6-v2
π€ Author
Syed Muhammad Aneeb CS Graduate Β· Full-Stack Developer Β· AI Engineer
Built with zero GPU budget on Google Colab free tier. Constraints breed creativity.