multi-doc-rag-v2 / Readme.md
aneeb15's picture
Deploy to new v2 Space
e2a3329
metadata
title: Multi Doc Rag V2
emoji: πŸ“š
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false

Multi-Document Research Assistant β€” RAG with Retrieval Quality Evaluation

Upload multiple PDFs. Ask questions. Get cited answers grounded in your documents β€” with automatic retrieval quality scoring.

Live Demo Python LangChain ChromaDB


🎯 Problem

Most LLMs hallucinate when asked about private or domain-specific documents. They generate confident answers from training data β€” not from your actual content.

Standard RAG systems fix this but introduce a new problem: you can't tell when retrieval fails. Wrong chunks get retrieved. The LLM answers confidently from bad context. No warning.

This project solves both problems:

  1. Grounds all answers in your uploaded documents
  2. Scores retrieval quality on every query so you know when to trust the answer

πŸ—οΈ Architecture

PDF Documents
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  PyPDFLoader    β”‚  Extract text page by page
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RecursiveText   β”‚  Split into 512-token chunks
β”‚   Splitter      β”‚  with 50-token overlap
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ all-MiniLM-L6   β”‚  Embed each chunk β†’ 384-dim vector
β”‚   Embeddings    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    ChromaDB     β”‚  Store vectors + text + metadata
β”‚  (Vector Store) β”‚  (source filename + page number)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
    User Query
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Similarity      β”‚  Cosine search β†’ Top-k chunks
β”‚   Search        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Gemini Flash   β”‚  Generate answer from chunks only
β”‚     (LLM)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Custom Evaluatorβ”‚  Score grounding, relevance,
β”‚                 β”‚  completeness
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
   Answer + Sources + Quality Scores

πŸ“Š Evaluation Metrics

Instead of using RAGAS (which requires paid OpenAI API), this project implements a custom lightweight evaluator β€” free, fast, and interpretable.

Metric What It Measures How
Grounding Score Is the answer based on retrieved chunks? Word overlap: answer ∩ context / answer
Retrieval Relevance Did we retrieve the right chunks? Cosine similarity: query vector vs chunk vectors
Answer Completeness Did the LLM use the retrieved context? Word overlap: answer ∩ context / context

Sample Results (Attention Is All You Need paper)

Query Grounding Relevance Completeness
"What is the attention mechanism?" 0.97 0.61 0.16
"Who are the authors?" 0.95 0.58 0.12
"What is the conclusion?" 0.91 0.55 0.14

Key insight: High grounding (0.97) but low completeness (0.16) means the LLM extracted a precise answer from a small portion of retrieved context β€” expected behavior for factual queries.


πŸš€ Quick Start

Prerequisites

Installation

git clone https://github.com/aneebnaqvi15/rag-research-assistant
cd rag-research-assistant

python -m venv venv
venv\Scripts\activate  # Windows
# source venv/bin/activate  # Mac/Linux

pip install -r requirements.txt

Configuration

Create a .env file:

GOOGLE_API_KEY=your_key_here

Run

streamlit run app.py

Open http://localhost:8501


πŸ› οΈ Tech Stack

Component Tool Why
PDF Loading pypdf + LangChain Page-level metadata for citations
Chunking RecursiveCharacterTextSplitter Respects sentence boundaries
Embeddings all-MiniLM-L6-v2 Fast, free, 384-dim, runs on CPU
Vector Store ChromaDB Local, zero-config, stores metadata
LLM Gemini 1.5 Flash Free tier, strong reasoning
Orchestration LangChain Connects all RAG components
UI Streamlit Rapid prototyping, real-time logs
Evaluation Custom (NumPy + sklearn) Free, interpretable, no OpenAI dependency

✨ Features

  • Multi-document support β€” Upload and query across multiple PDFs simultaneously
  • Source citations β€” Every answer shows exact filename and page number
  • Real-time processing logs β€” Watch the pipeline run: load β†’ chunk β†’ embed β†’ index
  • Retrieval quality scores β€” Three metrics scored on every query
  • Bring your own API key β€” Toggle in sidebar to use your own Gemini key
  • Custom model selection β€” Enter any Gemini model string from AI Studio
  • Adjustable retrieval β€” Control chunk size and top-k via sidebar sliders

πŸ’‘ Key Learnings

1. Dependency hell is real. ChromaDB, NumPy 2.0, and OpenTelemetry have a war inside Colab's environment. The fix: use EphemeralClient() instead of file-based persistence, and never pin NumPy manually.

2. RAGAS is not free. RAGAS v0.4+ requires OpenAI API for their InstructorLLM. Building a custom evaluator is not a compromise β€” it's better engineering. You understand every metric you ship.

3. Chunk size is a hyperparameter. 512 tokens with 50-token overlap is a starting point, not a truth. Smaller chunks = sharper vectors but less context. Tune based on your document type.

4. Retrieval can fail silently. A grounding score of 1.0 does not mean a correct answer. It means every word in the answer exists somewhere in retrieved chunks. You need multiple metrics to trust a RAG system.

5. LLM quality matters more than RAG architecture. flan-t5-base produced gibberish from perfect retrieval. Gemini 1.5 Flash produced accurate answers from the same chunks. The retrieval pipeline is only as useful as the model that reads it.


πŸ”— Related Projects

This project is Part 2 of a connected AI/ML portfolio:

Project Focus Link
Banking77 Intent Classifier NLP fine-tuning with DistilBERT + LoRA GitHub
RAG Research Assistant Retrieval-Augmented Generation + Evaluation This repo

Together these demonstrate: fine-tuning (shaping model behavior) vs RAG (shaping model knowledge) β€” two complementary approaches to applied NLP.


πŸ“„ References


πŸ‘€ Author

Syed Muhammad Aneeb CS Graduate Β· Full-Stack Developer Β· AI Engineer

GitHub


Built with zero GPU budget on Google Colab free tier. Constraints breed creativity.