File size: 8,579 Bytes

283d510

---
license: mit
language:
  - en
tags:
  - RAG
  - retrieval-augmented-generation
  - document-qa
  - pdf-processing
  - hybrid-retrieval
  - cross-encoder
  - langchain
  - chromadb
  - bm25
  - semantic-chunking
  - multi-document
  - question-answering
library_name: langchain
pipeline_tag: question-answering
datasets: []
metrics:
  - accuracy
base_model:
  - BAAI/bge-large-en-v1.5
  - BAAI/bge-reranker-v2-m3
  - sentence-transformers/all-MiniLM-L6-v2
---

# Multi-Document RAG System

A production-ready **Retrieval-Augmented Generation (RAG)** system for intelligent question-answering over multiple PDF documents. Features hybrid retrieval (vector + keyword search), cross-encoder re-ranking, semantic chunking, and a Gradio web interface.

![Architecture](https://img.shields.io/badge/Architecture-Hybrid%20RAG-blue)
![Python](https://img.shields.io/badge/Python-3.10%2B-green)
![LLM](https://img.shields.io/badge/LLM-Llama%203.3%2070B-orange)

## Model Description

This system implements an advanced RAG pipeline that combines multiple state-of-the-art techniques for optimal document retrieval and question answering:

### Core Models Used

| Component | Model | Purpose |
|-----------|-------|---------|
| **Embeddings** | `BAAI/bge-large-en-v1.5` | 1024-dim normalized embeddings for semantic search |
| **Re-ranker** | `BAAI/bge-reranker-v2-m3` | Cross-encoder neural re-ranking for precision |
| **Chunker** | `sentence-transformers/all-MiniLM-L6-v2` | Semantic similarity for intelligent chunking |
| **LLM** | Llama 3.3 70B (via Groq API) | Generation with inline citations |

### Architecture

```
User Query
    │
    ├── Query Classification (factoid/summary/comparison/extraction/reasoning)
    ├── Multi-Query Expansion (3 alternative phrasings)
    └── HyDE Generation (hypothetical answer document)
           │
           ▼
    ┌──────────────────────────────────────┐
    │         Hybrid Retrieval             │
    │  ┌─────────────┐  ┌─────────────┐    │
    │  │ ChromaDB    │  │ BM25        │    │
    │  │ (Vector)    │  │ (Keyword)   │    │
    │  └─────────────┘  └─────────────┘    │
    │           │              │           │
    │           └──────┬───────┘           │
    │                  ▼                   │
    │         RRF Fusion + Deduplication   │
    └──────────────────────────────────────┘
                       │
                       ▼
              Cross-Encoder Re-ranking
              (BAAI/bge-reranker-v2-m3)
                       │
                       ▼
              LLM Generation (Llama 3.3 70B)
              with inline source citations
                       │
                       ▼
              Answer Verification (for complex queries)
```

## Key Features

### Hybrid Retrieval
- **Vector Search (MMR)**: Semantic similarity with diversity via ChromaDB
- **Keyword Search (BM25)**: Exact term matching for rare words
- **Reciprocal Rank Fusion**: Combines multiple ranked lists optimally

### Semantic Chunking
Documents are split based on sentence embedding similarity rather than fixed character counts, preserving coherent ideas within chunks.

### Intelligent Query Classification
Automatically classifies queries into 5 types with adaptive retrieval:

| Query Type | Retrieval Depth (k) | Answer Style |
|------------|---------------------|--------------|
| Factoid | 6 | Direct |
| Summary | 10 | Bullets |
| Comparison | 12 | Bullets |
| Extraction | 8 | Direct |
| Reasoning | 10 | Steps |

### Multi-Document Support
- Upload multiple PDFs to build a combined knowledge base
- Automatic PDF diversity enforcement for cross-document queries
- Clear source attribution with document name and page number

### Query Enhancement
- **HyDE**: Generates hypothetical answer documents for better retrieval
- **Multi-Query Expansion**: Creates 3 alternative phrasings for broader coverage

### Answer Verification
Self-verification step for complex queries ensures answers are direct, structured, and grounded in sources.

## Intended Uses

### Primary Use Cases
- **Academic Research**: Analyze and compare research papers
- **Document Q&A**: Answer questions over technical documentation
- **Literature Review**: Synthesize information across multiple sources
- **Knowledge Extraction**: Extract specific facts, methodologies, or findings

### Out-of-Scope Uses
- Real-time streaming applications (latency-sensitive)
- Non-English documents (optimized for English)
- Image/table-heavy PDFs (text extraction only)

## How to Use

### Requirements
- Python 3.10+
- Groq API key (free at [console.groq.com](https://console.groq.com))
- GPU recommended but not required

### Installation

```bash
pip install numpy==1.26.4 pandas==2.2.2 scipy==1.13.1
pip install langchain-core==0.2.40 langchain-community==0.2.16 langchain==0.2.16
pip install langchain-groq==0.1.9 langchain-text-splitters==0.2.4
pip install chromadb==0.5.5 sentence-transformers==3.0.1
pip install pypdf==4.3.1 rank-bm25==0.2.2 gradio torch
```

### Quick Start

1. Open `rag.ipynb` in Jupyter Notebook or Google Colab
2. Run all cells sequentially
3. Enter your Groq API key in the Setup tab
4. Upload PDF documents
5. Ask questions in the Chat tab

### Example Queries

```python
# Single Document Analysis
"What is the main contribution of this paper?"
"Explain the methodology in detail"
"What are the limitations mentioned by the authors?"

# Multi-Document Comparison
"Compare the approaches discussed in these papers"
"What are the key differences between the methodologies?"
```

## Technical Specifications

### Performance Benchmarks

| Operation | Typical Duration |
|-----------|------------------|
| Model initialization | 30-60 seconds |
| PDF ingestion (per doc) | 10-30 seconds |
| Simple queries | 5-8 seconds |
| Complex queries | 10-15 seconds |
| Full document summary | 30-90 seconds |

### Configuration Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `max_chunk_size` | 1000 | Maximum characters per semantic chunk |
| `similarity_threshold` | 0.5 | Cosine similarity for chunk grouping |
| `chunk_size` | 800 | Fallback text splitter chunk size |
| `chunk_overlap` | 150 | Character overlap between chunks |
| `fetch_factor` | 2 | Multiplier for initial retrieval pool |
| `lambda_mult` | 0.6 | MMR diversity parameter |
| `cache_max_size` | 100 | Maximum cached query responses |

## Limitations

- Requires active internet connection for Groq API calls
- PDF quality affects text extraction accuracy
- Large documents may take longer to process
- Query cache does not persist between sessions
- Optimized for English language documents

## Training Details

This is a **retrieval system**, not a trained model. It orchestrates pre-trained models:

- **Embeddings**: Uses pre-trained `BAAI/bge-large-en-v1.5` without fine-tuning
- **Re-ranker**: Uses pre-trained `BAAI/bge-reranker-v2-m3` without fine-tuning
- **LLM**: Uses Llama 3.3 70B via Groq API with zero-shot prompting

## Evaluation

The system was evaluated qualitatively on academic papers and technical documents for:
- Answer relevance and accuracy
- Source attribution correctness
- Cross-document comparison quality
- Response structure and readability

## Environmental Impact

- **Hardware**: Developed and tested on Google Colab (NVIDIA T4 GPU)
- **Inference**: Primary compute via Groq API (cloud-hosted)
- Local model loading: ~2GB VRAM for embeddings + re-ranker

## Citation

```bibtex
@software{multi_doc_rag_system,
  title = {Multi-Document RAG System},
  year = {2024},
  description = {Production-ready RAG system with hybrid retrieval and cross-encoder re-ranking},
  url = {https://huggingface.co/your-username/your-repo}
}
```

## Acknowledgements

This project builds upon:
- [LangChain](https://github.com/langchain-ai/langchain) for RAG orchestration
- [ChromaDB](https://github.com/chroma-core/chroma) for vector storage
- [Sentence Transformers](https://www.sbert.net/) for embeddings
- [BAAI](https://huggingface.co/BAAI) for BGE models
- [Groq](https://groq.com/) for fast LLM inference

## Contact

For questions or feedback, please open an issue on the repository.