File size: 8,579 Bytes
283d510
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
---
license: mit
language:
  - en
tags:
  - RAG
  - retrieval-augmented-generation
  - document-qa
  - pdf-processing
  - hybrid-retrieval
  - cross-encoder
  - langchain
  - chromadb
  - bm25
  - semantic-chunking
  - multi-document
  - question-answering
library_name: langchain
pipeline_tag: question-answering
datasets: []
metrics:
  - accuracy
base_model:
  - BAAI/bge-large-en-v1.5
  - BAAI/bge-reranker-v2-m3
  - sentence-transformers/all-MiniLM-L6-v2
---

# Multi-Document RAG System

A production-ready **Retrieval-Augmented Generation (RAG)** system for intelligent question-answering over multiple PDF documents. Features hybrid retrieval (vector + keyword search), cross-encoder re-ranking, semantic chunking, and a Gradio web interface.

![Architecture](https://img.shields.io/badge/Architecture-Hybrid%20RAG-blue)
![Python](https://img.shields.io/badge/Python-3.10%2B-green)
![LLM](https://img.shields.io/badge/LLM-Llama%203.3%2070B-orange)

## Model Description

This system implements an advanced RAG pipeline that combines multiple state-of-the-art techniques for optimal document retrieval and question answering:

### Core Models Used

| Component | Model | Purpose |
|-----------|-------|---------|
| **Embeddings** | `BAAI/bge-large-en-v1.5` | 1024-dim normalized embeddings for semantic search |
| **Re-ranker** | `BAAI/bge-reranker-v2-m3` | Cross-encoder neural re-ranking for precision |
| **Chunker** | `sentence-transformers/all-MiniLM-L6-v2` | Semantic similarity for intelligent chunking |
| **LLM** | Llama 3.3 70B (via Groq API) | Generation with inline citations |

### Architecture

```
User Query

    ├── Query Classification (factoid/summary/comparison/extraction/reasoning)
    ├── Multi-Query Expansion (3 alternative phrasings)
    └── HyDE Generation (hypothetical answer document)


    ┌──────────────────────────────────────┐
    │         Hybrid Retrieval             │
    │  ┌─────────────┐  ┌─────────────┐    │
    │  │ ChromaDB    │  │ BM25        │    │
    │  │ (Vector)    │  │ (Keyword)   │    │
    │  └─────────────┘  └─────────────┘    │
    │           │              │           │
    │           └──────┬───────┘           │
    │                  ▼                   │
    │         RRF Fusion + Deduplication   │
    └──────────────────────────────────────┘


              Cross-Encoder Re-ranking
              (BAAI/bge-reranker-v2-m3)


              LLM Generation (Llama 3.3 70B)
              with inline source citations


              Answer Verification (for complex queries)
```

## Key Features

### Hybrid Retrieval
- **Vector Search (MMR)**: Semantic similarity with diversity via ChromaDB
- **Keyword Search (BM25)**: Exact term matching for rare words
- **Reciprocal Rank Fusion**: Combines multiple ranked lists optimally

### Semantic Chunking
Documents are split based on sentence embedding similarity rather than fixed character counts, preserving coherent ideas within chunks.

### Intelligent Query Classification
Automatically classifies queries into 5 types with adaptive retrieval:

| Query Type | Retrieval Depth (k) | Answer Style |
|------------|---------------------|--------------|
| Factoid | 6 | Direct |
| Summary | 10 | Bullets |
| Comparison | 12 | Bullets |
| Extraction | 8 | Direct |
| Reasoning | 10 | Steps |

### Multi-Document Support
- Upload multiple PDFs to build a combined knowledge base
- Automatic PDF diversity enforcement for cross-document queries
- Clear source attribution with document name and page number

### Query Enhancement
- **HyDE**: Generates hypothetical answer documents for better retrieval
- **Multi-Query Expansion**: Creates 3 alternative phrasings for broader coverage

### Answer Verification
Self-verification step for complex queries ensures answers are direct, structured, and grounded in sources.

## Intended Uses

### Primary Use Cases
- **Academic Research**: Analyze and compare research papers
- **Document Q&A**: Answer questions over technical documentation
- **Literature Review**: Synthesize information across multiple sources
- **Knowledge Extraction**: Extract specific facts, methodologies, or findings

### Out-of-Scope Uses
- Real-time streaming applications (latency-sensitive)
- Non-English documents (optimized for English)
- Image/table-heavy PDFs (text extraction only)

## How to Use

### Requirements
- Python 3.10+
- Groq API key (free at [console.groq.com](https://console.groq.com))
- GPU recommended but not required

### Installation

```bash
pip install numpy==1.26.4 pandas==2.2.2 scipy==1.13.1
pip install langchain-core==0.2.40 langchain-community==0.2.16 langchain==0.2.16
pip install langchain-groq==0.1.9 langchain-text-splitters==0.2.4
pip install chromadb==0.5.5 sentence-transformers==3.0.1
pip install pypdf==4.3.1 rank-bm25==0.2.2 gradio torch
```

### Quick Start

1. Open `rag.ipynb` in Jupyter Notebook or Google Colab
2. Run all cells sequentially
3. Enter your Groq API key in the Setup tab
4. Upload PDF documents
5. Ask questions in the Chat tab

### Example Queries

```python
# Single Document Analysis
"What is the main contribution of this paper?"
"Explain the methodology in detail"
"What are the limitations mentioned by the authors?"

# Multi-Document Comparison
"Compare the approaches discussed in these papers"
"What are the key differences between the methodologies?"
```

## Technical Specifications

### Performance Benchmarks

| Operation | Typical Duration |
|-----------|------------------|
| Model initialization | 30-60 seconds |
| PDF ingestion (per doc) | 10-30 seconds |
| Simple queries | 5-8 seconds |
| Complex queries | 10-15 seconds |
| Full document summary | 30-90 seconds |

### Configuration Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `max_chunk_size` | 1000 | Maximum characters per semantic chunk |
| `similarity_threshold` | 0.5 | Cosine similarity for chunk grouping |
| `chunk_size` | 800 | Fallback text splitter chunk size |
| `chunk_overlap` | 150 | Character overlap between chunks |
| `fetch_factor` | 2 | Multiplier for initial retrieval pool |
| `lambda_mult` | 0.6 | MMR diversity parameter |
| `cache_max_size` | 100 | Maximum cached query responses |

## Limitations

- Requires active internet connection for Groq API calls
- PDF quality affects text extraction accuracy
- Large documents may take longer to process
- Query cache does not persist between sessions
- Optimized for English language documents

## Training Details

This is a **retrieval system**, not a trained model. It orchestrates pre-trained models:

- **Embeddings**: Uses pre-trained `BAAI/bge-large-en-v1.5` without fine-tuning
- **Re-ranker**: Uses pre-trained `BAAI/bge-reranker-v2-m3` without fine-tuning
- **LLM**: Uses Llama 3.3 70B via Groq API with zero-shot prompting

## Evaluation

The system was evaluated qualitatively on academic papers and technical documents for:
- Answer relevance and accuracy
- Source attribution correctness
- Cross-document comparison quality
- Response structure and readability

## Environmental Impact

- **Hardware**: Developed and tested on Google Colab (NVIDIA T4 GPU)
- **Inference**: Primary compute via Groq API (cloud-hosted)
- Local model loading: ~2GB VRAM for embeddings + re-ranker

## Citation

```bibtex
@software{multi_doc_rag_system,
  title = {Multi-Document RAG System},
  year = {2024},
  description = {Production-ready RAG system with hybrid retrieval and cross-encoder re-ranking},
  url = {https://huggingface.co/your-username/your-repo}
}
```

## Acknowledgements

This project builds upon:
- [LangChain](https://github.com/langchain-ai/langchain) for RAG orchestration
- [ChromaDB](https://github.com/chroma-core/chroma) for vector storage
- [Sentence Transformers](https://www.sbert.net/) for embeddings
- [BAAI](https://huggingface.co/BAAI) for BGE models
- [Groq](https://groq.com/) for fast LLM inference

## Contact

For questions or feedback, please open an issue on the repository.