Spaces:

Mohamed2210
/

PDF-Rag-System

Configuration error

File size: 2,216 Bytes

ff46f8d

# 📚 PDF Q&A with Hybrid Search + LLM

## 🚀 Overview
This project is a **Question Answering (QA) system** that allows users to:
1. Upload a **PDF document**.  
2. Automatically process and chunk the text.  
3. Store embeddings in **Qdrant Vector Database** and build a **hybrid retriever** (BM25 + Qdrant).  
4. Ask **natural language questions**, and the model will retrieve the relevant context from the PDF and generate an answer using a **Large Language Model (LLM)**.  

It combines **semantic search (dense)** + **keyword search (BM25)** for better retrieval accuracy.

---

## 🛠️ Tech Stack
- **LangChain** → Orchestration of retrievers and chains.  
- **HuggingFace + Together API** → LLM endpoint (`Qwen3-235B-A22B-Instruct-2507`).  
- **Qdrant** → Vector database for storing embeddings.  
- **BM25** → Keyword-based retriever.  
- **Docling** → Loader to extract text from PDF into Markdown.  
- **Transformers** → Tokenizer for chunking text.  
- **Gradio** → Web interface.  
- **dotenv** → Secure API key management.  

---

## ⚙️ Workflow
1. **Upload PDF**  
   - The file is loaded with `DoclingLoader`.  
   - Text is split into **chunks** using HuggingFace tokenizer.  

2. **Build Hybrid Search**  
   - Embeddings are created using `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`.  
   - Chunks are stored in **Qdrant**.  
   - **Dense retriever** (embeddings) + **BM25 retriever** (keywords) are combined with weights `0.6` (dense) and `0.4` (BM25).  

3. **Ask Questions**  
   - User writes a question.  
   - Relevant chunks are retrieved.  
   - A **prompt** is built with context + question.  
   - The **LLM** generates the answer (max 3 sentences).  

---

## 📋 Features
- Upload any **PDF document**.  
- Hybrid search ensures **more accurate retrieval** than only embeddings or BM25.  
- Context-aware **Q&A** answers.  
- **Caching retriever** so you only upload once (no need to re-process for every question).  
- Simple **Gradio UI** with upload + question box.  

---

## 🔑 Requirements
- Python 3.10+  
- Install dependencies:  
  ```bash

  pip install -r requirements.txt