Spaces:
Configuration error
Configuration error
Upload README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π PDF Q&A with Hybrid Search + LLM
|
| 2 |
+
|
| 3 |
+
## π Overview
|
| 4 |
+
This project is a **Question Answering (QA) system** that allows users to:
|
| 5 |
+
1. Upload a **PDF document**.
|
| 6 |
+
2. Automatically process and chunk the text.
|
| 7 |
+
3. Store embeddings in **Qdrant Vector Database** and build a **hybrid retriever** (BM25 + Qdrant).
|
| 8 |
+
4. Ask **natural language questions**, and the model will retrieve the relevant context from the PDF and generate an answer using a **Large Language Model (LLM)**.
|
| 9 |
+
|
| 10 |
+
It combines **semantic search (dense)** + **keyword search (BM25)** for better retrieval accuracy.
|
| 11 |
+
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
## π οΈ Tech Stack
|
| 15 |
+
- **LangChain** β Orchestration of retrievers and chains.
|
| 16 |
+
- **HuggingFace + Together API** β LLM endpoint (`Qwen3-235B-A22B-Instruct-2507`).
|
| 17 |
+
- **Qdrant** β Vector database for storing embeddings.
|
| 18 |
+
- **BM25** β Keyword-based retriever.
|
| 19 |
+
- **Docling** β Loader to extract text from PDF into Markdown.
|
| 20 |
+
- **Transformers** β Tokenizer for chunking text.
|
| 21 |
+
- **Gradio** β Web interface.
|
| 22 |
+
- **dotenv** β Secure API key management.
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## βοΈ Workflow
|
| 27 |
+
1. **Upload PDF**
|
| 28 |
+
- The file is loaded with `DoclingLoader`.
|
| 29 |
+
- Text is split into **chunks** using HuggingFace tokenizer.
|
| 30 |
+
|
| 31 |
+
2. **Build Hybrid Search**
|
| 32 |
+
- Embeddings are created using `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`.
|
| 33 |
+
- Chunks are stored in **Qdrant**.
|
| 34 |
+
- **Dense retriever** (embeddings) + **BM25 retriever** (keywords) are combined with weights `0.6` (dense) and `0.4` (BM25).
|
| 35 |
+
|
| 36 |
+
3. **Ask Questions**
|
| 37 |
+
- User writes a question.
|
| 38 |
+
- Relevant chunks are retrieved.
|
| 39 |
+
- A **prompt** is built with context + question.
|
| 40 |
+
- The **LLM** generates the answer (max 3 sentences).
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
## π Features
|
| 45 |
+
- Upload any **PDF document**.
|
| 46 |
+
- Hybrid search ensures **more accurate retrieval** than only embeddings or BM25.
|
| 47 |
+
- Context-aware **Q&A** answers.
|
| 48 |
+
- **Caching retriever** so you only upload once (no need to re-process for every question).
|
| 49 |
+
- Simple **Gradio UI** with upload + question box.
|
| 50 |
+
|
| 51 |
+
---
|
| 52 |
+
|
| 53 |
+
## π Requirements
|
| 54 |
+
- Python 3.10+
|
| 55 |
+
- Install dependencies:
|
| 56 |
+
```bash
|
| 57 |
+
pip install -r requirements.txt
|