Spaces:
Configuration error
Configuration error
Delete README.md
Browse files
README.md
DELETED
|
@@ -1,57 +0,0 @@
|
|
| 1 |
-
# π PDF Q&A with Hybrid Search + LLM
|
| 2 |
-
|
| 3 |
-
## π Overview
|
| 4 |
-
This project is a **Question Answering (QA) system** that allows users to:
|
| 5 |
-
1. Upload a **PDF document**.
|
| 6 |
-
2. Automatically process and chunk the text.
|
| 7 |
-
3. Store embeddings in **Qdrant Vector Database** and build a **hybrid retriever** (BM25 + Qdrant).
|
| 8 |
-
4. Ask **natural language questions**, and the model will retrieve the relevant context from the PDF and generate an answer using a **Large Language Model (LLM)**.
|
| 9 |
-
|
| 10 |
-
It combines **semantic search (dense)** + **keyword search (BM25)** for better retrieval accuracy.
|
| 11 |
-
|
| 12 |
-
---
|
| 13 |
-
|
| 14 |
-
## π οΈ Tech Stack
|
| 15 |
-
- **LangChain** β Orchestration of retrievers and chains.
|
| 16 |
-
- **HuggingFace + Together API** β LLM endpoint (`Qwen3-235B-A22B-Instruct-2507`).
|
| 17 |
-
- **Qdrant** β Vector database for storing embeddings.
|
| 18 |
-
- **BM25** β Keyword-based retriever.
|
| 19 |
-
- **Docling** β Loader to extract text from PDF into Markdown.
|
| 20 |
-
- **Transformers** β Tokenizer for chunking text.
|
| 21 |
-
- **Gradio** β Web interface.
|
| 22 |
-
- **dotenv** β Secure API key management.
|
| 23 |
-
|
| 24 |
-
---
|
| 25 |
-
|
| 26 |
-
## βοΈ Workflow
|
| 27 |
-
1. **Upload PDF**
|
| 28 |
-
- The file is loaded with `DoclingLoader`.
|
| 29 |
-
- Text is split into **chunks** using HuggingFace tokenizer.
|
| 30 |
-
|
| 31 |
-
2. **Build Hybrid Search**
|
| 32 |
-
- Embeddings are created using `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`.
|
| 33 |
-
- Chunks are stored in **Qdrant**.
|
| 34 |
-
- **Dense retriever** (embeddings) + **BM25 retriever** (keywords) are combined with weights `0.6` (dense) and `0.4` (BM25).
|
| 35 |
-
|
| 36 |
-
3. **Ask Questions**
|
| 37 |
-
- User writes a question.
|
| 38 |
-
- Relevant chunks are retrieved.
|
| 39 |
-
- A **prompt** is built with context + question.
|
| 40 |
-
- The **LLM** generates the answer (max 3 sentences).
|
| 41 |
-
|
| 42 |
-
---
|
| 43 |
-
|
| 44 |
-
## π Features
|
| 45 |
-
- Upload any **PDF document**.
|
| 46 |
-
- Hybrid search ensures **more accurate retrieval** than only embeddings or BM25.
|
| 47 |
-
- Context-aware **Q&A** answers.
|
| 48 |
-
- **Caching retriever** so you only upload once (no need to re-process for every question).
|
| 49 |
-
- Simple **Gradio UI** with upload + question box.
|
| 50 |
-
|
| 51 |
-
---
|
| 52 |
-
|
| 53 |
-
## π Requirements
|
| 54 |
-
- Python 3.10+
|
| 55 |
-
- Install dependencies:
|
| 56 |
-
```bash
|
| 57 |
-
pip install -r requirements.txt
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|