Spaces:
Configuration error
Configuration error
| # π PDF Q&A with Hybrid Search + LLM | |
| ## π Overview | |
| This project is a **Question Answering (QA) system** that allows users to: | |
| 1. Upload a **PDF document**. | |
| 2. Automatically process and chunk the text. | |
| 3. Store embeddings in **Qdrant Vector Database** and build a **hybrid retriever** (BM25 + Qdrant). | |
| 4. Ask **natural language questions**, and the model will retrieve the relevant context from the PDF and generate an answer using a **Large Language Model (LLM)**. | |
| It combines **semantic search (dense)** + **keyword search (BM25)** for better retrieval accuracy. | |
| --- | |
| ## π οΈ Tech Stack | |
| - **LangChain** β Orchestration of retrievers and chains. | |
| - **HuggingFace + Together API** β LLM endpoint (`Qwen3-235B-A22B-Instruct-2507`). | |
| - **Qdrant** β Vector database for storing embeddings. | |
| - **BM25** β Keyword-based retriever. | |
| - **Docling** β Loader to extract text from PDF into Markdown. | |
| - **Transformers** β Tokenizer for chunking text. | |
| - **Gradio** β Web interface. | |
| - **dotenv** β Secure API key management. | |
| --- | |
| ## βοΈ Workflow | |
| 1. **Upload PDF** | |
| - The file is loaded with `DoclingLoader`. | |
| - Text is split into **chunks** using HuggingFace tokenizer. | |
| 2. **Build Hybrid Search** | |
| - Embeddings are created using `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`. | |
| - Chunks are stored in **Qdrant**. | |
| - **Dense retriever** (embeddings) + **BM25 retriever** (keywords) are combined with weights `0.6` (dense) and `0.4` (BM25). | |
| 3. **Ask Questions** | |
| - User writes a question. | |
| - Relevant chunks are retrieved. | |
| - A **prompt** is built with context + question. | |
| - The **LLM** generates the answer (max 3 sentences). | |
| --- | |
| ## π Features | |
| - Upload any **PDF document**. | |
| - Hybrid search ensures **more accurate retrieval** than only embeddings or BM25. | |
| - Context-aware **Q&A** answers. | |
| - **Caching retriever** so you only upload once (no need to re-process for every question). | |
| - Simple **Gradio UI** with upload + question box. | |
| --- | |
| ## π Requirements | |
| - Python 3.10+ | |
| - Install dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |