Mohamed2210 commited on
Commit
ff46f8d
Β·
verified Β·
1 Parent(s): 8dfaea0

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -0
README.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ“š PDF Q&A with Hybrid Search + LLM
2
+
3
+ ## πŸš€ Overview
4
+ This project is a **Question Answering (QA) system** that allows users to:
5
+ 1. Upload a **PDF document**.
6
+ 2. Automatically process and chunk the text.
7
+ 3. Store embeddings in **Qdrant Vector Database** and build a **hybrid retriever** (BM25 + Qdrant).
8
+ 4. Ask **natural language questions**, and the model will retrieve the relevant context from the PDF and generate an answer using a **Large Language Model (LLM)**.
9
+
10
+ It combines **semantic search (dense)** + **keyword search (BM25)** for better retrieval accuracy.
11
+
12
+ ---
13
+
14
+ ## πŸ› οΈ Tech Stack
15
+ - **LangChain** β†’ Orchestration of retrievers and chains.
16
+ - **HuggingFace + Together API** β†’ LLM endpoint (`Qwen3-235B-A22B-Instruct-2507`).
17
+ - **Qdrant** β†’ Vector database for storing embeddings.
18
+ - **BM25** β†’ Keyword-based retriever.
19
+ - **Docling** β†’ Loader to extract text from PDF into Markdown.
20
+ - **Transformers** β†’ Tokenizer for chunking text.
21
+ - **Gradio** β†’ Web interface.
22
+ - **dotenv** β†’ Secure API key management.
23
+
24
+ ---
25
+
26
+ ## βš™οΈ Workflow
27
+ 1. **Upload PDF**
28
+ - The file is loaded with `DoclingLoader`.
29
+ - Text is split into **chunks** using HuggingFace tokenizer.
30
+
31
+ 2. **Build Hybrid Search**
32
+ - Embeddings are created using `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`.
33
+ - Chunks are stored in **Qdrant**.
34
+ - **Dense retriever** (embeddings) + **BM25 retriever** (keywords) are combined with weights `0.6` (dense) and `0.4` (BM25).
35
+
36
+ 3. **Ask Questions**
37
+ - User writes a question.
38
+ - Relevant chunks are retrieved.
39
+ - A **prompt** is built with context + question.
40
+ - The **LLM** generates the answer (max 3 sentences).
41
+
42
+ ---
43
+
44
+ ## πŸ“‹ Features
45
+ - Upload any **PDF document**.
46
+ - Hybrid search ensures **more accurate retrieval** than only embeddings or BM25.
47
+ - Context-aware **Q&A** answers.
48
+ - **Caching retriever** so you only upload once (no need to re-process for every question).
49
+ - Simple **Gradio UI** with upload + question box.
50
+
51
+ ---
52
+
53
+ ## πŸ”‘ Requirements
54
+ - Python 3.10+
55
+ - Install dependencies:
56
+ ```bash
57
+ pip install -r requirements.txt