Mohamed2210 commited on
Commit
471f9ee
Β·
verified Β·
1 Parent(s): 8731139

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +57 -13
  2. app.py +140 -0
  3. requirements.txt +10 -0
README.md CHANGED
@@ -1,13 +1,57 @@
1
- ---
2
- title: PDF Rag System
3
- emoji: πŸ†
4
- colorFrom: yellow
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 6.0.2
8
- app_file: app.py
9
- pinned: false
10
- license: apache-2.0
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ“š PDF Q&A with Hybrid Search + LLM
2
+
3
+ ## πŸš€ Overview
4
+ This project is a **Question Answering (QA) system** that allows users to:
5
+ 1. Upload a **PDF document**.
6
+ 2. Automatically process and chunk the text.
7
+ 3. Store embeddings in **Qdrant Vector Database** and build a **hybrid retriever** (BM25 + Qdrant).
8
+ 4. Ask **natural language questions**, and the model will retrieve the relevant context from the PDF and generate an answer using a **Large Language Model (LLM)**.
9
+
10
+ It combines **semantic search (dense)** + **keyword search (BM25)** for better retrieval accuracy.
11
+
12
+ ---
13
+
14
+ ## πŸ› οΈ Tech Stack
15
+ - **LangChain** β†’ Orchestration of retrievers and chains.
16
+ - **HuggingFace + Together API** β†’ LLM endpoint (`Qwen3-235B-A22B-Instruct-2507`).
17
+ - **Qdrant** β†’ Vector database for storing embeddings.
18
+ - **BM25** β†’ Keyword-based retriever.
19
+ - **Docling** β†’ Loader to extract text from PDF into Markdown.
20
+ - **Transformers** β†’ Tokenizer for chunking text.
21
+ - **Gradio** β†’ Web interface.
22
+ - **dotenv** β†’ Secure API key management.
23
+
24
+ ---
25
+
26
+ ## βš™οΈ Workflow
27
+ 1. **Upload PDF**
28
+ - The file is loaded with `DoclingLoader`.
29
+ - Text is split into **chunks** using HuggingFace tokenizer.
30
+
31
+ 2. **Build Hybrid Search**
32
+ - Embeddings are created using `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`.
33
+ - Chunks are stored in **Qdrant**.
34
+ - **Dense retriever** (embeddings) + **BM25 retriever** (keywords) are combined with weights `0.6` (dense) and `0.4` (BM25).
35
+
36
+ 3. **Ask Questions**
37
+ - User writes a question.
38
+ - Relevant chunks are retrieved.
39
+ - A **prompt** is built with context + question.
40
+ - The **LLM** generates the answer (max 3 sentences).
41
+
42
+ ---
43
+
44
+ ## πŸ“‹ Features
45
+ - Upload any **PDF document**.
46
+ - Hybrid search ensures **more accurate retrieval** than only embeddings or BM25.
47
+ - Context-aware **Q&A** answers.
48
+ - **Caching retriever** so you only upload once (no need to re-process for every question).
49
+ - Simple **Gradio UI** with upload + question box.
50
+
51
+ ---
52
+
53
+ ## πŸ”‘ Requirements
54
+ - Python 3.10+
55
+ - Install dependencies:
56
+ ```bash
57
+ pip install -r requirements.txt
app.py ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import gradio as gr
3
+ from dotenv import load_dotenv
4
+ from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace
5
+ from langchain.prompts import PromptTemplate
6
+ from langchain.chains.combine_documents import create_stuff_documents_chain
7
+ from langchain_community.retrievers import BM25Retriever
8
+ from langchain.retrievers import EnsembleRetriever
9
+ from langchain_huggingface import HuggingFaceEmbeddings
10
+ from langchain_community.vectorstores import Qdrant
11
+ from langchain_docling import DoclingLoader
12
+ from langchain_docling.loader import ExportType
13
+ from transformers import AutoTokenizer
14
+
15
+ # ========== Load API KEYS ==========
16
+ load_dotenv()
17
+ huggingfacehub_api_token = os.getenv("HF_TOKEN")
18
+ Qdrant_api_key = os.getenv("QDRANT_API_KEY")
19
+
20
+ # ========== LLM ==========
21
+ llm = ChatHuggingFace(
22
+ llm=HuggingFaceEndpoint(
23
+ repo_id="Qwen/Qwen3-235B-A22B-Instruct-2507",
24
+ provider="together",
25
+ huggingfacehub_api_token=huggingfacehub_api_token,
26
+ task="conversational"
27
+ )
28
+ )
29
+
30
+ MODEL_NAME = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
31
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
32
+
33
+
34
+ retriever_cache = {}
35
+
36
+ # ========== Prepare Data ==========
37
+ def prepare_data(filepath):
38
+ loader = DoclingLoader(file_path=filepath, export_type=ExportType.MARKDOWN).load()
39
+ from langchain.text_splitter import CharacterTextSplitter
40
+ text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
41
+ tokenizer, chunk_size=300, chunk_overlap=20
42
+ )
43
+ normal_chunks = text_splitter.create_documents(
44
+ [loader[0].model_dump()['page_content']],
45
+ metadatas=[loader[0].model_dump()['metadata']]
46
+ )
47
+ return normal_chunks
48
+
49
+ # ========== Hybrid Search ==========
50
+ def Hybrid_search(normal_chunks):
51
+ embedding_llm = HuggingFaceEmbeddings(model_name=MODEL_NAME)
52
+
53
+ qdrant_store = Qdrant.from_documents(
54
+ documents=normal_chunks,
55
+ embedding=embedding_llm,
56
+ url="https://3464a78e-425b-4e6b-bc10-5b0333dc9ad1.us-east4-0.gcp.cloud.qdrant.io:6333",
57
+ api_key=Qdrant_api_key,
58
+ collection_name="my_collection",
59
+ force_recreate=True
60
+ )
61
+
62
+ dense_retriever = qdrant_store.as_retriever(
63
+ search_kwargs={"k": 8, "score_threshold": 0.25}
64
+ )
65
+ bm25_retriever = BM25Retriever.from_documents(normal_chunks)
66
+ bm25_retriever.k = 8
67
+
68
+ hybrid_retriever = EnsembleRetriever(
69
+ retrievers=[bm25_retriever, dense_retriever],
70
+ weights=[0.4, 0.6]
71
+ )
72
+ return hybrid_retriever
73
+
74
+ # ========== Call Model ==========
75
+ def call_model(question, retriever):
76
+ qna_template = """
77
+ You are an assistant for question-answering tasks.
78
+ Use the following pieces of retrieved context to answer the question.
79
+ If you don't know the answer, just say that you don't know.
80
+ Use three sentences maximum and keep the answer concise.
81
+ Question: {question}
82
+ Context: {context}
83
+ Answer:
84
+ """
85
+ from langchain.prompts import PromptTemplate
86
+ qna_prompt = PromptTemplate(
87
+ template=qna_template,
88
+ input_variables=['context', 'question']
89
+ )
90
+
91
+ stuff_chain = create_stuff_documents_chain(llm, prompt=qna_prompt)
92
+ retrieved_docs = retriever.get_relevant_documents(question)
93
+
94
+ answer = stuff_chain.invoke(
95
+ {
96
+ "context": retrieved_docs,
97
+ "question": question
98
+ }
99
+ )
100
+ return answer
101
+
102
+ # ========== Gradio App ==========
103
+ def upload_pdf(file_path, progress=gr.Progress()):
104
+ progress(0, desc="Preparing data...")
105
+ chunks = prepare_data(file_path)
106
+ progress(0.5, desc="Building retrievers...")
107
+ retriever_cache["retriever"] = Hybrid_search(chunks)
108
+ progress(1.0, desc="Done βœ…")
109
+ return "βœ… PDF uploaded successfully! Now ask your questions."
110
+
111
+ def qa_interface(question):
112
+ if "retriever" not in retriever_cache:
113
+ return "❌ Please upload a PDF first."
114
+ return call_model(question, retriever_cache["retriever"])
115
+
116
+ with gr.Blocks() as demo:
117
+ gr.Markdown("## πŸ“š PDF Q&A with Hybrid Search + LLM")
118
+
119
+ with gr.Row():
120
+ file_input = gr.File(label="Upload PDF", type="filepath")
121
+ upload_output = gr.Textbox(label="Upload Status")
122
+
123
+ upload_btn = gr.Button("Upload PDF")
124
+ upload_btn.click(
125
+ fn=upload_pdf,
126
+ inputs=[file_input],
127
+ outputs=[upload_output]
128
+ )
129
+
130
+ question_input = gr.Textbox(label="Ask a question")
131
+ output = gr.Markdown()
132
+ submit_btn = gr.Button("Get Answer")
133
+
134
+ submit_btn.click(
135
+ fn=qa_interface,
136
+ inputs=[question_input],
137
+ outputs=output
138
+ )
139
+
140
+ demo.launch(share=True)
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ gradio
2
+ langchain
3
+ langchain_huggingface
4
+ langchain_community
5
+ qdrant-client
6
+ transformers
7
+ pydantic
8
+ sentence-transformers
9
+ langchain-docling
10
+ rank_bm25