Spaces:

Mohamed2210
/

PDF-Rag-System

Configuration error

App Files Files Community

Mohamed2210 commited on Dec 5, 2025

Commit

471f9ee

verified ·

1 Parent(s): 8731139

Upload 3 files

Browse files

Files changed (3) hide show

README.md +57 -13
app.py +140 -0
requirements.txt +10 -0

README.md CHANGED Viewed

@@ -1,13 +1,57 @@
----
-title: PDF Rag System
-emoji: 🏆
-colorFrom: yellow
-colorTo: purple
-sdk: gradio
-sdk_version: 6.0.2
-app_file: app.py
-pinned: false
-license: apache-2.0
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# 📚 PDF Q&A with Hybrid Search + LLM
+## 🚀 Overview
+This project is a **Question Answering (QA) system** that allows users to:
+1. Upload a **PDF document**.
+2. Automatically process and chunk the text.
+3. Store embeddings in **Qdrant Vector Database** and build a **hybrid retriever** (BM25 + Qdrant).
+4. Ask **natural language questions**, and the model will retrieve the relevant context from the PDF and generate an answer using a **Large Language Model (LLM)**.
+It combines **semantic search (dense)** + **keyword search (BM25)** for better retrieval accuracy.
+---
+## 🛠️ Tech Stack
+- **LangChain** → Orchestration of retrievers and chains.
+- **HuggingFace + Together API** → LLM endpoint (`Qwen3-235B-A22B-Instruct-2507`).
+- **Qdrant** → Vector database for storing embeddings.
+- **BM25** → Keyword-based retriever.
+- **Docling** → Loader to extract text from PDF into Markdown.
+- **Transformers** → Tokenizer for chunking text.
+- **Gradio** → Web interface.
+- **dotenv** → Secure API key management.
+---
+## ⚙️ Workflow
+1. **Upload PDF**
+   - The file is loaded with `DoclingLoader`.
+   - Text is split into **chunks** using HuggingFace tokenizer.
+2. **Build Hybrid Search**
+   - Embeddings are created using `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`.
+   - Chunks are stored in **Qdrant**.
+   - **Dense retriever** (embeddings) + **BM25 retriever** (keywords) are combined with weights `0.6` (dense) and `0.4` (BM25).
+3. **Ask Questions**
+   - User writes a question.
+   - Relevant chunks are retrieved.
+   - A **prompt** is built with context + question.
+   - The **LLM** generates the answer (max 3 sentences).
+---
+## 📋 Features
+- Upload any **PDF document**.
+- Hybrid search ensures **more accurate retrieval** than only embeddings or BM25.
+- Context-aware **Q&A** answers.
+- **Caching retriever** so you only upload once (no need to re-process for every question).
+- Simple **Gradio UI** with upload + question box.
+---
+## 🔑 Requirements
+- Python 3.10+
+- Install dependencies:
+  ```bash
+  pip install -r requirements.txt

app.py ADDED Viewed

	@@ -0,0 +1,140 @@

+import os
+import gradio as gr
+from dotenv import load_dotenv
+from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace
+from langchain.prompts import PromptTemplate
+from langchain.chains.combine_documents import create_stuff_documents_chain
+from langchain_community.retrievers import BM25Retriever
+from langchain.retrievers import EnsembleRetriever
+from langchain_huggingface import HuggingFaceEmbeddings
+from langchain_community.vectorstores import Qdrant
+from langchain_docling import DoclingLoader
+from langchain_docling.loader import ExportType
+from transformers import AutoTokenizer
+# ========== Load API KEYS ==========
+load_dotenv()
+huggingfacehub_api_token = os.getenv("HF_TOKEN")
+Qdrant_api_key = os.getenv("QDRANT_API_KEY")
+# ========== LLM ==========
+llm = ChatHuggingFace(
+    llm=HuggingFaceEndpoint(
+        repo_id="Qwen/Qwen3-235B-A22B-Instruct-2507",
+        provider="together",
+        huggingfacehub_api_token=huggingfacehub_api_token,
+        task="conversational"
+    )
+)
+MODEL_NAME = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
+tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
+retriever_cache = {}
+# ========== Prepare Data ==========
+def prepare_data(filepath):
+    loader = DoclingLoader(file_path=filepath, export_type=ExportType.MARKDOWN).load()
+    from langchain.text_splitter import CharacterTextSplitter
+    text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
+        tokenizer, chunk_size=300, chunk_overlap=20
+    )
+    normal_chunks = text_splitter.create_documents(
+        [loader[0].model_dump()['page_content']],
+        metadatas=[loader[0].model_dump()['metadata']]
+    )
+    return normal_chunks
+# ========== Hybrid Search ==========
+def Hybrid_search(normal_chunks):
+    embedding_llm = HuggingFaceEmbeddings(model_name=MODEL_NAME)
+    qdrant_store = Qdrant.from_documents(
+        documents=normal_chunks,
+        embedding=embedding_llm,
+        url="https://3464a78e-425b-4e6b-bc10-5b0333dc9ad1.us-east4-0.gcp.cloud.qdrant.io:6333",
+        api_key=Qdrant_api_key,
+        collection_name="my_collection",
+        force_recreate=True
+    )
+    dense_retriever = qdrant_store.as_retriever(
+        search_kwargs={"k": 8, "score_threshold": 0.25}
+    )
+    bm25_retriever = BM25Retriever.from_documents(normal_chunks)
+    bm25_retriever.k = 8
+    hybrid_retriever = EnsembleRetriever(
+        retrievers=[bm25_retriever, dense_retriever],
+        weights=[0.4, 0.6]
+    )
+    return hybrid_retriever
+# ========== Call Model ==========
+def call_model(question, retriever):
+    qna_template = """
+    You are an assistant for question-answering tasks.
+    Use the following pieces of retrieved context to answer the question.
+    If you don't know the answer, just say that you don't know.
+    Use three sentences maximum and keep the answer concise.
+    Question: {question}
+    Context: {context}
+    Answer:
+    """
+    from langchain.prompts import PromptTemplate
+    qna_prompt = PromptTemplate(
+        template=qna_template,
+        input_variables=['context', 'question']
+    )
+    stuff_chain = create_stuff_documents_chain(llm, prompt=qna_prompt)
+    retrieved_docs = retriever.get_relevant_documents(question)
+    answer = stuff_chain.invoke(
+        {
+            "context": retrieved_docs,
+            "question": question
+        }
+    )
+    return answer
+# ========== Gradio App ==========
+def upload_pdf(file_path, progress=gr.Progress()):
+    progress(0, desc="Preparing data...")
+    chunks = prepare_data(file_path)
+    progress(0.5, desc="Building retrievers...")
+    retriever_cache["retriever"] = Hybrid_search(chunks)
+    progress(1.0, desc="Done ✅")
+    return "✅ PDF uploaded successfully! Now ask your questions."
+def qa_interface(question):
+    if "retriever" not in retriever_cache:
+        return "❌ Please upload a PDF first."
+    return call_model(question, retriever_cache["retriever"])
+with gr.Blocks() as demo:
+    gr.Markdown("## 📚 PDF Q&A with Hybrid Search + LLM")
+    with gr.Row():
+        file_input = gr.File(label="Upload PDF", type="filepath")
+        upload_output = gr.Textbox(label="Upload Status")
+    upload_btn = gr.Button("Upload PDF")
+    upload_btn.click(
+        fn=upload_pdf,
+        inputs=[file_input],
+        outputs=[upload_output]
+    )
+    question_input = gr.Textbox(label="Ask a question")
+    output = gr.Markdown()
+    submit_btn = gr.Button("Get Answer")
+    submit_btn.click(
+        fn=qa_interface,
+        inputs=[question_input],
+        outputs=output
+    )
+demo.launch(share=True)

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+gradio
+langchain
+langchain_huggingface
+langchain_community
+qdrant-client
+transformers
+pydantic
+sentence-transformers
+langchain-docling
+rank_bm25