PDF-Rag-System / README.md
Mohamed2210's picture
Upload README.md
ff46f8d verified

πŸ“š PDF Q&A with Hybrid Search + LLM

πŸš€ Overview

This project is a Question Answering (QA) system that allows users to:

  1. Upload a PDF document.
  2. Automatically process and chunk the text.
  3. Store embeddings in Qdrant Vector Database and build a hybrid retriever (BM25 + Qdrant).
  4. Ask natural language questions, and the model will retrieve the relevant context from the PDF and generate an answer using a Large Language Model (LLM).

It combines semantic search (dense) + keyword search (BM25) for better retrieval accuracy.


πŸ› οΈ Tech Stack

  • LangChain β†’ Orchestration of retrievers and chains.
  • HuggingFace + Together API β†’ LLM endpoint (Qwen3-235B-A22B-Instruct-2507).
  • Qdrant β†’ Vector database for storing embeddings.
  • BM25 β†’ Keyword-based retriever.
  • Docling β†’ Loader to extract text from PDF into Markdown.
  • Transformers β†’ Tokenizer for chunking text.
  • Gradio β†’ Web interface.
  • dotenv β†’ Secure API key management.

βš™οΈ Workflow

  1. Upload PDF

    • The file is loaded with DoclingLoader.
    • Text is split into chunks using HuggingFace tokenizer.
  2. Build Hybrid Search

    • Embeddings are created using sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2.
    • Chunks are stored in Qdrant.
    • Dense retriever (embeddings) + BM25 retriever (keywords) are combined with weights 0.6 (dense) and 0.4 (BM25).
  3. Ask Questions

    • User writes a question.
    • Relevant chunks are retrieved.
    • A prompt is built with context + question.
    • The LLM generates the answer (max 3 sentences).

πŸ“‹ Features

  • Upload any PDF document.
  • Hybrid search ensures more accurate retrieval than only embeddings or BM25.
  • Context-aware Q&A answers.
  • Caching retriever so you only upload once (no need to re-process for every question).
  • Simple Gradio UI with upload + question box.

πŸ”‘ Requirements

  • Python 3.10+
  • Install dependencies:
    pip install -r requirements.txt