hyde_rag / README.md
Vivek Vaddina
♻️ Refactor to improve model performance
92f1a38 unverified
|
raw
history blame
4.56 kB
metadata
title: Hyde Rag
emoji: 📊
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 5.47.0
app_file: app.py
pinned: false
short_description: answer based on input documents

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

RAG HYDE

This challenge is to build a minimal but powerful Retrieval-Augmented Generation (RAG) workflow inspired by the techniques in the articles already shared.

Your solution should:

  • Ingest and chunk local PDFs efficiently.
  • Use a small, fast embedding model for retrieval.
  • Apply HyDE to improve query relevance.
  • Fuse multiple query variants with Reciprocal Rank Fusion for higher accuracy.
  • Generate concise, context-grounded answers with clear source citations.

Goal: Deliver a working RAG example that is fast, lightweight, and high-quality in both retrieval and final answers.

Description

Approach

  • Get the folder containing data
  • Build Corpus & Index
  • Get the user query
  • transform it to separate search & intent (Query Rewrite)
  • generate hypothetical documents from a (short) user query
  • get their corresponding embeddings
  • for each of those embeddings, get relevant results from the corpus
  • fuse them all together using Reciprocal Rank Fusion
  • extract top relevant results
  • pre-process those to make a context and send it to the LLM one last time with the user query
  • receive the final answer based on the user's query & provided context.

Run

MODEL_COMBOS in config.py provides multiple variants of embedding & generative LLM model combinations keeping in mind the host system's limitations & capabilities.

Tips/Observations:

  • Extracting text as a markdown greatly preserved the structure and continuity of the text. This resulted in better logical chunking which in turn led to better embeddings and as a consequence, better search results.

  • Reading the document via docling extracted more and correct text compared to pymupdf4llm but at a bit of an expense of speed. It is enabled by default for prioritising accuracy.

    • This proved esp. useful in extracting data containing lots of tables spread over multiple pages.
    • You can pass --fast-extract from CLI or tick a box via gradio UI to use pymupdf instead.
  • Increasing the model size (coupled with correct text extraction in markdown) greatly improved performance. The Qwen3 models very much adhered to instructions but the smaller variants instead of hallucinating simply fell back to saying 'I don't know' (as per instructions). The 4B variant understood the user intent which sometimes was vague and yet managed to give relevant results. The base variant is huge and it wouldn't have been fit and run fast enough on a consumer grade laptop GPU. Loading the AWQ variant of it helped as it occupied substantially less memory compared to the original without much loss in performance.

    • This model also showed great multilingual capabilities. User can upload document in one language and ask questions in another. Or they could upload multilingual documents and ask multilingual queries. For the demo, I tested mostly in English & German.
  • The data is now stored in datasets format that allows for better storage & scaling (arrow) along with indexing (FAISS) for querying.


Limitations / Known Issues

  • Even though docling with mostly default options proved to be better than pymupdf4llm to extract text, it's not perfect everytime. There're instances where pymupdf extracted text from an embedded image inside a PDF better than docling. However, docling is highly configurable and allows for deep customization via 'pipelines'. And it also comes with a very permissive license for commercial use compared to PyMuPDF.

    • docling comes with easyocr by default for text OCR. It's not powerful enough compared to tesseract or similar models. But since installing the latter and linking it with docling involves touching system config, it's not pursued.
  • When user uploads multiple PDFs, we can improve load times by reading them asynchronously. Attempts to do that with docling sometimes resulted in pages with ordering different than the original. So it's dropped for the demo. More investigation is needed later.

Next Steps

  • Checkout EmbeddingGemma for embeddings
  • Checkout fastembed to generate embeddings faster
  • Improve text extraction via docling pipeline
  • Checkout GGUF models for CPU Inferencing