Spaces:

samsaara
/

hyde_rag

Build error

App Files Files

hyde_rag / README.md

Vivek Vaddina

♻️ Refactor to improve model performance

92f1a38 unverified 7 months ago

4.56 kB

title: Hyde Rag
emoji: 📊
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 5.47.0
app_file: app.py
pinned: false
short_description: answer based on input documents

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

RAG HYDE

This challenge is to build a minimal but powerful Retrieval-Augmented Generation (RAG) workflow inspired by the techniques in the articles already shared.

Your solution should:

Ingest and chunk local PDFs efficiently.
Use a small, fast embedding model for retrieval.
Apply HyDE to improve query relevance.
Fuse multiple query variants with Reciprocal Rank Fusion for higher accuracy.
Generate concise, context-grounded answers with clear source citations.

Goal: Deliver a working RAG example that is fast, lightweight, and high-quality in both retrieval and final answers.

Description

Approach

Get the folder containing data
Build Corpus & Index
Get the user query
transform it to separate search & intent (Query Rewrite)
generate hypothetical documents from a (short) user query
get their corresponding embeddings
for each of those embeddings, get relevant results from the corpus
fuse them all together using Reciprocal Rank Fusion
extract top relevant results
pre-process those to make a context and send it to the LLM one last time with the user query
receive the final answer based on the user's query & provided context.

Run

MODEL_COMBOS in config.py provides multiple variants of embedding & generative LLM model combinations keeping in mind the host system's limitations & capabilities.

Tips/Observations:

Extracting text as a markdown greatly preserved the structure and continuity of the text. This resulted in better logical chunking which in turn led to better embeddings and as a consequence, better search results.
Reading the document via docling extracted more and correct text compared to pymupdf4llm but at a bit of an expense of speed. It is enabled by default for prioritising accuracy.
- This proved esp. useful in extracting data containing lots of tables spread over multiple pages.
- You can pass --fast-extract from CLI or tick a box via gradio UI to use pymupdf instead.
Increasing the model size (coupled with correct text extraction in markdown) greatly improved performance. The Qwen3 models very much adhered to instructions but the smaller variants instead of hallucinating simply fell back to saying 'I don't know' (as per instructions). The 4B variant understood the user intent which sometimes was vague and yet managed to give relevant results. The base variant is huge and it wouldn't have been fit and run fast enough on a consumer grade laptop GPU. Loading the AWQ variant of it helped as it occupied substantially less memory compared to the original without much loss in performance.
- This model also showed great multilingual capabilities. User can upload document in one language and ask questions in another. Or they could upload multilingual documents and ask multilingual queries. For the demo, I tested mostly in English & German.
The data is now stored in datasets format that allows for better storage & scaling (arrow) along with indexing (FAISS) for querying.

Limitations / Known Issues

Even though docling with mostly default options proved to be better than pymupdf4llm to extract text, it's not perfect everytime. There're instances where pymupdf extracted text from an embedded image inside a PDF better than docling. However, docling is highly configurable and allows for deep customization via 'pipelines'. And it also comes with a very permissive license for commercial use compared to PyMuPDF.
- docling comes with easyocr by default for text OCR. It's not powerful enough compared to tesseract or similar models. But since installing the latter and linking it with docling involves touching system config, it's not pursued.
When user uploads multiple PDFs, we can improve load times by reading them asynchronously. Attempts to do that with docling sometimes resulted in pages with ordering different than the original. So it's dropped for the demo. More investigation is needed later.

Next Steps

Checkout EmbeddingGemma for embeddings
Checkout fastembed to generate embeddings faster
Improve text extraction via docling pipeline
Checkout GGUF models for CPU Inferencing