rag-qa-document / README.md
Nav772's picture
Update README.md
d4eadaa verified

A newer version of the Gradio SDK is available: 6.9.0

Upgrade
metadata
title: RAG Document Q&A Assistant
emoji: πŸ“„
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 6.5.1
app_file: app.py
pinned: false
license: mit

πŸ“„ RAG Document Q&A Assistant

A Retrieval-Augmented Generation (RAG) system that answers questions about uploaded documents with source citations and chunking strategy comparison.

🎯 What This Does

  1. Upload a PDF or TXT document
  2. Choose a chunking strategy (Fixed-size or Paragraph-based)
  3. Process the document (chunks it and creates embeddings)
  4. Ask questions about the document
  5. Get accurate answers with relevance scores and source citations

πŸ—οΈ Architecture

Document β†’ Chunking β†’ Embedding β†’ Vector Store (ChromaDB)
                                        ↓
User Question β†’ Embedding β†’ Semantic Search β†’ Retrieved Chunks β†’ GPT-4o-mini β†’ Answer
Component Technology
Embeddings sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
Vector Store ChromaDB (in-memory)
Chunking Fixed-size (500 chars, 100 overlap) or Paragraph-based
LLM OpenAI GPT-4o-mini
Framework Gradio

πŸ”¬ Chunking Strategies Compared

This app lets you compare two chunking approaches:

Strategy How It Works Best For
Fixed-size Splits text into 500-char chunks with 100-char overlap Uniform documents, consistent retrieval
Paragraph-based Splits on double newlines, preserves natural boundaries Structured documents, better context

Key Insight: Fixed-size chunking may cut mid-sentence but creates more chunks for better retrieval granularity. Paragraph-based preserves context but may create uneven chunk sizes.

πŸ› οΈ Technical Implementation

Vector Search Flow

# 1. Document Processing
chunks = chunk_by_strategy(document_text)
collection.add(documents=chunks, ids=chunk_ids)

# 2. Query Processing  
results = collection.query(query_texts=[question], n_results=3)

# 3. Answer Generation
context = format_retrieved_chunks(results)
answer = gpt4o_mini.generate(context + question)

Similarity Scoring

Distances from ChromaDB are converted to intuitive relevance percentages:

similarity = 1 / (1 + distance)  # Higher = more relevant

πŸ§ͺ Development Challenges

Challenge 1: Collection Already Exists Error

Problem: chromadb.errors.InternalError: Collection [documents] already exists Cause: Re-uploading documents without clearing the previous collection. Solution: Delete existing collection before creating new one:

try:
    chroma_client.delete_collection(name="documents")
except:
    pass
collection = chroma_client.create_collection(name="documents", ...)

Challenge 2: PDF Text Extraction

Problem: Some PDFs have unusual formatting resulting in few chunks. Solution: PyMuPDF (fitz) handles most PDF formats reliably. For problematic PDFs, fixed-size chunking provides more consistent results than paragraph-based.

πŸ“Š Features

  • βœ… PDF and TXT file support
  • βœ… Two chunking strategies for comparison
  • βœ… Source citations with relevance scores
  • βœ… Real-time document statistics
  • βœ… Clean, intuitive UI

πŸ“ Limitations

  • Requires OpenAI API key (uses GPT-4o-mini)
  • In-memory vector store (resets on each session)
  • English language optimized
  • Maximum file size limited by HF Spaces

πŸ“š Research References

πŸ‘€ Author

Nav772 - Built as part of AI/ML Engineering portfolio

πŸ“š Related Projects