Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.9.0
title: RAG Document Q&A Assistant
emoji: π
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 6.5.1
app_file: app.py
pinned: false
license: mit
π RAG Document Q&A Assistant
A Retrieval-Augmented Generation (RAG) system that answers questions about uploaded documents with source citations and chunking strategy comparison.
π― What This Does
- Upload a PDF or TXT document
- Choose a chunking strategy (Fixed-size or Paragraph-based)
- Process the document (chunks it and creates embeddings)
- Ask questions about the document
- Get accurate answers with relevance scores and source citations
ποΈ Architecture
Document β Chunking β Embedding β Vector Store (ChromaDB)
β
User Question β Embedding β Semantic Search β Retrieved Chunks β GPT-4o-mini β Answer
| Component | Technology |
|---|---|
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 (384 dimensions) |
| Vector Store | ChromaDB (in-memory) |
| Chunking | Fixed-size (500 chars, 100 overlap) or Paragraph-based |
| LLM | OpenAI GPT-4o-mini |
| Framework | Gradio |
π¬ Chunking Strategies Compared
This app lets you compare two chunking approaches:
| Strategy | How It Works | Best For |
|---|---|---|
| Fixed-size | Splits text into 500-char chunks with 100-char overlap | Uniform documents, consistent retrieval |
| Paragraph-based | Splits on double newlines, preserves natural boundaries | Structured documents, better context |
Key Insight: Fixed-size chunking may cut mid-sentence but creates more chunks for better retrieval granularity. Paragraph-based preserves context but may create uneven chunk sizes.
π οΈ Technical Implementation
Vector Search Flow
# 1. Document Processing
chunks = chunk_by_strategy(document_text)
collection.add(documents=chunks, ids=chunk_ids)
# 2. Query Processing
results = collection.query(query_texts=[question], n_results=3)
# 3. Answer Generation
context = format_retrieved_chunks(results)
answer = gpt4o_mini.generate(context + question)
Similarity Scoring
Distances from ChromaDB are converted to intuitive relevance percentages:
similarity = 1 / (1 + distance) # Higher = more relevant
π§ͺ Development Challenges
Challenge 1: Collection Already Exists Error
Problem: chromadb.errors.InternalError: Collection [documents] already exists
Cause: Re-uploading documents without clearing the previous collection.
Solution: Delete existing collection before creating new one:
try:
chroma_client.delete_collection(name="documents")
except:
pass
collection = chroma_client.create_collection(name="documents", ...)
Challenge 2: PDF Text Extraction
Problem: Some PDFs have unusual formatting resulting in few chunks. Solution: PyMuPDF (fitz) handles most PDF formats reliably. For problematic PDFs, fixed-size chunking provides more consistent results than paragraph-based.
π Features
- β PDF and TXT file support
- β Two chunking strategies for comparison
- β Source citations with relevance scores
- β Real-time document statistics
- β Clean, intuitive UI
π Limitations
- Requires OpenAI API key (uses GPT-4o-mini)
- In-memory vector store (resets on each session)
- English language optimized
- Maximum file size limited by HF Spaces
π Research References
- RAG Original Paper (Lewis et al., 2020) - Introduced Retrieval-Augmented Generation
- RAG Survey (Gao et al., 2023) - Comprehensive survey of RAG techniques
- Chunking Strategies for RAG (Merola & Singh, 2025) - Analysis of chunking approaches
π€ Author
Nav772 - Built as part of AI/ML Engineering portfolio