--- title: RAG Document Q&A Assistant emoji: ๐Ÿ“„ colorFrom: green colorTo: purple sdk: gradio sdk_version: 6.5.1 app_file: app.py pinned: false license: mit --- # ๐Ÿ“„ RAG Document Q&A Assistant A Retrieval-Augmented Generation (RAG) system that answers questions about uploaded documents with source citations and chunking strategy comparison. ## ๐ŸŽฏ What This Does 1. **Upload** a PDF or TXT document 2. **Choose** a chunking strategy (Fixed-size or Paragraph-based) 3. **Process** the document (chunks it and creates embeddings) 4. **Ask** questions about the document 5. **Get** accurate answers with relevance scores and source citations ## ๐Ÿ—๏ธ Architecture ``` Document โ†’ Chunking โ†’ Embedding โ†’ Vector Store (ChromaDB) โ†“ User Question โ†’ Embedding โ†’ Semantic Search โ†’ Retrieved Chunks โ†’ GPT-4o-mini โ†’ Answer ``` | Component | Technology | |-----------|------------| | Embeddings | sentence-transformers/all-MiniLM-L6-v2 (384 dimensions) | | Vector Store | ChromaDB (in-memory) | | Chunking | Fixed-size (500 chars, 100 overlap) or Paragraph-based | | LLM | OpenAI GPT-4o-mini | | Framework | Gradio | ## ๐Ÿ”ฌ Chunking Strategies Compared This app lets you compare two chunking approaches: | Strategy | How It Works | Best For | |----------|--------------|----------| | **Fixed-size** | Splits text into 500-char chunks with 100-char overlap | Uniform documents, consistent retrieval | | **Paragraph-based** | Splits on double newlines, preserves natural boundaries | Structured documents, better context | **Key Insight:** Fixed-size chunking may cut mid-sentence but creates more chunks for better retrieval granularity. Paragraph-based preserves context but may create uneven chunk sizes. ## ๐Ÿ› ๏ธ Technical Implementation ### Vector Search Flow ```python # 1. Document Processing chunks = chunk_by_strategy(document_text) collection.add(documents=chunks, ids=chunk_ids) # 2. Query Processing results = collection.query(query_texts=[question], n_results=3) # 3. Answer Generation context = format_retrieved_chunks(results) answer = gpt4o_mini.generate(context + question) ``` ### Similarity Scoring Distances from ChromaDB are converted to intuitive relevance percentages: ```python similarity = 1 / (1 + distance) # Higher = more relevant ``` ## ๐Ÿงช Development Challenges ### Challenge 1: Collection Already Exists Error **Problem:** `chromadb.errors.InternalError: Collection [documents] already exists` **Cause:** Re-uploading documents without clearing the previous collection. **Solution:** Delete existing collection before creating new one: ```python try: chroma_client.delete_collection(name="documents") except: pass collection = chroma_client.create_collection(name="documents", ...) ``` ### Challenge 2: PDF Text Extraction **Problem:** Some PDFs have unusual formatting resulting in few chunks. **Solution:** PyMuPDF (fitz) handles most PDF formats reliably. For problematic PDFs, fixed-size chunking provides more consistent results than paragraph-based. ## ๐Ÿ“Š Features - โœ… PDF and TXT file support - โœ… Two chunking strategies for comparison - โœ… Source citations with relevance scores - โœ… Real-time document statistics - โœ… Clean, intuitive UI ## ๐Ÿ“ Limitations - Requires OpenAI API key (uses GPT-4o-mini) - In-memory vector store (resets on each session) - English language optimized - Maximum file size limited by HF Spaces ## ๐Ÿ“š Research References - [RAG Original Paper (Lewis et al., 2020)](https://arxiv.org/abs/2005.11401) - Introduced Retrieval-Augmented Generation - [RAG Survey (Gao et al., 2023)](https://arxiv.org/pdf/2312.10997) - Comprehensive survey of RAG techniques - [Chunking Strategies for RAG (Merola & Singh, 2025)](https://arxiv.org/abs/2504.19754) - Analysis of chunking approaches ## ๐Ÿ‘ค Author [Nav772](https://huggingface.co/Nav772) - Built as part of AI/ML Engineering portfolio ## ๐Ÿ“š Related Projects - [RAG Document Q&A (LangChain version)](https://huggingface.co/spaces/Nav772/rag-document-qa) - [Movie Sentiment Analyzer](https://huggingface.co/spaces/Nav772/movie-sentiment-analyzer) - [Amazon Review Rating Predictor](https://huggingface.co/spaces/Nav772/amazon-review-rating-predictor) - [Food Image Classifier](https://huggingface.co/spaces/Nav772/food-image-classifier)