Spaces:
Sleeping
Sleeping
| title: RAG Document Q&A Assistant | |
| emoji: π | |
| colorFrom: green | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 6.5.1 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| # π RAG Document Q&A Assistant | |
| A Retrieval-Augmented Generation (RAG) system that answers questions about uploaded documents with source citations and chunking strategy comparison. | |
| ## π― What This Does | |
| 1. **Upload** a PDF or TXT document | |
| 2. **Choose** a chunking strategy (Fixed-size or Paragraph-based) | |
| 3. **Process** the document (chunks it and creates embeddings) | |
| 4. **Ask** questions about the document | |
| 5. **Get** accurate answers with relevance scores and source citations | |
| ## ποΈ Architecture | |
| ``` | |
| Document β Chunking β Embedding β Vector Store (ChromaDB) | |
| β | |
| User Question β Embedding β Semantic Search β Retrieved Chunks β GPT-4o-mini β Answer | |
| ``` | |
| | Component | Technology | | |
| |-----------|------------| | |
| | Embeddings | sentence-transformers/all-MiniLM-L6-v2 (384 dimensions) | | |
| | Vector Store | ChromaDB (in-memory) | | |
| | Chunking | Fixed-size (500 chars, 100 overlap) or Paragraph-based | | |
| | LLM | OpenAI GPT-4o-mini | | |
| | Framework | Gradio | | |
| ## π¬ Chunking Strategies Compared | |
| This app lets you compare two chunking approaches: | |
| | Strategy | How It Works | Best For | | |
| |----------|--------------|----------| | |
| | **Fixed-size** | Splits text into 500-char chunks with 100-char overlap | Uniform documents, consistent retrieval | | |
| | **Paragraph-based** | Splits on double newlines, preserves natural boundaries | Structured documents, better context | | |
| **Key Insight:** Fixed-size chunking may cut mid-sentence but creates more chunks for better retrieval granularity. Paragraph-based preserves context but may create uneven chunk sizes. | |
| ## π οΈ Technical Implementation | |
| ### Vector Search Flow | |
| ```python | |
| # 1. Document Processing | |
| chunks = chunk_by_strategy(document_text) | |
| collection.add(documents=chunks, ids=chunk_ids) | |
| # 2. Query Processing | |
| results = collection.query(query_texts=[question], n_results=3) | |
| # 3. Answer Generation | |
| context = format_retrieved_chunks(results) | |
| answer = gpt4o_mini.generate(context + question) | |
| ``` | |
| ### Similarity Scoring | |
| Distances from ChromaDB are converted to intuitive relevance percentages: | |
| ```python | |
| similarity = 1 / (1 + distance) # Higher = more relevant | |
| ``` | |
| ## π§ͺ Development Challenges | |
| ### Challenge 1: Collection Already Exists Error | |
| **Problem:** `chromadb.errors.InternalError: Collection [documents] already exists` | |
| **Cause:** Re-uploading documents without clearing the previous collection. | |
| **Solution:** Delete existing collection before creating new one: | |
| ```python | |
| try: | |
| chroma_client.delete_collection(name="documents") | |
| except: | |
| pass | |
| collection = chroma_client.create_collection(name="documents", ...) | |
| ``` | |
| ### Challenge 2: PDF Text Extraction | |
| **Problem:** Some PDFs have unusual formatting resulting in few chunks. | |
| **Solution:** PyMuPDF (fitz) handles most PDF formats reliably. For problematic PDFs, fixed-size chunking provides more consistent results than paragraph-based. | |
| ## π Features | |
| - β PDF and TXT file support | |
| - β Two chunking strategies for comparison | |
| - β Source citations with relevance scores | |
| - β Real-time document statistics | |
| - β Clean, intuitive UI | |
| ## π Limitations | |
| - Requires OpenAI API key (uses GPT-4o-mini) | |
| - In-memory vector store (resets on each session) | |
| - English language optimized | |
| - Maximum file size limited by HF Spaces | |
| ## π Research References | |
| - [RAG Original Paper (Lewis et al., 2020)](https://arxiv.org/abs/2005.11401) - Introduced Retrieval-Augmented Generation | |
| - [RAG Survey (Gao et al., 2023)](https://arxiv.org/pdf/2312.10997) - Comprehensive survey of RAG techniques | |
| - [Chunking Strategies for RAG (Merola & Singh, 2025)](https://arxiv.org/abs/2504.19754) - Analysis of chunking approaches | |
| ## π€ Author | |
| [Nav772](https://huggingface.co/Nav772) - Built as part of AI/ML Engineering portfolio | |
| ## π Related Projects | |
| - [RAG Document Q&A (LangChain version)](https://huggingface.co/spaces/Nav772/rag-document-qa) | |
| - [Movie Sentiment Analyzer](https://huggingface.co/spaces/Nav772/movie-sentiment-analyzer) | |
| - [Amazon Review Rating Predictor](https://huggingface.co/spaces/Nav772/amazon-review-rating-predictor) | |
| - [Food Image Classifier](https://huggingface.co/spaces/Nav772/food-image-classifier) |