Spaces:
Sleeping
Sleeping
File size: 4,356 Bytes
de95ad8 1bfb382 d4eadaa 1bfb382 de95ad8 d4eadaa de95ad8 47dd5e4 de95ad8 47dd5e4 1bfb382 47dd5e4 1bfb382 47dd5e4 1bfb382 47dd5e4 1bfb382 47dd5e4 1bfb382 47dd5e4 1bfb382 47dd5e4 1bfb382 47dd5e4 1bfb382 47dd5e4 1bfb382 47dd5e4 d4eadaa | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | ---
title: RAG Document Q&A Assistant
emoji: π
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 6.5.1
app_file: app.py
pinned: false
license: mit
---
# π RAG Document Q&A Assistant
A Retrieval-Augmented Generation (RAG) system that answers questions about uploaded documents with source citations and chunking strategy comparison.
## π― What This Does
1. **Upload** a PDF or TXT document
2. **Choose** a chunking strategy (Fixed-size or Paragraph-based)
3. **Process** the document (chunks it and creates embeddings)
4. **Ask** questions about the document
5. **Get** accurate answers with relevance scores and source citations
## ποΈ Architecture
```
Document β Chunking β Embedding β Vector Store (ChromaDB)
β
User Question β Embedding β Semantic Search β Retrieved Chunks β GPT-4o-mini β Answer
```
| Component | Technology |
|-----------|------------|
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 (384 dimensions) |
| Vector Store | ChromaDB (in-memory) |
| Chunking | Fixed-size (500 chars, 100 overlap) or Paragraph-based |
| LLM | OpenAI GPT-4o-mini |
| Framework | Gradio |
## π¬ Chunking Strategies Compared
This app lets you compare two chunking approaches:
| Strategy | How It Works | Best For |
|----------|--------------|----------|
| **Fixed-size** | Splits text into 500-char chunks with 100-char overlap | Uniform documents, consistent retrieval |
| **Paragraph-based** | Splits on double newlines, preserves natural boundaries | Structured documents, better context |
**Key Insight:** Fixed-size chunking may cut mid-sentence but creates more chunks for better retrieval granularity. Paragraph-based preserves context but may create uneven chunk sizes.
## π οΈ Technical Implementation
### Vector Search Flow
```python
# 1. Document Processing
chunks = chunk_by_strategy(document_text)
collection.add(documents=chunks, ids=chunk_ids)
# 2. Query Processing
results = collection.query(query_texts=[question], n_results=3)
# 3. Answer Generation
context = format_retrieved_chunks(results)
answer = gpt4o_mini.generate(context + question)
```
### Similarity Scoring
Distances from ChromaDB are converted to intuitive relevance percentages:
```python
similarity = 1 / (1 + distance) # Higher = more relevant
```
## π§ͺ Development Challenges
### Challenge 1: Collection Already Exists Error
**Problem:** `chromadb.errors.InternalError: Collection [documents] already exists`
**Cause:** Re-uploading documents without clearing the previous collection.
**Solution:** Delete existing collection before creating new one:
```python
try:
chroma_client.delete_collection(name="documents")
except:
pass
collection = chroma_client.create_collection(name="documents", ...)
```
### Challenge 2: PDF Text Extraction
**Problem:** Some PDFs have unusual formatting resulting in few chunks.
**Solution:** PyMuPDF (fitz) handles most PDF formats reliably. For problematic PDFs, fixed-size chunking provides more consistent results than paragraph-based.
## π Features
- β
PDF and TXT file support
- β
Two chunking strategies for comparison
- β
Source citations with relevance scores
- β
Real-time document statistics
- β
Clean, intuitive UI
## π Limitations
- Requires OpenAI API key (uses GPT-4o-mini)
- In-memory vector store (resets on each session)
- English language optimized
- Maximum file size limited by HF Spaces
## π Research References
- [RAG Original Paper (Lewis et al., 2020)](https://arxiv.org/abs/2005.11401) - Introduced Retrieval-Augmented Generation
- [RAG Survey (Gao et al., 2023)](https://arxiv.org/pdf/2312.10997) - Comprehensive survey of RAG techniques
- [Chunking Strategies for RAG (Merola & Singh, 2025)](https://arxiv.org/abs/2504.19754) - Analysis of chunking approaches
## π€ Author
[Nav772](https://huggingface.co/Nav772) - Built as part of AI/ML Engineering portfolio
## π Related Projects
- [RAG Document Q&A (LangChain version)](https://huggingface.co/spaces/Nav772/rag-document-qa)
- [Movie Sentiment Analyzer](https://huggingface.co/spaces/Nav772/movie-sentiment-analyzer)
- [Amazon Review Rating Predictor](https://huggingface.co/spaces/Nav772/amazon-review-rating-predictor)
- [Food Image Classifier](https://huggingface.co/spaces/Nav772/food-image-classifier) |