rag-qa-document / README.md
Nav772's picture
Update README.md
d4eadaa verified
---
title: RAG Document Q&A Assistant
emoji: πŸ“„
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 6.5.1
app_file: app.py
pinned: false
license: mit
---
# πŸ“„ RAG Document Q&A Assistant
A Retrieval-Augmented Generation (RAG) system that answers questions about uploaded documents with source citations and chunking strategy comparison.
## 🎯 What This Does
1. **Upload** a PDF or TXT document
2. **Choose** a chunking strategy (Fixed-size or Paragraph-based)
3. **Process** the document (chunks it and creates embeddings)
4. **Ask** questions about the document
5. **Get** accurate answers with relevance scores and source citations
## πŸ—οΈ Architecture
```
Document β†’ Chunking β†’ Embedding β†’ Vector Store (ChromaDB)
↓
User Question β†’ Embedding β†’ Semantic Search β†’ Retrieved Chunks β†’ GPT-4o-mini β†’ Answer
```
| Component | Technology |
|-----------|------------|
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 (384 dimensions) |
| Vector Store | ChromaDB (in-memory) |
| Chunking | Fixed-size (500 chars, 100 overlap) or Paragraph-based |
| LLM | OpenAI GPT-4o-mini |
| Framework | Gradio |
## πŸ”¬ Chunking Strategies Compared
This app lets you compare two chunking approaches:
| Strategy | How It Works | Best For |
|----------|--------------|----------|
| **Fixed-size** | Splits text into 500-char chunks with 100-char overlap | Uniform documents, consistent retrieval |
| **Paragraph-based** | Splits on double newlines, preserves natural boundaries | Structured documents, better context |
**Key Insight:** Fixed-size chunking may cut mid-sentence but creates more chunks for better retrieval granularity. Paragraph-based preserves context but may create uneven chunk sizes.
## πŸ› οΈ Technical Implementation
### Vector Search Flow
```python
# 1. Document Processing
chunks = chunk_by_strategy(document_text)
collection.add(documents=chunks, ids=chunk_ids)
# 2. Query Processing
results = collection.query(query_texts=[question], n_results=3)
# 3. Answer Generation
context = format_retrieved_chunks(results)
answer = gpt4o_mini.generate(context + question)
```
### Similarity Scoring
Distances from ChromaDB are converted to intuitive relevance percentages:
```python
similarity = 1 / (1 + distance) # Higher = more relevant
```
## πŸ§ͺ Development Challenges
### Challenge 1: Collection Already Exists Error
**Problem:** `chromadb.errors.InternalError: Collection [documents] already exists`
**Cause:** Re-uploading documents without clearing the previous collection.
**Solution:** Delete existing collection before creating new one:
```python
try:
chroma_client.delete_collection(name="documents")
except:
pass
collection = chroma_client.create_collection(name="documents", ...)
```
### Challenge 2: PDF Text Extraction
**Problem:** Some PDFs have unusual formatting resulting in few chunks.
**Solution:** PyMuPDF (fitz) handles most PDF formats reliably. For problematic PDFs, fixed-size chunking provides more consistent results than paragraph-based.
## πŸ“Š Features
- βœ… PDF and TXT file support
- βœ… Two chunking strategies for comparison
- βœ… Source citations with relevance scores
- βœ… Real-time document statistics
- βœ… Clean, intuitive UI
## πŸ“ Limitations
- Requires OpenAI API key (uses GPT-4o-mini)
- In-memory vector store (resets on each session)
- English language optimized
- Maximum file size limited by HF Spaces
## πŸ“š Research References
- [RAG Original Paper (Lewis et al., 2020)](https://arxiv.org/abs/2005.11401) - Introduced Retrieval-Augmented Generation
- [RAG Survey (Gao et al., 2023)](https://arxiv.org/pdf/2312.10997) - Comprehensive survey of RAG techniques
- [Chunking Strategies for RAG (Merola & Singh, 2025)](https://arxiv.org/abs/2504.19754) - Analysis of chunking approaches
## πŸ‘€ Author
[Nav772](https://huggingface.co/Nav772) - Built as part of AI/ML Engineering portfolio
## πŸ“š Related Projects
- [RAG Document Q&A (LangChain version)](https://huggingface.co/spaces/Nav772/rag-document-qa)
- [Movie Sentiment Analyzer](https://huggingface.co/spaces/Nav772/movie-sentiment-analyzer)
- [Amazon Review Rating Predictor](https://huggingface.co/spaces/Nav772/amazon-review-rating-predictor)
- [Food Image Classifier](https://huggingface.co/spaces/Nav772/food-image-classifier)