Spaces:

Nav772
/

rag-qa-document

Sleeping

File size: 4,356 Bytes

de95ad8
1bfb382
 
d4eadaa
1bfb382
de95ad8
d4eadaa
de95ad8
 
47dd5e4
de95ad8
 
47dd5e4
1bfb382
47dd5e4
1bfb382
47dd5e4
1bfb382
47dd5e4
 
 
 
 
1bfb382
47dd5e4
 
 
 
 
 
1bfb382
47dd5e4
 
 
 
 
 
 
1bfb382
47dd5e4
1bfb382
47dd5e4
1bfb382
47dd5e4
 
 
 
1bfb382
47dd5e4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d4eadaa

---
title: RAG Document Q&A Assistant
emoji: 📄
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 6.5.1
app_file: app.py
pinned: false
license: mit
---

# 📄 RAG Document Q&A Assistant

A Retrieval-Augmented Generation (RAG) system that answers questions about uploaded documents with source citations and chunking strategy comparison.

## 🎯 What This Does

1. **Upload** a PDF or TXT document
2. **Choose** a chunking strategy (Fixed-size or Paragraph-based)
3. **Process** the document (chunks it and creates embeddings)
4. **Ask** questions about the document
5. **Get** accurate answers with relevance scores and source citations

## 🏗️ Architecture
```
Document → Chunking → Embedding → Vector Store (ChromaDB)
                                        ↓
User Question → Embedding → Semantic Search → Retrieved Chunks → GPT-4o-mini → Answer
```

| Component | Technology |
|-----------|------------|
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 (384 dimensions) |
| Vector Store | ChromaDB (in-memory) |
| Chunking | Fixed-size (500 chars, 100 overlap) or Paragraph-based |
| LLM | OpenAI GPT-4o-mini |
| Framework | Gradio |

## 🔬 Chunking Strategies Compared

This app lets you compare two chunking approaches:

| Strategy | How It Works | Best For |
|----------|--------------|----------|
| **Fixed-size** | Splits text into 500-char chunks with 100-char overlap | Uniform documents, consistent retrieval |
| **Paragraph-based** | Splits on double newlines, preserves natural boundaries | Structured documents, better context |

**Key Insight:** Fixed-size chunking may cut mid-sentence but creates more chunks for better retrieval granularity. Paragraph-based preserves context but may create uneven chunk sizes.

## 🛠️ Technical Implementation

### Vector Search Flow
```python
# 1. Document Processing
chunks = chunk_by_strategy(document_text)
collection.add(documents=chunks, ids=chunk_ids)

# 2. Query Processing  
results = collection.query(query_texts=[question], n_results=3)

# 3. Answer Generation
context = format_retrieved_chunks(results)
answer = gpt4o_mini.generate(context + question)
```

### Similarity Scoring

Distances from ChromaDB are converted to intuitive relevance percentages:
```python
similarity = 1 / (1 + distance)  # Higher = more relevant
```

## 🧪 Development Challenges

### Challenge 1: Collection Already Exists Error

**Problem:** `chromadb.errors.InternalError: Collection [documents] already exists`
**Cause:** Re-uploading documents without clearing the previous collection.
**Solution:** Delete existing collection before creating new one:
```python
try:
    chroma_client.delete_collection(name="documents")
except:
    pass
collection = chroma_client.create_collection(name="documents", ...)
```

### Challenge 2: PDF Text Extraction

**Problem:** Some PDFs have unusual formatting resulting in few chunks.
**Solution:** PyMuPDF (fitz) handles most PDF formats reliably. For problematic PDFs, fixed-size chunking provides more consistent results than paragraph-based.


## 📊 Features

- ✅ PDF and TXT file support
- ✅ Two chunking strategies for comparison
- ✅ Source citations with relevance scores
- ✅ Real-time document statistics
- ✅ Clean, intuitive UI

## 📝 Limitations

- Requires OpenAI API key (uses GPT-4o-mini)
- In-memory vector store (resets on each session)
- English language optimized
- Maximum file size limited by HF Spaces

## 📚 Research References

- [RAG Original Paper (Lewis et al., 2020)](https://arxiv.org/abs/2005.11401) - Introduced Retrieval-Augmented Generation
- [RAG Survey (Gao et al., 2023)](https://arxiv.org/pdf/2312.10997) - Comprehensive survey of RAG techniques
- [Chunking Strategies for RAG (Merola & Singh, 2025)](https://arxiv.org/abs/2504.19754) - Analysis of chunking approaches

## 👤 Author

[Nav772](https://huggingface.co/Nav772) - Built as part of AI/ML Engineering portfolio

## 📚 Related Projects

- [RAG Document Q&A (LangChain version)](https://huggingface.co/spaces/Nav772/rag-document-qa)
- [Movie Sentiment Analyzer](https://huggingface.co/spaces/Nav772/movie-sentiment-analyzer)
- [Amazon Review Rating Predictor](https://huggingface.co/spaces/Nav772/amazon-review-rating-predictor)
- [Food Image Classifier](https://huggingface.co/spaces/Nav772/food-image-classifier)