Spaces:

Nav772
/

rag-qa-document

Sleeping

App Files Files Community

rag-qa-document / README.md

Nav772

Update README.md

d4eadaa verified 28 days ago

preview code

raw

history blame contribute delete

4.36 kB

	---
	title: RAG Document Q&A Assistant
	emoji: 📄
	colorFrom: green
	colorTo: purple
	sdk: gradio
	sdk_version: 6.5.1
	app_file: app.py
	pinned: false
	license: mit
	---

	# 📄 RAG Document Q&A Assistant

	A Retrieval-Augmented Generation (RAG) system that answers questions about uploaded documents with source citations and chunking strategy comparison.

	## 🎯 What This Does

	1. Upload a PDF or TXT document
	2. Choose a chunking strategy (Fixed-size or Paragraph-based)
	3. Process the document (chunks it and creates embeddings)
	4. Ask questions about the document
	5. Get accurate answers with relevance scores and source citations

	## 🏗️ Architecture
	```
	Document → Chunking → Embedding → Vector Store (ChromaDB)
	↓
	User Question → Embedding → Semantic Search → Retrieved Chunks → GPT-4o-mini → Answer
	```

	\| Component \| Technology \|
	\|-----------\|------------\|
	\| Embeddings \| sentence-transformers/all-MiniLM-L6-v2 (384 dimensions) \|
	\| Vector Store \| ChromaDB (in-memory) \|
	\| Chunking \| Fixed-size (500 chars, 100 overlap) or Paragraph-based \|
	\| LLM \| OpenAI GPT-4o-mini \|
	\| Framework \| Gradio \|

	## 🔬 Chunking Strategies Compared

	This app lets you compare two chunking approaches:

	\| Strategy \| How It Works \| Best For \|
	\|----------\|--------------\|----------\|
	\| Fixed-size \| Splits text into 500-char chunks with 100-char overlap \| Uniform documents, consistent retrieval \|
	\| Paragraph-based \| Splits on double newlines, preserves natural boundaries \| Structured documents, better context \|

	Key Insight: Fixed-size chunking may cut mid-sentence but creates more chunks for better retrieval granularity. Paragraph-based preserves context but may create uneven chunk sizes.

	## 🛠️ Technical Implementation

	### Vector Search Flow
	```python
	# 1. Document Processing
	chunks = chunk_by_strategy(document_text)
	collection.add(documents=chunks, ids=chunk_ids)

	# 2. Query Processing
	results = collection.query(query_texts=[question], n_results=3)

	# 3. Answer Generation
	context = format_retrieved_chunks(results)
	answer = gpt4o_mini.generate(context + question)
	```

	### Similarity Scoring

	Distances from ChromaDB are converted to intuitive relevance percentages:
	```python
	similarity = 1 / (1 + distance) # Higher = more relevant
	```

	## 🧪 Development Challenges

	### Challenge 1: Collection Already Exists Error

	Problem: `chromadb.errors.InternalError: Collection [documents] already exists`
	Cause: Re-uploading documents without clearing the previous collection.
	Solution: Delete existing collection before creating new one:
	```python
	try:
	chroma_client.delete_collection(name="documents")
	except:
	pass
	collection = chroma_client.create_collection(name="documents", ...)
	```

	### Challenge 2: PDF Text Extraction

	Problem: Some PDFs have unusual formatting resulting in few chunks.
	Solution: PyMuPDF (fitz) handles most PDF formats reliably. For problematic PDFs, fixed-size chunking provides more consistent results than paragraph-based.


	## 📊 Features

	- ✅ PDF and TXT file support
	- ✅ Two chunking strategies for comparison
	- ✅ Source citations with relevance scores
	- ✅ Real-time document statistics
	- ✅ Clean, intuitive UI

	## 📝 Limitations

	- Requires OpenAI API key (uses GPT-4o-mini)
	- In-memory vector store (resets on each session)
	- English language optimized
	- Maximum file size limited by HF Spaces

	## 📚 Research References

	- [RAG Original Paper (Lewis et al., 2020)](https://arxiv.org/abs/2005.11401) - Introduced Retrieval-Augmented Generation
	- [RAG Survey (Gao et al., 2023)](https://arxiv.org/pdf/2312.10997) - Comprehensive survey of RAG techniques
	- [Chunking Strategies for RAG (Merola & Singh, 2025)](https://arxiv.org/abs/2504.19754) - Analysis of chunking approaches

	## 👤 Author

	[Nav772](https://huggingface.co/Nav772) - Built as part of AI/ML Engineering portfolio

	## 📚 Related Projects

	- [RAG Document Q&A (LangChain version)](https://huggingface.co/spaces/Nav772/rag-document-qa)
	- [Movie Sentiment Analyzer](https://huggingface.co/spaces/Nav772/movie-sentiment-analyzer)
	- [Amazon Review Rating Predictor](https://huggingface.co/spaces/Nav772/amazon-review-rating-predictor)
	- [Food Image Classifier](https://huggingface.co/spaces/Nav772/food-image-classifier)