Spaces:

Nav772
/

rag-qa-document

Sleeping

App Files Files Community

Navneet Sai commited on Feb 14

Commit

47dd5e4

1 Parent(s): 997fea2

Update Readme with detailed documentation

Browse files

Files changed (1) hide show

README.md +110 -18

README.md CHANGED Viewed

@@ -7,32 +7,124 @@ sdk: gradio
 sdk_version: 4.44.1
 app_file: app.py
 pinned: false
 ---
-# RAG Document Q&A Assistant
-Upload a PDF or TXT document and ask questions about its content.
-## How It Works
-1. **Document Processing**: Your document is split into chunks using the selected strategy (fixed-size or paragraph-based)
-2. **Embedding**: Chunks are embedded using Sentence Transformers (all-MiniLM-L6-v2)
-3. **Retrieval**: When you ask a question, relevant chunks are retrieved using semantic search via ChromaDB
-4. **Generation**: GPT-4o-mini generates an answer based on the retrieved context
-## Features
-- PDF and TXT file support
-- Two chunking strategies for comparison
-- Source citations with relevance scores
-- Built with Gradio, ChromaDB, and OpenAI API
-## References
-- [RAG Original Paper (Lewis et al., 2020)](https://arxiv.org/abs/2005.11401)
-- [RAG Survey (Gao et al., 2023)](https://arxiv.org/pdf/2312.10997)
-- [Chunking Strategies for RAG (Merola & Singh, 2025)](https://arxiv.org/abs/2504.19754)
-## Author
-Built as part of an AI/ML Engineering portfolio project.

 sdk_version: 4.44.1
 app_file: app.py
 pinned: false
+license: mit
 ---
+# 📄 RAG Document Q&A Assistant
+A Retrieval-Augmented Generation (RAG) system that answers questions about uploaded documents with source citations and chunking strategy comparison.
+## 🎯 What This Does
+1. **Upload** a PDF or TXT document
+2. **Choose** a chunking strategy (Fixed-size or Paragraph-based)
+3. **Process** the document (chunks it and creates embeddings)
+4. **Ask** questions about the document
+5. **Get** accurate answers with relevance scores and source citations
+## 🏗️ Architecture
+```
+Document → Chunking → Embedding → Vector Store (ChromaDB)
+                                        ↓
+User Question → Embedding → Semantic Search → Retrieved Chunks → GPT-4o-mini → Answer
+```
+| Component | Technology |
+|-----------|------------|
+| Embeddings | sentence-transformers/all-MiniLM-L6-v2 (384 dimensions) |
+| Vector Store | ChromaDB (in-memory) |
+| Chunking | Fixed-size (500 chars, 100 overlap) or Paragraph-based |
+| LLM | OpenAI GPT-4o-mini |
+| Framework | Gradio |
+## 🔬 Chunking Strategies Compared
+This app lets you compare two chunking approaches:
+| Strategy | How It Works | Best For |
+|----------|--------------|----------|
+| **Fixed-size** | Splits text into 500-char chunks with 100-char overlap | Uniform documents, consistent retrieval |
+| **Paragraph-based** | Splits on double newlines, preserves natural boundaries | Structured documents, better context |
+**Key Insight:** Fixed-size chunking may cut mid-sentence but creates more chunks for better retrieval granularity. Paragraph-based preserves context but may create uneven chunk sizes.
+## 🛠️ Technical Implementation
+### Vector Search Flow
+```python
+# 1. Document Processing
+chunks = chunk_by_strategy(document_text)
+collection.add(documents=chunks, ids=chunk_ids)
+# 2. Query Processing
+results = collection.query(query_texts=[question], n_results=3)
+# 3. Answer Generation
+context = format_retrieved_chunks(results)
+answer = gpt4o_mini.generate(context + question)
+```
+### Similarity Scoring
+Distances from ChromaDB are converted to intuitive relevance percentages:
+```python
+similarity = 1 / (1 + distance)  # Higher = more relevant
+```
+## 🧪 Development Challenges
+### Challenge 1: Collection Already Exists Error
+**Problem:** `chromadb.errors.InternalError: Collection [documents] already exists`
+**Cause:** Re-uploading documents without clearing the previous collection.
+**Solution:** Delete existing collection before creating new one:
+```python
+try:
+    chroma_client.delete_collection(name="documents")
+except:
+    pass
+collection = chroma_client.create_collection(name="documents", ...)
+```
+### Challenge 2: PDF Text Extraction
+**Problem:** Some PDFs have unusual formatting resulting in few chunks.
+**Solution:** PyMuPDF (fitz) handles most PDF formats reliably. For problematic PDFs, fixed-size chunking provides more consistent results than paragraph-based.
+### Challenge 3: Hugging Face Dependency Conflicts
+**Problem:** `ImportError: cannot import name 'HfFolder' from 'huggingface_hub'`
+**Cause:** Version mismatch between gradio and huggingface-hub.
+**Solution:** Pin specific compatible versions in requirements.txt.
+## 📊 Features
+- ✅ PDF and TXT file support
+- ✅ Two chunking strategies for comparison
+- ✅ Source citations with relevance scores
+- ✅ Real-time document statistics
+- ✅ Clean, intuitive UI
+## 📝 Limitations
+- Requires OpenAI API key (uses GPT-4o-mini)
+- In-memory vector store (resets on each session)
+- English language optimized
+- Maximum file size limited by HF Spaces
+## 📚 Research References
+- [RAG Original Paper (Lewis et al., 2020)](https://arxiv.org/abs/2005.11401) - Introduced Retrieval-Augmented Generation
+- [RAG Survey (Gao et al., 2023)](https://arxiv.org/pdf/2312.10997) - Comprehensive survey of RAG techniques
+- [Chunking Strategies for RAG (Merola & Singh, 2025)](https://arxiv.org/abs/2504.19754) - Analysis of chunking approaches
+## 👤 Author
+[Nav772](https://huggingface.co/Nav772) - Built as part of AI/ML Engineering portfolio
+## 📚 Related Projects
+- [RAG Document Q&A (LangChain version)](https://huggingface.co/spaces/Nav772/rag-document-qa)
+- [Movie Sentiment Analyzer](https://huggingface.co/spaces/Nav772/movie-sentiment-analyzer)
+- [Amazon Review Rating Predictor](https://huggingface.co/spaces/Nav772/amazon-review-rating-predictor)
+- [Food Image Classifier](https://huggingface.co/spaces/Nav772/food-image-classifier)