Navneet Sai commited on
Commit
47dd5e4
Β·
1 Parent(s): 997fea2

Update Readme with detailed documentation

Browse files
Files changed (1) hide show
  1. README.md +110 -18
README.md CHANGED
@@ -7,32 +7,124 @@ sdk: gradio
7
  sdk_version: 4.44.1
8
  app_file: app.py
9
  pinned: false
 
10
  ---
11
 
12
- # RAG Document Q&A Assistant
13
 
14
- Upload a PDF or TXT document and ask questions about its content.
15
 
16
- ## How It Works
17
 
18
- 1. **Document Processing**: Your document is split into chunks using the selected strategy (fixed-size or paragraph-based)
19
- 2. **Embedding**: Chunks are embedded using Sentence Transformers (all-MiniLM-L6-v2)
20
- 3. **Retrieval**: When you ask a question, relevant chunks are retrieved using semantic search via ChromaDB
21
- 4. **Generation**: GPT-4o-mini generates an answer based on the retrieved context
 
22
 
23
- ## Features
 
 
 
 
 
24
 
25
- - PDF and TXT file support
26
- - Two chunking strategies for comparison
27
- - Source citations with relevance scores
28
- - Built with Gradio, ChromaDB, and OpenAI API
 
 
 
29
 
30
- ## References
31
 
32
- - [RAG Original Paper (Lewis et al., 2020)](https://arxiv.org/abs/2005.11401)
33
- - [RAG Survey (Gao et al., 2023)](https://arxiv.org/pdf/2312.10997)
34
- - [Chunking Strategies for RAG (Merola & Singh, 2025)](https://arxiv.org/abs/2504.19754)
35
 
36
- ## Author
 
 
 
37
 
38
- Built as part of an AI/ML Engineering portfolio project.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  sdk_version: 4.44.1
8
  app_file: app.py
9
  pinned: false
10
+ license: mit
11
  ---
12
 
13
+ # πŸ“„ RAG Document Q&A Assistant
14
 
15
+ A Retrieval-Augmented Generation (RAG) system that answers questions about uploaded documents with source citations and chunking strategy comparison.
16
 
17
+ ## 🎯 What This Does
18
 
19
+ 1. **Upload** a PDF or TXT document
20
+ 2. **Choose** a chunking strategy (Fixed-size or Paragraph-based)
21
+ 3. **Process** the document (chunks it and creates embeddings)
22
+ 4. **Ask** questions about the document
23
+ 5. **Get** accurate answers with relevance scores and source citations
24
 
25
+ ## πŸ—οΈ Architecture
26
+ ```
27
+ Document β†’ Chunking β†’ Embedding β†’ Vector Store (ChromaDB)
28
+ ↓
29
+ User Question β†’ Embedding β†’ Semantic Search β†’ Retrieved Chunks β†’ GPT-4o-mini β†’ Answer
30
+ ```
31
 
32
+ | Component | Technology |
33
+ |-----------|------------|
34
+ | Embeddings | sentence-transformers/all-MiniLM-L6-v2 (384 dimensions) |
35
+ | Vector Store | ChromaDB (in-memory) |
36
+ | Chunking | Fixed-size (500 chars, 100 overlap) or Paragraph-based |
37
+ | LLM | OpenAI GPT-4o-mini |
38
+ | Framework | Gradio |
39
 
40
+ ## πŸ”¬ Chunking Strategies Compared
41
 
42
+ This app lets you compare two chunking approaches:
 
 
43
 
44
+ | Strategy | How It Works | Best For |
45
+ |----------|--------------|----------|
46
+ | **Fixed-size** | Splits text into 500-char chunks with 100-char overlap | Uniform documents, consistent retrieval |
47
+ | **Paragraph-based** | Splits on double newlines, preserves natural boundaries | Structured documents, better context |
48
 
49
+ **Key Insight:** Fixed-size chunking may cut mid-sentence but creates more chunks for better retrieval granularity. Paragraph-based preserves context but may create uneven chunk sizes.
50
+
51
+ ## πŸ› οΈ Technical Implementation
52
+
53
+ ### Vector Search Flow
54
+ ```python
55
+ # 1. Document Processing
56
+ chunks = chunk_by_strategy(document_text)
57
+ collection.add(documents=chunks, ids=chunk_ids)
58
+
59
+ # 2. Query Processing
60
+ results = collection.query(query_texts=[question], n_results=3)
61
+
62
+ # 3. Answer Generation
63
+ context = format_retrieved_chunks(results)
64
+ answer = gpt4o_mini.generate(context + question)
65
+ ```
66
+
67
+ ### Similarity Scoring
68
+
69
+ Distances from ChromaDB are converted to intuitive relevance percentages:
70
+ ```python
71
+ similarity = 1 / (1 + distance) # Higher = more relevant
72
+ ```
73
+
74
+ ## πŸ§ͺ Development Challenges
75
+
76
+ ### Challenge 1: Collection Already Exists Error
77
+
78
+ **Problem:** `chromadb.errors.InternalError: Collection [documents] already exists`
79
+ **Cause:** Re-uploading documents without clearing the previous collection.
80
+ **Solution:** Delete existing collection before creating new one:
81
+ ```python
82
+ try:
83
+ chroma_client.delete_collection(name="documents")
84
+ except:
85
+ pass
86
+ collection = chroma_client.create_collection(name="documents", ...)
87
+ ```
88
+
89
+ ### Challenge 2: PDF Text Extraction
90
+
91
+ **Problem:** Some PDFs have unusual formatting resulting in few chunks.
92
+ **Solution:** PyMuPDF (fitz) handles most PDF formats reliably. For problematic PDFs, fixed-size chunking provides more consistent results than paragraph-based.
93
+
94
+ ### Challenge 3: Hugging Face Dependency Conflicts
95
+
96
+ **Problem:** `ImportError: cannot import name 'HfFolder' from 'huggingface_hub'`
97
+ **Cause:** Version mismatch between gradio and huggingface-hub.
98
+ **Solution:** Pin specific compatible versions in requirements.txt.
99
+
100
+ ## πŸ“Š Features
101
+
102
+ - βœ… PDF and TXT file support
103
+ - βœ… Two chunking strategies for comparison
104
+ - βœ… Source citations with relevance scores
105
+ - βœ… Real-time document statistics
106
+ - βœ… Clean, intuitive UI
107
+
108
+ ## πŸ“ Limitations
109
+
110
+ - Requires OpenAI API key (uses GPT-4o-mini)
111
+ - In-memory vector store (resets on each session)
112
+ - English language optimized
113
+ - Maximum file size limited by HF Spaces
114
+
115
+ ## πŸ“š Research References
116
+
117
+ - [RAG Original Paper (Lewis et al., 2020)](https://arxiv.org/abs/2005.11401) - Introduced Retrieval-Augmented Generation
118
+ - [RAG Survey (Gao et al., 2023)](https://arxiv.org/pdf/2312.10997) - Comprehensive survey of RAG techniques
119
+ - [Chunking Strategies for RAG (Merola & Singh, 2025)](https://arxiv.org/abs/2504.19754) - Analysis of chunking approaches
120
+
121
+ ## πŸ‘€ Author
122
+
123
+ [Nav772](https://huggingface.co/Nav772) - Built as part of AI/ML Engineering portfolio
124
+
125
+ ## πŸ“š Related Projects
126
+
127
+ - [RAG Document Q&A (LangChain version)](https://huggingface.co/spaces/Nav772/rag-document-qa)
128
+ - [Movie Sentiment Analyzer](https://huggingface.co/spaces/Nav772/movie-sentiment-analyzer)
129
+ - [Amazon Review Rating Predictor](https://huggingface.co/spaces/Nav772/amazon-review-rating-predictor)
130
+ - [Food Image Classifier](https://huggingface.co/spaces/Nav772/food-image-classifier)