Nav772 commited on
Commit
e80959b
Β·
verified Β·
1 Parent(s): 8e54f61

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +77 -6
README.md CHANGED
@@ -1,12 +1,83 @@
1
  ---
2
- title: Rag Document Qa
3
- emoji: πŸ“ˆ
4
- colorFrom: gray
5
- colorTo: gray
6
  sdk: gradio
7
- sdk_version: 6.3.0
8
  app_file: app.py
9
  pinned: false
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: RAG Document Q&A System
3
+ emoji: πŸ“š
4
+ colorFrom: blue
5
+ colorTo: purple
6
  sdk: gradio
7
+ sdk_version: 4.44.0
8
  app_file: app.py
9
  pinned: false
10
+ license: mit
11
  ---
12
 
13
+ # πŸ“š RAG Document Q&A System
14
+
15
+ A Retrieval-Augmented Generation (RAG) system that answers questions about uploaded PDF documents.
16
+
17
+ ## 🎯 What This Does
18
+
19
+ 1. **Upload** a PDF document
20
+ 2. **Process** the document (chunks it and creates embeddings)
21
+ 3. **Ask** questions about the document
22
+ 4. **Get** accurate answers with source citations
23
+
24
+ ## πŸ—οΈ Architecture
25
+ ```
26
+ User Question β†’ Embedding β†’ Vector Search β†’ Retrieved Chunks β†’ LLM β†’ Answer
27
+ ```
28
+
29
+ | Component | Technology |
30
+ |-----------|------------|
31
+ | Embeddings | sentence-transformers/all-MiniLM-L6-v2 (384 dimensions) |
32
+ | Vector Store | FAISS (Facebook AI Similarity Search) |
33
+ | Text Splitter | RecursiveCharacterTextSplitter (1000 chars, 200 overlap) |
34
+ | LLM | HuggingFaceH4/zephyr-7b-beta via Inference API |
35
+ | Framework | LangChain + Gradio |
36
+
37
+ ## πŸ› οΈ Development Challenges
38
+
39
+ This project encountered several technical challenges during development:
40
+
41
+ ### Challenge 1: LangChain API Changes
42
+ **Problem:** Import errors due to LangChain's package restructuring.
43
+ ```python
44
+ # Old (broken)
45
+ from langchain.document_loaders import PyPDFLoader
46
+ from langchain.chains import RetrievalQA
47
+
48
+ # New (working)
49
+ from langchain_community.document_loaders import PyPDFLoader
50
+ # RetrievalQA deprecated β†’ use LCEL chains instead
51
+ ```
52
+ **Lesson:** Fast-evolving libraries require checking current documentation.
53
+
54
+ ### Challenge 2: PDF Download Issues
55
+ **Problem:** `PdfStreamError: Stream has ended unexpectedly`
56
+ **Cause:** Incomplete download due to missing User-Agent header.
57
+ **Solution:** Added proper headers to HTTP request.
58
+
59
+ ### Challenge 3: LLM Response Quality
60
+ **Problem:** FLAN-T5-Large produced fragment-like responses instead of complete answers.
61
+ **Attempted Solutions:**
62
+ 1. Adjusted generation parameters β€” minimal improvement
63
+ 2. Modified prompt format β€” slight improvement
64
+ 3. Switched to FLAN-T5-XL β€” OOM error
65
+
66
+ **Final Solution:** Switched to Zephyr-7B-beta, which produces comprehensive answers.
67
+
68
+ ## πŸ“ Limitations
69
+
70
+ - Only processes PDF documents
71
+ - English language only
72
+ - Free Inference API has rate limits
73
+
74
+ ## πŸ‘€ Author
75
+
76
+ [Nav772](https://huggingface.co/Nav772) - Built as part of AI Engineering portfolio
77
+
78
+ ## πŸ“š Related Projects
79
+
80
+ - [Movie Sentiment Analyzer](https://huggingface.co/spaces/Nav772/movie-sentiment-analyzer)
81
+ - [Amazon Review Rating Predictor](https://huggingface.co/spaces/Nav772/amazon-review-rating-predictor)
82
+ - [Food Image Classifier](https://huggingface.co/spaces/Nav772/food-image-classifier)
83
+ - [Sentiment Model Comparison](https://huggingface.co/spaces/Nav772/sentiment-model-comparison)