mrshibly commited on
Commit
d6d3be9
·
verified ·
1 Parent(s): 06d8674

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -236
README.md CHANGED
@@ -1,236 +1,10 @@
1
- # 🎓 Bangladesh University Academic Regulation Q&A using RAG
2
-
3
- A production-ready Retrieval-Augmented Generation (RAG) system for answering questions about Bangladesh university academic regulations, examination rules, and grading policies.
4
-
5
- ![RAG System](https://img.shields.io/badge/RAG-System-blue)
6
- ![Python](https://img.shields.io/badge/Python-3.9+-green)
7
- ![Streamlit](https://img.shields.io/badge/Streamlit-App-red)
8
-
9
- ## 🎯 Problem Statement
10
-
11
- Students and faculty often need quick access to specific information from lengthy university regulation documents. Traditional keyword search fails to understand context and intent. This RAG system provides:
12
-
13
- - **Semantic search** - Understands question intent, not just keywords
14
- - **Context-bounded answers** - Generates answers strictly from source documents
15
- - **Source citation** - Shows which documents were used
16
- - **No hallucination** - Says "I don't know" when answer isn't in the corpus
17
-
18
- ## 🏗️ Architecture
19
-
20
- ```mermaid
21
- graph TD
22
- A[User Question] --> B[E5-base-v2 Embedding]
23
- B --> C[FAISS Vector Search]
24
- C --> D[Top-3 Relevant Chunks]
25
- D --> E[Flan-T5 Generation]
26
- E --> F[Answer + Sources]
27
-
28
- G[PDF Documents] --> H[Text Extraction]
29
- H --> I[Chunking 500 words, 80 overlap]
30
- I --> J[E5 Embeddings]
31
- J --> K[FAISS Index]
32
- K --> C
33
-
34
- style A fill:#e1f5ff
35
- style F fill:#e1f5ff
36
- style C fill:#fff4e1
37
- style E fill:#ffe1e1
38
- ```
39
-
40
- ### System Components
41
-
42
- | Component | Technology | Purpose |
43
- |-----------|-----------|---------|
44
- | **Embedding** | E5-base-v2 | Convert text to semantic vectors |
45
- | **Indexing** | FAISS (IndexFlatL2) | Fast similarity search |
46
- | **Generation** | Flan-T5-base | Context-bounded answer generation |
47
- | **Frontend** | Streamlit | User interface |
48
- | **Deployment** | Hugging Face Spaces | Free CPU hosting |
49
-
50
- ## 🔬 Technical Decisions
51
-
52
- ### Why E5-base-v2?
53
-
54
- - **State-of-the-art**: Outperforms SBERT and other embedding models on retrieval tasks
55
- - **Query/Passage distinction**: Separate prefixes for questions vs documents
56
- - **Multilingual capable**: Foundation for future Bangla support
57
- - **Efficient**: 768-dim embeddings, good balance of speed and quality
58
-
59
- ### Why FAISS?
60
-
61
- - **Industry standard**: Used by production systems at scale
62
- - **CPU efficient**: Works well on free-tier hosting
63
- - **Exact search**: IndexFlatL2 guarantees best matches
64
- - **Scalable**: Can upgrade to approximate search (IVF) for larger datasets
65
-
66
- ### Why Flan-T5?
67
-
68
- - **Instruction-tuned**: Follows prompts better than base T5
69
- - **CPU compatible**: Runs on Hugging Face free tier
70
- - **Context-bounded**: Good at answering from provided context
71
- - **No API costs**: Self-hosted, no OpenAI/Anthropic fees
72
-
73
- ## 📊 Dataset
74
-
75
- The system is trained on 4 university regulation PDFs:
76
-
77
- 1. **credit.pdf** - Credit and grading policies
78
- 2. **exam guideline.pdf** - Examination procedures
79
- 3. **notice.pdf** - Academic notices and regulations
80
- 4. **rules.pdf** - General academic rules
81
-
82
- **Processing Pipeline:**
83
- - Text extraction: PyMuPDF (fitz)
84
- - Chunking: 500 words with 80-word overlap
85
- - Total chunks: ~150-200 (varies by dataset)
86
-
87
- ## 🚀 Usage
88
-
89
- ### Run Locally
90
-
91
- ```bash
92
- # Clone repository
93
- git clone https://github.com/yourusername/QNARag.git
94
- cd QNARag
95
-
96
- # Install dependencies
97
- pip install -r requirements.txt
98
-
99
- # Run Streamlit app
100
- streamlit run app.py
101
- ```
102
-
103
- **Note**: You need `faiss.index` and `metadata.pkl` files (generated from Colab notebook).
104
-
105
- ### Deploy to Hugging Face
106
-
107
- 1. Create a new Space on Hugging Face
108
- 2. Select "Streamlit" as the SDK
109
- 3. Upload files:
110
- - `app.py`
111
- - `requirements.txt`
112
- - `faiss.index`
113
- - `metadata.pkl`
114
- 4. Space will auto-deploy
115
-
116
- ## 🔧 Development Workflow
117
-
118
- ### 1. Data Preparation (Google Colab)
119
-
120
- Run `RAG_Embedding_Indexing.ipynb` to:
121
- - Extract text from PDFs
122
- - Generate chunks
123
- - Create embeddings
124
- - Build FAISS index
125
- - Export `faiss.index` and `metadata.pkl`
126
-
127
- ### 2. Local Testing
128
-
129
- ```bash
130
- streamlit run app.py
131
- ```
132
-
133
- Test with various questions:
134
- - "What is the grading system?"
135
- - "How many credits are required for graduation?"
136
- - "What are the examination rules?"
137
-
138
- ### 3. Deployment
139
-
140
- Upload to Hugging Face Spaces for public access.
141
-
142
- ## 📈 Performance Characteristics
143
-
144
- | Metric | Value |
145
- |--------|-------|
146
- | **Retrieval time** | ~100-200ms (CPU) |
147
- | **Generation time** | ~2-4s (CPU, Flan-T5-base) |
148
- | **Total latency** | ~2-5s per query |
149
- | **Index size** | ~5-10 MB (depends on chunks) |
150
- | **Model size** | ~900 MB (E5 + Flan-T5) |
151
-
152
- ## ⚠️ Limitations
153
-
154
- 1. **CPU Latency**: Runs on free-tier CPU, slower than GPU (2-5s per query)
155
- 2. **Static Index**: No real-time updates; requires re-indexing for new documents
156
- 3. **English Only**: Current dataset is English; no Bangla support yet
157
- 4. **Context Window**: Limited to top-3 chunks (~1500 words)
158
- 5. **No Reranking**: Simple similarity search without reranking
159
-
160
- ## 🔮 Future Work
161
-
162
- ### Short-term
163
- - [ ] Add more university PDFs (expand to 10-15 documents)
164
- - [ ] Implement reranking (cross-encoder) for better retrieval
165
- - [ ] Add conversation history (multi-turn dialogue)
166
- - [ ] Improve chunking strategy (semantic chunking)
167
-
168
- ### Medium-term
169
- - [ ] **Bangla support**: Use BanglaBERT or multilingual models
170
- - [ ] Hybrid search: Combine keyword (BM25) + semantic search
171
- - [ ] Query expansion: Generate multiple query variations
172
- - [ ] GPU deployment: Faster inference on paid tier
173
-
174
- ### Long-term
175
- - [ ] Fine-tune E5 on university domain
176
- - [ ] Custom Bangla LLM for generation
177
- - [ ] Multi-modal: Extract tables and images from PDFs
178
- - [ ] User feedback loop: Improve based on user ratings
179
-
180
- ## 🛠️ Tech Stack Summary
181
-
182
- ```
183
- Frontend: Streamlit
184
- Backend: Python 3.9+
185
- Embedding: sentence-transformers (E5-base-v2)
186
- Indexing: FAISS (faiss-cpu)
187
- LLM: Hugging Face Transformers (Flan-T5-base)
188
- Hosting: Hugging Face Spaces (free tier)
189
- ```
190
-
191
- ## 📝 Project Structure
192
-
193
- ```
194
- QNARag/
195
- ├── app.py # Streamlit application
196
- ├── requirements.txt # Python dependencies
197
- ├── RAG_Embedding_Indexing.ipynb # Colab notebook for indexing
198
- ├── faiss.index # FAISS vector index (generated)
199
- ├── metadata.pkl # Document metadata (generated)
200
- ├── pdfs/ # Source PDFs
201
- │ ├── credit.pdf
202
- │ ├── exam guideline.pdf
203
- │ ├── notice.pdf
204
- │ └── rules.pdf
205
- └── README.md # This file
206
- ```
207
-
208
- ## 🎓 Learning Outcomes
209
-
210
- This project demonstrates:
211
-
212
- 1. **Information Retrieval**: Semantic search with embeddings
213
- 2. **Vector Databases**: FAISS indexing and similarity search
214
- 3. **LLM Integration**: Prompt engineering and context-bounded generation
215
- 4. **Production Deployment**: Handling CPU constraints, model caching
216
- 5. **RAG Architecture**: End-to-end retrieval-augmented generation
217
-
218
- ## 📄 License
219
-
220
- MIT License - feel free to use for your own projects!
221
-
222
- ## 🤝 Contributing
223
-
224
- Contributions welcome! Areas for improvement:
225
- - Better chunking strategies
226
- - Bangla language support
227
- - UI/UX enhancements
228
- - Performance optimizations
229
-
230
- ## 📧 Contact
231
-
232
- Built by [Your Name] | [GitHub](https://github.com/yourusername) | [LinkedIn](https://linkedin.com/in/yourprofile)
233
-
234
- ---
235
-
236
- **Note for Recruiters**: This project showcases practical ML engineering skills including embedding models, vector search, LLM integration, and production deployment under resource constraints. The focus is on building a working, deployable system rather than achieving state-of-the-art metrics.
 
1
+ ---
2
+ title: QNARag
3
+ emoji: 📚
4
+ colorFrom: purple
5
+ colorTo: indigo
6
+ sdk: docker
7
+ pinned: false
8
+ ---
9
+
10
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference