Spaces:

mrshibly
/

QNARag

Sleeping

App Files Files Community

mrshibly commited on Jan 10

Commit

d6d3be9

verified ·

1 Parent(s): 06d8674

Update README.md

Browse files

Files changed (1) hide show

README.md +10 -236

README.md CHANGED Viewed

@@ -1,236 +1,10 @@
-# 🎓 Bangladesh University Academic Regulation Q&A using RAG
-A production-ready Retrieval-Augmented Generation (RAG) system for answering questions about Bangladesh university academic regulations, examination rules, and grading policies.
-![RAG System](https://img.shields.io/badge/RAG-System-blue)
-![Python](https://img.shields.io/badge/Python-3.9+-green)
-![Streamlit](https://img.shields.io/badge/Streamlit-App-red)
-## 🎯 Problem Statement
-Students and faculty often need quick access to specific information from lengthy university regulation documents. Traditional keyword search fails to understand context and intent. This RAG system provides:
-- **Semantic search** - Understands question intent, not just keywords
-- **Context-bounded answers** - Generates answers strictly from source documents
-- **Source citation** - Shows which documents were used
-- **No hallucination** - Says "I don't know" when answer isn't in the corpus
-## 🏗️ Architecture
-```mermaid
-graph TD
-    A[User Question] --> B[E5-base-v2 Embedding]
-    B --> C[FAISS Vector Search]
-    C --> D[Top-3 Relevant Chunks]
-    D --> E[Flan-T5 Generation]
-    E --> F[Answer + Sources]
-    G[PDF Documents] --> H[Text Extraction]
-    H --> I[Chunking 500 words, 80 overlap]
-    I --> J[E5 Embeddings]
-    J --> K[FAISS Index]
-    K --> C
-    style A fill:#e1f5ff
-    style F fill:#e1f5ff
-    style C fill:#fff4e1
-    style E fill:#ffe1e1
-```
-### System Components
-| Component | Technology | Purpose |
-|-----------|-----------|---------|
-| **Embedding** | E5-base-v2 | Convert text to semantic vectors |
-| **Indexing** | FAISS (IndexFlatL2) | Fast similarity search |
-| **Generation** | Flan-T5-base | Context-bounded answer generation |
-| **Frontend** | Streamlit | User interface |
-| **Deployment** | Hugging Face Spaces | Free CPU hosting |
-## 🔬 Technical Decisions
-### Why E5-base-v2?
-- **State-of-the-art**: Outperforms SBERT and other embedding models on retrieval tasks
-- **Query/Passage distinction**: Separate prefixes for questions vs documents
-- **Multilingual capable**: Foundation for future Bangla support
-- **Efficient**: 768-dim embeddings, good balance of speed and quality
-### Why FAISS?
-- **Industry standard**: Used by production systems at scale
-- **CPU efficient**: Works well on free-tier hosting
-- **Exact search**: IndexFlatL2 guarantees best matches
-- **Scalable**: Can upgrade to approximate search (IVF) for larger datasets
-### Why Flan-T5?
-- **Instruction-tuned**: Follows prompts better than base T5
-- **CPU compatible**: Runs on Hugging Face free tier
-- **Context-bounded**: Good at answering from provided context
-- **No API costs**: Self-hosted, no OpenAI/Anthropic fees
-## 📊 Dataset
-The system is trained on 4 university regulation PDFs:
-1. **credit.pdf** - Credit and grading policies
-2. **exam guideline.pdf** - Examination procedures
-3. **notice.pdf** - Academic notices and regulations
-4. **rules.pdf** - General academic rules
-**Processing Pipeline:**
-- Text extraction: PyMuPDF (fitz)
-- Chunking: 500 words with 80-word overlap
-- Total chunks: ~150-200 (varies by dataset)
-## 🚀 Usage
-### Run Locally
-```bash
-# Clone repository
-git clone https://github.com/yourusername/QNARag.git
-cd QNARag
-# Install dependencies
-pip install -r requirements.txt
-# Run Streamlit app
-streamlit run app.py
-```
-**Note**: You need `faiss.index` and `metadata.pkl` files (generated from Colab notebook).
-### Deploy to Hugging Face
-1. Create a new Space on Hugging Face
-2. Select "Streamlit" as the SDK
-3. Upload files:
-   - `app.py`
-   - `requirements.txt`
-   - `faiss.index`
-   - `metadata.pkl`
-4. Space will auto-deploy
-## 🔧 Development Workflow
-### 1. Data Preparation (Google Colab)
-Run `RAG_Embedding_Indexing.ipynb` to:
-- Extract text from PDFs
-- Generate chunks
-- Create embeddings
-- Build FAISS index
-- Export `faiss.index` and `metadata.pkl`
-### 2. Local Testing
-```bash
-streamlit run app.py
-```
-Test with various questions:
-- "What is the grading system?"
-- "How many credits are required for graduation?"
-- "What are the examination rules?"
-### 3. Deployment
-Upload to Hugging Face Spaces for public access.
-## 📈 Performance Characteristics
-| Metric | Value |
-|--------|-------|
-| **Retrieval time** | ~100-200ms (CPU) |
-| **Generation time** | ~2-4s (CPU, Flan-T5-base) |
-| **Total latency** | ~2-5s per query |
-| **Index size** | ~5-10 MB (depends on chunks) |
-| **Model size** | ~900 MB (E5 + Flan-T5) |
-## ⚠️ Limitations
-1. **CPU Latency**: Runs on free-tier CPU, slower than GPU (2-5s per query)
-2. **Static Index**: No real-time updates; requires re-indexing for new documents
-3. **English Only**: Current dataset is English; no Bangla support yet
-4. **Context Window**: Limited to top-3 chunks (~1500 words)
-5. **No Reranking**: Simple similarity search without reranking
-## 🔮 Future Work
-### Short-term
-- [ ] Add more university PDFs (expand to 10-15 documents)
-- [ ] Implement reranking (cross-encoder) for better retrieval
-- [ ] Add conversation history (multi-turn dialogue)
-- [ ] Improve chunking strategy (semantic chunking)
-### Medium-term
-- [ ] **Bangla support**: Use BanglaBERT or multilingual models
-- [ ] Hybrid search: Combine keyword (BM25) + semantic search
-- [ ] Query expansion: Generate multiple query variations
-- [ ] GPU deployment: Faster inference on paid tier
-### Long-term
-- [ ] Fine-tune E5 on university domain
-- [ ] Custom Bangla LLM for generation
-- [ ] Multi-modal: Extract tables and images from PDFs
-- [ ] User feedback loop: Improve based on user ratings
-## 🛠️ Tech Stack Summary
-```
-Frontend:  Streamlit
-Backend:   Python 3.9+
-Embedding: sentence-transformers (E5-base-v2)
-Indexing:  FAISS (faiss-cpu)
-LLM:       Hugging Face Transformers (Flan-T5-base)
-Hosting:   Hugging Face Spaces (free tier)
-```
-## 📝 Project Structure
-```
-QNARag/
-├── app.py                          # Streamlit application
-├── requirements.txt                # Python dependencies
-├── RAG_Embedding_Indexing.ipynb   # Colab notebook for indexing
-├── faiss.index                     # FAISS vector index (generated)
-├── metadata.pkl                    # Document metadata (generated)
-├── pdfs/                          # Source PDFs
-│   ├── credit.pdf
-│   ├── exam guideline.pdf
-│   ├── notice.pdf
-│   └── rules.pdf
-└── README.md                      # This file
-```
-## 🎓 Learning Outcomes
-This project demonstrates:
-1. **Information Retrieval**: Semantic search with embeddings
-2. **Vector Databases**: FAISS indexing and similarity search
-3. **LLM Integration**: Prompt engineering and context-bounded generation
-4. **Production Deployment**: Handling CPU constraints, model caching
-5. **RAG Architecture**: End-to-end retrieval-augmented generation
-## 📄 License
-MIT License - feel free to use for your own projects!
-## 🤝 Contributing
-Contributions welcome! Areas for improvement:
-- Better chunking strategies
-- Bangla language support
-- UI/UX enhancements
-- Performance optimizations
-## 📧 Contact
-Built by [Your Name] | [GitHub](https://github.com/yourusername) | [LinkedIn](https://linkedin.com/in/yourprofile)
----
-**Note for Recruiters**: This project showcases practical ML engineering skills including embedding models, vector search, LLM integration, and production deployment under resource constraints. The focus is on building a working, deployable system rather than achieving state-of-the-art metrics.

+---
+title: QNARag
+emoji: 📚
+colorFrom: purple
+colorTo: indigo
+sdk: docker
+pinned: false
+---
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference