vgecbot / DOCUMENTATION_PLAN.md
harsh-dev's picture
docker deployment
4225666
# VGEC RAG Chatbot β€” Software Documentation Plan
> Based on IEEE/Industry Standard | Updated: 2026-03-25
> Reference: `CODEBASE_DOCUMENTATION.md` covers most of Phase 5 already β€” reuse it.
---
## DIAGRAMS FIRST β€” Priority Order
> Do all diagrams before writing any prose. Diagrams take the most time and are referenced throughout.
| # | Diagram | Phase Used In | Tool | Status |
|---|---|---|---|---|
| 1 | High-Level Architecture (Component Diagram) | Phase 5 | Draw.io / Mermaid | [ ] |
| 2 | Data Flow β€” Query Path | Phase 5 | Draw.io (DFD Level 2) | [ ] |
| 3 | Data Flow β€” Ingestion Path | Phase 5 | Draw.io (DFD Level 2) | [ ] |
| 4 | Hierarchical Taxonomy Tree (Type→Category→Topic) | Phase 5 | Tree diagram / Mermaid | [ ] |
| 5 | Filter Decision Flowchart (Strict→Partial→Fallback) | Phase 5 | Flowchart / Draw.io | [ ] |
| 6 | Hybrid Retrieval Sequence (Vector→BM25→RRF→Boost) | Phase 5 | Sequence diagram / Flow | [ ] |
| 7 | Use Case Diagram (Student, Faculty, Admin actors) | Phase 4 | Draw.io / PlantUML | [ ] |
| 8 | System Context Diagram / Level 0 DFD | Phase 2 | Draw.io | [ ] |
| 9 | Class Diagram (simplified β€” RAGService + helpers) | Phase 6 | Draw.io / UML | [ ] |
| 10 | Activity Diagram β€” Chunking Process | Phase 6 | Activity flow / Draw.io | [ ] |
| 11 | MRR Bar Chart β€” Your RAG vs Traditional | Phase 7 | matplotlib / Excel | [ ] |
| 12 | Noise Rate Bar Chart β€” Comparison | Phase 7 | matplotlib / Excel | [ ] |
| 13 | Classifier Confusion Matrix (per field) | Phase 7 | Seaborn heatmap | [ ] |
| 14 | Deployment Diagram (Express β†’ FastAPI β†’ ChromaDB) | Phase 8 | Draw.io | [ ] |
| 15 | Future Roadmap / Gantt-style Timeline | Phase 9 | Draw.io / simple table | [ ] |
---
## Phase 1 β€” Front Matter
**Est. time: 1–2 hrs | No diagrams needed**
- [ ] Title Page
- Project: VGEC RAG Chatbot
- Subtitle: Retrieval-Augmented Generation System for Academic Queries
- Name, Roll No., Department, Submission Date
- Guide name, College name
- [ ] Abstract (150–200 words)
- Problem: Students struggle to find accurate VGEC info scattered across website
- Solution: RAG-based chatbot with hierarchical classification + hybrid retrieval
- Key results: MRR, noise reduction *(fill placeholders after deployment)*
- Tech: FastAPI, ChromaDB, Gemini, Logistic Regression classifier
- [ ] Table of Contents *(auto-generate at end β€” structure it now)*
- [ ] List of Figures *(auto-generate at end)*
- [ ] List of Abbreviations
- RAG, BM25, RRF, LLM, MRR, API, VGEC, HOD, etc.
---
## Phase 2 β€” Introduction
**Est. time: 2–3 hrs | Diagrams needed: System Context Diagram (Diagram #8)**
- [ ] 2.1 Background
- Current state: Static website, PDFs, manual queries to admin office
- Pain points: Information scattered, no natural language interface
- [ ] 2.2 Problem Statement
- Lack of intelligent query system for institutional data
- Need for domain-specific (VGEC) accurate retrieval
- [ ] 2.3 Objectives
- Build RAG pipeline with >75% MRR
- Implement metadata classification for pre-filtering
- Provide REST API for frontend integration
- Deploy with a secure Express gateway
- [ ] 2.4 Scope
- **In scope:** Department data (faculty, labs, syllabus, HOD, intake), REST API, classification, evaluation
- **Out of scope:** Real-time website scraping, admissions processing, multimedia
> **Reuse from:** `CODEBASE_DOCUMENTATION.md` Section 1 (Project Overview)
---
## Phase 3 β€” Literature Review / Related Work
**Est. time: 2–3 hrs | Diagrams needed: Evolution timeline (simple horizontal flow)**
- [ ] 3.1 Traditional Chatbots
- Rule-based (ALICE, ELIZA) β€” rigid, no context
- Keyword matching chatbots β€” no semantic understanding
- [ ] 3.2 Modern RAG Systems
- OpenAI GPT-4 + vector DB (generic, not domain-specific)
- LlamaIndex / LangChain baseline RAG β€” no metadata filtering
- [ ] 3.3 Hybrid Search Systems
- Elasticsearch (BM25 only), Cohere (vector only)
- RRF as the standard fusion method (reference paper)
- [ ] 3.4 Your Differentiation
- Hierarchical classifier (Type→Category→Topic→Intent) for pre-filtering
- Hybrid retrieval (BM25 + Vector + RRF) vs pure semantic search
- Domain-specific ingestion strategy (intent-aware JSON chunking)
---
## Phase 4 β€” System Analysis & Requirements
**Est. time: 3–4 hrs | Diagrams needed: Use Case Diagram (#7), Level 1 DFD**
- [ ] 4.1 Functional Requirements
- FR1: Ingest structured JSON and unstructured documents (PDF, MD, TXT)
- FR2: Classify queries into metadata filters (type, category, topic, intent)
- FR3: Retrieve relevant chunks with configurable similarity threshold
- FR4: Generate contextual answers using Gemini or local LLM
- FR5: Provide CRUD operations on vector store via REST API
- FR6: Rate-limit and authenticate requests via Express gateway
- [ ] 4.2 Non-Functional Requirements
- Performance: <5s response (cloud), <30s (local LLM)
- Accuracy: MRR >0.75
- Security: Admin routes protected by JWT, Python API never publicly exposed
- Scalability: Support 10,000+ chunks in ChromaDB
- [ ] 4.3 Use Case Diagram *(Diagram #7)*
- Actors: Student, Faculty, Admin
- Student use cases: Submit query, View answer, View references
- Admin use cases: Ingest document, Delete document, Run evaluation, Change settings
- [ ] 4.4 Level 1 DFD
- Major processes: Ingest, Classify, Retrieve, Generate, Evaluate
---
## Phase 5 β€” System Design
**Est. time: 4–6 hrs | MOST MARKS, MOST DIAGRAMS**
**Diagrams needed: #1, #2, #3, #4, #5, #6**
> **Reuse heavily from:** `CODEBASE_DOCUMENTATION.md` Sections 2, 3, 4
- [ ] 5.1 Architecture Design
- [ ] High-Level Component Diagram *(Diagram #1)*
- [ ] Data Flow β€” Ingestion Path *(Diagram #3)*
- [ ] Data Flow β€” Query Path *(Diagram #2)*
- [ ] Technology Stack Table (already in CODEBASE_DOCUMENTATION.md Section 1)
- [ ] 5.2 Database Design
- [ ] Vector DB Metadata Schema (field table β€” already in CODEBASE_DOCUMENTATION.md Section 3)
- [ ] Source JSON Schema (already documented)
- [ ] File Tracking Registry Schema (FileService JSON records)
- [ ] 5.3 Algorithm Design
- [ ] Hierarchical Taxonomy Tree *(Diagram #4)* (Type β†’ Category β†’ Topic β†’ Intent)
- [ ] Filter Decision Flowchart *(Diagram #5)* (confidence thresholds β†’ Strict/Partial/Fallback)
- [ ] Hybrid Retrieval Sequence *(Diagram #6)* (Vector β†’ BM25 β†’ RRF formula β†’ Boost β†’ Threshold)
- [ ] Chunking Strategy (JSON intent-aware vs RecursiveCharacterTextSplitter)
- [ ] RRF Formula β€” document with the actual equation:
```
score(d) = bm25_weight * 1/(rrf_k + rank_bm25)
+ vector_weight * 1/(rrf_k + rank_vec)
```
- [ ] 5.4 Interface Design
- [ ] API Endpoint Table β€” /rag and /vector routes (already in CODEBASE_DOCUMENTATION.md Section 5)
- [ ] Request/Response JSON examples (sample curl or Postman output)
- [ ] Express Gateway design (rate limit + auth + concurrency queue)
---
## Phase 6 β€” Implementation
**Est. time: 2–3 hrs | Diagrams needed: Directory tree (#9 class diagram, #10 activity diagram)**
> **Reuse heavily from:** `CODEBASE_DOCUMENTATION.md` Section 5 and Section 8
- [ ] 6.1 Directory Structure (already in CODEBASE_DOCUMENTATION.md Section 8)
- [ ] 6.2 Module Descriptions (already in CODEBASE_DOCUMENTATION.md Section 5)
- [ ] 6.3 Key Code Snippets *(do NOT paste full files β€” only algorithm excerpts)*
- [ ] Filter construction logic (`_build_filter` method)
- [ ] RRF scoring loop
- [ ] Intent-aware JSON chunking (`handle_json_docs`)
- [ ] Classifier prediction + threshold gating
- [ ] 6.4 Configuration
- [ ] `.env` variables table (already in CODEBASE_DOCUMENTATION.md Section 5)
- [ ] Hyperparameter table (BM25 weights, thresholds, chunk size)
- [ ] 6.5 Express Gateway Implementation
- [ ] Rate limiting configuration
- [ ] JWT auth middleware snippet
- [ ] Concurrency queue (`p-limit`) snippet
---
## Phase 7 β€” Testing & Evaluation
**Est. time: 3–4 hrs | Diagrams needed: #11 (MRR bar chart), #12 (noise chart), #13 (confusion matrix)**
> ⚠️ PLACEHOLDER β€” fill real numbers and screenshots AFTER deployment
- [ ] 7.1 Test Plan
- [ ] Unit tests: Classifier accuracy per field (run `/test_classifier_dataset`)
- [ ] Integration tests: End-to-end hybrid query
- [ ] Performance: Measure average latency (cloud vs local)
- [ ] 7.2 Results
- [ ] Comparison Table: Traditional pure-vector RAG vs Your Hybrid RAG
- Metrics: MRR, Hit Rate, Top-1 Hit Rate, Noise Rate, Latency
- [ ] MRR Bar Chart by query intent type *(Diagram #11)*
- [ ] Noise Rate comparison *(Diagram #12)*
- [ ] Classifier Confusion Matrix per field *(Diagram #13)*
- [ ] 7.3 Sample Query Demonstrations
- Choose 3–5 representative queries, show:
- Input question
- Classifier output (type, category, topic, intent + confidences)
- Retrieved chunks with scores
- Final LLM answer
---
## Phase 8 β€” Deployment
**Est. time: 1–2 hrs | Diagrams needed: Deployment diagram (#14)**
> ⚠️ PLACEHOLDER β€” fill AFTER actual deployment
- [ ] 8.1 System Requirements
- Hardware: 8GB RAM, 4-core CPU (local LLM) OR Google API key (Gemini)
- Software: Python 3.9+, Node.js 18+, ChromaDB
- [ ] 8.2 Deployment Architecture *(Diagram #14)*
- Frontend β†’ Express Gateway β†’ FastAPI β†’ ChromaDB
- [ ] 8.3 Installation Steps
- Clone β†’ `pip install -r requirements.txt` β†’ Set `.env` β†’ Run ingestion β†’ Start API
- Express: `npm install` β†’ Set `.env` β†’ `node server.js`
- [ ] 8.4 Screenshots *(fill after deployment)*
- [ ] Swagger UI (`/docs`)
- [ ] Sample chatbot interaction
- [ ] Admin panel
- [ ] Classification test panel
---
## Phase 9 β€” Future Scope & Conclusion
**Est. time: 1–2 hrs | Diagrams needed: Roadmap (#15)**
- [ ] 9.1 Future Enhancements
- Dynamic LLM switching via admin UI (ModelManager architecture)
- Cross-encoder re-ranking step (after resource becomes available)
- Query result caching layer
- Automated metadata prediction during ingestion (classifier-assisted)
- Website scraping for real-time data updates
- [ ] 9.2 Known Limitations (already in CODEBASE_DOCUMENTATION.md Section 7)
- Local LLM latency (CPU-bound, no GPU)
- BM25 corpus rebuilt per request
- No real-time data β€” static knowledge base
- [ ] 9.3 Conclusion
- Successfully built domain-specific RAG with hybrid retrieval
- Hierarchical classification reduces noise and improves precision
- Secure deployment with Express gateway protects the inference server
---
## Phase 10 β€” References & Appendices
**Est. time: 1–2 hrs | No diagrams needed**
- [ ] 10.1 References
- LangChain documentation
- ChromaDB documentation
- Original RRF paper (Cormack et al., 2009)
- Gemini API documentation
- VGEC official website (data source)
- BM25 (Robertson & Zaragoza, 2009)
- Sentence Transformers (Reimers & Gurevych, 2019)
- [ ] 10.2 Appendix A β€” MASTER_INDEX full taxonomy
- [ ] 10.3 Appendix B β€” Full API documentation (export from Swagger `/docs`)
- [ ] 10.4 Appendix C β€” Sample classifier training data
- [ ] 10.5 Appendix D β€” Sample department JSON format
---
## Execution Timeline
| Phase | When | Priority |
|---|---|---|
| **All Diagrams** | Start NOW (before writing prose) | πŸ”΄ Critical |
| Phase 1–3 (Intro, Lit Review) | Day 1 | Must have |
| Phase 4–5 (Design) | Day 2–3 | πŸ”΄ Critical β€” most marks |
| Phase 6 (Implementation) | Day 4 | Must have |
| Phase 7 (Testing) | After deployment β€” Day 5 | πŸ”΄ Critical β€” proof |
| Phase 8 (Deployment) | After deployment | Must have |
| Phase 9–10 (Future, Refs) | Day 6 | Finish strong |
| Final PDF export + proofread | Last | Required |
---
## Reuse Map β€” What's Already Written
| Documentation Section | Already in |
|---|---|
| System Architecture (components, data flow) | `CODEBASE_DOCUMENTATION.md` Section 2 |
| Tech Stack Table | `CODEBASE_DOCUMENTATION.md` Section 1 |
| Metadata Schema / Taxonomy | `CODEBASE_DOCUMENTATION.md` Section 3 |
| Retrieval Pipeline steps | `CODEBASE_DOCUMENTATION.md` Section 4 |
| All class/method descriptions | `CODEBASE_DOCUMENTATION.md` Section 5 |
| Metrics definitions | `CODEBASE_DOCUMENTATION.md` Section 6 |
| Known Limitations | `CODEBASE_DOCUMENTATION.md` Section 7 |
| File Structure Tree | `CODEBASE_DOCUMENTATION.md` Section 8 |