# VGEC RAG Chatbot — Software Documentation Plan > Based on IEEE/Industry Standard | Updated: 2026-03-25 > Reference: `CODEBASE_DOCUMENTATION.md` covers most of Phase 5 already — reuse it. --- ## DIAGRAMS FIRST — Priority Order > Do all diagrams before writing any prose. Diagrams take the most time and are referenced throughout. | # | Diagram | Phase Used In | Tool | Status | |---|---|---|---|---| | 1 | High-Level Architecture (Component Diagram) | Phase 5 | Draw.io / Mermaid | [ ] | | 2 | Data Flow — Query Path | Phase 5 | Draw.io (DFD Level 2) | [ ] | | 3 | Data Flow — Ingestion Path | Phase 5 | Draw.io (DFD Level 2) | [ ] | | 4 | Hierarchical Taxonomy Tree (Type→Category→Topic) | Phase 5 | Tree diagram / Mermaid | [ ] | | 5 | Filter Decision Flowchart (Strict→Partial→Fallback) | Phase 5 | Flowchart / Draw.io | [ ] | | 6 | Hybrid Retrieval Sequence (Vector→BM25→RRF→Boost) | Phase 5 | Sequence diagram / Flow | [ ] | | 7 | Use Case Diagram (Student, Faculty, Admin actors) | Phase 4 | Draw.io / PlantUML | [ ] | | 8 | System Context Diagram / Level 0 DFD | Phase 2 | Draw.io | [ ] | | 9 | Class Diagram (simplified — RAGService + helpers) | Phase 6 | Draw.io / UML | [ ] | | 10 | Activity Diagram — Chunking Process | Phase 6 | Activity flow / Draw.io | [ ] | | 11 | MRR Bar Chart — Your RAG vs Traditional | Phase 7 | matplotlib / Excel | [ ] | | 12 | Noise Rate Bar Chart — Comparison | Phase 7 | matplotlib / Excel | [ ] | | 13 | Classifier Confusion Matrix (per field) | Phase 7 | Seaborn heatmap | [ ] | | 14 | Deployment Diagram (Express → FastAPI → ChromaDB) | Phase 8 | Draw.io | [ ] | | 15 | Future Roadmap / Gantt-style Timeline | Phase 9 | Draw.io / simple table | [ ] | --- ## Phase 1 — Front Matter **Est. time: 1–2 hrs | No diagrams needed** - [ ] Title Page - Project: VGEC RAG Chatbot - Subtitle: Retrieval-Augmented Generation System for Academic Queries - Name, Roll No., Department, Submission Date - Guide name, College name - [ ] Abstract (150–200 words) - Problem: Students struggle to find accurate VGEC info scattered across website - Solution: RAG-based chatbot with hierarchical classification + hybrid retrieval - Key results: MRR, noise reduction *(fill placeholders after deployment)* - Tech: FastAPI, ChromaDB, Gemini, Logistic Regression classifier - [ ] Table of Contents *(auto-generate at end — structure it now)* - [ ] List of Figures *(auto-generate at end)* - [ ] List of Abbreviations - RAG, BM25, RRF, LLM, MRR, API, VGEC, HOD, etc. --- ## Phase 2 — Introduction **Est. time: 2–3 hrs | Diagrams needed: System Context Diagram (Diagram #8)** - [ ] 2.1 Background - Current state: Static website, PDFs, manual queries to admin office - Pain points: Information scattered, no natural language interface - [ ] 2.2 Problem Statement - Lack of intelligent query system for institutional data - Need for domain-specific (VGEC) accurate retrieval - [ ] 2.3 Objectives - Build RAG pipeline with >75% MRR - Implement metadata classification for pre-filtering - Provide REST API for frontend integration - Deploy with a secure Express gateway - [ ] 2.4 Scope - **In scope:** Department data (faculty, labs, syllabus, HOD, intake), REST API, classification, evaluation - **Out of scope:** Real-time website scraping, admissions processing, multimedia > **Reuse from:** `CODEBASE_DOCUMENTATION.md` Section 1 (Project Overview) --- ## Phase 3 — Literature Review / Related Work **Est. time: 2–3 hrs | Diagrams needed: Evolution timeline (simple horizontal flow)** - [ ] 3.1 Traditional Chatbots - Rule-based (ALICE, ELIZA) — rigid, no context - Keyword matching chatbots — no semantic understanding - [ ] 3.2 Modern RAG Systems - OpenAI GPT-4 + vector DB (generic, not domain-specific) - LlamaIndex / LangChain baseline RAG — no metadata filtering - [ ] 3.3 Hybrid Search Systems - Elasticsearch (BM25 only), Cohere (vector only) - RRF as the standard fusion method (reference paper) - [ ] 3.4 Your Differentiation - Hierarchical classifier (Type→Category→Topic→Intent) for pre-filtering - Hybrid retrieval (BM25 + Vector + RRF) vs pure semantic search - Domain-specific ingestion strategy (intent-aware JSON chunking) --- ## Phase 4 — System Analysis & Requirements **Est. time: 3–4 hrs | Diagrams needed: Use Case Diagram (#7), Level 1 DFD** - [ ] 4.1 Functional Requirements - FR1: Ingest structured JSON and unstructured documents (PDF, MD, TXT) - FR2: Classify queries into metadata filters (type, category, topic, intent) - FR3: Retrieve relevant chunks with configurable similarity threshold - FR4: Generate contextual answers using Gemini or local LLM - FR5: Provide CRUD operations on vector store via REST API - FR6: Rate-limit and authenticate requests via Express gateway - [ ] 4.2 Non-Functional Requirements - Performance: <5s response (cloud), <30s (local LLM) - Accuracy: MRR >0.75 - Security: Admin routes protected by JWT, Python API never publicly exposed - Scalability: Support 10,000+ chunks in ChromaDB - [ ] 4.3 Use Case Diagram *(Diagram #7)* - Actors: Student, Faculty, Admin - Student use cases: Submit query, View answer, View references - Admin use cases: Ingest document, Delete document, Run evaluation, Change settings - [ ] 4.4 Level 1 DFD - Major processes: Ingest, Classify, Retrieve, Generate, Evaluate --- ## Phase 5 — System Design **Est. time: 4–6 hrs | MOST MARKS, MOST DIAGRAMS** **Diagrams needed: #1, #2, #3, #4, #5, #6** > **Reuse heavily from:** `CODEBASE_DOCUMENTATION.md` Sections 2, 3, 4 - [ ] 5.1 Architecture Design - [ ] High-Level Component Diagram *(Diagram #1)* - [ ] Data Flow — Ingestion Path *(Diagram #3)* - [ ] Data Flow — Query Path *(Diagram #2)* - [ ] Technology Stack Table (already in CODEBASE_DOCUMENTATION.md Section 1) - [ ] 5.2 Database Design - [ ] Vector DB Metadata Schema (field table — already in CODEBASE_DOCUMENTATION.md Section 3) - [ ] Source JSON Schema (already documented) - [ ] File Tracking Registry Schema (FileService JSON records) - [ ] 5.3 Algorithm Design - [ ] Hierarchical Taxonomy Tree *(Diagram #4)* (Type → Category → Topic → Intent) - [ ] Filter Decision Flowchart *(Diagram #5)* (confidence thresholds → Strict/Partial/Fallback) - [ ] Hybrid Retrieval Sequence *(Diagram #6)* (Vector → BM25 → RRF formula → Boost → Threshold) - [ ] Chunking Strategy (JSON intent-aware vs RecursiveCharacterTextSplitter) - [ ] RRF Formula — document with the actual equation: ``` score(d) = bm25_weight * 1/(rrf_k + rank_bm25) + vector_weight * 1/(rrf_k + rank_vec) ``` - [ ] 5.4 Interface Design - [ ] API Endpoint Table — /rag and /vector routes (already in CODEBASE_DOCUMENTATION.md Section 5) - [ ] Request/Response JSON examples (sample curl or Postman output) - [ ] Express Gateway design (rate limit + auth + concurrency queue) --- ## Phase 6 — Implementation **Est. time: 2–3 hrs | Diagrams needed: Directory tree (#9 class diagram, #10 activity diagram)** > **Reuse heavily from:** `CODEBASE_DOCUMENTATION.md` Section 5 and Section 8 - [ ] 6.1 Directory Structure (already in CODEBASE_DOCUMENTATION.md Section 8) - [ ] 6.2 Module Descriptions (already in CODEBASE_DOCUMENTATION.md Section 5) - [ ] 6.3 Key Code Snippets *(do NOT paste full files — only algorithm excerpts)* - [ ] Filter construction logic (`_build_filter` method) - [ ] RRF scoring loop - [ ] Intent-aware JSON chunking (`handle_json_docs`) - [ ] Classifier prediction + threshold gating - [ ] 6.4 Configuration - [ ] `.env` variables table (already in CODEBASE_DOCUMENTATION.md Section 5) - [ ] Hyperparameter table (BM25 weights, thresholds, chunk size) - [ ] 6.5 Express Gateway Implementation - [ ] Rate limiting configuration - [ ] JWT auth middleware snippet - [ ] Concurrency queue (`p-limit`) snippet --- ## Phase 7 — Testing & Evaluation **Est. time: 3–4 hrs | Diagrams needed: #11 (MRR bar chart), #12 (noise chart), #13 (confusion matrix)** > ⚠️ PLACEHOLDER — fill real numbers and screenshots AFTER deployment - [ ] 7.1 Test Plan - [ ] Unit tests: Classifier accuracy per field (run `/test_classifier_dataset`) - [ ] Integration tests: End-to-end hybrid query - [ ] Performance: Measure average latency (cloud vs local) - [ ] 7.2 Results - [ ] Comparison Table: Traditional pure-vector RAG vs Your Hybrid RAG - Metrics: MRR, Hit Rate, Top-1 Hit Rate, Noise Rate, Latency - [ ] MRR Bar Chart by query intent type *(Diagram #11)* - [ ] Noise Rate comparison *(Diagram #12)* - [ ] Classifier Confusion Matrix per field *(Diagram #13)* - [ ] 7.3 Sample Query Demonstrations - Choose 3–5 representative queries, show: - Input question - Classifier output (type, category, topic, intent + confidences) - Retrieved chunks with scores - Final LLM answer --- ## Phase 8 — Deployment **Est. time: 1–2 hrs | Diagrams needed: Deployment diagram (#14)** > ⚠️ PLACEHOLDER — fill AFTER actual deployment - [ ] 8.1 System Requirements - Hardware: 8GB RAM, 4-core CPU (local LLM) OR Google API key (Gemini) - Software: Python 3.9+, Node.js 18+, ChromaDB - [ ] 8.2 Deployment Architecture *(Diagram #14)* - Frontend → Express Gateway → FastAPI → ChromaDB - [ ] 8.3 Installation Steps - Clone → `pip install -r requirements.txt` → Set `.env` → Run ingestion → Start API - Express: `npm install` → Set `.env` → `node server.js` - [ ] 8.4 Screenshots *(fill after deployment)* - [ ] Swagger UI (`/docs`) - [ ] Sample chatbot interaction - [ ] Admin panel - [ ] Classification test panel --- ## Phase 9 — Future Scope & Conclusion **Est. time: 1–2 hrs | Diagrams needed: Roadmap (#15)** - [ ] 9.1 Future Enhancements - Dynamic LLM switching via admin UI (ModelManager architecture) - Cross-encoder re-ranking step (after resource becomes available) - Query result caching layer - Automated metadata prediction during ingestion (classifier-assisted) - Website scraping for real-time data updates - [ ] 9.2 Known Limitations (already in CODEBASE_DOCUMENTATION.md Section 7) - Local LLM latency (CPU-bound, no GPU) - BM25 corpus rebuilt per request - No real-time data — static knowledge base - [ ] 9.3 Conclusion - Successfully built domain-specific RAG with hybrid retrieval - Hierarchical classification reduces noise and improves precision - Secure deployment with Express gateway protects the inference server --- ## Phase 10 — References & Appendices **Est. time: 1–2 hrs | No diagrams needed** - [ ] 10.1 References - LangChain documentation - ChromaDB documentation - Original RRF paper (Cormack et al., 2009) - Gemini API documentation - VGEC official website (data source) - BM25 (Robertson & Zaragoza, 2009) - Sentence Transformers (Reimers & Gurevych, 2019) - [ ] 10.2 Appendix A — MASTER_INDEX full taxonomy - [ ] 10.3 Appendix B — Full API documentation (export from Swagger `/docs`) - [ ] 10.4 Appendix C — Sample classifier training data - [ ] 10.5 Appendix D — Sample department JSON format --- ## Execution Timeline | Phase | When | Priority | |---|---|---| | **All Diagrams** | Start NOW (before writing prose) | 🔴 Critical | | Phase 1–3 (Intro, Lit Review) | Day 1 | Must have | | Phase 4–5 (Design) | Day 2–3 | 🔴 Critical — most marks | | Phase 6 (Implementation) | Day 4 | Must have | | Phase 7 (Testing) | After deployment — Day 5 | 🔴 Critical — proof | | Phase 8 (Deployment) | After deployment | Must have | | Phase 9–10 (Future, Refs) | Day 6 | Finish strong | | Final PDF export + proofread | Last | Required | --- ## Reuse Map — What's Already Written | Documentation Section | Already in | |---|---| | System Architecture (components, data flow) | `CODEBASE_DOCUMENTATION.md` Section 2 | | Tech Stack Table | `CODEBASE_DOCUMENTATION.md` Section 1 | | Metadata Schema / Taxonomy | `CODEBASE_DOCUMENTATION.md` Section 3 | | Retrieval Pipeline steps | `CODEBASE_DOCUMENTATION.md` Section 4 | | All class/method descriptions | `CODEBASE_DOCUMENTATION.md` Section 5 | | Metrics definitions | `CODEBASE_DOCUMENTATION.md` Section 6 | | Known Limitations | `CODEBASE_DOCUMENTATION.md` Section 7 | | File Structure Tree | `CODEBASE_DOCUMENTATION.md` Section 8 |