# VGEC RAG Chatbot — Software Documentation Plan
> Based on IEEE/Industry Standard | Updated: 2026-03-25
> Reference: `CODEBASE_DOCUMENTATION.md` covers most of Phase 5 already — reuse it.

---

## DIAGRAMS FIRST — Priority Order

> Do all diagrams before writing any prose. Diagrams take the most time and are referenced throughout.

| # | Diagram | Phase Used In | Tool | Status |
|---|---|---|---|---|
| 1 | High-Level Architecture (Component Diagram) | Phase 5 | Draw.io / Mermaid | [ ] |
| 2 | Data Flow — Query Path | Phase 5 | Draw.io (DFD Level 2) | [ ] |
| 3 | Data Flow — Ingestion Path | Phase 5 | Draw.io (DFD Level 2) | [ ] |
| 4 | Hierarchical Taxonomy Tree (Type→Category→Topic) | Phase 5 | Tree diagram / Mermaid | [ ] |
| 5 | Filter Decision Flowchart (Strict→Partial→Fallback) | Phase 5 | Flowchart / Draw.io | [ ] |
| 6 | Hybrid Retrieval Sequence (Vector→BM25→RRF→Boost) | Phase 5 | Sequence diagram / Flow | [ ] |
| 7 | Use Case Diagram (Student, Faculty, Admin actors) | Phase 4 | Draw.io / PlantUML | [ ] |
| 8 | System Context Diagram / Level 0 DFD | Phase 2 | Draw.io | [ ] |
| 9 | Class Diagram (simplified — RAGService + helpers) | Phase 6 | Draw.io / UML | [ ] |
| 10 | Activity Diagram — Chunking Process | Phase 6 | Activity flow / Draw.io | [ ] |
| 11 | MRR Bar Chart — Your RAG vs Traditional | Phase 7 | matplotlib / Excel | [ ] |
| 12 | Noise Rate Bar Chart — Comparison | Phase 7 | matplotlib / Excel | [ ] |
| 13 | Classifier Confusion Matrix (per field) | Phase 7 | Seaborn heatmap | [ ] |
| 14 | Deployment Diagram (Express → FastAPI → ChromaDB) | Phase 8 | Draw.io | [ ] |
| 15 | Future Roadmap / Gantt-style Timeline | Phase 9 | Draw.io / simple table | [ ] |

---

## Phase 1 — Front Matter
**Est. time: 1–2 hrs | No diagrams needed**

- [ ] Title Page
  - Project: VGEC RAG Chatbot
  - Subtitle: Retrieval-Augmented Generation System for Academic Queries
  - Name, Roll No., Department, Submission Date
  - Guide name, College name
- [ ] Abstract (150–200 words)
  - Problem: Students struggle to find accurate VGEC info scattered across website
  - Solution: RAG-based chatbot with hierarchical classification + hybrid retrieval
  - Key results: MRR, noise reduction *(fill placeholders after deployment)*
  - Tech: FastAPI, ChromaDB, Gemini, Logistic Regression classifier
- [ ] Table of Contents *(auto-generate at end — structure it now)*
- [ ] List of Figures *(auto-generate at end)*
- [ ] List of Abbreviations
  - RAG, BM25, RRF, LLM, MRR, API, VGEC, HOD, etc.

---

## Phase 2 — Introduction
**Est. time: 2–3 hrs | Diagrams needed: System Context Diagram (Diagram #8)**

- [ ] 2.1 Background
  - Current state: Static website, PDFs, manual queries to admin office
  - Pain points: Information scattered, no natural language interface
- [ ] 2.2 Problem Statement
  - Lack of intelligent query system for institutional data
  - Need for domain-specific (VGEC) accurate retrieval
- [ ] 2.3 Objectives
  - Build RAG pipeline with >75% MRR
  - Implement metadata classification for pre-filtering
  - Provide REST API for frontend integration
  - Deploy with a secure Express gateway
- [ ] 2.4 Scope
  - **In scope:** Department data (faculty, labs, syllabus, HOD, intake), REST API, classification, evaluation
  - **Out of scope:** Real-time website scraping, admissions processing, multimedia

> **Reuse from:** `CODEBASE_DOCUMENTATION.md` Section 1 (Project Overview)

---

## Phase 3 — Literature Review / Related Work
**Est. time: 2–3 hrs | Diagrams needed: Evolution timeline (simple horizontal flow)**

- [ ] 3.1 Traditional Chatbots
  - Rule-based (ALICE, ELIZA) — rigid, no context
  - Keyword matching chatbots — no semantic understanding
- [ ] 3.2 Modern RAG Systems
  - OpenAI GPT-4 + vector DB (generic, not domain-specific)
  - LlamaIndex / LangChain baseline RAG — no metadata filtering
- [ ] 3.3 Hybrid Search Systems
  - Elasticsearch (BM25 only), Cohere (vector only)
  - RRF as the standard fusion method (reference paper)
- [ ] 3.4 Your Differentiation
  - Hierarchical classifier (Type→Category→Topic→Intent) for pre-filtering
  - Hybrid retrieval (BM25 + Vector + RRF) vs pure semantic search
  - Domain-specific ingestion strategy (intent-aware JSON chunking)

---

## Phase 4 — System Analysis & Requirements
**Est. time: 3–4 hrs | Diagrams needed: Use Case Diagram (#7), Level 1 DFD**

- [ ] 4.1 Functional Requirements
  - FR1: Ingest structured JSON and unstructured documents (PDF, MD, TXT)
  - FR2: Classify queries into metadata filters (type, category, topic, intent)
  - FR3: Retrieve relevant chunks with configurable similarity threshold
  - FR4: Generate contextual answers using Gemini or local LLM
  - FR5: Provide CRUD operations on vector store via REST API
  - FR6: Rate-limit and authenticate requests via Express gateway
- [ ] 4.2 Non-Functional Requirements
  - Performance: <5s response (cloud), <30s (local LLM)
  - Accuracy: MRR >0.75
  - Security: Admin routes protected by JWT, Python API never publicly exposed
  - Scalability: Support 10,000+ chunks in ChromaDB
- [ ] 4.3 Use Case Diagram *(Diagram #7)*
  - Actors: Student, Faculty, Admin
  - Student use cases: Submit query, View answer, View references
  - Admin use cases: Ingest document, Delete document, Run evaluation, Change settings
- [ ] 4.4 Level 1 DFD
  - Major processes: Ingest, Classify, Retrieve, Generate, Evaluate

---

## Phase 5 — System Design
**Est. time: 4–6 hrs | MOST MARKS, MOST DIAGRAMS**
**Diagrams needed: #1, #2, #3, #4, #5, #6**

> **Reuse heavily from:** `CODEBASE_DOCUMENTATION.md` Sections 2, 3, 4

- [ ] 5.1 Architecture Design
  - [ ] High-Level Component Diagram *(Diagram #1)*
  - [ ] Data Flow — Ingestion Path *(Diagram #3)*
  - [ ] Data Flow — Query Path *(Diagram #2)*
  - [ ] Technology Stack Table (already in CODEBASE_DOCUMENTATION.md Section 1)
- [ ] 5.2 Database Design
  - [ ] Vector DB Metadata Schema (field table — already in CODEBASE_DOCUMENTATION.md Section 3)
  - [ ] Source JSON Schema (already documented)
  - [ ] File Tracking Registry Schema (FileService JSON records)
- [ ] 5.3 Algorithm Design
  - [ ] Hierarchical Taxonomy Tree *(Diagram #4)* (Type → Category → Topic → Intent)
  - [ ] Filter Decision Flowchart *(Diagram #5)* (confidence thresholds → Strict/Partial/Fallback)
  - [ ] Hybrid Retrieval Sequence *(Diagram #6)* (Vector → BM25 → RRF formula → Boost → Threshold)
  - [ ] Chunking Strategy (JSON intent-aware vs RecursiveCharacterTextSplitter)
  - [ ] RRF Formula — document with the actual equation:
    ```
    score(d) = bm25_weight * 1/(rrf_k + rank_bm25)
             + vector_weight * 1/(rrf_k + rank_vec)
    ```
- [ ] 5.4 Interface Design
  - [ ] API Endpoint Table — /rag and /vector routes (already in CODEBASE_DOCUMENTATION.md Section 5)
  - [ ] Request/Response JSON examples (sample curl or Postman output)
  - [ ] Express Gateway design (rate limit + auth + concurrency queue)

---

## Phase 6 — Implementation
**Est. time: 2–3 hrs | Diagrams needed: Directory tree (#9 class diagram, #10 activity diagram)**

> **Reuse heavily from:** `CODEBASE_DOCUMENTATION.md` Section 5 and Section 8

- [ ] 6.1 Directory Structure (already in CODEBASE_DOCUMENTATION.md Section 8)
- [ ] 6.2 Module Descriptions (already in CODEBASE_DOCUMENTATION.md Section 5)
- [ ] 6.3 Key Code Snippets *(do NOT paste full files — only algorithm excerpts)*
  - [ ] Filter construction logic (`_build_filter` method)
  - [ ] RRF scoring loop
  - [ ] Intent-aware JSON chunking (`handle_json_docs`)
  - [ ] Classifier prediction + threshold gating
- [ ] 6.4 Configuration
  - [ ] `.env` variables table (already in CODEBASE_DOCUMENTATION.md Section 5)
  - [ ] Hyperparameter table (BM25 weights, thresholds, chunk size)
- [ ] 6.5 Express Gateway Implementation
  - [ ] Rate limiting configuration
  - [ ] JWT auth middleware snippet
  - [ ] Concurrency queue (`p-limit`) snippet

---

## Phase 7 — Testing & Evaluation
**Est. time: 3–4 hrs | Diagrams needed: #11 (MRR bar chart), #12 (noise chart), #13 (confusion matrix)**
> ⚠️ PLACEHOLDER — fill real numbers and screenshots AFTER deployment

- [ ] 7.1 Test Plan
  - [ ] Unit tests: Classifier accuracy per field (run `/test_classifier_dataset`)
  - [ ] Integration tests: End-to-end hybrid query
  - [ ] Performance: Measure average latency (cloud vs local)
- [ ] 7.2 Results
  - [ ] Comparison Table: Traditional pure-vector RAG vs Your Hybrid RAG
    - Metrics: MRR, Hit Rate, Top-1 Hit Rate, Noise Rate, Latency
  - [ ] MRR Bar Chart by query intent type *(Diagram #11)*
  - [ ] Noise Rate comparison *(Diagram #12)*
  - [ ] Classifier Confusion Matrix per field *(Diagram #13)*
- [ ] 7.3 Sample Query Demonstrations
  - Choose 3–5 representative queries, show:
    - Input question
    - Classifier output (type, category, topic, intent + confidences)
    - Retrieved chunks with scores
    - Final LLM answer

---

## Phase 8 — Deployment
**Est. time: 1–2 hrs | Diagrams needed: Deployment diagram (#14)**
> ⚠️ PLACEHOLDER — fill AFTER actual deployment

- [ ] 8.1 System Requirements
  - Hardware: 8GB RAM, 4-core CPU (local LLM) OR Google API key (Gemini)
  - Software: Python 3.9+, Node.js 18+, ChromaDB
- [ ] 8.2 Deployment Architecture *(Diagram #14)*
  - Frontend → Express Gateway → FastAPI → ChromaDB
- [ ] 8.3 Installation Steps
  - Clone → `pip install -r requirements.txt` → Set `.env` → Run ingestion → Start API
  - Express: `npm install` → Set `.env` → `node server.js`
- [ ] 8.4 Screenshots *(fill after deployment)*
  - [ ] Swagger UI (`/docs`)
  - [ ] Sample chatbot interaction
  - [ ] Admin panel
  - [ ] Classification test panel

---

## Phase 9 — Future Scope & Conclusion
**Est. time: 1–2 hrs | Diagrams needed: Roadmap (#15)**

- [ ] 9.1 Future Enhancements
  - Dynamic LLM switching via admin UI (ModelManager architecture)
  - Cross-encoder re-ranking step (after resource becomes available)
  - Query result caching layer
  - Automated metadata prediction during ingestion (classifier-assisted)
  - Website scraping for real-time data updates
- [ ] 9.2 Known Limitations (already in CODEBASE_DOCUMENTATION.md Section 7)
  - Local LLM latency (CPU-bound, no GPU)
  - BM25 corpus rebuilt per request
  - No real-time data — static knowledge base
- [ ] 9.3 Conclusion
  - Successfully built domain-specific RAG with hybrid retrieval
  - Hierarchical classification reduces noise and improves precision
  - Secure deployment with Express gateway protects the inference server

---

## Phase 10 — References & Appendices
**Est. time: 1–2 hrs | No diagrams needed**

- [ ] 10.1 References
  - LangChain documentation
  - ChromaDB documentation
  - Original RRF paper (Cormack et al., 2009)
  - Gemini API documentation
  - VGEC official website (data source)
  - BM25 (Robertson & Zaragoza, 2009)
  - Sentence Transformers (Reimers & Gurevych, 2019)
- [ ] 10.2 Appendix A — MASTER_INDEX full taxonomy
- [ ] 10.3 Appendix B — Full API documentation (export from Swagger `/docs`)
- [ ] 10.4 Appendix C — Sample classifier training data
- [ ] 10.5 Appendix D — Sample department JSON format

---

## Execution Timeline

| Phase | When | Priority |
|---|---|---|
| **All Diagrams** | Start NOW (before writing prose) | 🔴 Critical |
| Phase 1–3 (Intro, Lit Review) | Day 1 | Must have |
| Phase 4–5 (Design) | Day 2–3 | 🔴 Critical — most marks |
| Phase 6 (Implementation) | Day 4 | Must have |
| Phase 7 (Testing) | After deployment — Day 5 | 🔴 Critical — proof |
| Phase 8 (Deployment) | After deployment | Must have |
| Phase 9–10 (Future, Refs) | Day 6 | Finish strong |
| Final PDF export + proofread | Last | Required |

---

## Reuse Map — What's Already Written

| Documentation Section | Already in |
|---|---|
| System Architecture (components, data flow) | `CODEBASE_DOCUMENTATION.md` Section 2 |
| Tech Stack Table | `CODEBASE_DOCUMENTATION.md` Section 1 |
| Metadata Schema / Taxonomy | `CODEBASE_DOCUMENTATION.md` Section 3 |
| Retrieval Pipeline steps | `CODEBASE_DOCUMENTATION.md` Section 4 |
| All class/method descriptions | `CODEBASE_DOCUMENTATION.md` Section 5 |
| Metrics definitions | `CODEBASE_DOCUMENTATION.md` Section 6 |
| Known Limitations | `CODEBASE_DOCUMENTATION.md` Section 7 |
| File Structure Tree | `CODEBASE_DOCUMENTATION.md` Section 8 |