vgecbot / DOCUMENTATION_PLAN.md
harsh-dev's picture
docker deployment
4225666

VGEC RAG Chatbot β€” Software Documentation Plan

Based on IEEE/Industry Standard | Updated: 2026-03-25 Reference: CODEBASE_DOCUMENTATION.md covers most of Phase 5 already β€” reuse it.


DIAGRAMS FIRST β€” Priority Order

Do all diagrams before writing any prose. Diagrams take the most time and are referenced throughout.

# Diagram Phase Used In Tool Status
1 High-Level Architecture (Component Diagram) Phase 5 Draw.io / Mermaid [ ]
2 Data Flow β€” Query Path Phase 5 Draw.io (DFD Level 2) [ ]
3 Data Flow β€” Ingestion Path Phase 5 Draw.io (DFD Level 2) [ ]
4 Hierarchical Taxonomy Tree (Type→Category→Topic) Phase 5 Tree diagram / Mermaid [ ]
5 Filter Decision Flowchart (Strict→Partial→Fallback) Phase 5 Flowchart / Draw.io [ ]
6 Hybrid Retrieval Sequence (Vector→BM25→RRF→Boost) Phase 5 Sequence diagram / Flow [ ]
7 Use Case Diagram (Student, Faculty, Admin actors) Phase 4 Draw.io / PlantUML [ ]
8 System Context Diagram / Level 0 DFD Phase 2 Draw.io [ ]
9 Class Diagram (simplified β€” RAGService + helpers) Phase 6 Draw.io / UML [ ]
10 Activity Diagram β€” Chunking Process Phase 6 Activity flow / Draw.io [ ]
11 MRR Bar Chart β€” Your RAG vs Traditional Phase 7 matplotlib / Excel [ ]
12 Noise Rate Bar Chart β€” Comparison Phase 7 matplotlib / Excel [ ]
13 Classifier Confusion Matrix (per field) Phase 7 Seaborn heatmap [ ]
14 Deployment Diagram (Express β†’ FastAPI β†’ ChromaDB) Phase 8 Draw.io [ ]
15 Future Roadmap / Gantt-style Timeline Phase 9 Draw.io / simple table [ ]

Phase 1 β€” Front Matter

Est. time: 1–2 hrs | No diagrams needed

  • Title Page
    • Project: VGEC RAG Chatbot
    • Subtitle: Retrieval-Augmented Generation System for Academic Queries
    • Name, Roll No., Department, Submission Date
    • Guide name, College name
  • Abstract (150–200 words)
    • Problem: Students struggle to find accurate VGEC info scattered across website
    • Solution: RAG-based chatbot with hierarchical classification + hybrid retrieval
    • Key results: MRR, noise reduction (fill placeholders after deployment)
    • Tech: FastAPI, ChromaDB, Gemini, Logistic Regression classifier
  • Table of Contents (auto-generate at end β€” structure it now)
  • List of Figures (auto-generate at end)
  • List of Abbreviations
    • RAG, BM25, RRF, LLM, MRR, API, VGEC, HOD, etc.

Phase 2 β€” Introduction

Est. time: 2–3 hrs | Diagrams needed: System Context Diagram (Diagram #8)

  • 2.1 Background
    • Current state: Static website, PDFs, manual queries to admin office
    • Pain points: Information scattered, no natural language interface
  • 2.2 Problem Statement
    • Lack of intelligent query system for institutional data
    • Need for domain-specific (VGEC) accurate retrieval
  • 2.3 Objectives
    • Build RAG pipeline with >75% MRR
    • Implement metadata classification for pre-filtering
    • Provide REST API for frontend integration
    • Deploy with a secure Express gateway
  • 2.4 Scope
    • In scope: Department data (faculty, labs, syllabus, HOD, intake), REST API, classification, evaluation
    • Out of scope: Real-time website scraping, admissions processing, multimedia

Reuse from: CODEBASE_DOCUMENTATION.md Section 1 (Project Overview)


Phase 3 β€” Literature Review / Related Work

Est. time: 2–3 hrs | Diagrams needed: Evolution timeline (simple horizontal flow)

  • 3.1 Traditional Chatbots
    • Rule-based (ALICE, ELIZA) β€” rigid, no context
    • Keyword matching chatbots β€” no semantic understanding
  • 3.2 Modern RAG Systems
    • OpenAI GPT-4 + vector DB (generic, not domain-specific)
    • LlamaIndex / LangChain baseline RAG β€” no metadata filtering
  • 3.3 Hybrid Search Systems
    • Elasticsearch (BM25 only), Cohere (vector only)
    • RRF as the standard fusion method (reference paper)
  • 3.4 Your Differentiation
    • Hierarchical classifier (Typeβ†’Categoryβ†’Topicβ†’Intent) for pre-filtering
    • Hybrid retrieval (BM25 + Vector + RRF) vs pure semantic search
    • Domain-specific ingestion strategy (intent-aware JSON chunking)

Phase 4 β€” System Analysis & Requirements

Est. time: 3–4 hrs | Diagrams needed: Use Case Diagram (#7), Level 1 DFD

  • 4.1 Functional Requirements
    • FR1: Ingest structured JSON and unstructured documents (PDF, MD, TXT)
    • FR2: Classify queries into metadata filters (type, category, topic, intent)
    • FR3: Retrieve relevant chunks with configurable similarity threshold
    • FR4: Generate contextual answers using Gemini or local LLM
    • FR5: Provide CRUD operations on vector store via REST API
    • FR6: Rate-limit and authenticate requests via Express gateway
  • 4.2 Non-Functional Requirements
    • Performance: <5s response (cloud), <30s (local LLM)
    • Accuracy: MRR >0.75
    • Security: Admin routes protected by JWT, Python API never publicly exposed
    • Scalability: Support 10,000+ chunks in ChromaDB
  • 4.3 Use Case Diagram (Diagram #7)
    • Actors: Student, Faculty, Admin
    • Student use cases: Submit query, View answer, View references
    • Admin use cases: Ingest document, Delete document, Run evaluation, Change settings
  • 4.4 Level 1 DFD
    • Major processes: Ingest, Classify, Retrieve, Generate, Evaluate

Phase 5 β€” System Design

Est. time: 4–6 hrs | MOST MARKS, MOST DIAGRAMS Diagrams needed: #1, #2, #3, #4, #5, #6

Reuse heavily from: CODEBASE_DOCUMENTATION.md Sections 2, 3, 4

  • 5.1 Architecture Design
    • High-Level Component Diagram (Diagram #1)
    • Data Flow β€” Ingestion Path (Diagram #3)
    • Data Flow β€” Query Path (Diagram #2)
    • Technology Stack Table (already in CODEBASE_DOCUMENTATION.md Section 1)
  • 5.2 Database Design
    • Vector DB Metadata Schema (field table β€” already in CODEBASE_DOCUMENTATION.md Section 3)
    • Source JSON Schema (already documented)
    • File Tracking Registry Schema (FileService JSON records)
  • 5.3 Algorithm Design
    • Hierarchical Taxonomy Tree (Diagram #4) (Type β†’ Category β†’ Topic β†’ Intent)
    • Filter Decision Flowchart (Diagram #5) (confidence thresholds β†’ Strict/Partial/Fallback)
    • Hybrid Retrieval Sequence (Diagram #6) (Vector β†’ BM25 β†’ RRF formula β†’ Boost β†’ Threshold)
    • Chunking Strategy (JSON intent-aware vs RecursiveCharacterTextSplitter)
    • RRF Formula β€” document with the actual equation:
      score(d) = bm25_weight * 1/(rrf_k + rank_bm25)
               + vector_weight * 1/(rrf_k + rank_vec)
      
  • 5.4 Interface Design
    • API Endpoint Table β€” /rag and /vector routes (already in CODEBASE_DOCUMENTATION.md Section 5)
    • Request/Response JSON examples (sample curl or Postman output)
    • Express Gateway design (rate limit + auth + concurrency queue)

Phase 6 β€” Implementation

Est. time: 2–3 hrs | Diagrams needed: Directory tree (#9 class diagram, #10 activity diagram)

Reuse heavily from: CODEBASE_DOCUMENTATION.md Section 5 and Section 8

  • 6.1 Directory Structure (already in CODEBASE_DOCUMENTATION.md Section 8)
  • 6.2 Module Descriptions (already in CODEBASE_DOCUMENTATION.md Section 5)
  • 6.3 Key Code Snippets (do NOT paste full files β€” only algorithm excerpts)
    • Filter construction logic (_build_filter method)
    • RRF scoring loop
    • Intent-aware JSON chunking (handle_json_docs)
    • Classifier prediction + threshold gating
  • 6.4 Configuration
    • .env variables table (already in CODEBASE_DOCUMENTATION.md Section 5)
    • Hyperparameter table (BM25 weights, thresholds, chunk size)
  • 6.5 Express Gateway Implementation
    • Rate limiting configuration
    • JWT auth middleware snippet
    • Concurrency queue (p-limit) snippet

Phase 7 β€” Testing & Evaluation

Est. time: 3–4 hrs | Diagrams needed: #11 (MRR bar chart), #12 (noise chart), #13 (confusion matrix)

⚠️ PLACEHOLDER β€” fill real numbers and screenshots AFTER deployment

  • 7.1 Test Plan
    • Unit tests: Classifier accuracy per field (run /test_classifier_dataset)
    • Integration tests: End-to-end hybrid query
    • Performance: Measure average latency (cloud vs local)
  • 7.2 Results
    • Comparison Table: Traditional pure-vector RAG vs Your Hybrid RAG
      • Metrics: MRR, Hit Rate, Top-1 Hit Rate, Noise Rate, Latency
    • MRR Bar Chart by query intent type (Diagram #11)
    • Noise Rate comparison (Diagram #12)
    • Classifier Confusion Matrix per field (Diagram #13)
  • 7.3 Sample Query Demonstrations
    • Choose 3–5 representative queries, show:
      • Input question
      • Classifier output (type, category, topic, intent + confidences)
      • Retrieved chunks with scores
      • Final LLM answer

Phase 8 β€” Deployment

Est. time: 1–2 hrs | Diagrams needed: Deployment diagram (#14)

⚠️ PLACEHOLDER β€” fill AFTER actual deployment

  • 8.1 System Requirements
    • Hardware: 8GB RAM, 4-core CPU (local LLM) OR Google API key (Gemini)
    • Software: Python 3.9+, Node.js 18+, ChromaDB
  • 8.2 Deployment Architecture (Diagram #14)
    • Frontend β†’ Express Gateway β†’ FastAPI β†’ ChromaDB
  • 8.3 Installation Steps
    • Clone β†’ pip install -r requirements.txt β†’ Set .env β†’ Run ingestion β†’ Start API
    • Express: npm install β†’ Set .env β†’ node server.js
  • 8.4 Screenshots (fill after deployment)
    • Swagger UI (/docs)
    • Sample chatbot interaction
    • Admin panel
    • Classification test panel

Phase 9 β€” Future Scope & Conclusion

Est. time: 1–2 hrs | Diagrams needed: Roadmap (#15)

  • 9.1 Future Enhancements
    • Dynamic LLM switching via admin UI (ModelManager architecture)
    • Cross-encoder re-ranking step (after resource becomes available)
    • Query result caching layer
    • Automated metadata prediction during ingestion (classifier-assisted)
    • Website scraping for real-time data updates
  • 9.2 Known Limitations (already in CODEBASE_DOCUMENTATION.md Section 7)
    • Local LLM latency (CPU-bound, no GPU)
    • BM25 corpus rebuilt per request
    • No real-time data β€” static knowledge base
  • 9.3 Conclusion
    • Successfully built domain-specific RAG with hybrid retrieval
    • Hierarchical classification reduces noise and improves precision
    • Secure deployment with Express gateway protects the inference server

Phase 10 β€” References & Appendices

Est. time: 1–2 hrs | No diagrams needed

  • 10.1 References
    • LangChain documentation
    • ChromaDB documentation
    • Original RRF paper (Cormack et al., 2009)
    • Gemini API documentation
    • VGEC official website (data source)
    • BM25 (Robertson & Zaragoza, 2009)
    • Sentence Transformers (Reimers & Gurevych, 2019)
  • 10.2 Appendix A β€” MASTER_INDEX full taxonomy
  • 10.3 Appendix B β€” Full API documentation (export from Swagger /docs)
  • 10.4 Appendix C β€” Sample classifier training data
  • 10.5 Appendix D β€” Sample department JSON format

Execution Timeline

Phase When Priority
All Diagrams Start NOW (before writing prose) πŸ”΄ Critical
Phase 1–3 (Intro, Lit Review) Day 1 Must have
Phase 4–5 (Design) Day 2–3 πŸ”΄ Critical β€” most marks
Phase 6 (Implementation) Day 4 Must have
Phase 7 (Testing) After deployment β€” Day 5 πŸ”΄ Critical β€” proof
Phase 8 (Deployment) After deployment Must have
Phase 9–10 (Future, Refs) Day 6 Finish strong
Final PDF export + proofread Last Required

Reuse Map β€” What's Already Written

Documentation Section Already in
System Architecture (components, data flow) CODEBASE_DOCUMENTATION.md Section 2
Tech Stack Table CODEBASE_DOCUMENTATION.md Section 1
Metadata Schema / Taxonomy CODEBASE_DOCUMENTATION.md Section 3
Retrieval Pipeline steps CODEBASE_DOCUMENTATION.md Section 4
All class/method descriptions CODEBASE_DOCUMENTATION.md Section 5
Metrics definitions CODEBASE_DOCUMENTATION.md Section 6
Known Limitations CODEBASE_DOCUMENTATION.md Section 7
File Structure Tree CODEBASE_DOCUMENTATION.md Section 8