Spaces:

harshvisualz
/

vgecbot

Sleeping

App Files Files Community

vgecbot / DOCUMENTATION_PLAN.md

harsh-dev

docker deployment

4225666 about 1 month ago

preview code

raw

history blame contribute delete

12.4 kB

VGEC RAG Chatbot — Software Documentation Plan

Based on IEEE/Industry Standard | Updated: 2026-03-25 Reference: CODEBASE_DOCUMENTATION.md covers most of Phase 5 already — reuse it.

DIAGRAMS FIRST — Priority Order

Do all diagrams before writing any prose. Diagrams take the most time and are referenced throughout.

#	Diagram	Phase Used In	Tool	Status
1	High-Level Architecture (Component Diagram)	Phase 5	Draw.io / Mermaid	[ ]
2	Data Flow — Query Path	Phase 5	Draw.io (DFD Level 2)	[ ]
3	Data Flow — Ingestion Path	Phase 5	Draw.io (DFD Level 2)	[ ]
4	Hierarchical Taxonomy Tree (Type→Category→Topic)	Phase 5	Tree diagram / Mermaid	[ ]
5	Filter Decision Flowchart (Strict→Partial→Fallback)	Phase 5	Flowchart / Draw.io	[ ]
6	Hybrid Retrieval Sequence (Vector→BM25→RRF→Boost)	Phase 5	Sequence diagram / Flow	[ ]
7	Use Case Diagram (Student, Faculty, Admin actors)	Phase 4	Draw.io / PlantUML	[ ]
8	System Context Diagram / Level 0 DFD	Phase 2	Draw.io	[ ]
9	Class Diagram (simplified — RAGService + helpers)	Phase 6	Draw.io / UML	[ ]
10	Activity Diagram — Chunking Process	Phase 6	Activity flow / Draw.io	[ ]
11	MRR Bar Chart — Your RAG vs Traditional	Phase 7	matplotlib / Excel	[ ]
12	Noise Rate Bar Chart — Comparison	Phase 7	matplotlib / Excel	[ ]
13	Classifier Confusion Matrix (per field)	Phase 7	Seaborn heatmap	[ ]
14	Deployment Diagram (Express → FastAPI → ChromaDB)	Phase 8	Draw.io	[ ]
15	Future Roadmap / Gantt-style Timeline	Phase 9	Draw.io / simple table	[ ]

Phase 1 — Front Matter

Est. time: 1–2 hrs | No diagrams needed

Title Page
- Project: VGEC RAG Chatbot
- Subtitle: Retrieval-Augmented Generation System for Academic Queries
- Name, Roll No., Department, Submission Date
- Guide name, College name
Abstract (150–200 words)
- Problem: Students struggle to find accurate VGEC info scattered across website
- Solution: RAG-based chatbot with hierarchical classification + hybrid retrieval
- Key results: MRR, noise reduction (fill placeholders after deployment)
- Tech: FastAPI, ChromaDB, Gemini, Logistic Regression classifier
Table of Contents (auto-generate at end — structure it now)
List of Figures (auto-generate at end)
List of Abbreviations
- RAG, BM25, RRF, LLM, MRR, API, VGEC, HOD, etc.

Phase 2 — Introduction

Est. time: 2–3 hrs | Diagrams needed: System Context Diagram (Diagram #8)

2.1 Background
- Current state: Static website, PDFs, manual queries to admin office
- Pain points: Information scattered, no natural language interface
2.2 Problem Statement
- Lack of intelligent query system for institutional data
- Need for domain-specific (VGEC) accurate retrieval
2.3 Objectives
- Build RAG pipeline with >75% MRR
- Implement metadata classification for pre-filtering
- Provide REST API for frontend integration
- Deploy with a secure Express gateway
2.4 Scope
- In scope: Department data (faculty, labs, syllabus, HOD, intake), REST API, classification, evaluation
- Out of scope: Real-time website scraping, admissions processing, multimedia

Reuse from: CODEBASE_DOCUMENTATION.md Section 1 (Project Overview)

Phase 3 — Literature Review / Related Work

Est. time: 2–3 hrs | Diagrams needed: Evolution timeline (simple horizontal flow)

3.1 Traditional Chatbots
- Rule-based (ALICE, ELIZA) — rigid, no context
- Keyword matching chatbots — no semantic understanding
3.2 Modern RAG Systems
- OpenAI GPT-4 + vector DB (generic, not domain-specific)
- LlamaIndex / LangChain baseline RAG — no metadata filtering
3.3 Hybrid Search Systems
- Elasticsearch (BM25 only), Cohere (vector only)
- RRF as the standard fusion method (reference paper)
3.4 Your Differentiation
- Hierarchical classifier (Type→Category→Topic→Intent) for pre-filtering
- Hybrid retrieval (BM25 + Vector + RRF) vs pure semantic search
- Domain-specific ingestion strategy (intent-aware JSON chunking)

Phase 4 — System Analysis & Requirements

Est. time: 3–4 hrs | Diagrams needed: Use Case Diagram (#7), Level 1 DFD

4.1 Functional Requirements
- FR1: Ingest structured JSON and unstructured documents (PDF, MD, TXT)
- FR2: Classify queries into metadata filters (type, category, topic, intent)
- FR3: Retrieve relevant chunks with configurable similarity threshold
- FR4: Generate contextual answers using Gemini or local LLM
- FR5: Provide CRUD operations on vector store via REST API
- FR6: Rate-limit and authenticate requests via Express gateway
4.2 Non-Functional Requirements
- Performance: <5s response (cloud), <30s (local LLM)
- Accuracy: MRR >0.75
- Security: Admin routes protected by JWT, Python API never publicly exposed
- Scalability: Support 10,000+ chunks in ChromaDB
4.3 Use Case Diagram (Diagram #7)
- Actors: Student, Faculty, Admin
- Student use cases: Submit query, View answer, View references
- Admin use cases: Ingest document, Delete document, Run evaluation, Change settings
4.4 Level 1 DFD
- Major processes: Ingest, Classify, Retrieve, Generate, Evaluate

Phase 5 — System Design

Est. time: 4–6 hrs | MOST MARKS, MOST DIAGRAMS Diagrams needed: #1, #2, #3, #4, #5, #6

Reuse heavily from: CODEBASE_DOCUMENTATION.md Sections 2, 3, 4

5.1 Architecture Design
- High-Level Component Diagram (Diagram #1)
- Data Flow — Ingestion Path (Diagram #3)
- Data Flow — Query Path (Diagram #2)
- Technology Stack Table (already in CODEBASE_DOCUMENTATION.md Section 1)
5.2 Database Design
- Vector DB Metadata Schema (field table — already in CODEBASE_DOCUMENTATION.md Section 3)
- Source JSON Schema (already documented)
- File Tracking Registry Schema (FileService JSON records)
5.3 Algorithm Design
- Hierarchical Taxonomy Tree (Diagram #4) (Type → Category → Topic → Intent)
- Filter Decision Flowchart (Diagram #5) (confidence thresholds → Strict/Partial/Fallback)
- Hybrid Retrieval Sequence (Diagram #6) (Vector → BM25 → RRF formula → Boost → Threshold)
- Chunking Strategy (JSON intent-aware vs RecursiveCharacterTextSplitter)
- RRF Formula — document with the actual equation:
```
score(d) = bm25_weight * 1/(rrf_k + rank_bm25)
         + vector_weight * 1/(rrf_k + rank_vec)
```
5.4 Interface Design
- API Endpoint Table — /rag and /vector routes (already in CODEBASE_DOCUMENTATION.md Section 5)
- Request/Response JSON examples (sample curl or Postman output)
- Express Gateway design (rate limit + auth + concurrency queue)

Phase 6 — Implementation

Est. time: 2–3 hrs | Diagrams needed: Directory tree (#9 class diagram, #10 activity diagram)

Reuse heavily from: CODEBASE_DOCUMENTATION.md Section 5 and Section 8

6.1 Directory Structure (already in CODEBASE_DOCUMENTATION.md Section 8)
6.2 Module Descriptions (already in CODEBASE_DOCUMENTATION.md Section 5)
6.3 Key Code Snippets (do NOT paste full files — only algorithm excerpts)
- Filter construction logic (_build_filter method)
- RRF scoring loop
- Intent-aware JSON chunking (handle_json_docs)
- Classifier prediction + threshold gating
6.4 Configuration
- .env variables table (already in CODEBASE_DOCUMENTATION.md Section 5)
- Hyperparameter table (BM25 weights, thresholds, chunk size)
6.5 Express Gateway Implementation
- Rate limiting configuration
- JWT auth middleware snippet
- Concurrency queue (p-limit) snippet

Phase 7 — Testing & Evaluation

Est. time: 3–4 hrs | Diagrams needed: #11 (MRR bar chart), #12 (noise chart), #13 (confusion matrix)

⚠️ PLACEHOLDER — fill real numbers and screenshots AFTER deployment

7.1 Test Plan
- Unit tests: Classifier accuracy per field (run /test_classifier_dataset)
- Integration tests: End-to-end hybrid query
- Performance: Measure average latency (cloud vs local)
7.2 Results
- Comparison Table: Traditional pure-vector RAG vs Your Hybrid RAG
  - Metrics: MRR, Hit Rate, Top-1 Hit Rate, Noise Rate, Latency
- MRR Bar Chart by query intent type (Diagram #11)
- Noise Rate comparison (Diagram #12)
- Classifier Confusion Matrix per field (Diagram #13)
7.3 Sample Query Demonstrations
- Choose 3–5 representative queries, show:
  - Input question
  - Classifier output (type, category, topic, intent + confidences)
  - Retrieved chunks with scores
  - Final LLM answer

Phase 8 — Deployment

Est. time: 1–2 hrs | Diagrams needed: Deployment diagram (#14)

⚠️ PLACEHOLDER — fill AFTER actual deployment

8.1 System Requirements
- Hardware: 8GB RAM, 4-core CPU (local LLM) OR Google API key (Gemini)
- Software: Python 3.9+, Node.js 18+, ChromaDB
8.2 Deployment Architecture (Diagram #14)
- Frontend → Express Gateway → FastAPI → ChromaDB
8.3 Installation Steps
- Clone → pip install -r requirements.txt → Set .env → Run ingestion → Start API
- Express: npm install → Set .env → node server.js
8.4 Screenshots (fill after deployment)
- Swagger UI (/docs)
- Sample chatbot interaction
- Admin panel
- Classification test panel

Phase 9 — Future Scope & Conclusion

Est. time: 1–2 hrs | Diagrams needed: Roadmap (#15)

9.1 Future Enhancements
- Dynamic LLM switching via admin UI (ModelManager architecture)
- Cross-encoder re-ranking step (after resource becomes available)
- Query result caching layer
- Automated metadata prediction during ingestion (classifier-assisted)
- Website scraping for real-time data updates
9.2 Known Limitations (already in CODEBASE_DOCUMENTATION.md Section 7)
- Local LLM latency (CPU-bound, no GPU)
- BM25 corpus rebuilt per request
- No real-time data — static knowledge base
9.3 Conclusion
- Successfully built domain-specific RAG with hybrid retrieval
- Hierarchical classification reduces noise and improves precision
- Secure deployment with Express gateway protects the inference server

Phase 10 — References & Appendices

Est. time: 1–2 hrs | No diagrams needed

10.1 References
- LangChain documentation
- ChromaDB documentation
- Original RRF paper (Cormack et al., 2009)
- Gemini API documentation
- VGEC official website (data source)
- BM25 (Robertson & Zaragoza, 2009)
- Sentence Transformers (Reimers & Gurevych, 2019)
10.2 Appendix A — MASTER_INDEX full taxonomy
10.3 Appendix B — Full API documentation (export from Swagger /docs)
10.4 Appendix C — Sample classifier training data
10.5 Appendix D — Sample department JSON format

Execution Timeline

Phase	When	Priority
All Diagrams	Start NOW (before writing prose)	🔴 Critical
Phase 1–3 (Intro, Lit Review)	Day 1	Must have
Phase 4–5 (Design)	Day 2–3	🔴 Critical — most marks
Phase 6 (Implementation)	Day 4	Must have
Phase 7 (Testing)	After deployment — Day 5	🔴 Critical — proof
Phase 8 (Deployment)	After deployment	Must have
Phase 9–10 (Future, Refs)	Day 6	Finish strong
Final PDF export + proofread	Last	Required

Reuse Map — What's Already Written

Documentation Section	Already in
System Architecture (components, data flow)	`CODEBASE_DOCUMENTATION.md` Section 2
Tech Stack Table	`CODEBASE_DOCUMENTATION.md` Section 1
Metadata Schema / Taxonomy	`CODEBASE_DOCUMENTATION.md` Section 3
Retrieval Pipeline steps	`CODEBASE_DOCUMENTATION.md` Section 4
All class/method descriptions	`CODEBASE_DOCUMENTATION.md` Section 5
Metrics definitions	`CODEBASE_DOCUMENTATION.md` Section 6
Known Limitations	`CODEBASE_DOCUMENTATION.md` Section 7
File Structure Tree	`CODEBASE_DOCUMENTATION.md` Section 8