Spaces:

harshvisualz
/

vgecbot

Sleeping

App Files Files Community

vgecbot / DOCUMENTATION_PLAN.md

harsh-dev

docker deployment

4225666 about 1 month ago

preview code

raw

history blame contribute delete

12.4 kB

	# VGEC RAG Chatbot — Software Documentation Plan
	> Based on IEEE/Industry Standard \| Updated: 2026-03-25
	> Reference: `CODEBASE_DOCUMENTATION.md` covers most of Phase 5 already — reuse it.

	---

	## DIAGRAMS FIRST — Priority Order

	> Do all diagrams before writing any prose. Diagrams take the most time and are referenced throughout.

	\| # \| Diagram \| Phase Used In \| Tool \| Status \|
	\|---\|---\|---\|---\|---\|
	\| 1 \| High-Level Architecture (Component Diagram) \| Phase 5 \| Draw.io / Mermaid \| [ ] \|
	\| 2 \| Data Flow — Query Path \| Phase 5 \| Draw.io (DFD Level 2) \| [ ] \|
	\| 3 \| Data Flow — Ingestion Path \| Phase 5 \| Draw.io (DFD Level 2) \| [ ] \|
	\| 4 \| Hierarchical Taxonomy Tree (Type→Category→Topic) \| Phase 5 \| Tree diagram / Mermaid \| [ ] \|
	\| 5 \| Filter Decision Flowchart (Strict→Partial→Fallback) \| Phase 5 \| Flowchart / Draw.io \| [ ] \|
	\| 6 \| Hybrid Retrieval Sequence (Vector→BM25→RRF→Boost) \| Phase 5 \| Sequence diagram / Flow \| [ ] \|
	\| 7 \| Use Case Diagram (Student, Faculty, Admin actors) \| Phase 4 \| Draw.io / PlantUML \| [ ] \|
	\| 8 \| System Context Diagram / Level 0 DFD \| Phase 2 \| Draw.io \| [ ] \|
	\| 9 \| Class Diagram (simplified — RAGService + helpers) \| Phase 6 \| Draw.io / UML \| [ ] \|
	\| 10 \| Activity Diagram — Chunking Process \| Phase 6 \| Activity flow / Draw.io \| [ ] \|
	\| 11 \| MRR Bar Chart — Your RAG vs Traditional \| Phase 7 \| matplotlib / Excel \| [ ] \|
	\| 12 \| Noise Rate Bar Chart — Comparison \| Phase 7 \| matplotlib / Excel \| [ ] \|
	\| 13 \| Classifier Confusion Matrix (per field) \| Phase 7 \| Seaborn heatmap \| [ ] \|
	\| 14 \| Deployment Diagram (Express → FastAPI → ChromaDB) \| Phase 8 \| Draw.io \| [ ] \|
	\| 15 \| Future Roadmap / Gantt-style Timeline \| Phase 9 \| Draw.io / simple table \| [ ] \|

	---

	## Phase 1 — Front Matter
	Est. time: 1–2 hrs \| No diagrams needed

	- [ ] Title Page
	- Project: VGEC RAG Chatbot
	- Subtitle: Retrieval-Augmented Generation System for Academic Queries
	- Name, Roll No., Department, Submission Date
	- Guide name, College name
	- [ ] Abstract (150–200 words)
	- Problem: Students struggle to find accurate VGEC info scattered across website
	- Solution: RAG-based chatbot with hierarchical classification + hybrid retrieval
	- Key results: MRR, noise reduction (fill placeholders after deployment)
	- Tech: FastAPI, ChromaDB, Gemini, Logistic Regression classifier
	- [ ] Table of Contents (auto-generate at end — structure it now)
	- [ ] List of Figures (auto-generate at end)
	- [ ] List of Abbreviations
	- RAG, BM25, RRF, LLM, MRR, API, VGEC, HOD, etc.

	---

	## Phase 2 — Introduction
	Est. time: 2–3 hrs \| Diagrams needed: System Context Diagram (Diagram #8)

	- [ ] 2.1 Background
	- Current state: Static website, PDFs, manual queries to admin office
	- Pain points: Information scattered, no natural language interface
	- [ ] 2.2 Problem Statement
	- Lack of intelligent query system for institutional data
	- Need for domain-specific (VGEC) accurate retrieval
	- [ ] 2.3 Objectives
	- Build RAG pipeline with >75% MRR
	- Implement metadata classification for pre-filtering
	- Provide REST API for frontend integration
	- Deploy with a secure Express gateway
	- [ ] 2.4 Scope
	- In scope: Department data (faculty, labs, syllabus, HOD, intake), REST API, classification, evaluation
	- Out of scope: Real-time website scraping, admissions processing, multimedia

	> Reuse from: `CODEBASE_DOCUMENTATION.md` Section 1 (Project Overview)

	---

	## Phase 3 — Literature Review / Related Work
	Est. time: 2–3 hrs \| Diagrams needed: Evolution timeline (simple horizontal flow)

	- [ ] 3.1 Traditional Chatbots
	- Rule-based (ALICE, ELIZA) — rigid, no context
	- Keyword matching chatbots — no semantic understanding
	- [ ] 3.2 Modern RAG Systems
	- OpenAI GPT-4 + vector DB (generic, not domain-specific)
	- LlamaIndex / LangChain baseline RAG — no metadata filtering
	- [ ] 3.3 Hybrid Search Systems
	- Elasticsearch (BM25 only), Cohere (vector only)
	- RRF as the standard fusion method (reference paper)
	- [ ] 3.4 Your Differentiation
	- Hierarchical classifier (Type→Category→Topic→Intent) for pre-filtering
	- Hybrid retrieval (BM25 + Vector + RRF) vs pure semantic search
	- Domain-specific ingestion strategy (intent-aware JSON chunking)

	---

	## Phase 4 — System Analysis & Requirements
	Est. time: 3–4 hrs \| Diagrams needed: Use Case Diagram (#7), Level 1 DFD

	- [ ] 4.1 Functional Requirements
	- FR1: Ingest structured JSON and unstructured documents (PDF, MD, TXT)
	- FR2: Classify queries into metadata filters (type, category, topic, intent)
	- FR3: Retrieve relevant chunks with configurable similarity threshold
	- FR4: Generate contextual answers using Gemini or local LLM
	- FR5: Provide CRUD operations on vector store via REST API
	- FR6: Rate-limit and authenticate requests via Express gateway
	- [ ] 4.2 Non-Functional Requirements
	- Performance: <5s response (cloud), <30s (local LLM)
	- Accuracy: MRR >0.75
	- Security: Admin routes protected by JWT, Python API never publicly exposed
	- Scalability: Support 10,000+ chunks in ChromaDB
	- [ ] 4.3 Use Case Diagram (Diagram #7)
	- Actors: Student, Faculty, Admin
	- Student use cases: Submit query, View answer, View references
	- Admin use cases: Ingest document, Delete document, Run evaluation, Change settings
	- [ ] 4.4 Level 1 DFD
	- Major processes: Ingest, Classify, Retrieve, Generate, Evaluate

	---

	## Phase 5 — System Design
	Est. time: 4–6 hrs \| MOST MARKS, MOST DIAGRAMS
	Diagrams needed: #1, #2, #3, #4, #5, #6

	> Reuse heavily from: `CODEBASE_DOCUMENTATION.md` Sections 2, 3, 4

	- [ ] 5.1 Architecture Design
	- [ ] High-Level Component Diagram (Diagram #1)
	- [ ] Data Flow — Ingestion Path (Diagram #3)
	- [ ] Data Flow — Query Path (Diagram #2)
	- [ ] Technology Stack Table (already in CODEBASE_DOCUMENTATION.md Section 1)
	- [ ] 5.2 Database Design
	- [ ] Vector DB Metadata Schema (field table — already in CODEBASE_DOCUMENTATION.md Section 3)
	- [ ] Source JSON Schema (already documented)
	- [ ] File Tracking Registry Schema (FileService JSON records)
	- [ ] 5.3 Algorithm Design
	- [ ] Hierarchical Taxonomy Tree (Diagram #4) (Type → Category → Topic → Intent)
	- [ ] Filter Decision Flowchart (Diagram #5) (confidence thresholds → Strict/Partial/Fallback)
	- [ ] Hybrid Retrieval Sequence (Diagram #6) (Vector → BM25 → RRF formula → Boost → Threshold)
	- [ ] Chunking Strategy (JSON intent-aware vs RecursiveCharacterTextSplitter)
	- [ ] RRF Formula — document with the actual equation:
	```
	score(d) = bm25_weight * 1/(rrf_k + rank_bm25)
	+ vector_weight * 1/(rrf_k + rank_vec)
	```
	- [ ] 5.4 Interface Design
	- [ ] API Endpoint Table — /rag and /vector routes (already in CODEBASE_DOCUMENTATION.md Section 5)
	- [ ] Request/Response JSON examples (sample curl or Postman output)
	- [ ] Express Gateway design (rate limit + auth + concurrency queue)

	---

	## Phase 6 — Implementation
	Est. time: 2–3 hrs \| Diagrams needed: Directory tree (#9 class diagram, #10 activity diagram)

	> Reuse heavily from: `CODEBASE_DOCUMENTATION.md` Section 5 and Section 8

	- [ ] 6.1 Directory Structure (already in CODEBASE_DOCUMENTATION.md Section 8)
	- [ ] 6.2 Module Descriptions (already in CODEBASE_DOCUMENTATION.md Section 5)
	- [ ] 6.3 Key Code Snippets (do NOT paste full files — only algorithm excerpts)
	- [ ] Filter construction logic (`_build_filter` method)
	- [ ] RRF scoring loop
	- [ ] Intent-aware JSON chunking (`handle_json_docs`)
	- [ ] Classifier prediction + threshold gating
	- [ ] 6.4 Configuration
	- [ ] `.env` variables table (already in CODEBASE_DOCUMENTATION.md Section 5)
	- [ ] Hyperparameter table (BM25 weights, thresholds, chunk size)
	- [ ] 6.5 Express Gateway Implementation
	- [ ] Rate limiting configuration
	- [ ] JWT auth middleware snippet
	- [ ] Concurrency queue (`p-limit`) snippet

	---

	## Phase 7 — Testing & Evaluation
	Est. time: 3–4 hrs \| Diagrams needed: #11 (MRR bar chart), #12 (noise chart), #13 (confusion matrix)
	> ⚠️ PLACEHOLDER — fill real numbers and screenshots AFTER deployment

	- [ ] 7.1 Test Plan
	- [ ] Unit tests: Classifier accuracy per field (run `/test_classifier_dataset`)
	- [ ] Integration tests: End-to-end hybrid query
	- [ ] Performance: Measure average latency (cloud vs local)
	- [ ] 7.2 Results
	- [ ] Comparison Table: Traditional pure-vector RAG vs Your Hybrid RAG
	- Metrics: MRR, Hit Rate, Top-1 Hit Rate, Noise Rate, Latency
	- [ ] MRR Bar Chart by query intent type (Diagram #11)
	- [ ] Noise Rate comparison (Diagram #12)
	- [ ] Classifier Confusion Matrix per field (Diagram #13)
	- [ ] 7.3 Sample Query Demonstrations
	- Choose 3–5 representative queries, show:
	- Input question
	- Classifier output (type, category, topic, intent + confidences)
	- Retrieved chunks with scores
	- Final LLM answer

	---

	## Phase 8 — Deployment
	Est. time: 1–2 hrs \| Diagrams needed: Deployment diagram (#14)
	> ⚠️ PLACEHOLDER — fill AFTER actual deployment

	- [ ] 8.1 System Requirements
	- Hardware: 8GB RAM, 4-core CPU (local LLM) OR Google API key (Gemini)
	- Software: Python 3.9+, Node.js 18+, ChromaDB
	- [ ] 8.2 Deployment Architecture (Diagram #14)
	- Frontend → Express Gateway → FastAPI → ChromaDB
	- [ ] 8.3 Installation Steps
	- Clone → `pip install -r requirements.txt` → Set `.env` → Run ingestion → Start API
	- Express: `npm install` → Set `.env` → `node server.js`
	- [ ] 8.4 Screenshots (fill after deployment)
	- [ ] Swagger UI (`/docs`)
	- [ ] Sample chatbot interaction
	- [ ] Admin panel
	- [ ] Classification test panel

	---

	## Phase 9 — Future Scope & Conclusion
	Est. time: 1–2 hrs \| Diagrams needed: Roadmap (#15)

	- [ ] 9.1 Future Enhancements
	- Dynamic LLM switching via admin UI (ModelManager architecture)
	- Cross-encoder re-ranking step (after resource becomes available)
	- Query result caching layer
	- Automated metadata prediction during ingestion (classifier-assisted)
	- Website scraping for real-time data updates
	- [ ] 9.2 Known Limitations (already in CODEBASE_DOCUMENTATION.md Section 7)
	- Local LLM latency (CPU-bound, no GPU)
	- BM25 corpus rebuilt per request
	- No real-time data — static knowledge base
	- [ ] 9.3 Conclusion
	- Successfully built domain-specific RAG with hybrid retrieval
	- Hierarchical classification reduces noise and improves precision
	- Secure deployment with Express gateway protects the inference server

	---

	## Phase 10 — References & Appendices
	Est. time: 1–2 hrs \| No diagrams needed

	- [ ] 10.1 References
	- LangChain documentation
	- ChromaDB documentation
	- Original RRF paper (Cormack et al., 2009)
	- Gemini API documentation
	- VGEC official website (data source)
	- BM25 (Robertson & Zaragoza, 2009)
	- Sentence Transformers (Reimers & Gurevych, 2019)
	- [ ] 10.2 Appendix A — MASTER_INDEX full taxonomy
	- [ ] 10.3 Appendix B — Full API documentation (export from Swagger `/docs`)
	- [ ] 10.4 Appendix C — Sample classifier training data
	- [ ] 10.5 Appendix D — Sample department JSON format

	---

	## Execution Timeline

	\| Phase \| When \| Priority \|
	\|---\|---\|---\|
	\| All Diagrams \| Start NOW (before writing prose) \| 🔴 Critical \|
	\| Phase 1–3 (Intro, Lit Review) \| Day 1 \| Must have \|
	\| Phase 4–5 (Design) \| Day 2–3 \| 🔴 Critical — most marks \|
	\| Phase 6 (Implementation) \| Day 4 \| Must have \|
	\| Phase 7 (Testing) \| After deployment — Day 5 \| 🔴 Critical — proof \|
	\| Phase 8 (Deployment) \| After deployment \| Must have \|
	\| Phase 9–10 (Future, Refs) \| Day 6 \| Finish strong \|
	\| Final PDF export + proofread \| Last \| Required \|

	---

	## Reuse Map — What's Already Written

	\| Documentation Section \| Already in \|
	\|---\|---\|
	\| System Architecture (components, data flow) \| `CODEBASE_DOCUMENTATION.md` Section 2 \|
	\| Tech Stack Table \| `CODEBASE_DOCUMENTATION.md` Section 1 \|
	\| Metadata Schema / Taxonomy \| `CODEBASE_DOCUMENTATION.md` Section 3 \|
	\| Retrieval Pipeline steps \| `CODEBASE_DOCUMENTATION.md` Section 4 \|
	\| All class/method descriptions \| `CODEBASE_DOCUMENTATION.md` Section 5 \|
	\| Metrics definitions \| `CODEBASE_DOCUMENTATION.md` Section 6 \|
	\| Known Limitations \| `CODEBASE_DOCUMENTATION.md` Section 7 \|
	\| File Structure Tree \| `CODEBASE_DOCUMENTATION.md` Section 8 \|