Spaces:
Sleeping
VGEC RAG Chatbot β Software Documentation Plan
Based on IEEE/Industry Standard | Updated: 2026-03-25 Reference:
CODEBASE_DOCUMENTATION.mdcovers most of Phase 5 already β reuse it.
DIAGRAMS FIRST β Priority Order
Do all diagrams before writing any prose. Diagrams take the most time and are referenced throughout.
| # | Diagram | Phase Used In | Tool | Status |
|---|---|---|---|---|
| 1 | High-Level Architecture (Component Diagram) | Phase 5 | Draw.io / Mermaid | [ ] |
| 2 | Data Flow β Query Path | Phase 5 | Draw.io (DFD Level 2) | [ ] |
| 3 | Data Flow β Ingestion Path | Phase 5 | Draw.io (DFD Level 2) | [ ] |
| 4 | Hierarchical Taxonomy Tree (TypeβCategoryβTopic) | Phase 5 | Tree diagram / Mermaid | [ ] |
| 5 | Filter Decision Flowchart (StrictβPartialβFallback) | Phase 5 | Flowchart / Draw.io | [ ] |
| 6 | Hybrid Retrieval Sequence (VectorβBM25βRRFβBoost) | Phase 5 | Sequence diagram / Flow | [ ] |
| 7 | Use Case Diagram (Student, Faculty, Admin actors) | Phase 4 | Draw.io / PlantUML | [ ] |
| 8 | System Context Diagram / Level 0 DFD | Phase 2 | Draw.io | [ ] |
| 9 | Class Diagram (simplified β RAGService + helpers) | Phase 6 | Draw.io / UML | [ ] |
| 10 | Activity Diagram β Chunking Process | Phase 6 | Activity flow / Draw.io | [ ] |
| 11 | MRR Bar Chart β Your RAG vs Traditional | Phase 7 | matplotlib / Excel | [ ] |
| 12 | Noise Rate Bar Chart β Comparison | Phase 7 | matplotlib / Excel | [ ] |
| 13 | Classifier Confusion Matrix (per field) | Phase 7 | Seaborn heatmap | [ ] |
| 14 | Deployment Diagram (Express β FastAPI β ChromaDB) | Phase 8 | Draw.io | [ ] |
| 15 | Future Roadmap / Gantt-style Timeline | Phase 9 | Draw.io / simple table | [ ] |
Phase 1 β Front Matter
Est. time: 1β2 hrs | No diagrams needed
- Title Page
- Project: VGEC RAG Chatbot
- Subtitle: Retrieval-Augmented Generation System for Academic Queries
- Name, Roll No., Department, Submission Date
- Guide name, College name
- Abstract (150β200 words)
- Problem: Students struggle to find accurate VGEC info scattered across website
- Solution: RAG-based chatbot with hierarchical classification + hybrid retrieval
- Key results: MRR, noise reduction (fill placeholders after deployment)
- Tech: FastAPI, ChromaDB, Gemini, Logistic Regression classifier
- Table of Contents (auto-generate at end β structure it now)
- List of Figures (auto-generate at end)
- List of Abbreviations
- RAG, BM25, RRF, LLM, MRR, API, VGEC, HOD, etc.
Phase 2 β Introduction
Est. time: 2β3 hrs | Diagrams needed: System Context Diagram (Diagram #8)
- 2.1 Background
- Current state: Static website, PDFs, manual queries to admin office
- Pain points: Information scattered, no natural language interface
- 2.2 Problem Statement
- Lack of intelligent query system for institutional data
- Need for domain-specific (VGEC) accurate retrieval
- 2.3 Objectives
- Build RAG pipeline with >75% MRR
- Implement metadata classification for pre-filtering
- Provide REST API for frontend integration
- Deploy with a secure Express gateway
- 2.4 Scope
- In scope: Department data (faculty, labs, syllabus, HOD, intake), REST API, classification, evaluation
- Out of scope: Real-time website scraping, admissions processing, multimedia
Reuse from:
CODEBASE_DOCUMENTATION.mdSection 1 (Project Overview)
Phase 3 β Literature Review / Related Work
Est. time: 2β3 hrs | Diagrams needed: Evolution timeline (simple horizontal flow)
- 3.1 Traditional Chatbots
- Rule-based (ALICE, ELIZA) β rigid, no context
- Keyword matching chatbots β no semantic understanding
- 3.2 Modern RAG Systems
- OpenAI GPT-4 + vector DB (generic, not domain-specific)
- LlamaIndex / LangChain baseline RAG β no metadata filtering
- 3.3 Hybrid Search Systems
- Elasticsearch (BM25 only), Cohere (vector only)
- RRF as the standard fusion method (reference paper)
- 3.4 Your Differentiation
- Hierarchical classifier (TypeβCategoryβTopicβIntent) for pre-filtering
- Hybrid retrieval (BM25 + Vector + RRF) vs pure semantic search
- Domain-specific ingestion strategy (intent-aware JSON chunking)
Phase 4 β System Analysis & Requirements
Est. time: 3β4 hrs | Diagrams needed: Use Case Diagram (#7), Level 1 DFD
- 4.1 Functional Requirements
- FR1: Ingest structured JSON and unstructured documents (PDF, MD, TXT)
- FR2: Classify queries into metadata filters (type, category, topic, intent)
- FR3: Retrieve relevant chunks with configurable similarity threshold
- FR4: Generate contextual answers using Gemini or local LLM
- FR5: Provide CRUD operations on vector store via REST API
- FR6: Rate-limit and authenticate requests via Express gateway
- 4.2 Non-Functional Requirements
- Performance: <5s response (cloud), <30s (local LLM)
- Accuracy: MRR >0.75
- Security: Admin routes protected by JWT, Python API never publicly exposed
- Scalability: Support 10,000+ chunks in ChromaDB
- 4.3 Use Case Diagram (Diagram #7)
- Actors: Student, Faculty, Admin
- Student use cases: Submit query, View answer, View references
- Admin use cases: Ingest document, Delete document, Run evaluation, Change settings
- 4.4 Level 1 DFD
- Major processes: Ingest, Classify, Retrieve, Generate, Evaluate
Phase 5 β System Design
Est. time: 4β6 hrs | MOST MARKS, MOST DIAGRAMS Diagrams needed: #1, #2, #3, #4, #5, #6
Reuse heavily from:
CODEBASE_DOCUMENTATION.mdSections 2, 3, 4
- 5.1 Architecture Design
- High-Level Component Diagram (Diagram #1)
- Data Flow β Ingestion Path (Diagram #3)
- Data Flow β Query Path (Diagram #2)
- Technology Stack Table (already in CODEBASE_DOCUMENTATION.md Section 1)
- 5.2 Database Design
- Vector DB Metadata Schema (field table β already in CODEBASE_DOCUMENTATION.md Section 3)
- Source JSON Schema (already documented)
- File Tracking Registry Schema (FileService JSON records)
- 5.3 Algorithm Design
- Hierarchical Taxonomy Tree (Diagram #4) (Type β Category β Topic β Intent)
- Filter Decision Flowchart (Diagram #5) (confidence thresholds β Strict/Partial/Fallback)
- Hybrid Retrieval Sequence (Diagram #6) (Vector β BM25 β RRF formula β Boost β Threshold)
- Chunking Strategy (JSON intent-aware vs RecursiveCharacterTextSplitter)
- RRF Formula β document with the actual equation:
score(d) = bm25_weight * 1/(rrf_k + rank_bm25) + vector_weight * 1/(rrf_k + rank_vec)
- 5.4 Interface Design
- API Endpoint Table β /rag and /vector routes (already in CODEBASE_DOCUMENTATION.md Section 5)
- Request/Response JSON examples (sample curl or Postman output)
- Express Gateway design (rate limit + auth + concurrency queue)
Phase 6 β Implementation
Est. time: 2β3 hrs | Diagrams needed: Directory tree (#9 class diagram, #10 activity diagram)
Reuse heavily from:
CODEBASE_DOCUMENTATION.mdSection 5 and Section 8
- 6.1 Directory Structure (already in CODEBASE_DOCUMENTATION.md Section 8)
- 6.2 Module Descriptions (already in CODEBASE_DOCUMENTATION.md Section 5)
- 6.3 Key Code Snippets (do NOT paste full files β only algorithm excerpts)
- Filter construction logic (
_build_filtermethod) - RRF scoring loop
- Intent-aware JSON chunking (
handle_json_docs) - Classifier prediction + threshold gating
- Filter construction logic (
- 6.4 Configuration
-
.envvariables table (already in CODEBASE_DOCUMENTATION.md Section 5) - Hyperparameter table (BM25 weights, thresholds, chunk size)
-
- 6.5 Express Gateway Implementation
- Rate limiting configuration
- JWT auth middleware snippet
- Concurrency queue (
p-limit) snippet
Phase 7 β Testing & Evaluation
Est. time: 3β4 hrs | Diagrams needed: #11 (MRR bar chart), #12 (noise chart), #13 (confusion matrix)
β οΈ PLACEHOLDER β fill real numbers and screenshots AFTER deployment
- 7.1 Test Plan
- Unit tests: Classifier accuracy per field (run
/test_classifier_dataset) - Integration tests: End-to-end hybrid query
- Performance: Measure average latency (cloud vs local)
- Unit tests: Classifier accuracy per field (run
- 7.2 Results
- Comparison Table: Traditional pure-vector RAG vs Your Hybrid RAG
- Metrics: MRR, Hit Rate, Top-1 Hit Rate, Noise Rate, Latency
- MRR Bar Chart by query intent type (Diagram #11)
- Noise Rate comparison (Diagram #12)
- Classifier Confusion Matrix per field (Diagram #13)
- Comparison Table: Traditional pure-vector RAG vs Your Hybrid RAG
- 7.3 Sample Query Demonstrations
- Choose 3β5 representative queries, show:
- Input question
- Classifier output (type, category, topic, intent + confidences)
- Retrieved chunks with scores
- Final LLM answer
- Choose 3β5 representative queries, show:
Phase 8 β Deployment
Est. time: 1β2 hrs | Diagrams needed: Deployment diagram (#14)
β οΈ PLACEHOLDER β fill AFTER actual deployment
- 8.1 System Requirements
- Hardware: 8GB RAM, 4-core CPU (local LLM) OR Google API key (Gemini)
- Software: Python 3.9+, Node.js 18+, ChromaDB
- 8.2 Deployment Architecture (Diagram #14)
- Frontend β Express Gateway β FastAPI β ChromaDB
- 8.3 Installation Steps
- Clone β
pip install -r requirements.txtβ Set.envβ Run ingestion β Start API - Express:
npm installβ Set.envβnode server.js
- Clone β
- 8.4 Screenshots (fill after deployment)
- Swagger UI (
/docs) - Sample chatbot interaction
- Admin panel
- Classification test panel
- Swagger UI (
Phase 9 β Future Scope & Conclusion
Est. time: 1β2 hrs | Diagrams needed: Roadmap (#15)
- 9.1 Future Enhancements
- Dynamic LLM switching via admin UI (ModelManager architecture)
- Cross-encoder re-ranking step (after resource becomes available)
- Query result caching layer
- Automated metadata prediction during ingestion (classifier-assisted)
- Website scraping for real-time data updates
- 9.2 Known Limitations (already in CODEBASE_DOCUMENTATION.md Section 7)
- Local LLM latency (CPU-bound, no GPU)
- BM25 corpus rebuilt per request
- No real-time data β static knowledge base
- 9.3 Conclusion
- Successfully built domain-specific RAG with hybrid retrieval
- Hierarchical classification reduces noise and improves precision
- Secure deployment with Express gateway protects the inference server
Phase 10 β References & Appendices
Est. time: 1β2 hrs | No diagrams needed
- 10.1 References
- LangChain documentation
- ChromaDB documentation
- Original RRF paper (Cormack et al., 2009)
- Gemini API documentation
- VGEC official website (data source)
- BM25 (Robertson & Zaragoza, 2009)
- Sentence Transformers (Reimers & Gurevych, 2019)
- 10.2 Appendix A β MASTER_INDEX full taxonomy
- 10.3 Appendix B β Full API documentation (export from Swagger
/docs) - 10.4 Appendix C β Sample classifier training data
- 10.5 Appendix D β Sample department JSON format
Execution Timeline
| Phase | When | Priority |
|---|---|---|
| All Diagrams | Start NOW (before writing prose) | π΄ Critical |
| Phase 1β3 (Intro, Lit Review) | Day 1 | Must have |
| Phase 4β5 (Design) | Day 2β3 | π΄ Critical β most marks |
| Phase 6 (Implementation) | Day 4 | Must have |
| Phase 7 (Testing) | After deployment β Day 5 | π΄ Critical β proof |
| Phase 8 (Deployment) | After deployment | Must have |
| Phase 9β10 (Future, Refs) | Day 6 | Finish strong |
| Final PDF export + proofread | Last | Required |
Reuse Map β What's Already Written
| Documentation Section | Already in |
|---|---|
| System Architecture (components, data flow) | CODEBASE_DOCUMENTATION.md Section 2 |
| Tech Stack Table | CODEBASE_DOCUMENTATION.md Section 1 |
| Metadata Schema / Taxonomy | CODEBASE_DOCUMENTATION.md Section 3 |
| Retrieval Pipeline steps | CODEBASE_DOCUMENTATION.md Section 4 |
| All class/method descriptions | CODEBASE_DOCUMENTATION.md Section 5 |
| Metrics definitions | CODEBASE_DOCUMENTATION.md Section 6 |
| Known Limitations | CODEBASE_DOCUMENTATION.md Section 7 |
| File Structure Tree | CODEBASE_DOCUMENTATION.md Section 8 |