Spaces:
Sleeping
Sleeping
| # VGEC RAG Chatbot β Software Documentation Plan | |
| > Based on IEEE/Industry Standard | Updated: 2026-03-25 | |
| > Reference: `CODEBASE_DOCUMENTATION.md` covers most of Phase 5 already β reuse it. | |
| --- | |
| ## DIAGRAMS FIRST β Priority Order | |
| > Do all diagrams before writing any prose. Diagrams take the most time and are referenced throughout. | |
| | # | Diagram | Phase Used In | Tool | Status | | |
| |---|---|---|---|---| | |
| | 1 | High-Level Architecture (Component Diagram) | Phase 5 | Draw.io / Mermaid | [ ] | | |
| | 2 | Data Flow β Query Path | Phase 5 | Draw.io (DFD Level 2) | [ ] | | |
| | 3 | Data Flow β Ingestion Path | Phase 5 | Draw.io (DFD Level 2) | [ ] | | |
| | 4 | Hierarchical Taxonomy Tree (TypeβCategoryβTopic) | Phase 5 | Tree diagram / Mermaid | [ ] | | |
| | 5 | Filter Decision Flowchart (StrictβPartialβFallback) | Phase 5 | Flowchart / Draw.io | [ ] | | |
| | 6 | Hybrid Retrieval Sequence (VectorβBM25βRRFβBoost) | Phase 5 | Sequence diagram / Flow | [ ] | | |
| | 7 | Use Case Diagram (Student, Faculty, Admin actors) | Phase 4 | Draw.io / PlantUML | [ ] | | |
| | 8 | System Context Diagram / Level 0 DFD | Phase 2 | Draw.io | [ ] | | |
| | 9 | Class Diagram (simplified β RAGService + helpers) | Phase 6 | Draw.io / UML | [ ] | | |
| | 10 | Activity Diagram β Chunking Process | Phase 6 | Activity flow / Draw.io | [ ] | | |
| | 11 | MRR Bar Chart β Your RAG vs Traditional | Phase 7 | matplotlib / Excel | [ ] | | |
| | 12 | Noise Rate Bar Chart β Comparison | Phase 7 | matplotlib / Excel | [ ] | | |
| | 13 | Classifier Confusion Matrix (per field) | Phase 7 | Seaborn heatmap | [ ] | | |
| | 14 | Deployment Diagram (Express β FastAPI β ChromaDB) | Phase 8 | Draw.io | [ ] | | |
| | 15 | Future Roadmap / Gantt-style Timeline | Phase 9 | Draw.io / simple table | [ ] | | |
| --- | |
| ## Phase 1 β Front Matter | |
| **Est. time: 1β2 hrs | No diagrams needed** | |
| - [ ] Title Page | |
| - Project: VGEC RAG Chatbot | |
| - Subtitle: Retrieval-Augmented Generation System for Academic Queries | |
| - Name, Roll No., Department, Submission Date | |
| - Guide name, College name | |
| - [ ] Abstract (150β200 words) | |
| - Problem: Students struggle to find accurate VGEC info scattered across website | |
| - Solution: RAG-based chatbot with hierarchical classification + hybrid retrieval | |
| - Key results: MRR, noise reduction *(fill placeholders after deployment)* | |
| - Tech: FastAPI, ChromaDB, Gemini, Logistic Regression classifier | |
| - [ ] Table of Contents *(auto-generate at end β structure it now)* | |
| - [ ] List of Figures *(auto-generate at end)* | |
| - [ ] List of Abbreviations | |
| - RAG, BM25, RRF, LLM, MRR, API, VGEC, HOD, etc. | |
| --- | |
| ## Phase 2 β Introduction | |
| **Est. time: 2β3 hrs | Diagrams needed: System Context Diagram (Diagram #8)** | |
| - [ ] 2.1 Background | |
| - Current state: Static website, PDFs, manual queries to admin office | |
| - Pain points: Information scattered, no natural language interface | |
| - [ ] 2.2 Problem Statement | |
| - Lack of intelligent query system for institutional data | |
| - Need for domain-specific (VGEC) accurate retrieval | |
| - [ ] 2.3 Objectives | |
| - Build RAG pipeline with >75% MRR | |
| - Implement metadata classification for pre-filtering | |
| - Provide REST API for frontend integration | |
| - Deploy with a secure Express gateway | |
| - [ ] 2.4 Scope | |
| - **In scope:** Department data (faculty, labs, syllabus, HOD, intake), REST API, classification, evaluation | |
| - **Out of scope:** Real-time website scraping, admissions processing, multimedia | |
| > **Reuse from:** `CODEBASE_DOCUMENTATION.md` Section 1 (Project Overview) | |
| --- | |
| ## Phase 3 β Literature Review / Related Work | |
| **Est. time: 2β3 hrs | Diagrams needed: Evolution timeline (simple horizontal flow)** | |
| - [ ] 3.1 Traditional Chatbots | |
| - Rule-based (ALICE, ELIZA) β rigid, no context | |
| - Keyword matching chatbots β no semantic understanding | |
| - [ ] 3.2 Modern RAG Systems | |
| - OpenAI GPT-4 + vector DB (generic, not domain-specific) | |
| - LlamaIndex / LangChain baseline RAG β no metadata filtering | |
| - [ ] 3.3 Hybrid Search Systems | |
| - Elasticsearch (BM25 only), Cohere (vector only) | |
| - RRF as the standard fusion method (reference paper) | |
| - [ ] 3.4 Your Differentiation | |
| - Hierarchical classifier (TypeβCategoryβTopicβIntent) for pre-filtering | |
| - Hybrid retrieval (BM25 + Vector + RRF) vs pure semantic search | |
| - Domain-specific ingestion strategy (intent-aware JSON chunking) | |
| --- | |
| ## Phase 4 β System Analysis & Requirements | |
| **Est. time: 3β4 hrs | Diagrams needed: Use Case Diagram (#7), Level 1 DFD** | |
| - [ ] 4.1 Functional Requirements | |
| - FR1: Ingest structured JSON and unstructured documents (PDF, MD, TXT) | |
| - FR2: Classify queries into metadata filters (type, category, topic, intent) | |
| - FR3: Retrieve relevant chunks with configurable similarity threshold | |
| - FR4: Generate contextual answers using Gemini or local LLM | |
| - FR5: Provide CRUD operations on vector store via REST API | |
| - FR6: Rate-limit and authenticate requests via Express gateway | |
| - [ ] 4.2 Non-Functional Requirements | |
| - Performance: <5s response (cloud), <30s (local LLM) | |
| - Accuracy: MRR >0.75 | |
| - Security: Admin routes protected by JWT, Python API never publicly exposed | |
| - Scalability: Support 10,000+ chunks in ChromaDB | |
| - [ ] 4.3 Use Case Diagram *(Diagram #7)* | |
| - Actors: Student, Faculty, Admin | |
| - Student use cases: Submit query, View answer, View references | |
| - Admin use cases: Ingest document, Delete document, Run evaluation, Change settings | |
| - [ ] 4.4 Level 1 DFD | |
| - Major processes: Ingest, Classify, Retrieve, Generate, Evaluate | |
| --- | |
| ## Phase 5 β System Design | |
| **Est. time: 4β6 hrs | MOST MARKS, MOST DIAGRAMS** | |
| **Diagrams needed: #1, #2, #3, #4, #5, #6** | |
| > **Reuse heavily from:** `CODEBASE_DOCUMENTATION.md` Sections 2, 3, 4 | |
| - [ ] 5.1 Architecture Design | |
| - [ ] High-Level Component Diagram *(Diagram #1)* | |
| - [ ] Data Flow β Ingestion Path *(Diagram #3)* | |
| - [ ] Data Flow β Query Path *(Diagram #2)* | |
| - [ ] Technology Stack Table (already in CODEBASE_DOCUMENTATION.md Section 1) | |
| - [ ] 5.2 Database Design | |
| - [ ] Vector DB Metadata Schema (field table β already in CODEBASE_DOCUMENTATION.md Section 3) | |
| - [ ] Source JSON Schema (already documented) | |
| - [ ] File Tracking Registry Schema (FileService JSON records) | |
| - [ ] 5.3 Algorithm Design | |
| - [ ] Hierarchical Taxonomy Tree *(Diagram #4)* (Type β Category β Topic β Intent) | |
| - [ ] Filter Decision Flowchart *(Diagram #5)* (confidence thresholds β Strict/Partial/Fallback) | |
| - [ ] Hybrid Retrieval Sequence *(Diagram #6)* (Vector β BM25 β RRF formula β Boost β Threshold) | |
| - [ ] Chunking Strategy (JSON intent-aware vs RecursiveCharacterTextSplitter) | |
| - [ ] RRF Formula β document with the actual equation: | |
| ``` | |
| score(d) = bm25_weight * 1/(rrf_k + rank_bm25) | |
| + vector_weight * 1/(rrf_k + rank_vec) | |
| ``` | |
| - [ ] 5.4 Interface Design | |
| - [ ] API Endpoint Table β /rag and /vector routes (already in CODEBASE_DOCUMENTATION.md Section 5) | |
| - [ ] Request/Response JSON examples (sample curl or Postman output) | |
| - [ ] Express Gateway design (rate limit + auth + concurrency queue) | |
| --- | |
| ## Phase 6 β Implementation | |
| **Est. time: 2β3 hrs | Diagrams needed: Directory tree (#9 class diagram, #10 activity diagram)** | |
| > **Reuse heavily from:** `CODEBASE_DOCUMENTATION.md` Section 5 and Section 8 | |
| - [ ] 6.1 Directory Structure (already in CODEBASE_DOCUMENTATION.md Section 8) | |
| - [ ] 6.2 Module Descriptions (already in CODEBASE_DOCUMENTATION.md Section 5) | |
| - [ ] 6.3 Key Code Snippets *(do NOT paste full files β only algorithm excerpts)* | |
| - [ ] Filter construction logic (`_build_filter` method) | |
| - [ ] RRF scoring loop | |
| - [ ] Intent-aware JSON chunking (`handle_json_docs`) | |
| - [ ] Classifier prediction + threshold gating | |
| - [ ] 6.4 Configuration | |
| - [ ] `.env` variables table (already in CODEBASE_DOCUMENTATION.md Section 5) | |
| - [ ] Hyperparameter table (BM25 weights, thresholds, chunk size) | |
| - [ ] 6.5 Express Gateway Implementation | |
| - [ ] Rate limiting configuration | |
| - [ ] JWT auth middleware snippet | |
| - [ ] Concurrency queue (`p-limit`) snippet | |
| --- | |
| ## Phase 7 β Testing & Evaluation | |
| **Est. time: 3β4 hrs | Diagrams needed: #11 (MRR bar chart), #12 (noise chart), #13 (confusion matrix)** | |
| > β οΈ PLACEHOLDER β fill real numbers and screenshots AFTER deployment | |
| - [ ] 7.1 Test Plan | |
| - [ ] Unit tests: Classifier accuracy per field (run `/test_classifier_dataset`) | |
| - [ ] Integration tests: End-to-end hybrid query | |
| - [ ] Performance: Measure average latency (cloud vs local) | |
| - [ ] 7.2 Results | |
| - [ ] Comparison Table: Traditional pure-vector RAG vs Your Hybrid RAG | |
| - Metrics: MRR, Hit Rate, Top-1 Hit Rate, Noise Rate, Latency | |
| - [ ] MRR Bar Chart by query intent type *(Diagram #11)* | |
| - [ ] Noise Rate comparison *(Diagram #12)* | |
| - [ ] Classifier Confusion Matrix per field *(Diagram #13)* | |
| - [ ] 7.3 Sample Query Demonstrations | |
| - Choose 3β5 representative queries, show: | |
| - Input question | |
| - Classifier output (type, category, topic, intent + confidences) | |
| - Retrieved chunks with scores | |
| - Final LLM answer | |
| --- | |
| ## Phase 8 β Deployment | |
| **Est. time: 1β2 hrs | Diagrams needed: Deployment diagram (#14)** | |
| > β οΈ PLACEHOLDER β fill AFTER actual deployment | |
| - [ ] 8.1 System Requirements | |
| - Hardware: 8GB RAM, 4-core CPU (local LLM) OR Google API key (Gemini) | |
| - Software: Python 3.9+, Node.js 18+, ChromaDB | |
| - [ ] 8.2 Deployment Architecture *(Diagram #14)* | |
| - Frontend β Express Gateway β FastAPI β ChromaDB | |
| - [ ] 8.3 Installation Steps | |
| - Clone β `pip install -r requirements.txt` β Set `.env` β Run ingestion β Start API | |
| - Express: `npm install` β Set `.env` β `node server.js` | |
| - [ ] 8.4 Screenshots *(fill after deployment)* | |
| - [ ] Swagger UI (`/docs`) | |
| - [ ] Sample chatbot interaction | |
| - [ ] Admin panel | |
| - [ ] Classification test panel | |
| --- | |
| ## Phase 9 β Future Scope & Conclusion | |
| **Est. time: 1β2 hrs | Diagrams needed: Roadmap (#15)** | |
| - [ ] 9.1 Future Enhancements | |
| - Dynamic LLM switching via admin UI (ModelManager architecture) | |
| - Cross-encoder re-ranking step (after resource becomes available) | |
| - Query result caching layer | |
| - Automated metadata prediction during ingestion (classifier-assisted) | |
| - Website scraping for real-time data updates | |
| - [ ] 9.2 Known Limitations (already in CODEBASE_DOCUMENTATION.md Section 7) | |
| - Local LLM latency (CPU-bound, no GPU) | |
| - BM25 corpus rebuilt per request | |
| - No real-time data β static knowledge base | |
| - [ ] 9.3 Conclusion | |
| - Successfully built domain-specific RAG with hybrid retrieval | |
| - Hierarchical classification reduces noise and improves precision | |
| - Secure deployment with Express gateway protects the inference server | |
| --- | |
| ## Phase 10 β References & Appendices | |
| **Est. time: 1β2 hrs | No diagrams needed** | |
| - [ ] 10.1 References | |
| - LangChain documentation | |
| - ChromaDB documentation | |
| - Original RRF paper (Cormack et al., 2009) | |
| - Gemini API documentation | |
| - VGEC official website (data source) | |
| - BM25 (Robertson & Zaragoza, 2009) | |
| - Sentence Transformers (Reimers & Gurevych, 2019) | |
| - [ ] 10.2 Appendix A β MASTER_INDEX full taxonomy | |
| - [ ] 10.3 Appendix B β Full API documentation (export from Swagger `/docs`) | |
| - [ ] 10.4 Appendix C β Sample classifier training data | |
| - [ ] 10.5 Appendix D β Sample department JSON format | |
| --- | |
| ## Execution Timeline | |
| | Phase | When | Priority | | |
| |---|---|---| | |
| | **All Diagrams** | Start NOW (before writing prose) | π΄ Critical | | |
| | Phase 1β3 (Intro, Lit Review) | Day 1 | Must have | | |
| | Phase 4β5 (Design) | Day 2β3 | π΄ Critical β most marks | | |
| | Phase 6 (Implementation) | Day 4 | Must have | | |
| | Phase 7 (Testing) | After deployment β Day 5 | π΄ Critical β proof | | |
| | Phase 8 (Deployment) | After deployment | Must have | | |
| | Phase 9β10 (Future, Refs) | Day 6 | Finish strong | | |
| | Final PDF export + proofread | Last | Required | | |
| --- | |
| ## Reuse Map β What's Already Written | |
| | Documentation Section | Already in | | |
| |---|---| | |
| | System Architecture (components, data flow) | `CODEBASE_DOCUMENTATION.md` Section 2 | | |
| | Tech Stack Table | `CODEBASE_DOCUMENTATION.md` Section 1 | | |
| | Metadata Schema / Taxonomy | `CODEBASE_DOCUMENTATION.md` Section 3 | | |
| | Retrieval Pipeline steps | `CODEBASE_DOCUMENTATION.md` Section 4 | | |
| | All class/method descriptions | `CODEBASE_DOCUMENTATION.md` Section 5 | | |
| | Metrics definitions | `CODEBASE_DOCUMENTATION.md` Section 6 | | |
| | Known Limitations | `CODEBASE_DOCUMENTATION.md` Section 7 | | |
| | File Structure Tree | `CODEBASE_DOCUMENTATION.md` Section 8 | | |