Spaces:

T0X1N
/

Agentic-RagBot

Sleeping

File size: 42,080 Bytes

aefac4f

╔══════════════════════════════════════════════════════════════════════════════╗
║           🚀 RAGBOT 4-MONTH IMPLEMENTATION ROADMAP - ALL 34 SKILLS           ║
║              Systematic, Phased Approach to Enterprise-Grade AI              ║
╚══════════════════════════════════════════════════════════════════════════════╝

IMPLEMENTATION PHILOSOPHY
════════════════════════════════════════════════════════════════════════════════
• Fix critical issues first (security, state management, schema)
• Build tests concurrently (every feature gets tests immediately)
• Deploy incrementally (working code at each phase)
• Measure continuously (metrics drive priorities)
• Document along the way (knowledge preservation)

PROJECT BASELINE
════════════════════════════════════════════════════════════════════════════════
Current Status:
  • 83+ passing tests (~70% coverage)
  • 6 specialist agents (Biomarker Analyzer, Disease Explainer, etc.)
  • FastAPI REST API + CLI interface
  • FAISS vector store (750+ pages medical knowledge)
  • 2,861 medical knowledge chunks

Critical Issues to Fix:
  1. biomarker_flags & safety_alerts not propagating through workflow
  2. Schema mismatch between workflow output & API formatter
  3. Prediction confidence forced to 0.5 (dangerous for medical domain)
  4. Different biomarker naming (API vs CLI)
  5. JSON parsing breaks on malformed LLM output
  6. No citation enforcement in RAG outputs

Success Metrics:
  • Test coverage: 70% → 90%+
  • Response latency: 25s → 15-20s
  • Prediction accuracy: +15-20%
  • API costs: -40% (Groq free tier optimization)
  • Security: OWASP compliant, HIPAA aligned

════════════════════════════════════════════════════════════════════════════════

PHASE 1: FOUNDATION & CRITICAL FIXES (Week 1-2)
════════════════════════════════════════════════════════════════════════════════

GOAL: Security baseline + fix state propagation + unify schemas

Week 1: Days 1-5

SKILL #18: OWASP Security Check
  ├─ Duration: 2-3 hours
  ├─ Task: Run comprehensive security audit
  ├─ Deliverable: Security issues list, prioritized fixes
  ├─ Actions:
  │  1. Read SKILL.md documentation
  │  2. Run vulnerability scanner on /api and /src
  │  3. Document findings in SECURITY_AUDIT.md
  │  4. Create tickets for each finding
  └─ Outcome: Clear understanding of security gaps

SKILL #17: API Security Hardening
  ├─ Duration: 4-6 hours
  ├─ Task: Implement authentication & hardening
  ├─ Deliverable: JWT auth on /api/v1/analyze endpoint
  ├─ Actions:
  │  1. Read SKILL.md (auth patterns, CORS, headers)
  │  2. Add JWT middleware to api/main.py
  │  3. Update routes with @require_auth decorator
  │  4. Add security headers (HSTS, CSP, X-Frame-Options)
  │  5. Write tests for auth (SKILL #22: Python Testing Patterns)
  │  6. Update docs with API key requirement
  └─ Code Location: api/app/middleware/auth.py (NEW)

SKILL #22: Python Testing Patterns (First Use)
  ├─ Duration: 2-3 hours
  ├─ Task: Create testing infrastructure & auth tests
  ├─ Deliverable: tests/test_api_auth.py with 10+ tests
  ├─ Actions:
  │  1. Read SKILL.md (fixtures, mocking, parametrization)
  │  2. Create conftest.py with auth fixtures
  │  3. Write tests for JWT generation, validation, failure cases
  │  4. Implement pytest fixtures for authenticated client
  │  5. Run: pytest tests/test_api_auth.py -v
  └─ Outcome: 80% test coverage on auth module

SKILL #2: Workflow Orchestration Patterns
  ├─ Duration: 4-6 hours
  ├─ Task: Fix state propagation in LangGraph workflow
  ├─ Deliverable: biomarker_flags & safety_alerts propagate end-to-end
  ├─ Actions:
  │  1. Read SKILL.md (LangGraph state management, parallel execution)
  │  2. Review src/state.py current structure
  │  3. Identify missing state fields in GuildState
  │  4. Refactor agents to return complete state:
  │     - src/agents/biomarker_analyzer.py → return biomarker_flags
  │     - src/agents/biomarker_analyzer.py → return safety_alerts
  │     - src/agents/confidence_assessor.py → update state
  │  5. Test with: python -c "from src.workflow import create_guild..."
  │  6. Write integration tests (SKILL #22)
  └─ Code Changes: src/state.py, src/agents/*.py

SKILL #16: AI Wrapper/Structured Output
  ├─ Duration: 3-5 hours
  ├─ Task: Unify workflow → API response schema
  ├─ Deliverable: Single canonical response format (Pydantic model)
  ├─ Actions:
  │  1. Read SKILL.md (structured outputs, Pydantic, validation)
  │  2. Create api/app/models/response.py with unified schema
  │  3. Define BaseAnalysisResponse with all required fields
  │  4. Update api/app/services/ragbot.py to use unified schema
  │  5. Ensure ResponseSynthesizerAgent outputs match schema
  │  6. Add Pydantic validation in all endpoints
  │  7. Run: pytest tests/test_response_schema.py -v
  └─ Code Location: api/app/models/response.py (REFACTORED)

Week 2: Days 6-10

SKILL #3: Multi-Agent Orchestration
  ├─ Duration: 3-4 hours
  ├─ Task: Fix deterministic execution of parallel agents
  ├─ Deliverable: Agents execute without race conditions
  ├─ Actions:
  │  1. Read SKILL.md (agent coordination, deterministic scheduling)
  │  2. Review src/workflow.py parallel execution
  │  3. Ensure explicit state passing between agents:
  │     - Biomarker Analyzer outputs → Disease Explainer inputs
  │     - Sequential where needed (Analyzer before Linker)
  │     - Parallel where safe (Explainer & Guidelines)
  │  4. Add logging to track execution order
  │  5. Run 10 times: python scripts/test_chat_demo.py (same output each time)
  └─ Outcome: Deterministic workflow execution

SKILL #19: LLM Security  
  ├─ Duration: 3-4 hours
  ├─ Task: Prevent LLM-specific attacks
  ├─ Deliverable: Input validation against prompt injection
  ├─ Actions:
  │  1. Read SKILL.md (prompt injection, token limit attacks)
  │  2. Add input sanitization in api/app/services/extraction.py
  │  3. Implement prompt injection detection:
  │     - Check for "ignore instructions" patterns
  │     - Limit biomarker input length
  │     - Escape special characters
  │  4. Add rate limiting per user (SKILL #20)
  │  5. Write security tests
  └─ Code Location: api/app/middleware/input_validation.py (NEW)

SKILL #20: API Rate Limiting
  ├─ Duration: 2-3 hours
  ├─ Task: Implement tiered rate limiting
  ├─ Deliverable: /api/v1/analyze limited to 10/min free, 1000/min pro
  ├─ Actions:
  │  1. Read SKILL.md (token bucket, sliding window algorithms)
  │  2. Import python-ratelimit library
  │  3. Add rate limiter middleware to api/main.py
  │  4. Implement tiered limits (free/pro based on API key)
  │  5. Return 429 with retry-after headers
  │  6. Test rate limiting behavior
  └─ Code Location: api/app/middleware/rate_limiter.py (NEW)

END OF PHASE 1 OUTCOMES:
✅ Security audit complete with fixes prioritized
✅ JWT authentication on REST API
✅ biomarker_flags & safety_alerts propagating through workflow
✅ Unified response schema (API & CLI use same format)
✅ LLM prompt injection protection
✅ Rate limiting in place
✅ Auth + security tests written (15+ new tests)
✅ Coverage increased to ~75%

════════════════════════════════════════════════════════════════════════════════

PHASE 2: TEST EXPANSION & AGENT OPTIMIZATION (Week 3-5)
════════════════════════════════════════════════════════════════════════════════

GOAL: 90%+ test coverage + improved agent decision logic + prompt optimization

Week 3: Days 11-15

SKILL #22: Python Testing Patterns (Advanced Use)
  ├─ Duration: 8-10 hours (this is the main focus)
  ├─ Task: Parametrized testing for biomarker combinations
  ├─ Deliverable: 50+ new parametrized tests
  ├─ Actions:
  │  1. Read SKILL.md sections on parametrization & fixtures
  │  2. Create tests/fixtures/biomarkers.py with test data:
  │     - Normal values tuple
  │     - Diabetes indicators tuple
  │     - Mixed abnormal values tuple
  │     - Edge cases tuple
  │  3. Write parametrized test for each biomarker combination:
  │     @pytest.mark.parametrize("biomarkers,expected_disease", [...])
  │     def test_disease_prediction(biomarkers, expected_disease):
  │        assert predict_disease(biomarkers) == expected_disease
  │  4. Create mocking fixtures for LLM calls:
  │     @pytest.fixture
  │     def mock_groq_client(monkeypatch):
  │        # Mock all LLM interactions
  │  5. Test agent outputs:
  │     - Biomarker Analyzer with 10 scenarios
  │     - Disease Explainer with 5 diseases
  │     - Confidence Assessor with low/medium/high confidence cases
  │  6. Run: pytest tests/ -v --cov src --cov-report=html
  │  7. Goal: 90%+ coverage on agents/
  └─ Code Location: tests/test_parametrized_*.py

SKILL #26: Python Design Patterns
  ├─ Duration: 4-5 hours
  ├─ Task: Refactor agent implementations with design patterns
  ├─ Deliverable: Cleaner, more maintainable agent code
  ├─ Actions:
  │  1. Read SKILL.md (SOLID, composition, factory patterns)
  │  2. Identify code smells in src/agents/
  │  3. Extract common agent logic to BaseAgent class:
  │     class BaseAgent:
  │        def invoke(self, input_data) -> AgentOutput
  │        def validate_inputs(self)
  │        def log_execution(self)
  │  4. Use composition over inheritance:
  │     - Each agent has optional retriever, validator, cache
  │     - Reduce coupling between agents
  │  5. Implement Factory pattern for agent creation:
  │     AgentFactory.create("biomarker_analyzer")
  │  6. Refactor tests to use new pattern
  └─ Code Location: src/agents/base_agent.py (NEW)

SKILL #4: Agentic Development
  ├─ Duration: 3-4 hours
  ├─ Task: Improve agent decision logic
  ├─ Deliverable: Better biomarker analysis confidence scores
  ├─ Actions:
  │  1. Read SKILL.md (planning, reasoning, decision making)
  │  2. Add confidence threshold in BiomarkerAnalyzerAgent
  │  3. Instead of returning all results:
  │     - Only return HIGH confidence matches
  │     - Flag LOW confidence for manual review
  │     - Add reasoning trace (why this conclusion)
  │  4. Update response format with:
  │     - confidence_score (0-1)
  │     - evidence_count (# sources)
  │     - alternative_hypotheses (if low confidence)
  │  5. Update tests
  └─ Code Location: src/agents/biomarker_analyzer.py (MODIFIED)

SKILL #13: Senior Prompt Engineer (First Use)
  ├─ Duration: 5-6 hours
  ├─ Task: Optimize prompts for medical accuracy
  ├─ Deliverable: Updated agent prompts with better accuracy
  ├─ Actions:
  │  1. Read SKILL.md (prompt patterns, few-shot, CoT)
  │  2. Audit current agent prompts in src/agents/*.py
  │  3. Apply few-shot learning to extraction agent:
  │     - Add 3 examples of correct biomarker extraction
  │     - Show format expected
  │     - Show handling of ambiguous inputs
  │  4. Add chain-of-thought reasoning:
  │     "First identify the biomarkers mentioned. Then look up their ranges.
  │      Then determine if abnormal. Then assess severity."
  │  5. Add role prompting:
  │     "You are an expert medical lab analyst with 20 years experience..."
  │  6. Implement structured output prompts:
  │     "Return JSON with these exact fields: biomarkers, disease, confidence"
  │  7. Benchmark against baseline accuracy
  │  8. Run: python scripts/test_evaluation_system.py (SKILL #14)
  └─ Code Location: src/agents/*/invoke() prompts

Week 4: Days 16-20

SKILL #14: LLM Evaluation
  ├─ Duration: 4-5 hours
  ├─ Task: Benchmark LLM quality improvements
  ├─ Deliverable: Metrics dashboard showing promise of improvements
  ├─ Actions:
  │  1. Read SKILL.md (evaluation metrics, benchmarking)
  │  2. Create tests/evaluation_metrics.py with metrics:
  │     - Accuracy (correct disease prediction)
  │     - Precision (of biomarker extraction)
  │     - Recall (of clinical recommendations)
  │     - F1 score (biomarker identification)
  │  3. Create test dataset with 20 patient scenarios:
  │     tests/fixtures/evaluation_patients.py
  │  4. Benchmark Groq vs Gemini on accuracy, latency, cost
  │  5. Create evaluation report:
  │     "Before optimization: 65% accuracy, 25s latency
  │      After optimization: 80% accuracy, 18s latency"
  │  6. Generate graphs/charts of improvements
  └─ Code Location: tests/evaluation_metrics.py

SKILL #5: Tool/Function Calling Patterns
  ├─ Duration: 3-4 hours
  ├─ Task: Use function calling for reliable LLM outputs
  ├─ Deliverable: Structured output via function calling (not prompting)
  ├─ Actions:
  │  1. Read SKILL.md (tool definition, structured returns)
  │  2. Define tools for extraction agent:
  │     - extract_biomarkers(text: str) -> dict
  │     - classify_severity(value: float, range: tuple) -> str
  │     - assess_disease_risk(biomarkers: dict) -> dict
  │  3. Modify extraction service to use function calling:
  │     Instead of parsing JSON from text, call literal functions
  │  4. Groq free tier check (may not support function calling)
  │     Alternative: Use strict Pydantic output validation
  │  5. Test: Parsing should never fail, always return valid output
  │  6. Error handling: If LLM output wrong format, retry with function calling
  └─ Code Location: api/app/services/extraction.py (MODIFIED)

SKILL #21: Python Error Handling
  ├─ Duration: 3-4 hours
  ├─ Task: Comprehensive error handling for production
  ├─ Deliverable: Custom exception hierarchy, graceful degradation
  ├─ Actions:
  │  1. Read SKILL.md (exception patterns, logging, recovery)
  │  2. Create src/exceptions.py with hierarchy:
  │     - RagBotException (base)
  │     - BiomarkerValidationError
  │     - LLMTimeoutError (with retry logic)
  │     - VectorStoreError
  │     - SchemaValidationError
  │  3. Wrap agent calls with try-except:
  │     try:
  │        result = agent.invoke(input)
  │     except LLMTimeoutError:
  │        retry_with_smaller_context()
  │     except BiomarkerValidationError:
  │        return low_confidence_response()
  │  4. Add telemetry: which exceptions most common?
  │  5. Write exception tests (10+ scenarios)
  └─ Code Location: src/exceptions.py (NEW)

Week 5: Days 21-25

SKILL #27: Python Observability (First Use)
  ├─ Duration: 4-5 hours
  ├─ Task: Structured logging for debugging & monitoring
  ├─ Deliverable: JSON-formatted logs with context
  ├─ Actions:
  │  1. Read SKILL.md (structured logging, correlation IDs)
  │  2. Replace print() with logger calls:
  │     logger.info("analyzing biomarkers", extra={
  │        "biomarkers": {"glucose": 140},
  │        "user_id": "user123",
  │        "correlation_id": "req-abc123"
  │     })
  │  3. Add correlation IDs to track requests through agents
  │  4. Structure logs as JSON (not text):
  │     - timestamp
  │     - level
  │     - message
  │     - context (user, request, agent)
  │     - metrics (latency, tokens used)
  │  5. Implement in all agents (src/agents/*)
  │  6. Test: Review logs.jsonl output
  └─ Code Location: src/observability.py (NEW)

SKILL #24: GitHub Actions Templates
  ├─ Duration: 2-3 hours
  ├─ Task: Set up CI/CD pipeline
  ├─ Deliverable: .github/workflows/test.yml (auto-run tests on PR)
  ├─ Actions:
  │  1. Read SKILL.md (GitHub Actions workflow syntax)
  │  2. Create .github/workflows/test.yml:
  │     name: Run Tests
  │     on: [push, pull_request]
  │     jobs:
  │       test:
  │         runs-on: ubuntu-latest
  │         steps:
  │           - uses: actions/checkout@v3
  │           - uses: actions/setup-python@v4
  │           - run: pip install -r requirements.txt
  │           - run: pytest tests/ -v --cov src --cov-report=xml
  │           - run: coverage report (fail if <90%)
  │  3. Create .github/workflows/security.yml:
  │     - Run OWASP checks
  │     - Lint code
  │     - Check dependencies for CVEs
  │  4. Create .github/workflows/docker.yml:
  │     - Build Docker image
  │     - Push to registry (optional)
  │  5. Test: Create a PR, verify workflows run
  └─ Location: .github/workflows/

END OF PHASE 2 OUTCOMES:
✅ 90%+ test coverage achieved
✅ 50+ parametrized tests added
✅ Agent code refactored with design patterns
✅ LLM prompts optimized for medical accuracy
✅ Evaluation metrics show +15% accuracy improvement
✅ Function calling prevents JSON parsing failures
✅ Comprehensive error handling in place
✅ Structured JSON logging implemented
✅ CI/CD pipeline automated

════════════════════════════════════════════════════════════════════════════════

PHASE 3: RETRIEVAL OPTIMIZATION & KNOWLEDGE GRAPHS (Week 6-8)
════════════════════════════════════════════════════════════════════════════════

GOAL: Better medical knowledge retrieval + citations + knowledge graphs

Week 6: Days 26-30

SKILL #8: Hybrid Search Implementation
  ├─ Duration: 4-6 hours
  ├─ Task: Combine semantic + keyword search for better recall
  ├─ Deliverable: Hybrid retriever for RagBot (BM25 + FAISS)
  ├─ Actions:
  │  1. Read SKILL.md (hybrid search architecture, reciprocal rank fusion)
  │  2. Current state: Only FAISS semantic search (misses rare diseases)
  │  3. Add BM25 keyword search:
  │     pip install rank-bm25
  │  4. Create src/retrievers/hybrid_retriever.py:
  │     class HybridRetriever:
  │        def semantic_search(query, k=5)  # FAISS
  │        def keyword_search(query, k=5)   # BM25
  │        def hybrid_search(query):        # Combine + rerank
  │  5. Reranking (Reciprocal Rank Fusion):
  │     score = 1/(k + rank_semantic) + 1/(k + rank_keyword)
  │  6. Replace old retriever in disease_explainer agent:
  │     old: retriever = faiss_retriever
  │     new: retriever = hybrid_retriever
  │  7. Benchmark: Test retrieval quality on 10 disease cases
  │  8. Test rare disease retrieval (uncommon biomarker combinations)
  └─ Code Location: src/retrievers/hybrid_retriever.py (NEW)

SKILL #9: Chunking Strategy
  ├─ Duration: 4-5 hours
  ├─ Task: Optimize medical document chunking
  ├─ Deliverable: Improved chunks for better context
  ├─ Actions:
  │  1. Read SKILL.md (chunking strategies, semantic boundaries)
  │  2. Current: Fixed 1000-char chunks (may split mid-sentence)
  │  3. Implement intelligent chunking:
  │     - Split by medical sections (diagnosis, treatment, etc.)
  │     - Keep related content together
  │     - Maintain minimum 500 chars (context) max 2000 chars (context window)
  │  4. Preserve medical structure:
  │     - Disease headers stay with symptoms
  │     - Labs stay with reference ranges
  │     - Treatment options stay together
  │  5. Create src/chunking_strategy.py:
  │     def chunk_medical_pdf(pdf_text) -> List[Chunk]:
  │        # Split by disease headers, maintain structure
  │  6. Re-chunk medical_knowledge.faiss (2,861 chunks → how many?)
  │  7. Re-embed with new chunks
  │  8. Benchmark: Document retrieval precision improved?
  └─ Code Location: src/chunking_strategy.py (REFACTORED)

SKILL #10: Embedding Pipeline Builder
  ├─ Duration: 3-4 hours
  ├─ Task: Optimize embeddings for medical terminology
  ├─ Deliverable: Better semantic search for medical terms
  ├─ Actions:
  │  1. Read SKILL.md (embedding models, fine-tuning considerations)
  │  2. Current: sentence-transformers/all-MiniLM-L6-v2 (generic)
  │  3. Options for medical embeddings:
  │     - all-MiniLM-L6-v2 (157M params, fast, baseline)
  │     - all-mpnet-base-v2 (438M params, better quality)
  │     - Medical-specific: SciBERT or BioSentenceTransformer (if available)
  │  4. Benchmark embeddings on medical queries:
  │     Query: "High glucose and elevated HbA1c"
  │     Expected top result: Diabetes diagnosis section
  │  5. If using different model:
  │     pip install [new-model]
  │     Re-embed all medical documents
  │     Save new FAISS index
  │  6. Measure: Mean reciprocal rank (MRR) of correct document
  │  7. Update src/pdf_processor.py with better embeddings
  └─ Code Location: src/llm_config.py (MODIFIED)

SKILL #11: RAG Implementation  
  ├─ Duration: 3-4 hours
  ├─ Task: Enforce citation enforcement in responses
  ├─ Deliverable: All claims backed by retrieved documents
  ├─ Actions:
  │  1. Read SKILL.md (citation tracking, source attribution)
  │  2. Modify disease_explainer agent to track sources:
  │     result = retriever.hybrid_search(query)
  │     sources = [doc.metadata['source'] for doc in result]
  │     # Keep track of which statements came from which docs
  │  3. Update ResponseSynthesizerAgent to require citations:
  │     Every claim must be followed by [source: page N]
  │  4. Add validation:
  │     if not has_citations(response):
  │        return "Insufficient evidence for this conclusion"
  │  5. Modify API response to include citations:
  │     {
  │       "disease": "Diabetes",
  │       "evidence": [
  │         {"claim": "High glucose", "source": "Clinical_Guidelines.pdf:p45"}
  │       ]
  │     }
  │  6. Test: Every response should have citations
  └─ Code Location: src/agents/disease_explainer.py (MODIFIED)

Week 7: Days 31-35

SKILL #12: Knowledge Graph Builder
  ├─ Duration: 6-8 hours
  ├─ Task: Extract and use knowledge graphs for relationships
  ├─ Deliverable: Biomarker → Disease → Treatment graph
  ├─ Actions:
  │  1. Read SKILL.md (knowledge graphs, entity extraction, relationships)
  │  2. Design graph structure:
  │     Nodes: Biomarkers, Diseases, Treatments, Symptoms
  │     Edges: "elevated_glucose" -[indicates]-> "diabetes"
  │            "diabetes" -[treated_by]-> "metformin"
  │  3. Extract entities from medical PDFs:
  │     Use LLM to identify: (biomarker, disease, treatment) triples
  │     Store in graph database (networkx for simplicity)
  │  4. Build src/knowledge_graph.py:
  │     class MedicalKnowledgeGraph:
  │        def find_diseases_for_biomarker(biomarker) -> List[Disease]
  │        def find_treatments_for_disease(disease) -> List[Treatment]
  │        def shortest_path(biomarker, disease) -> List[Node]
  │  5. Integrate with biomarker_analyzer:
  │     Instead of rule-based disease prediction,
  │     Use knowledge graph paths
  │  6. Test: Graph should have >100 nodes, >500 edges
  │  7. Visualize: Create graph.html (D3.js visualization)
  └─ Code Location: src/knowledge_graph.py (NEW)

SKILL #1: LangChain Architecture (Deep Dive)
  ├─ Duration: 3-4 hours
  ├─ Task: Advanced LangChain patterns for RAG
  ├─ Deliverable: More sophisticated agent chain design
  ├─ Actions:
  │  1. Read SKILL.md (advanced chains, custom tools)
  │  2. Add custom tools to agents:
  │     @tool
  │     def lookup_reference_range(biomarker: str) -> dict:
  │        """Get normal range for biomarker"""
  │        return config.biomarker_references[biomarker]
  │  3. Create composite chains:
  │     Chain = (lookup_range_tool | linter | analyzer)
  │  4. Implement memory for conversation context:
  │     buffer = ConversationBufferMemory()
  │     chain = RunnableWithMessageHistory(agent, buffer)
  │  5. Add callbacks for observability:
  │     .with_config(callbacks=[logger_callback])
  │  6. Test chain composition & memory
  └─ Code Location: src/agents/tools/ (NEW)

SKILL #28: Memory Management
  ├─ Duration: 3-4 hours
  ├─ Task: Optimize context window usage
  ├─ Deliverable: Fit more patient history without exceeding token limits
  ├─ Actions:
  │  1. Read SKILL.md (context compression, memory hierarchies)
  │  2. Implement sliding window memory:
  │     Keep last 5 messages (pruned conversation)
  │     Summarize older messages into facts
  │  3. Add context compression:
  │     "User mentioned: glucose 140, HbA1c 10" (compressed)
  │     Instead of full raw conversation
  │  4. Monitor token usage:
  │     - Groq free tier: ~500 requests/month
  │     - Each request: ~1-2K tokens average
  │  5. Optimize prompts to use fewer tokens:
  │     Remove verbose preamble
  │     Use shorthand for common terms
  │  6. Test: Save 20-30% on token usage
  └─ Code Location: src/memory_manager.py (NEW)

Week 8: Days 36-40

SKILL #15: Cost-Aware LLM Pipeline
  ├─ Duration: 4-5 hours
  ├─ Task: Optimize API costs (reduce Groq/Gemini usage)
  ├─ Deliverable: Model routing by task complexity
  ├─ Actions:
  │  1. Read SKILL.md (cost estimation, model selection, caching)
  │  2. Analyze current costs:
  │     - Groq llama-3.3-70B: Expensive for simple tasks
  │     - Gemini free tier: Rate-limited
  │  3. Implement model routing:
  │     Simple task: Route to smaller model (if available) or cache
  │     Complex task: Use llama-3.3-70B
  │  4. Example routing:
  │     if task == "extract_biomarkers" and has_cache:
  │       return cached_result
  │     elif task == "complex_reasoning":
  │       use_groq_70b()
  │     else:
  │       use_gemini_free()
  │  5. Implement caching:
  │     hash(query) -> check cache -> LLM -> store result
  │  6. Track costs:
  │     log every API call with cost
  │     Generate monthly cost report
  │  7. Target: -40% cost reduction
  └─ Code Location: src/llm_config.py (MODIFIED)

END OF PHASE 3 OUTCOMES:
✅ Hybrid search implemented (semantic + keyword)
✅ Medical chunking improves knowledge quality
✅ Embeddings optimized for medical terminology
✅ Citation enforcement in all RAG outputs
✅ Knowledge graph built from medical PDFs
✅ LangChain advanced patterns implemented
✅ Context window optimization reduces token waste
✅ Model routing saves -40% on API costs
✅ Better disease prediction via knowledge graphs

════════════════════════════════════════════════════════════════════════════════

PHASE 4: DEPLOYMENT, MONITORING & SCALING (Week 9-12)
════════════════════════════════════════════════════════════════════════════════

GOAL: Production-ready system with monitoring, docs, and deployment

Week 9: Days 41-45

SKILL #25: FastAPI Templates
  ├─ Duration: 3-4 hours
  ├─ Task: Production-grade FastAPI configuration
  ├─ Deliverable: Optimized FastAPI settings, middleware
  ├─ Actions:
  │  1. Read SKILL.md (async patterns, dependency injection, middleware)
  │  2. Apply async best practices:
  │     - All endpoints async def
  │     - Use asyncio for parallel agent calls
  │     - Remove any sync blocking calls
  │  3. Add middleware chain:
  │     - CORS middleware (for web frontend)
  │     - Request logging (correlation IDs)
  │     - Error handling
  │     - Rate limiting
  │     - Auth
  │  4. Optimize configuration:
  │     - Connection pooling for databases
  │     - Caching headers (HTTP)
  │     - Compression (gzip)
  │  5. Add health checks:
  │     /health - basic healthcheck
  │     /health/deep - check dependencies (FAISS, LLM)
  │  6. Test: Load testing with async
  └─ Code Location: api/app/main.py (REFACTORED)

SKILL #29: API Docs Generator
  ├─ Duration: 2-3 hours
  ├─ Task: Auto-generate OpenAPI spec + interactive docs
  ├─ Deliverable: /docs (Swagger UI) + /redoc (ReDoc)
  ├─ Actions:
  │  1. Read SKILL.md (OpenAPI, Swagger UI, ReDoc)
  │  2. FastAPI auto-generates OpenAPI from endpoints
  │  3. Enhance documentation:
  │     Add detailed descriptions to each endpoint
  │     Add example responses
  │     Add error codes
  │  4. Example:
  │     @app.post("/api/v1/analyze/structured")
  │     async def analyze_structured(request: AnalysisRequest):
  │        """
  │        Analyze biomarkers (structured input)
  │        
  │        - **biomarkers**: Dict of biomarker names → values
  │        - **response**: Full analysis with disease prediction
  │        
  │        Example:
  │        {"biomarkers": {"glucose": 140, "HbA1c": 10}}
  │        """
  │  5. Auto-docs available at:
  │     http://localhost:8000/docs
  │     http://localhost:8000/redoc
  │  6. Generate OpenAPI JSON:
  │     http://localhost:8000/openapi.json
  │  7. Create client SDK (optional):
  │     OpenAPI Generator → Python, JS, Go clients
  └─ Docs auto-generated from code

SKILL #30: GitHub PR Review Workflow
  ├─ Duration: 2-3 hours  
  ├─ Task: Establish code review standards
  ├─ Deliverable: CODEOWNERS, PR templates, branch protection
  ├─ Actions:
  │  1. Read SKILL.md (PR templates, CODEOWNERS, review process)
  │  2. Create .github/CODEOWNERS:
  │     # Security reviews required for:
  │     /api/app/middleware/ @security-team
  │     # Testing reviews required for:
  │     /tests/           @qa-team
  │  3. Create .github/pull_request_template.md:
  │     ## Description
  │     ## Type of change
  │     ## Tests added
  │     ## Checklist
  │     ## Related issues
  │  4. Configure branch protection:
  │     - Require 1 approval before merge
  │     - Require status checks pass (tests, lint)
  │     - Require up-to-date branch
  │  5. Create CONTRIBUTING.md with guidelines
  └─ Location: .github/

Week 10: Days 46-50

SKILL #27: Python Observability (Advanced)
  ├─ Duration: 4-5 hours
  ├─ Task: Metrics collection + monitoring dashboard
  ├─ Deliverable: Key metrics tracked (latency, accuracy, errors)
  ├─ Actions:
  │  1. Read SKILL.md (metrics, histograms, summaries)
  │  2. Add prometheus metrics:
  │     pip install prometheus-client
  │  3. Track key metrics:
  │     - request_latency_ms (histogram)
  │     - disease_prediction_accuracy (gauge)
  │     - llm_api_calls_total (counter)
  │     - error_rate (gauge)
  │     - citations_found_rate (gauge)
  │  4. Add to all agents:
  │     with timer("biomarker_analyzer"):
  │       result = analyzer.invoke(input)
  │  5. Expose metrics at /metrics
  │  6. Integrate with monitoring (optional):
  │     Send to Prometheus -> Grafana dashboard
  │  7. Alerts:
  │     If latency > 25s: alert
  │     If accuracy < 75%: alert
  │     If error rate > 5%: alert
  └─ Code Location: src/monitoring/ (NEW)

SKILL #23: Code Review Excellence
  ├─ Duration: 2-3 hours
  ├─ Task: Review and improve code quality
  ├─ Deliverable: Code quality assessment report
  ├─ Actions:
  │  1. Read SKILL.md (code review patterns, common issues)
  │  2. Self-review all Phase 1-3 changes:
  │     - Are functions <20 lines? (if not, break up)
  │     - Are variable names clear? (rename if not)
  │     - Are error cases handled? (if not, add)
  │     - Are tests present? (required: >90% coverage)
  │  3. Common medical code patterns to enforce:
  │     - Never assume biomarker values are valid
  │     - Always include units (mg/dL, etc.)
  │     - Always cite medical literature
  │     - Never hardcode disease thresholds
  │  4. Create REVIEW_GUIDELINES.md
  │  5. Review Agent implementations:
  │     Check for: typos, unclear logic, missing docstrings
  └─ Code Location: docs/REVIEW_GUIDELINES.md (NEW)

SKILL #31: CI-CD Best Practices
  ├─ Duration: 3-4 hours
  ├─ Task: Enhance CI/CD with deployment
  ├─ Deliverable: Automated deployment pipeline
  ├─ Actions:
  │  1. Read SKILL.md (deployment strategies, environments)
  │  2. Add deployment workflow:
  │     .github/workflows/deploy.yml:
  │     - Build Docker image
  │     - Push to registry
  │     - Deploy to staging
  │     - Run smoke tests
  │     - Manual approval for production
  │     - Deploy to production
  │  3. Environment management:
  │     - .env.development (localhost)
  │     - .env.staging (staging server)
  │     - .env.production (prod server)
  │  4. Deployment strategy:
  │     Canary: Deploy to 10% of traffic first
  │     Monitor for errors
  │     If OK, deploy to 100%
  │     If errors, rollback
  │  5. Docker configuration:
  │     Multi-stage build for smaller images
  │     Security: Non-root user, minimal base image
  │  6. Test deployment locally:
  │     docker build -t ragbot .
  │     docker run -p 8000:8000 ragbot
  └─ Location: .github/workflows/deploy.yml (NEW)

SKILL #32: Frontend Accessibility (if building web frontend)
  ├─ Duration: 2-3 hours (optional, skip if CLI only)
  ├─ Task: Accessibility standards for web interface
  ├─ Deliverable: WCAG 2.1 AA compliant UI
  ├─ Actions:
  │  1. Read SKILL.md (a11y, screen readers, keyboard nav)
  │  2. If building React frontend for medical results:
  │     - All buttons keyboard accessible
  │     - Screen reader labels on medical data
  │     - High contrast for readability
  │     - Clear error messages
  │  3. Test with screen reader (NVDA or JAWS)
  └─ Code Location: examples/web_interface/ (if needed)

Week 11: Days 51-55

SKILL #6: LLM Application Dev with LangChain
  ├─ Duration: 4-5 hours
  ├─ Task: Production LangChain patterns
  ├─ Deliverable: Robust, maintainable agent code
  ├─ Actions:
  │  1. Read SKILL.md (production patterns, error handling, logging)
  │  2. Implement agent lifecycle:
  │     - Setup (load models, prepare context)
  │     - Execution (with retries)
  │     - Cleanup (save state, log metrics)
  │  3. Add retry logic for LLM calls:
  │     @retry(max_attempts=3, backoff=exponential)
  │     def invoke_agent(self, input):
  │        return self.llm.predict(...)
  │  4. Add graceful degradation:
  │     If LLM fails, return cached result
  │     If vector store fails, return rule-based result
  │  5. Implement agent composition:
  │     Multi-step workflows where agents call other agents
  │  6. Test: 99.99% uptime in staging
  └─ Code Location: src/agents/base_agent.py (REFINED)

SKILL #33: Webhook Receiver Hardener
  ├─ Duration: 2-3 hours
  ├─ Task: Secure webhook handling (for integrations)
  ├─ Deliverable: Webhook endpoint with signature verification
  ├─ Actions:
  │  1. Read SKILL.md (signature verification, replay protection)
  │  2. If accepting webhooks from external systems:
  │     - Verify HMAC signature
  │     - Check timestamp (prevent replay attacks)
  │     - Idempotency key handling
  │  3. Example: EHR system sends patient updates
  │     POST /webhooks/patient-update
  │     Verify: X-Webhook-Signature header
  │     Prevent: Same update processed twice
  │  4. Create api/app/webhooks/ (NEW if needed)
  │  5. Test: Webhook security scenarios
  └─ Code Location: api/app/webhooks/ (OPTIONAL)

Week 12: Days 56-60

SKILL #7: RAG Agent Builder
  ├─ Duration: 4-5 hours
  ├─ Task: Full RAG agent architecture review
  ├─ Deliverable: Production-ready RAG agents
  ├─ Actions:
  │  1. Read SKILL.md (RAG agent design, retrieval QA chains)
  │  2. Comprehensive RAG review:
  │     - Retriever quality (hybrid search, ranking)
  │     - Prompt quality (citations, evidence)
  │     - Response quality (accurate, safe)
  │  3. Disease Explainer Agent refactor:
  │     Step 1: Retrieve relevant medical documents
  │     Step 2: Extract key evidence from docs
  │     Step 3: Synthesize explanation with citations
  │     Step 4: Assess confidence (high/medium/low)
  │  4. Test: All responses have citations
  │  5. Test: No medical hallucinations
  │  6. Benchmark: Accuracy, latency, cost
  └─ Code Location: src/agents/ (FINAL REVIEW)

Final Week Integration (Days 56-60):

SKILL #2: Workflow Orchestration (Refinement)
  ├─ Final review of entire workflow
  ├─ Ensure all agents work together
  ├─ Test end-to-end: CLI and API

Comprehensive Testing:
  ├─ Functional tests: All features work
  ├─ Security tests: No vulnerabilities
  ├─ Performance tests: <20s latency
  ├─ Load tests: Handle 10 concurrent requests

Documentation:
  ├─ Update README with new features
  ├─ Document API at /docs
  ├─ Create deployment guide
  ├─ Create troubleshooting guide

Production Deployment:
  ├─ Stage: Test with real environment
  ├─ Canary: 10% of traffic
  ├─ Monitor: Errors, latency, accuracy
  ├─ Full deployment: 100% of traffic

END OF PHASE 4 OUTCOMES:
✅ FastAPI optimized for production
✅ API documentation auto-generated
✅ Code review standards established
✅ Full observability (logging, metrics)
✅ CI/CD with automated deployment
✅ Security best practices implemented
✅ Production-ready RAG agents
✅ System deployed and monitored

════════════════════════════════════════════════════════════════════════════════

IMPLEMENTATION SUMMARY
════════════════════════════════════════════════════════════════════════════════

SKILLS USED IN ORDER:

Phase 1 (Security + Fixes): 2, 3, 4, 16, 17, 18, 19, 20, 22
Phase 2 (Testing + Agents): 22, 26, 4, 13, 14, 5, 21, 27, 24
Phase 3 (Retrieval + Graphs): 8, 9, 10, 11, 12, 1, 28, 15
Phase 4 (Production): 25, 29, 30, 27, 23, 31, 32(*), 6, 33(*), 7

(*) Optional based on needs

TOTAL IMPLEMENTATION TIME:
Phase 1: ~30-40 hours
Phase 2: ~35-45 hours
Phase 3: ~30-40 hours  
Phase 4: ~30-40 hours
─────────────────────
TOTAL: ~130-160 hours over 12 weeks (~10-12 hours/week)

EXPECTED OUTCOMES:

Metrics:
  Test Coverage: 70% → 90%+
  Response Latency: 25s → 15-20s (-30%)
  Accuracy: 65% → 80% (+15-20%)
  API Costs: -40% via optimization
  Citations: 0% → 100%

Quality:
  ✅ OWASP compliant
  ✅ HIPAA aligned
  ✅ Production-ready
  ✅ Enterprise monitoring
  ✅ Automated deployments

System Capabilities:
  ✅ Hybrid semantic + keyword search
  ✅ Knowledge graphs for reasoning
  ✅ Cost-optimized LLM routing
  ✅ Full citation enforcement
  ✅ Advanced observability

════════════════════════════════════════════════════════════════════════════════

WEEKLY CHECKLIST
════════════════════════════════════════════════════════════════════════════════

Each week, verify:

□ Code committed with clear commit messages
□ Tests pass locally: pytest -v --cov
□ Coverage >85% on any new code
□ PR created with documentation
□ Code reviewed (self or team)
□ No security warnings
□ Documentation updated
□ Metrics tracked (custom dashboard)
□ No breaking changes to API

════════════════════════════════════════════════════════════════════════════════

DONE! Your 4-month implementation plan is ready.

Start with Phase 1 Week 1.
Execute systematically.
Measure progress weekly.
Celebrate wins!

Your RagBot will be enterprise-grade. 🚀