Spaces:
Sleeping
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β π RAGBOT 4-MONTH IMPLEMENTATION ROADMAP - ALL 34 SKILLS β β Systematic, Phased Approach to Enterprise-Grade AI β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
IMPLEMENTATION PHILOSOPHY ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β’ Fix critical issues first (security, state management, schema) β’ Build tests concurrently (every feature gets tests immediately) β’ Deploy incrementally (working code at each phase) β’ Measure continuously (metrics drive priorities) β’ Document along the way (knowledge preservation)
PROJECT BASELINE ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Current Status: β’ 83+ passing tests (~70% coverage) β’ 6 specialist agents (Biomarker Analyzer, Disease Explainer, etc.) β’ FastAPI REST API + CLI interface β’ FAISS vector store (750+ pages medical knowledge) β’ 2,861 medical knowledge chunks
Critical Issues to Fix:
- biomarker_flags & safety_alerts not propagating through workflow
- Schema mismatch between workflow output & API formatter
- Prediction confidence forced to 0.5 (dangerous for medical domain)
- Different biomarker naming (API vs CLI)
- JSON parsing breaks on malformed LLM output
- No citation enforcement in RAG outputs
Success Metrics: β’ Test coverage: 70% β 90%+ β’ Response latency: 25s β 15-20s β’ Prediction accuracy: +15-20% β’ API costs: -40% (Groq free tier optimization) β’ Security: OWASP compliant, HIPAA aligned
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
PHASE 1: FOUNDATION & CRITICAL FIXES (Week 1-2) ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GOAL: Security baseline + fix state propagation + unify schemas
Week 1: Days 1-5
SKILL #18: OWASP Security Check ββ Duration: 2-3 hours ββ Task: Run comprehensive security audit ββ Deliverable: Security issues list, prioritized fixes ββ Actions: β 1. Read SKILL.md documentation β 2. Run vulnerability scanner on /api and /src β 3. Document findings in SECURITY_AUDIT.md β 4. Create tickets for each finding ββ Outcome: Clear understanding of security gaps
SKILL #17: API Security Hardening ββ Duration: 4-6 hours ββ Task: Implement authentication & hardening ββ Deliverable: JWT auth on /api/v1/analyze endpoint ββ Actions: β 1. Read SKILL.md (auth patterns, CORS, headers) β 2. Add JWT middleware to api/main.py β 3. Update routes with @require_auth decorator β 4. Add security headers (HSTS, CSP, X-Frame-Options) β 5. Write tests for auth (SKILL #22: Python Testing Patterns) β 6. Update docs with API key requirement ββ Code Location: api/app/middleware/auth.py (NEW)
SKILL #22: Python Testing Patterns (First Use) ββ Duration: 2-3 hours ββ Task: Create testing infrastructure & auth tests ββ Deliverable: tests/test_api_auth.py with 10+ tests ββ Actions: β 1. Read SKILL.md (fixtures, mocking, parametrization) β 2. Create conftest.py with auth fixtures β 3. Write tests for JWT generation, validation, failure cases β 4. Implement pytest fixtures for authenticated client β 5. Run: pytest tests/test_api_auth.py -v ββ Outcome: 80% test coverage on auth module
SKILL #2: Workflow Orchestration Patterns ββ Duration: 4-6 hours ββ Task: Fix state propagation in LangGraph workflow ββ Deliverable: biomarker_flags & safety_alerts propagate end-to-end ββ Actions: β 1. Read SKILL.md (LangGraph state management, parallel execution) β 2. Review src/state.py current structure β 3. Identify missing state fields in GuildState β 4. Refactor agents to return complete state: β - src/agents/biomarker_analyzer.py β return biomarker_flags β - src/agents/biomarker_analyzer.py β return safety_alerts β - src/agents/confidence_assessor.py β update state β 5. Test with: python -c "from src.workflow import create_guild..." β 6. Write integration tests (SKILL #22) ββ Code Changes: src/state.py, src/agents/*.py
SKILL #16: AI Wrapper/Structured Output ββ Duration: 3-5 hours ββ Task: Unify workflow β API response schema ββ Deliverable: Single canonical response format (Pydantic model) ββ Actions: β 1. Read SKILL.md (structured outputs, Pydantic, validation) β 2. Create api/app/models/response.py with unified schema β 3. Define BaseAnalysisResponse with all required fields β 4. Update api/app/services/ragbot.py to use unified schema β 5. Ensure ResponseSynthesizerAgent outputs match schema β 6. Add Pydantic validation in all endpoints β 7. Run: pytest tests/test_response_schema.py -v ββ Code Location: api/app/models/response.py (REFACTORED)
Week 2: Days 6-10
SKILL #3: Multi-Agent Orchestration ββ Duration: 3-4 hours ββ Task: Fix deterministic execution of parallel agents ββ Deliverable: Agents execute without race conditions ββ Actions: β 1. Read SKILL.md (agent coordination, deterministic scheduling) β 2. Review src/workflow.py parallel execution β 3. Ensure explicit state passing between agents: β - Biomarker Analyzer outputs β Disease Explainer inputs β - Sequential where needed (Analyzer before Linker) β - Parallel where safe (Explainer & Guidelines) β 4. Add logging to track execution order β 5. Run 10 times: python scripts/test_chat_demo.py (same output each time) ββ Outcome: Deterministic workflow execution
SKILL #19: LLM Security
ββ Duration: 3-4 hours
ββ Task: Prevent LLM-specific attacks
ββ Deliverable: Input validation against prompt injection
ββ Actions:
β 1. Read SKILL.md (prompt injection, token limit attacks)
β 2. Add input sanitization in api/app/services/extraction.py
β 3. Implement prompt injection detection:
β - Check for "ignore instructions" patterns
β - Limit biomarker input length
β - Escape special characters
β 4. Add rate limiting per user (SKILL #20)
β 5. Write security tests
ββ Code Location: api/app/middleware/input_validation.py (NEW)
SKILL #20: API Rate Limiting ββ Duration: 2-3 hours ββ Task: Implement tiered rate limiting ββ Deliverable: /api/v1/analyze limited to 10/min free, 1000/min pro ββ Actions: β 1. Read SKILL.md (token bucket, sliding window algorithms) β 2. Import python-ratelimit library β 3. Add rate limiter middleware to api/main.py β 4. Implement tiered limits (free/pro based on API key) β 5. Return 429 with retry-after headers β 6. Test rate limiting behavior ββ Code Location: api/app/middleware/rate_limiter.py (NEW)
END OF PHASE 1 OUTCOMES: β Security audit complete with fixes prioritized β JWT authentication on REST API β biomarker_flags & safety_alerts propagating through workflow β Unified response schema (API & CLI use same format) β LLM prompt injection protection β Rate limiting in place β Auth + security tests written (15+ new tests) β Coverage increased to ~75%
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
PHASE 2: TEST EXPANSION & AGENT OPTIMIZATION (Week 3-5) ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GOAL: 90%+ test coverage + improved agent decision logic + prompt optimization
Week 3: Days 11-15
SKILL #22: Python Testing Patterns (Advanced Use) ββ Duration: 8-10 hours (this is the main focus) ββ Task: Parametrized testing for biomarker combinations ββ Deliverable: 50+ new parametrized tests ββ Actions: β 1. Read SKILL.md sections on parametrization & fixtures β 2. Create tests/fixtures/biomarkers.py with test data: β - Normal values tuple β - Diabetes indicators tuple β - Mixed abnormal values tuple β - Edge cases tuple β 3. Write parametrized test for each biomarker combination: β @pytest.mark.parametrize("biomarkers,expected_disease", [...]) β def test_disease_prediction(biomarkers, expected_disease): β assert predict_disease(biomarkers) == expected_disease β 4. Create mocking fixtures for LLM calls: β @pytest.fixture β def mock_groq_client(monkeypatch): β # Mock all LLM interactions β 5. Test agent outputs: β - Biomarker Analyzer with 10 scenarios β - Disease Explainer with 5 diseases β - Confidence Assessor with low/medium/high confidence cases β 6. Run: pytest tests/ -v --cov src --cov-report=html β 7. Goal: 90%+ coverage on agents/ ββ Code Location: tests/test_parametrized_*.py
SKILL #26: Python Design Patterns ββ Duration: 4-5 hours ββ Task: Refactor agent implementations with design patterns ββ Deliverable: Cleaner, more maintainable agent code ββ Actions: β 1. Read SKILL.md (SOLID, composition, factory patterns) β 2. Identify code smells in src/agents/ β 3. Extract common agent logic to BaseAgent class: β class BaseAgent: β def invoke(self, input_data) -> AgentOutput β def validate_inputs(self) β def log_execution(self) β 4. Use composition over inheritance: β - Each agent has optional retriever, validator, cache β - Reduce coupling between agents β 5. Implement Factory pattern for agent creation: β AgentFactory.create("biomarker_analyzer") β 6. Refactor tests to use new pattern ββ Code Location: src/agents/base_agent.py (NEW)
SKILL #4: Agentic Development ββ Duration: 3-4 hours ββ Task: Improve agent decision logic ββ Deliverable: Better biomarker analysis confidence scores ββ Actions: β 1. Read SKILL.md (planning, reasoning, decision making) β 2. Add confidence threshold in BiomarkerAnalyzerAgent β 3. Instead of returning all results: β - Only return HIGH confidence matches β - Flag LOW confidence for manual review β - Add reasoning trace (why this conclusion) β 4. Update response format with: β - confidence_score (0-1) β - evidence_count (# sources) β - alternative_hypotheses (if low confidence) β 5. Update tests ββ Code Location: src/agents/biomarker_analyzer.py (MODIFIED)
SKILL #13: Senior Prompt Engineer (First Use) ββ Duration: 5-6 hours ββ Task: Optimize prompts for medical accuracy ββ Deliverable: Updated agent prompts with better accuracy ββ Actions: β 1. Read SKILL.md (prompt patterns, few-shot, CoT) β 2. Audit current agent prompts in src/agents/.py β 3. Apply few-shot learning to extraction agent: β - Add 3 examples of correct biomarker extraction β - Show format expected β - Show handling of ambiguous inputs β 4. Add chain-of-thought reasoning: β "First identify the biomarkers mentioned. Then look up their ranges. β Then determine if abnormal. Then assess severity." β 5. Add role prompting: β "You are an expert medical lab analyst with 20 years experience..." β 6. Implement structured output prompts: β "Return JSON with these exact fields: biomarkers, disease, confidence" β 7. Benchmark against baseline accuracy β 8. Run: python scripts/test_evaluation_system.py (SKILL #14) ββ Code Location: src/agents//invoke() prompts
Week 4: Days 16-20
SKILL #14: LLM Evaluation ββ Duration: 4-5 hours ββ Task: Benchmark LLM quality improvements ββ Deliverable: Metrics dashboard showing promise of improvements ββ Actions: β 1. Read SKILL.md (evaluation metrics, benchmarking) β 2. Create tests/evaluation_metrics.py with metrics: β - Accuracy (correct disease prediction) β - Precision (of biomarker extraction) β - Recall (of clinical recommendations) β - F1 score (biomarker identification) β 3. Create test dataset with 20 patient scenarios: β tests/fixtures/evaluation_patients.py β 4. Benchmark Groq vs Gemini on accuracy, latency, cost β 5. Create evaluation report: β "Before optimization: 65% accuracy, 25s latency β After optimization: 80% accuracy, 18s latency" β 6. Generate graphs/charts of improvements ββ Code Location: tests/evaluation_metrics.py
SKILL #5: Tool/Function Calling Patterns ββ Duration: 3-4 hours ββ Task: Use function calling for reliable LLM outputs ββ Deliverable: Structured output via function calling (not prompting) ββ Actions: β 1. Read SKILL.md (tool definition, structured returns) β 2. Define tools for extraction agent: β - extract_biomarkers(text: str) -> dict β - classify_severity(value: float, range: tuple) -> str β - assess_disease_risk(biomarkers: dict) -> dict β 3. Modify extraction service to use function calling: β Instead of parsing JSON from text, call literal functions β 4. Groq free tier check (may not support function calling) β Alternative: Use strict Pydantic output validation β 5. Test: Parsing should never fail, always return valid output β 6. Error handling: If LLM output wrong format, retry with function calling ββ Code Location: api/app/services/extraction.py (MODIFIED)
SKILL #21: Python Error Handling ββ Duration: 3-4 hours ββ Task: Comprehensive error handling for production ββ Deliverable: Custom exception hierarchy, graceful degradation ββ Actions: β 1. Read SKILL.md (exception patterns, logging, recovery) β 2. Create src/exceptions.py with hierarchy: β - RagBotException (base) β - BiomarkerValidationError β - LLMTimeoutError (with retry logic) β - VectorStoreError β - SchemaValidationError β 3. Wrap agent calls with try-except: β try: β result = agent.invoke(input) β except LLMTimeoutError: β retry_with_smaller_context() β except BiomarkerValidationError: β return low_confidence_response() β 4. Add telemetry: which exceptions most common? β 5. Write exception tests (10+ scenarios) ββ Code Location: src/exceptions.py (NEW)
Week 5: Days 21-25
SKILL #27: Python Observability (First Use) ββ Duration: 4-5 hours ββ Task: Structured logging for debugging & monitoring ββ Deliverable: JSON-formatted logs with context ββ Actions: β 1. Read SKILL.md (structured logging, correlation IDs) β 2. Replace print() with logger calls: β logger.info("analyzing biomarkers", extra={ β "biomarkers": {"glucose": 140}, β "user_id": "user123", β "correlation_id": "req-abc123" β }) β 3. Add correlation IDs to track requests through agents β 4. Structure logs as JSON (not text): β - timestamp β - level β - message β - context (user, request, agent) β - metrics (latency, tokens used) β 5. Implement in all agents (src/agents/*) β 6. Test: Review logs.jsonl output ββ Code Location: src/observability.py (NEW)
SKILL #24: GitHub Actions Templates ββ Duration: 2-3 hours ββ Task: Set up CI/CD pipeline ββ Deliverable: .github/workflows/test.yml (auto-run tests on PR) ββ Actions: β 1. Read SKILL.md (GitHub Actions workflow syntax) β 2. Create .github/workflows/test.yml: β name: Run Tests β on: [push, pull_request] β jobs: β test: β runs-on: ubuntu-latest β steps: β - uses: actions/checkout@v3 β - uses: actions/setup-python@v4 β - run: pip install -r requirements.txt β - run: pytest tests/ -v --cov src --cov-report=xml β - run: coverage report (fail if <90%) β 3. Create .github/workflows/security.yml: β - Run OWASP checks β - Lint code β - Check dependencies for CVEs β 4. Create .github/workflows/docker.yml: β - Build Docker image β - Push to registry (optional) β 5. Test: Create a PR, verify workflows run ββ Location: .github/workflows/
END OF PHASE 2 OUTCOMES: β 90%+ test coverage achieved β 50+ parametrized tests added β Agent code refactored with design patterns β LLM prompts optimized for medical accuracy β Evaluation metrics show +15% accuracy improvement β Function calling prevents JSON parsing failures β Comprehensive error handling in place β Structured JSON logging implemented β CI/CD pipeline automated
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
PHASE 3: RETRIEVAL OPTIMIZATION & KNOWLEDGE GRAPHS (Week 6-8) ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GOAL: Better medical knowledge retrieval + citations + knowledge graphs
Week 6: Days 26-30
SKILL #8: Hybrid Search Implementation ββ Duration: 4-6 hours ββ Task: Combine semantic + keyword search for better recall ββ Deliverable: Hybrid retriever for RagBot (BM25 + FAISS) ββ Actions: β 1. Read SKILL.md (hybrid search architecture, reciprocal rank fusion) β 2. Current state: Only FAISS semantic search (misses rare diseases) β 3. Add BM25 keyword search: β pip install rank-bm25 β 4. Create src/retrievers/hybrid_retriever.py: β class HybridRetriever: β def semantic_search(query, k=5) # FAISS β def keyword_search(query, k=5) # BM25 β def hybrid_search(query): # Combine + rerank β 5. Reranking (Reciprocal Rank Fusion): β score = 1/(k + rank_semantic) + 1/(k + rank_keyword) β 6. Replace old retriever in disease_explainer agent: β old: retriever = faiss_retriever β new: retriever = hybrid_retriever β 7. Benchmark: Test retrieval quality on 10 disease cases β 8. Test rare disease retrieval (uncommon biomarker combinations) ββ Code Location: src/retrievers/hybrid_retriever.py (NEW)
SKILL #9: Chunking Strategy ββ Duration: 4-5 hours ββ Task: Optimize medical document chunking ββ Deliverable: Improved chunks for better context ββ Actions: β 1. Read SKILL.md (chunking strategies, semantic boundaries) β 2. Current: Fixed 1000-char chunks (may split mid-sentence) β 3. Implement intelligent chunking: β - Split by medical sections (diagnosis, treatment, etc.) β - Keep related content together β - Maintain minimum 500 chars (context) max 2000 chars (context window) β 4. Preserve medical structure: β - Disease headers stay with symptoms β - Labs stay with reference ranges β - Treatment options stay together β 5. Create src/chunking_strategy.py: β def chunk_medical_pdf(pdf_text) -> List[Chunk]: β # Split by disease headers, maintain structure β 6. Re-chunk medical_knowledge.faiss (2,861 chunks β how many?) β 7. Re-embed with new chunks β 8. Benchmark: Document retrieval precision improved? ββ Code Location: src/chunking_strategy.py (REFACTORED)
SKILL #10: Embedding Pipeline Builder ββ Duration: 3-4 hours ββ Task: Optimize embeddings for medical terminology ββ Deliverable: Better semantic search for medical terms ββ Actions: β 1. Read SKILL.md (embedding models, fine-tuning considerations) β 2. Current: sentence-transformers/all-MiniLM-L6-v2 (generic) β 3. Options for medical embeddings: β - all-MiniLM-L6-v2 (157M params, fast, baseline) β - all-mpnet-base-v2 (438M params, better quality) β - Medical-specific: SciBERT or BioSentenceTransformer (if available) β 4. Benchmark embeddings on medical queries: β Query: "High glucose and elevated HbA1c" β Expected top result: Diabetes diagnosis section β 5. If using different model: β pip install [new-model] β Re-embed all medical documents β Save new FAISS index β 6. Measure: Mean reciprocal rank (MRR) of correct document β 7. Update src/pdf_processor.py with better embeddings ββ Code Location: src/llm_config.py (MODIFIED)
SKILL #11: RAG Implementation
ββ Duration: 3-4 hours
ββ Task: Enforce citation enforcement in responses
ββ Deliverable: All claims backed by retrieved documents
ββ Actions:
β 1. Read SKILL.md (citation tracking, source attribution)
β 2. Modify disease_explainer agent to track sources:
β result = retriever.hybrid_search(query)
β sources = [doc.metadata['source'] for doc in result]
β # Keep track of which statements came from which docs
β 3. Update ResponseSynthesizerAgent to require citations:
β Every claim must be followed by [source: page N]
β 4. Add validation:
β if not has_citations(response):
β return "Insufficient evidence for this conclusion"
β 5. Modify API response to include citations:
β {
β "disease": "Diabetes",
β "evidence": [
β {"claim": "High glucose", "source": "Clinical_Guidelines.pdf:p45"}
β ]
β }
β 6. Test: Every response should have citations
ββ Code Location: src/agents/disease_explainer.py (MODIFIED)
Week 7: Days 31-35
SKILL #12: Knowledge Graph Builder ββ Duration: 6-8 hours ββ Task: Extract and use knowledge graphs for relationships ββ Deliverable: Biomarker β Disease β Treatment graph ββ Actions: β 1. Read SKILL.md (knowledge graphs, entity extraction, relationships) β 2. Design graph structure: β Nodes: Biomarkers, Diseases, Treatments, Symptoms β Edges: "elevated_glucose" -[indicates]-> "diabetes" β "diabetes" -[treated_by]-> "metformin" β 3. Extract entities from medical PDFs: β Use LLM to identify: (biomarker, disease, treatment) triples β Store in graph database (networkx for simplicity) β 4. Build src/knowledge_graph.py: β class MedicalKnowledgeGraph: β def find_diseases_for_biomarker(biomarker) -> List[Disease] β def find_treatments_for_disease(disease) -> List[Treatment] β def shortest_path(biomarker, disease) -> List[Node] β 5. Integrate with biomarker_analyzer: β Instead of rule-based disease prediction, β Use knowledge graph paths β 6. Test: Graph should have >100 nodes, >500 edges β 7. Visualize: Create graph.html (D3.js visualization) ββ Code Location: src/knowledge_graph.py (NEW)
SKILL #1: LangChain Architecture (Deep Dive) ββ Duration: 3-4 hours ββ Task: Advanced LangChain patterns for RAG ββ Deliverable: More sophisticated agent chain design ββ Actions: β 1. Read SKILL.md (advanced chains, custom tools) β 2. Add custom tools to agents: β @tool β def lookup_reference_range(biomarker: str) -> dict: β """Get normal range for biomarker""" β return config.biomarker_references[biomarker] β 3. Create composite chains: β Chain = (lookup_range_tool | linter | analyzer) β 4. Implement memory for conversation context: β buffer = ConversationBufferMemory() β chain = RunnableWithMessageHistory(agent, buffer) β 5. Add callbacks for observability: β .with_config(callbacks=[logger_callback]) β 6. Test chain composition & memory ββ Code Location: src/agents/tools/ (NEW)
SKILL #28: Memory Management ββ Duration: 3-4 hours ββ Task: Optimize context window usage ββ Deliverable: Fit more patient history without exceeding token limits ββ Actions: β 1. Read SKILL.md (context compression, memory hierarchies) β 2. Implement sliding window memory: β Keep last 5 messages (pruned conversation) β Summarize older messages into facts β 3. Add context compression: β "User mentioned: glucose 140, HbA1c 10" (compressed) β Instead of full raw conversation β 4. Monitor token usage: β - Groq free tier: ~500 requests/month β - Each request: ~1-2K tokens average β 5. Optimize prompts to use fewer tokens: β Remove verbose preamble β Use shorthand for common terms β 6. Test: Save 20-30% on token usage ββ Code Location: src/memory_manager.py (NEW)
Week 8: Days 36-40
SKILL #15: Cost-Aware LLM Pipeline ββ Duration: 4-5 hours ββ Task: Optimize API costs (reduce Groq/Gemini usage) ββ Deliverable: Model routing by task complexity ββ Actions: β 1. Read SKILL.md (cost estimation, model selection, caching) β 2. Analyze current costs: β - Groq llama-3.3-70B: Expensive for simple tasks β - Gemini free tier: Rate-limited β 3. Implement model routing: β Simple task: Route to smaller model (if available) or cache β Complex task: Use llama-3.3-70B β 4. Example routing: β if task == "extract_biomarkers" and has_cache: β return cached_result β elif task == "complex_reasoning": β use_groq_70b() β else: β use_gemini_free() β 5. Implement caching: β hash(query) -> check cache -> LLM -> store result β 6. Track costs: β log every API call with cost β Generate monthly cost report β 7. Target: -40% cost reduction ββ Code Location: src/llm_config.py (MODIFIED)
END OF PHASE 3 OUTCOMES: β Hybrid search implemented (semantic + keyword) β Medical chunking improves knowledge quality β Embeddings optimized for medical terminology β Citation enforcement in all RAG outputs β Knowledge graph built from medical PDFs β LangChain advanced patterns implemented β Context window optimization reduces token waste β Model routing saves -40% on API costs β Better disease prediction via knowledge graphs
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
PHASE 4: DEPLOYMENT, MONITORING & SCALING (Week 9-12) ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GOAL: Production-ready system with monitoring, docs, and deployment
Week 9: Days 41-45
SKILL #25: FastAPI Templates ββ Duration: 3-4 hours ββ Task: Production-grade FastAPI configuration ββ Deliverable: Optimized FastAPI settings, middleware ββ Actions: β 1. Read SKILL.md (async patterns, dependency injection, middleware) β 2. Apply async best practices: β - All endpoints async def β - Use asyncio for parallel agent calls β - Remove any sync blocking calls β 3. Add middleware chain: β - CORS middleware (for web frontend) β - Request logging (correlation IDs) β - Error handling β - Rate limiting β - Auth β 4. Optimize configuration: β - Connection pooling for databases β - Caching headers (HTTP) β - Compression (gzip) β 5. Add health checks: β /health - basic healthcheck β /health/deep - check dependencies (FAISS, LLM) β 6. Test: Load testing with async ββ Code Location: api/app/main.py (REFACTORED)
SKILL #29: API Docs Generator
ββ Duration: 2-3 hours
ββ Task: Auto-generate OpenAPI spec + interactive docs
ββ Deliverable: /docs (Swagger UI) + /redoc (ReDoc)
ββ Actions:
β 1. Read SKILL.md (OpenAPI, Swagger UI, ReDoc)
β 2. FastAPI auto-generates OpenAPI from endpoints
β 3. Enhance documentation:
β Add detailed descriptions to each endpoint
β Add example responses
β Add error codes
β 4. Example:
β @app.post("/api/v1/analyze/structured")
β async def analyze_structured(request: AnalysisRequest):
β """
β Analyze biomarkers (structured input)
β
β - biomarkers: Dict of biomarker names β values
β - response: Full analysis with disease prediction
β
β Example:
β {"biomarkers": {"glucose": 140, "HbA1c": 10}}
β """
β 5. Auto-docs available at:
β http://localhost:8000/docs
β http://localhost:8000/redoc
β 6. Generate OpenAPI JSON:
β http://localhost:8000/openapi.json
β 7. Create client SDK (optional):
β OpenAPI Generator β Python, JS, Go clients
ββ Docs auto-generated from code
SKILL #30: GitHub PR Review Workflow
ββ Duration: 2-3 hours
ββ Task: Establish code review standards
ββ Deliverable: CODEOWNERS, PR templates, branch protection
ββ Actions:
β 1. Read SKILL.md (PR templates, CODEOWNERS, review process)
β 2. Create .github/CODEOWNERS:
β # Security reviews required for:
β /api/app/middleware/ @security-team
β # Testing reviews required for:
β /tests/ @qa-team
β 3. Create .github/pull_request_template.md:
β ## Description
β ## Type of change
β ## Tests added
β ## Checklist
β ## Related issues
β 4. Configure branch protection:
β - Require 1 approval before merge
β - Require status checks pass (tests, lint)
β - Require up-to-date branch
β 5. Create CONTRIBUTING.md with guidelines
ββ Location: .github/
Week 10: Days 46-50
SKILL #27: Python Observability (Advanced) ββ Duration: 4-5 hours ββ Task: Metrics collection + monitoring dashboard ββ Deliverable: Key metrics tracked (latency, accuracy, errors) ββ Actions: β 1. Read SKILL.md (metrics, histograms, summaries) β 2. Add prometheus metrics: β pip install prometheus-client β 3. Track key metrics: β - request_latency_ms (histogram) β - disease_prediction_accuracy (gauge) β - llm_api_calls_total (counter) β - error_rate (gauge) β - citations_found_rate (gauge) β 4. Add to all agents: β with timer("biomarker_analyzer"): β result = analyzer.invoke(input) β 5. Expose metrics at /metrics β 6. Integrate with monitoring (optional): β Send to Prometheus -> Grafana dashboard β 7. Alerts: β If latency > 25s: alert β If accuracy < 75%: alert β If error rate > 5%: alert ββ Code Location: src/monitoring/ (NEW)
SKILL #23: Code Review Excellence ββ Duration: 2-3 hours ββ Task: Review and improve code quality ββ Deliverable: Code quality assessment report ββ Actions: β 1. Read SKILL.md (code review patterns, common issues) β 2. Self-review all Phase 1-3 changes: β - Are functions <20 lines? (if not, break up) β - Are variable names clear? (rename if not) β - Are error cases handled? (if not, add) β - Are tests present? (required: >90% coverage) β 3. Common medical code patterns to enforce: β - Never assume biomarker values are valid β - Always include units (mg/dL, etc.) β - Always cite medical literature β - Never hardcode disease thresholds β 4. Create REVIEW_GUIDELINES.md β 5. Review Agent implementations: β Check for: typos, unclear logic, missing docstrings ββ Code Location: docs/REVIEW_GUIDELINES.md (NEW)
SKILL #31: CI-CD Best Practices ββ Duration: 3-4 hours ββ Task: Enhance CI/CD with deployment ββ Deliverable: Automated deployment pipeline ββ Actions: β 1. Read SKILL.md (deployment strategies, environments) β 2. Add deployment workflow: β .github/workflows/deploy.yml: β - Build Docker image β - Push to registry β - Deploy to staging β - Run smoke tests β - Manual approval for production β - Deploy to production β 3. Environment management: β - .env.development (localhost) β - .env.staging (staging server) β - .env.production (prod server) β 4. Deployment strategy: β Canary: Deploy to 10% of traffic first β Monitor for errors β If OK, deploy to 100% β If errors, rollback β 5. Docker configuration: β Multi-stage build for smaller images β Security: Non-root user, minimal base image β 6. Test deployment locally: β docker build -t ragbot . β docker run -p 8000:8000 ragbot ββ Location: .github/workflows/deploy.yml (NEW)
SKILL #32: Frontend Accessibility (if building web frontend) ββ Duration: 2-3 hours (optional, skip if CLI only) ββ Task: Accessibility standards for web interface ββ Deliverable: WCAG 2.1 AA compliant UI ββ Actions: β 1. Read SKILL.md (a11y, screen readers, keyboard nav) β 2. If building React frontend for medical results: β - All buttons keyboard accessible β - Screen reader labels on medical data β - High contrast for readability β - Clear error messages β 3. Test with screen reader (NVDA or JAWS) ββ Code Location: examples/web_interface/ (if needed)
Week 11: Days 51-55
SKILL #6: LLM Application Dev with LangChain ββ Duration: 4-5 hours ββ Task: Production LangChain patterns ββ Deliverable: Robust, maintainable agent code ββ Actions: β 1. Read SKILL.md (production patterns, error handling, logging) β 2. Implement agent lifecycle: β - Setup (load models, prepare context) β - Execution (with retries) β - Cleanup (save state, log metrics) β 3. Add retry logic for LLM calls: β @retry(max_attempts=3, backoff=exponential) β def invoke_agent(self, input): β return self.llm.predict(...) β 4. Add graceful degradation: β If LLM fails, return cached result β If vector store fails, return rule-based result β 5. Implement agent composition: β Multi-step workflows where agents call other agents β 6. Test: 99.99% uptime in staging ββ Code Location: src/agents/base_agent.py (REFINED)
SKILL #33: Webhook Receiver Hardener ββ Duration: 2-3 hours ββ Task: Secure webhook handling (for integrations) ββ Deliverable: Webhook endpoint with signature verification ββ Actions: β 1. Read SKILL.md (signature verification, replay protection) β 2. If accepting webhooks from external systems: β - Verify HMAC signature β - Check timestamp (prevent replay attacks) β - Idempotency key handling β 3. Example: EHR system sends patient updates β POST /webhooks/patient-update β Verify: X-Webhook-Signature header β Prevent: Same update processed twice β 4. Create api/app/webhooks/ (NEW if needed) β 5. Test: Webhook security scenarios ββ Code Location: api/app/webhooks/ (OPTIONAL)
Week 12: Days 56-60
SKILL #7: RAG Agent Builder ββ Duration: 4-5 hours ββ Task: Full RAG agent architecture review ββ Deliverable: Production-ready RAG agents ββ Actions: β 1. Read SKILL.md (RAG agent design, retrieval QA chains) β 2. Comprehensive RAG review: β - Retriever quality (hybrid search, ranking) β - Prompt quality (citations, evidence) β - Response quality (accurate, safe) β 3. Disease Explainer Agent refactor: β Step 1: Retrieve relevant medical documents β Step 2: Extract key evidence from docs β Step 3: Synthesize explanation with citations β Step 4: Assess confidence (high/medium/low) β 4. Test: All responses have citations β 5. Test: No medical hallucinations β 6. Benchmark: Accuracy, latency, cost ββ Code Location: src/agents/ (FINAL REVIEW)
Final Week Integration (Days 56-60):
SKILL #2: Workflow Orchestration (Refinement) ββ Final review of entire workflow ββ Ensure all agents work together ββ Test end-to-end: CLI and API
Comprehensive Testing: ββ Functional tests: All features work ββ Security tests: No vulnerabilities ββ Performance tests: <20s latency ββ Load tests: Handle 10 concurrent requests
Documentation: ββ Update README with new features ββ Document API at /docs ββ Create deployment guide ββ Create troubleshooting guide
Production Deployment: ββ Stage: Test with real environment ββ Canary: 10% of traffic ββ Monitor: Errors, latency, accuracy ββ Full deployment: 100% of traffic
END OF PHASE 4 OUTCOMES: β FastAPI optimized for production β API documentation auto-generated β Code review standards established β Full observability (logging, metrics) β CI/CD with automated deployment β Security best practices implemented β Production-ready RAG agents β System deployed and monitored
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
IMPLEMENTATION SUMMARY ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SKILLS USED IN ORDER:
Phase 1 (Security + Fixes): 2, 3, 4, 16, 17, 18, 19, 20, 22 Phase 2 (Testing + Agents): 22, 26, 4, 13, 14, 5, 21, 27, 24 Phase 3 (Retrieval + Graphs): 8, 9, 10, 11, 12, 1, 28, 15 Phase 4 (Production): 25, 29, 30, 27, 23, 31, 32(), 6, 33(), 7
(*) Optional based on needs
TOTAL IMPLEMENTATION TIME:
Phase 1: 30-40 hours
Phase 2: ~35-45 hours
Phase 3: ~30-40 hours10-12 hours/week)
Phase 4: ~30-40 hours
βββββββββββββββββββββ
TOTAL: ~130-160 hours over 12 weeks (
EXPECTED OUTCOMES:
Metrics: Test Coverage: 70% β 90%+ Response Latency: 25s β 15-20s (-30%) Accuracy: 65% β 80% (+15-20%) API Costs: -40% via optimization Citations: 0% β 100%
Quality: β OWASP compliant β HIPAA aligned β Production-ready β Enterprise monitoring β Automated deployments
System Capabilities: β Hybrid semantic + keyword search β Knowledge graphs for reasoning β Cost-optimized LLM routing β Full citation enforcement β Advanced observability
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
WEEKLY CHECKLIST ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Each week, verify:
β‘ Code committed with clear commit messages β‘ Tests pass locally: pytest -v --cov β‘ Coverage >85% on any new code β‘ PR created with documentation β‘ Code reviewed (self or team) β‘ No security warnings β‘ Documentation updated β‘ Metrics tracked (custom dashboard) β‘ No breaking changes to API
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
DONE! Your 4-month implementation plan is ready.
Start with Phase 1 Week 1. Execute systematically. Measure progress weekly. Celebrate wins!
Your RagBot will be enterprise-grade. π