Spaces:
Sleeping
Sleeping
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β π RAGBOT 4-MONTH IMPLEMENTATION ROADMAP - ALL 34 SKILLS β | |
| β Systematic, Phased Approach to Enterprise-Grade AI β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| IMPLEMENTATION PHILOSOPHY | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β’ Fix critical issues first (security, state management, schema) | |
| β’ Build tests concurrently (every feature gets tests immediately) | |
| β’ Deploy incrementally (working code at each phase) | |
| β’ Measure continuously (metrics drive priorities) | |
| β’ Document along the way (knowledge preservation) | |
| PROJECT BASELINE | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| Current Status: | |
| β’ 83+ passing tests (~70% coverage) | |
| β’ 6 specialist agents (Biomarker Analyzer, Disease Explainer, etc.) | |
| β’ FastAPI REST API + CLI interface | |
| β’ FAISS vector store (750+ pages medical knowledge) | |
| β’ 2,861 medical knowledge chunks | |
| Critical Issues to Fix: | |
| 1. biomarker_flags & safety_alerts not propagating through workflow | |
| 2. Schema mismatch between workflow output & API formatter | |
| 3. Prediction confidence forced to 0.5 (dangerous for medical domain) | |
| 4. Different biomarker naming (API vs CLI) | |
| 5. JSON parsing breaks on malformed LLM output | |
| 6. No citation enforcement in RAG outputs | |
| Success Metrics: | |
| β’ Test coverage: 70% β 90%+ | |
| β’ Response latency: 25s β 15-20s | |
| β’ Prediction accuracy: +15-20% | |
| β’ API costs: -40% (Groq free tier optimization) | |
| β’ Security: OWASP compliant, HIPAA aligned | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| PHASE 1: FOUNDATION & CRITICAL FIXES (Week 1-2) | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| GOAL: Security baseline + fix state propagation + unify schemas | |
| Week 1: Days 1-5 | |
| SKILL #18: OWASP Security Check | |
| ββ Duration: 2-3 hours | |
| ββ Task: Run comprehensive security audit | |
| ββ Deliverable: Security issues list, prioritized fixes | |
| ββ Actions: | |
| β 1. Read SKILL.md documentation | |
| β 2. Run vulnerability scanner on /api and /src | |
| β 3. Document findings in SECURITY_AUDIT.md | |
| β 4. Create tickets for each finding | |
| ββ Outcome: Clear understanding of security gaps | |
| SKILL #17: API Security Hardening | |
| ββ Duration: 4-6 hours | |
| ββ Task: Implement authentication & hardening | |
| ββ Deliverable: JWT auth on /api/v1/analyze endpoint | |
| ββ Actions: | |
| β 1. Read SKILL.md (auth patterns, CORS, headers) | |
| β 2. Add JWT middleware to api/main.py | |
| β 3. Update routes with @require_auth decorator | |
| β 4. Add security headers (HSTS, CSP, X-Frame-Options) | |
| β 5. Write tests for auth (SKILL #22: Python Testing Patterns) | |
| β 6. Update docs with API key requirement | |
| ββ Code Location: api/app/middleware/auth.py (NEW) | |
| SKILL #22: Python Testing Patterns (First Use) | |
| ββ Duration: 2-3 hours | |
| ββ Task: Create testing infrastructure & auth tests | |
| ββ Deliverable: tests/test_api_auth.py with 10+ tests | |
| ββ Actions: | |
| β 1. Read SKILL.md (fixtures, mocking, parametrization) | |
| β 2. Create conftest.py with auth fixtures | |
| β 3. Write tests for JWT generation, validation, failure cases | |
| β 4. Implement pytest fixtures for authenticated client | |
| β 5. Run: pytest tests/test_api_auth.py -v | |
| ββ Outcome: 80% test coverage on auth module | |
| SKILL #2: Workflow Orchestration Patterns | |
| ββ Duration: 4-6 hours | |
| ββ Task: Fix state propagation in LangGraph workflow | |
| ββ Deliverable: biomarker_flags & safety_alerts propagate end-to-end | |
| ββ Actions: | |
| β 1. Read SKILL.md (LangGraph state management, parallel execution) | |
| β 2. Review src/state.py current structure | |
| β 3. Identify missing state fields in GuildState | |
| β 4. Refactor agents to return complete state: | |
| β - src/agents/biomarker_analyzer.py β return biomarker_flags | |
| β - src/agents/biomarker_analyzer.py β return safety_alerts | |
| β - src/agents/confidence_assessor.py β update state | |
| β 5. Test with: python -c "from src.workflow import create_guild..." | |
| β 6. Write integration tests (SKILL #22) | |
| ββ Code Changes: src/state.py, src/agents/*.py | |
| SKILL #16: AI Wrapper/Structured Output | |
| ββ Duration: 3-5 hours | |
| ββ Task: Unify workflow β API response schema | |
| ββ Deliverable: Single canonical response format (Pydantic model) | |
| ββ Actions: | |
| β 1. Read SKILL.md (structured outputs, Pydantic, validation) | |
| β 2. Create api/app/models/response.py with unified schema | |
| β 3. Define BaseAnalysisResponse with all required fields | |
| β 4. Update api/app/services/ragbot.py to use unified schema | |
| β 5. Ensure ResponseSynthesizerAgent outputs match schema | |
| β 6. Add Pydantic validation in all endpoints | |
| β 7. Run: pytest tests/test_response_schema.py -v | |
| ββ Code Location: api/app/models/response.py (REFACTORED) | |
| Week 2: Days 6-10 | |
| SKILL #3: Multi-Agent Orchestration | |
| ββ Duration: 3-4 hours | |
| ββ Task: Fix deterministic execution of parallel agents | |
| ββ Deliverable: Agents execute without race conditions | |
| ββ Actions: | |
| β 1. Read SKILL.md (agent coordination, deterministic scheduling) | |
| β 2. Review src/workflow.py parallel execution | |
| β 3. Ensure explicit state passing between agents: | |
| β - Biomarker Analyzer outputs β Disease Explainer inputs | |
| β - Sequential where needed (Analyzer before Linker) | |
| β - Parallel where safe (Explainer & Guidelines) | |
| β 4. Add logging to track execution order | |
| β 5. Run 10 times: python scripts/test_chat_demo.py (same output each time) | |
| ββ Outcome: Deterministic workflow execution | |
| SKILL #19: LLM Security | |
| ββ Duration: 3-4 hours | |
| ββ Task: Prevent LLM-specific attacks | |
| ββ Deliverable: Input validation against prompt injection | |
| ββ Actions: | |
| β 1. Read SKILL.md (prompt injection, token limit attacks) | |
| β 2. Add input sanitization in api/app/services/extraction.py | |
| β 3. Implement prompt injection detection: | |
| β - Check for "ignore instructions" patterns | |
| β - Limit biomarker input length | |
| β - Escape special characters | |
| β 4. Add rate limiting per user (SKILL #20) | |
| β 5. Write security tests | |
| ββ Code Location: api/app/middleware/input_validation.py (NEW) | |
| SKILL #20: API Rate Limiting | |
| ββ Duration: 2-3 hours | |
| ββ Task: Implement tiered rate limiting | |
| ββ Deliverable: /api/v1/analyze limited to 10/min free, 1000/min pro | |
| ββ Actions: | |
| β 1. Read SKILL.md (token bucket, sliding window algorithms) | |
| β 2. Import python-ratelimit library | |
| β 3. Add rate limiter middleware to api/main.py | |
| β 4. Implement tiered limits (free/pro based on API key) | |
| β 5. Return 429 with retry-after headers | |
| β 6. Test rate limiting behavior | |
| ββ Code Location: api/app/middleware/rate_limiter.py (NEW) | |
| END OF PHASE 1 OUTCOMES: | |
| β Security audit complete with fixes prioritized | |
| β JWT authentication on REST API | |
| β biomarker_flags & safety_alerts propagating through workflow | |
| β Unified response schema (API & CLI use same format) | |
| β LLM prompt injection protection | |
| β Rate limiting in place | |
| β Auth + security tests written (15+ new tests) | |
| β Coverage increased to ~75% | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| PHASE 2: TEST EXPANSION & AGENT OPTIMIZATION (Week 3-5) | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| GOAL: 90%+ test coverage + improved agent decision logic + prompt optimization | |
| Week 3: Days 11-15 | |
| SKILL #22: Python Testing Patterns (Advanced Use) | |
| ββ Duration: 8-10 hours (this is the main focus) | |
| ββ Task: Parametrized testing for biomarker combinations | |
| ββ Deliverable: 50+ new parametrized tests | |
| ββ Actions: | |
| β 1. Read SKILL.md sections on parametrization & fixtures | |
| β 2. Create tests/fixtures/biomarkers.py with test data: | |
| β - Normal values tuple | |
| β - Diabetes indicators tuple | |
| β - Mixed abnormal values tuple | |
| β - Edge cases tuple | |
| β 3. Write parametrized test for each biomarker combination: | |
| β @pytest.mark.parametrize("biomarkers,expected_disease", [...]) | |
| β def test_disease_prediction(biomarkers, expected_disease): | |
| β assert predict_disease(biomarkers) == expected_disease | |
| β 4. Create mocking fixtures for LLM calls: | |
| β @pytest.fixture | |
| β def mock_groq_client(monkeypatch): | |
| β # Mock all LLM interactions | |
| β 5. Test agent outputs: | |
| β - Biomarker Analyzer with 10 scenarios | |
| β - Disease Explainer with 5 diseases | |
| β - Confidence Assessor with low/medium/high confidence cases | |
| β 6. Run: pytest tests/ -v --cov src --cov-report=html | |
| β 7. Goal: 90%+ coverage on agents/ | |
| ββ Code Location: tests/test_parametrized_*.py | |
| SKILL #26: Python Design Patterns | |
| ββ Duration: 4-5 hours | |
| ββ Task: Refactor agent implementations with design patterns | |
| ββ Deliverable: Cleaner, more maintainable agent code | |
| ββ Actions: | |
| β 1. Read SKILL.md (SOLID, composition, factory patterns) | |
| β 2. Identify code smells in src/agents/ | |
| β 3. Extract common agent logic to BaseAgent class: | |
| β class BaseAgent: | |
| β def invoke(self, input_data) -> AgentOutput | |
| β def validate_inputs(self) | |
| β def log_execution(self) | |
| β 4. Use composition over inheritance: | |
| β - Each agent has optional retriever, validator, cache | |
| β - Reduce coupling between agents | |
| β 5. Implement Factory pattern for agent creation: | |
| β AgentFactory.create("biomarker_analyzer") | |
| β 6. Refactor tests to use new pattern | |
| ββ Code Location: src/agents/base_agent.py (NEW) | |
| SKILL #4: Agentic Development | |
| ββ Duration: 3-4 hours | |
| ββ Task: Improve agent decision logic | |
| ββ Deliverable: Better biomarker analysis confidence scores | |
| ββ Actions: | |
| β 1. Read SKILL.md (planning, reasoning, decision making) | |
| β 2. Add confidence threshold in BiomarkerAnalyzerAgent | |
| β 3. Instead of returning all results: | |
| β - Only return HIGH confidence matches | |
| β - Flag LOW confidence for manual review | |
| β - Add reasoning trace (why this conclusion) | |
| β 4. Update response format with: | |
| β - confidence_score (0-1) | |
| β - evidence_count (# sources) | |
| β - alternative_hypotheses (if low confidence) | |
| β 5. Update tests | |
| ββ Code Location: src/agents/biomarker_analyzer.py (MODIFIED) | |
| SKILL #13: Senior Prompt Engineer (First Use) | |
| ββ Duration: 5-6 hours | |
| ββ Task: Optimize prompts for medical accuracy | |
| ββ Deliverable: Updated agent prompts with better accuracy | |
| ββ Actions: | |
| β 1. Read SKILL.md (prompt patterns, few-shot, CoT) | |
| β 2. Audit current agent prompts in src/agents/*.py | |
| β 3. Apply few-shot learning to extraction agent: | |
| β - Add 3 examples of correct biomarker extraction | |
| β - Show format expected | |
| β - Show handling of ambiguous inputs | |
| β 4. Add chain-of-thought reasoning: | |
| β "First identify the biomarkers mentioned. Then look up their ranges. | |
| β Then determine if abnormal. Then assess severity." | |
| β 5. Add role prompting: | |
| β "You are an expert medical lab analyst with 20 years experience..." | |
| β 6. Implement structured output prompts: | |
| β "Return JSON with these exact fields: biomarkers, disease, confidence" | |
| β 7. Benchmark against baseline accuracy | |
| β 8. Run: python scripts/test_evaluation_system.py (SKILL #14) | |
| ββ Code Location: src/agents/*/invoke() prompts | |
| Week 4: Days 16-20 | |
| SKILL #14: LLM Evaluation | |
| ββ Duration: 4-5 hours | |
| ββ Task: Benchmark LLM quality improvements | |
| ββ Deliverable: Metrics dashboard showing promise of improvements | |
| ββ Actions: | |
| β 1. Read SKILL.md (evaluation metrics, benchmarking) | |
| β 2. Create tests/evaluation_metrics.py with metrics: | |
| β - Accuracy (correct disease prediction) | |
| β - Precision (of biomarker extraction) | |
| β - Recall (of clinical recommendations) | |
| β - F1 score (biomarker identification) | |
| β 3. Create test dataset with 20 patient scenarios: | |
| β tests/fixtures/evaluation_patients.py | |
| β 4. Benchmark Groq vs Gemini on accuracy, latency, cost | |
| β 5. Create evaluation report: | |
| β "Before optimization: 65% accuracy, 25s latency | |
| β After optimization: 80% accuracy, 18s latency" | |
| β 6. Generate graphs/charts of improvements | |
| ββ Code Location: tests/evaluation_metrics.py | |
| SKILL #5: Tool/Function Calling Patterns | |
| ββ Duration: 3-4 hours | |
| ββ Task: Use function calling for reliable LLM outputs | |
| ββ Deliverable: Structured output via function calling (not prompting) | |
| ββ Actions: | |
| β 1. Read SKILL.md (tool definition, structured returns) | |
| β 2. Define tools for extraction agent: | |
| β - extract_biomarkers(text: str) -> dict | |
| β - classify_severity(value: float, range: tuple) -> str | |
| β - assess_disease_risk(biomarkers: dict) -> dict | |
| β 3. Modify extraction service to use function calling: | |
| β Instead of parsing JSON from text, call literal functions | |
| β 4. Groq free tier check (may not support function calling) | |
| β Alternative: Use strict Pydantic output validation | |
| β 5. Test: Parsing should never fail, always return valid output | |
| β 6. Error handling: If LLM output wrong format, retry with function calling | |
| ββ Code Location: api/app/services/extraction.py (MODIFIED) | |
| SKILL #21: Python Error Handling | |
| ββ Duration: 3-4 hours | |
| ββ Task: Comprehensive error handling for production | |
| ββ Deliverable: Custom exception hierarchy, graceful degradation | |
| ββ Actions: | |
| β 1. Read SKILL.md (exception patterns, logging, recovery) | |
| β 2. Create src/exceptions.py with hierarchy: | |
| β - RagBotException (base) | |
| β - BiomarkerValidationError | |
| β - LLMTimeoutError (with retry logic) | |
| β - VectorStoreError | |
| β - SchemaValidationError | |
| β 3. Wrap agent calls with try-except: | |
| β try: | |
| β result = agent.invoke(input) | |
| β except LLMTimeoutError: | |
| β retry_with_smaller_context() | |
| β except BiomarkerValidationError: | |
| β return low_confidence_response() | |
| β 4. Add telemetry: which exceptions most common? | |
| β 5. Write exception tests (10+ scenarios) | |
| ββ Code Location: src/exceptions.py (NEW) | |
| Week 5: Days 21-25 | |
| SKILL #27: Python Observability (First Use) | |
| ββ Duration: 4-5 hours | |
| ββ Task: Structured logging for debugging & monitoring | |
| ββ Deliverable: JSON-formatted logs with context | |
| ββ Actions: | |
| β 1. Read SKILL.md (structured logging, correlation IDs) | |
| β 2. Replace print() with logger calls: | |
| β logger.info("analyzing biomarkers", extra={ | |
| β "biomarkers": {"glucose": 140}, | |
| β "user_id": "user123", | |
| β "correlation_id": "req-abc123" | |
| β }) | |
| β 3. Add correlation IDs to track requests through agents | |
| β 4. Structure logs as JSON (not text): | |
| β - timestamp | |
| β - level | |
| β - message | |
| β - context (user, request, agent) | |
| β - metrics (latency, tokens used) | |
| β 5. Implement in all agents (src/agents/*) | |
| β 6. Test: Review logs.jsonl output | |
| ββ Code Location: src/observability.py (NEW) | |
| SKILL #24: GitHub Actions Templates | |
| ββ Duration: 2-3 hours | |
| ββ Task: Set up CI/CD pipeline | |
| ββ Deliverable: .github/workflows/test.yml (auto-run tests on PR) | |
| ββ Actions: | |
| β 1. Read SKILL.md (GitHub Actions workflow syntax) | |
| β 2. Create .github/workflows/test.yml: | |
| β name: Run Tests | |
| β on: [push, pull_request] | |
| β jobs: | |
| β test: | |
| β runs-on: ubuntu-latest | |
| β steps: | |
| β - uses: actions/checkout@v3 | |
| β - uses: actions/setup-python@v4 | |
| β - run: pip install -r requirements.txt | |
| β - run: pytest tests/ -v --cov src --cov-report=xml | |
| β - run: coverage report (fail if <90%) | |
| β 3. Create .github/workflows/security.yml: | |
| β - Run OWASP checks | |
| β - Lint code | |
| β - Check dependencies for CVEs | |
| β 4. Create .github/workflows/docker.yml: | |
| β - Build Docker image | |
| β - Push to registry (optional) | |
| β 5. Test: Create a PR, verify workflows run | |
| ββ Location: .github/workflows/ | |
| END OF PHASE 2 OUTCOMES: | |
| β 90%+ test coverage achieved | |
| β 50+ parametrized tests added | |
| β Agent code refactored with design patterns | |
| β LLM prompts optimized for medical accuracy | |
| β Evaluation metrics show +15% accuracy improvement | |
| β Function calling prevents JSON parsing failures | |
| β Comprehensive error handling in place | |
| β Structured JSON logging implemented | |
| β CI/CD pipeline automated | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| PHASE 3: RETRIEVAL OPTIMIZATION & KNOWLEDGE GRAPHS (Week 6-8) | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| GOAL: Better medical knowledge retrieval + citations + knowledge graphs | |
| Week 6: Days 26-30 | |
| SKILL #8: Hybrid Search Implementation | |
| ββ Duration: 4-6 hours | |
| ββ Task: Combine semantic + keyword search for better recall | |
| ββ Deliverable: Hybrid retriever for RagBot (BM25 + FAISS) | |
| ββ Actions: | |
| β 1. Read SKILL.md (hybrid search architecture, reciprocal rank fusion) | |
| β 2. Current state: Only FAISS semantic search (misses rare diseases) | |
| β 3. Add BM25 keyword search: | |
| β pip install rank-bm25 | |
| β 4. Create src/retrievers/hybrid_retriever.py: | |
| β class HybridRetriever: | |
| β def semantic_search(query, k=5) # FAISS | |
| β def keyword_search(query, k=5) # BM25 | |
| β def hybrid_search(query): # Combine + rerank | |
| β 5. Reranking (Reciprocal Rank Fusion): | |
| β score = 1/(k + rank_semantic) + 1/(k + rank_keyword) | |
| β 6. Replace old retriever in disease_explainer agent: | |
| β old: retriever = faiss_retriever | |
| β new: retriever = hybrid_retriever | |
| β 7. Benchmark: Test retrieval quality on 10 disease cases | |
| β 8. Test rare disease retrieval (uncommon biomarker combinations) | |
| ββ Code Location: src/retrievers/hybrid_retriever.py (NEW) | |
| SKILL #9: Chunking Strategy | |
| ββ Duration: 4-5 hours | |
| ββ Task: Optimize medical document chunking | |
| ββ Deliverable: Improved chunks for better context | |
| ββ Actions: | |
| β 1. Read SKILL.md (chunking strategies, semantic boundaries) | |
| β 2. Current: Fixed 1000-char chunks (may split mid-sentence) | |
| β 3. Implement intelligent chunking: | |
| β - Split by medical sections (diagnosis, treatment, etc.) | |
| β - Keep related content together | |
| β - Maintain minimum 500 chars (context) max 2000 chars (context window) | |
| β 4. Preserve medical structure: | |
| β - Disease headers stay with symptoms | |
| β - Labs stay with reference ranges | |
| β - Treatment options stay together | |
| β 5. Create src/chunking_strategy.py: | |
| β def chunk_medical_pdf(pdf_text) -> List[Chunk]: | |
| β # Split by disease headers, maintain structure | |
| β 6. Re-chunk medical_knowledge.faiss (2,861 chunks β how many?) | |
| β 7. Re-embed with new chunks | |
| β 8. Benchmark: Document retrieval precision improved? | |
| ββ Code Location: src/chunking_strategy.py (REFACTORED) | |
| SKILL #10: Embedding Pipeline Builder | |
| ββ Duration: 3-4 hours | |
| ββ Task: Optimize embeddings for medical terminology | |
| ββ Deliverable: Better semantic search for medical terms | |
| ββ Actions: | |
| β 1. Read SKILL.md (embedding models, fine-tuning considerations) | |
| β 2. Current: sentence-transformers/all-MiniLM-L6-v2 (generic) | |
| β 3. Options for medical embeddings: | |
| β - all-MiniLM-L6-v2 (157M params, fast, baseline) | |
| β - all-mpnet-base-v2 (438M params, better quality) | |
| β - Medical-specific: SciBERT or BioSentenceTransformer (if available) | |
| β 4. Benchmark embeddings on medical queries: | |
| β Query: "High glucose and elevated HbA1c" | |
| β Expected top result: Diabetes diagnosis section | |
| β 5. If using different model: | |
| β pip install [new-model] | |
| β Re-embed all medical documents | |
| β Save new FAISS index | |
| β 6. Measure: Mean reciprocal rank (MRR) of correct document | |
| β 7. Update src/pdf_processor.py with better embeddings | |
| ββ Code Location: src/llm_config.py (MODIFIED) | |
| SKILL #11: RAG Implementation | |
| ββ Duration: 3-4 hours | |
| ββ Task: Enforce citation enforcement in responses | |
| ββ Deliverable: All claims backed by retrieved documents | |
| ββ Actions: | |
| β 1. Read SKILL.md (citation tracking, source attribution) | |
| β 2. Modify disease_explainer agent to track sources: | |
| β result = retriever.hybrid_search(query) | |
| β sources = [doc.metadata['source'] for doc in result] | |
| β # Keep track of which statements came from which docs | |
| β 3. Update ResponseSynthesizerAgent to require citations: | |
| β Every claim must be followed by [source: page N] | |
| β 4. Add validation: | |
| β if not has_citations(response): | |
| β return "Insufficient evidence for this conclusion" | |
| β 5. Modify API response to include citations: | |
| β { | |
| β "disease": "Diabetes", | |
| β "evidence": [ | |
| β {"claim": "High glucose", "source": "Clinical_Guidelines.pdf:p45"} | |
| β ] | |
| β } | |
| β 6. Test: Every response should have citations | |
| ββ Code Location: src/agents/disease_explainer.py (MODIFIED) | |
| Week 7: Days 31-35 | |
| SKILL #12: Knowledge Graph Builder | |
| ββ Duration: 6-8 hours | |
| ββ Task: Extract and use knowledge graphs for relationships | |
| ββ Deliverable: Biomarker β Disease β Treatment graph | |
| ββ Actions: | |
| β 1. Read SKILL.md (knowledge graphs, entity extraction, relationships) | |
| β 2. Design graph structure: | |
| β Nodes: Biomarkers, Diseases, Treatments, Symptoms | |
| β Edges: "elevated_glucose" -[indicates]-> "diabetes" | |
| β "diabetes" -[treated_by]-> "metformin" | |
| β 3. Extract entities from medical PDFs: | |
| β Use LLM to identify: (biomarker, disease, treatment) triples | |
| β Store in graph database (networkx for simplicity) | |
| β 4. Build src/knowledge_graph.py: | |
| β class MedicalKnowledgeGraph: | |
| β def find_diseases_for_biomarker(biomarker) -> List[Disease] | |
| β def find_treatments_for_disease(disease) -> List[Treatment] | |
| β def shortest_path(biomarker, disease) -> List[Node] | |
| β 5. Integrate with biomarker_analyzer: | |
| β Instead of rule-based disease prediction, | |
| β Use knowledge graph paths | |
| β 6. Test: Graph should have >100 nodes, >500 edges | |
| β 7. Visualize: Create graph.html (D3.js visualization) | |
| ββ Code Location: src/knowledge_graph.py (NEW) | |
| SKILL #1: LangChain Architecture (Deep Dive) | |
| ββ Duration: 3-4 hours | |
| ββ Task: Advanced LangChain patterns for RAG | |
| ββ Deliverable: More sophisticated agent chain design | |
| ββ Actions: | |
| β 1. Read SKILL.md (advanced chains, custom tools) | |
| β 2. Add custom tools to agents: | |
| β @tool | |
| β def lookup_reference_range(biomarker: str) -> dict: | |
| β """Get normal range for biomarker""" | |
| β return config.biomarker_references[biomarker] | |
| β 3. Create composite chains: | |
| β Chain = (lookup_range_tool | linter | analyzer) | |
| β 4. Implement memory for conversation context: | |
| β buffer = ConversationBufferMemory() | |
| β chain = RunnableWithMessageHistory(agent, buffer) | |
| β 5. Add callbacks for observability: | |
| β .with_config(callbacks=[logger_callback]) | |
| β 6. Test chain composition & memory | |
| ββ Code Location: src/agents/tools/ (NEW) | |
| SKILL #28: Memory Management | |
| ββ Duration: 3-4 hours | |
| ββ Task: Optimize context window usage | |
| ββ Deliverable: Fit more patient history without exceeding token limits | |
| ββ Actions: | |
| β 1. Read SKILL.md (context compression, memory hierarchies) | |
| β 2. Implement sliding window memory: | |
| β Keep last 5 messages (pruned conversation) | |
| β Summarize older messages into facts | |
| β 3. Add context compression: | |
| β "User mentioned: glucose 140, HbA1c 10" (compressed) | |
| β Instead of full raw conversation | |
| β 4. Monitor token usage: | |
| β - Groq free tier: ~500 requests/month | |
| β - Each request: ~1-2K tokens average | |
| β 5. Optimize prompts to use fewer tokens: | |
| β Remove verbose preamble | |
| β Use shorthand for common terms | |
| β 6. Test: Save 20-30% on token usage | |
| ββ Code Location: src/memory_manager.py (NEW) | |
| Week 8: Days 36-40 | |
| SKILL #15: Cost-Aware LLM Pipeline | |
| ββ Duration: 4-5 hours | |
| ββ Task: Optimize API costs (reduce Groq/Gemini usage) | |
| ββ Deliverable: Model routing by task complexity | |
| ββ Actions: | |
| β 1. Read SKILL.md (cost estimation, model selection, caching) | |
| β 2. Analyze current costs: | |
| β - Groq llama-3.3-70B: Expensive for simple tasks | |
| β - Gemini free tier: Rate-limited | |
| β 3. Implement model routing: | |
| β Simple task: Route to smaller model (if available) or cache | |
| β Complex task: Use llama-3.3-70B | |
| β 4. Example routing: | |
| β if task == "extract_biomarkers" and has_cache: | |
| β return cached_result | |
| β elif task == "complex_reasoning": | |
| β use_groq_70b() | |
| β else: | |
| β use_gemini_free() | |
| β 5. Implement caching: | |
| β hash(query) -> check cache -> LLM -> store result | |
| β 6. Track costs: | |
| β log every API call with cost | |
| β Generate monthly cost report | |
| β 7. Target: -40% cost reduction | |
| ββ Code Location: src/llm_config.py (MODIFIED) | |
| END OF PHASE 3 OUTCOMES: | |
| β Hybrid search implemented (semantic + keyword) | |
| β Medical chunking improves knowledge quality | |
| β Embeddings optimized for medical terminology | |
| β Citation enforcement in all RAG outputs | |
| β Knowledge graph built from medical PDFs | |
| β LangChain advanced patterns implemented | |
| β Context window optimization reduces token waste | |
| β Model routing saves -40% on API costs | |
| β Better disease prediction via knowledge graphs | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| PHASE 4: DEPLOYMENT, MONITORING & SCALING (Week 9-12) | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| GOAL: Production-ready system with monitoring, docs, and deployment | |
| Week 9: Days 41-45 | |
| SKILL #25: FastAPI Templates | |
| ββ Duration: 3-4 hours | |
| ββ Task: Production-grade FastAPI configuration | |
| ββ Deliverable: Optimized FastAPI settings, middleware | |
| ββ Actions: | |
| β 1. Read SKILL.md (async patterns, dependency injection, middleware) | |
| β 2. Apply async best practices: | |
| β - All endpoints async def | |
| β - Use asyncio for parallel agent calls | |
| β - Remove any sync blocking calls | |
| β 3. Add middleware chain: | |
| β - CORS middleware (for web frontend) | |
| β - Request logging (correlation IDs) | |
| β - Error handling | |
| β - Rate limiting | |
| β - Auth | |
| β 4. Optimize configuration: | |
| β - Connection pooling for databases | |
| β - Caching headers (HTTP) | |
| β - Compression (gzip) | |
| β 5. Add health checks: | |
| β /health - basic healthcheck | |
| β /health/deep - check dependencies (FAISS, LLM) | |
| β 6. Test: Load testing with async | |
| ββ Code Location: api/app/main.py (REFACTORED) | |
| SKILL #29: API Docs Generator | |
| ββ Duration: 2-3 hours | |
| ββ Task: Auto-generate OpenAPI spec + interactive docs | |
| ββ Deliverable: /docs (Swagger UI) + /redoc (ReDoc) | |
| ββ Actions: | |
| β 1. Read SKILL.md (OpenAPI, Swagger UI, ReDoc) | |
| β 2. FastAPI auto-generates OpenAPI from endpoints | |
| β 3. Enhance documentation: | |
| β Add detailed descriptions to each endpoint | |
| β Add example responses | |
| β Add error codes | |
| β 4. Example: | |
| β @app.post("/api/v1/analyze/structured") | |
| β async def analyze_structured(request: AnalysisRequest): | |
| β """ | |
| β Analyze biomarkers (structured input) | |
| β | |
| β - **biomarkers**: Dict of biomarker names β values | |
| β - **response**: Full analysis with disease prediction | |
| β | |
| β Example: | |
| β {"biomarkers": {"glucose": 140, "HbA1c": 10}} | |
| β """ | |
| β 5. Auto-docs available at: | |
| β http://localhost:8000/docs | |
| β http://localhost:8000/redoc | |
| β 6. Generate OpenAPI JSON: | |
| β http://localhost:8000/openapi.json | |
| β 7. Create client SDK (optional): | |
| β OpenAPI Generator β Python, JS, Go clients | |
| ββ Docs auto-generated from code | |
| SKILL #30: GitHub PR Review Workflow | |
| ββ Duration: 2-3 hours | |
| ββ Task: Establish code review standards | |
| ββ Deliverable: CODEOWNERS, PR templates, branch protection | |
| ββ Actions: | |
| β 1. Read SKILL.md (PR templates, CODEOWNERS, review process) | |
| β 2. Create .github/CODEOWNERS: | |
| β # Security reviews required for: | |
| β /api/app/middleware/ @security-team | |
| β # Testing reviews required for: | |
| β /tests/ @qa-team | |
| β 3. Create .github/pull_request_template.md: | |
| β ## Description | |
| β ## Type of change | |
| β ## Tests added | |
| β ## Checklist | |
| β ## Related issues | |
| β 4. Configure branch protection: | |
| β - Require 1 approval before merge | |
| β - Require status checks pass (tests, lint) | |
| β - Require up-to-date branch | |
| β 5. Create CONTRIBUTING.md with guidelines | |
| ββ Location: .github/ | |
| Week 10: Days 46-50 | |
| SKILL #27: Python Observability (Advanced) | |
| ββ Duration: 4-5 hours | |
| ββ Task: Metrics collection + monitoring dashboard | |
| ββ Deliverable: Key metrics tracked (latency, accuracy, errors) | |
| ββ Actions: | |
| β 1. Read SKILL.md (metrics, histograms, summaries) | |
| β 2. Add prometheus metrics: | |
| β pip install prometheus-client | |
| β 3. Track key metrics: | |
| β - request_latency_ms (histogram) | |
| β - disease_prediction_accuracy (gauge) | |
| β - llm_api_calls_total (counter) | |
| β - error_rate (gauge) | |
| β - citations_found_rate (gauge) | |
| β 4. Add to all agents: | |
| β with timer("biomarker_analyzer"): | |
| β result = analyzer.invoke(input) | |
| β 5. Expose metrics at /metrics | |
| β 6. Integrate with monitoring (optional): | |
| β Send to Prometheus -> Grafana dashboard | |
| β 7. Alerts: | |
| β If latency > 25s: alert | |
| β If accuracy < 75%: alert | |
| β If error rate > 5%: alert | |
| ββ Code Location: src/monitoring/ (NEW) | |
| SKILL #23: Code Review Excellence | |
| ββ Duration: 2-3 hours | |
| ββ Task: Review and improve code quality | |
| ββ Deliverable: Code quality assessment report | |
| ββ Actions: | |
| β 1. Read SKILL.md (code review patterns, common issues) | |
| β 2. Self-review all Phase 1-3 changes: | |
| β - Are functions <20 lines? (if not, break up) | |
| β - Are variable names clear? (rename if not) | |
| β - Are error cases handled? (if not, add) | |
| β - Are tests present? (required: >90% coverage) | |
| β 3. Common medical code patterns to enforce: | |
| β - Never assume biomarker values are valid | |
| β - Always include units (mg/dL, etc.) | |
| β - Always cite medical literature | |
| β - Never hardcode disease thresholds | |
| β 4. Create REVIEW_GUIDELINES.md | |
| β 5. Review Agent implementations: | |
| β Check for: typos, unclear logic, missing docstrings | |
| ββ Code Location: docs/REVIEW_GUIDELINES.md (NEW) | |
| SKILL #31: CI-CD Best Practices | |
| ββ Duration: 3-4 hours | |
| ββ Task: Enhance CI/CD with deployment | |
| ββ Deliverable: Automated deployment pipeline | |
| ββ Actions: | |
| β 1. Read SKILL.md (deployment strategies, environments) | |
| β 2. Add deployment workflow: | |
| β .github/workflows/deploy.yml: | |
| β - Build Docker image | |
| β - Push to registry | |
| β - Deploy to staging | |
| β - Run smoke tests | |
| β - Manual approval for production | |
| β - Deploy to production | |
| β 3. Environment management: | |
| β - .env.development (localhost) | |
| β - .env.staging (staging server) | |
| β - .env.production (prod server) | |
| β 4. Deployment strategy: | |
| β Canary: Deploy to 10% of traffic first | |
| β Monitor for errors | |
| β If OK, deploy to 100% | |
| β If errors, rollback | |
| β 5. Docker configuration: | |
| β Multi-stage build for smaller images | |
| β Security: Non-root user, minimal base image | |
| β 6. Test deployment locally: | |
| β docker build -t ragbot . | |
| β docker run -p 8000:8000 ragbot | |
| ββ Location: .github/workflows/deploy.yml (NEW) | |
| SKILL #32: Frontend Accessibility (if building web frontend) | |
| ββ Duration: 2-3 hours (optional, skip if CLI only) | |
| ββ Task: Accessibility standards for web interface | |
| ββ Deliverable: WCAG 2.1 AA compliant UI | |
| ββ Actions: | |
| β 1. Read SKILL.md (a11y, screen readers, keyboard nav) | |
| β 2. If building React frontend for medical results: | |
| β - All buttons keyboard accessible | |
| β - Screen reader labels on medical data | |
| β - High contrast for readability | |
| β - Clear error messages | |
| β 3. Test with screen reader (NVDA or JAWS) | |
| ββ Code Location: examples/web_interface/ (if needed) | |
| Week 11: Days 51-55 | |
| SKILL #6: LLM Application Dev with LangChain | |
| ββ Duration: 4-5 hours | |
| ββ Task: Production LangChain patterns | |
| ββ Deliverable: Robust, maintainable agent code | |
| ββ Actions: | |
| β 1. Read SKILL.md (production patterns, error handling, logging) | |
| β 2. Implement agent lifecycle: | |
| β - Setup (load models, prepare context) | |
| β - Execution (with retries) | |
| β - Cleanup (save state, log metrics) | |
| β 3. Add retry logic for LLM calls: | |
| β @retry(max_attempts=3, backoff=exponential) | |
| β def invoke_agent(self, input): | |
| β return self.llm.predict(...) | |
| β 4. Add graceful degradation: | |
| β If LLM fails, return cached result | |
| β If vector store fails, return rule-based result | |
| β 5. Implement agent composition: | |
| β Multi-step workflows where agents call other agents | |
| β 6. Test: 99.99% uptime in staging | |
| ββ Code Location: src/agents/base_agent.py (REFINED) | |
| SKILL #33: Webhook Receiver Hardener | |
| ββ Duration: 2-3 hours | |
| ββ Task: Secure webhook handling (for integrations) | |
| ββ Deliverable: Webhook endpoint with signature verification | |
| ββ Actions: | |
| β 1. Read SKILL.md (signature verification, replay protection) | |
| β 2. If accepting webhooks from external systems: | |
| β - Verify HMAC signature | |
| β - Check timestamp (prevent replay attacks) | |
| β - Idempotency key handling | |
| β 3. Example: EHR system sends patient updates | |
| β POST /webhooks/patient-update | |
| β Verify: X-Webhook-Signature header | |
| β Prevent: Same update processed twice | |
| β 4. Create api/app/webhooks/ (NEW if needed) | |
| β 5. Test: Webhook security scenarios | |
| ββ Code Location: api/app/webhooks/ (OPTIONAL) | |
| Week 12: Days 56-60 | |
| SKILL #7: RAG Agent Builder | |
| ββ Duration: 4-5 hours | |
| ββ Task: Full RAG agent architecture review | |
| ββ Deliverable: Production-ready RAG agents | |
| ββ Actions: | |
| β 1. Read SKILL.md (RAG agent design, retrieval QA chains) | |
| β 2. Comprehensive RAG review: | |
| β - Retriever quality (hybrid search, ranking) | |
| β - Prompt quality (citations, evidence) | |
| β - Response quality (accurate, safe) | |
| β 3. Disease Explainer Agent refactor: | |
| β Step 1: Retrieve relevant medical documents | |
| β Step 2: Extract key evidence from docs | |
| β Step 3: Synthesize explanation with citations | |
| β Step 4: Assess confidence (high/medium/low) | |
| β 4. Test: All responses have citations | |
| β 5. Test: No medical hallucinations | |
| β 6. Benchmark: Accuracy, latency, cost | |
| ββ Code Location: src/agents/ (FINAL REVIEW) | |
| Final Week Integration (Days 56-60): | |
| SKILL #2: Workflow Orchestration (Refinement) | |
| ββ Final review of entire workflow | |
| ββ Ensure all agents work together | |
| ββ Test end-to-end: CLI and API | |
| Comprehensive Testing: | |
| ββ Functional tests: All features work | |
| ββ Security tests: No vulnerabilities | |
| ββ Performance tests: <20s latency | |
| ββ Load tests: Handle 10 concurrent requests | |
| Documentation: | |
| ββ Update README with new features | |
| ββ Document API at /docs | |
| ββ Create deployment guide | |
| ββ Create troubleshooting guide | |
| Production Deployment: | |
| ββ Stage: Test with real environment | |
| ββ Canary: 10% of traffic | |
| ββ Monitor: Errors, latency, accuracy | |
| ββ Full deployment: 100% of traffic | |
| END OF PHASE 4 OUTCOMES: | |
| β FastAPI optimized for production | |
| β API documentation auto-generated | |
| β Code review standards established | |
| β Full observability (logging, metrics) | |
| β CI/CD with automated deployment | |
| β Security best practices implemented | |
| β Production-ready RAG agents | |
| β System deployed and monitored | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| IMPLEMENTATION SUMMARY | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| SKILLS USED IN ORDER: | |
| Phase 1 (Security + Fixes): 2, 3, 4, 16, 17, 18, 19, 20, 22 | |
| Phase 2 (Testing + Agents): 22, 26, 4, 13, 14, 5, 21, 27, 24 | |
| Phase 3 (Retrieval + Graphs): 8, 9, 10, 11, 12, 1, 28, 15 | |
| Phase 4 (Production): 25, 29, 30, 27, 23, 31, 32(*), 6, 33(*), 7 | |
| (*) Optional based on needs | |
| TOTAL IMPLEMENTATION TIME: | |
| Phase 1: ~30-40 hours | |
| Phase 2: ~35-45 hours | |
| Phase 3: ~30-40 hours | |
| Phase 4: ~30-40 hours | |
| βββββββββββββββββββββ | |
| TOTAL: ~130-160 hours over 12 weeks (~10-12 hours/week) | |
| EXPECTED OUTCOMES: | |
| Metrics: | |
| Test Coverage: 70% β 90%+ | |
| Response Latency: 25s β 15-20s (-30%) | |
| Accuracy: 65% β 80% (+15-20%) | |
| API Costs: -40% via optimization | |
| Citations: 0% β 100% | |
| Quality: | |
| β OWASP compliant | |
| β HIPAA aligned | |
| β Production-ready | |
| β Enterprise monitoring | |
| β Automated deployments | |
| System Capabilities: | |
| β Hybrid semantic + keyword search | |
| β Knowledge graphs for reasoning | |
| β Cost-optimized LLM routing | |
| β Full citation enforcement | |
| β Advanced observability | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| WEEKLY CHECKLIST | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| Each week, verify: | |
| β‘ Code committed with clear commit messages | |
| β‘ Tests pass locally: pytest -v --cov | |
| β‘ Coverage >85% on any new code | |
| β‘ PR created with documentation | |
| β‘ Code reviewed (self or team) | |
| β‘ No security warnings | |
| β‘ Documentation updated | |
| β‘ Metrics tracked (custom dashboard) | |
| β‘ No breaking changes to API | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| DONE! Your 4-month implementation plan is ready. | |
| Start with Phase 1 Week 1. | |
| Execute systematically. | |
| Measure progress weekly. | |
| Celebrate wins! | |
| Your RagBot will be enterprise-grade. π | |