Spaces:
Sleeping
Sleeping
File size: 42,080 Bytes
aefac4f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 | ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π RAGBOT 4-MONTH IMPLEMENTATION ROADMAP - ALL 34 SKILLS β
β Systematic, Phased Approach to Enterprise-Grade AI β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
IMPLEMENTATION PHILOSOPHY
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β’ Fix critical issues first (security, state management, schema)
β’ Build tests concurrently (every feature gets tests immediately)
β’ Deploy incrementally (working code at each phase)
β’ Measure continuously (metrics drive priorities)
β’ Document along the way (knowledge preservation)
PROJECT BASELINE
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Current Status:
β’ 83+ passing tests (~70% coverage)
β’ 6 specialist agents (Biomarker Analyzer, Disease Explainer, etc.)
β’ FastAPI REST API + CLI interface
β’ FAISS vector store (750+ pages medical knowledge)
β’ 2,861 medical knowledge chunks
Critical Issues to Fix:
1. biomarker_flags & safety_alerts not propagating through workflow
2. Schema mismatch between workflow output & API formatter
3. Prediction confidence forced to 0.5 (dangerous for medical domain)
4. Different biomarker naming (API vs CLI)
5. JSON parsing breaks on malformed LLM output
6. No citation enforcement in RAG outputs
Success Metrics:
β’ Test coverage: 70% β 90%+
β’ Response latency: 25s β 15-20s
β’ Prediction accuracy: +15-20%
β’ API costs: -40% (Groq free tier optimization)
β’ Security: OWASP compliant, HIPAA aligned
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
PHASE 1: FOUNDATION & CRITICAL FIXES (Week 1-2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GOAL: Security baseline + fix state propagation + unify schemas
Week 1: Days 1-5
SKILL #18: OWASP Security Check
ββ Duration: 2-3 hours
ββ Task: Run comprehensive security audit
ββ Deliverable: Security issues list, prioritized fixes
ββ Actions:
β 1. Read SKILL.md documentation
β 2. Run vulnerability scanner on /api and /src
β 3. Document findings in SECURITY_AUDIT.md
β 4. Create tickets for each finding
ββ Outcome: Clear understanding of security gaps
SKILL #17: API Security Hardening
ββ Duration: 4-6 hours
ββ Task: Implement authentication & hardening
ββ Deliverable: JWT auth on /api/v1/analyze endpoint
ββ Actions:
β 1. Read SKILL.md (auth patterns, CORS, headers)
β 2. Add JWT middleware to api/main.py
β 3. Update routes with @require_auth decorator
β 4. Add security headers (HSTS, CSP, X-Frame-Options)
β 5. Write tests for auth (SKILL #22: Python Testing Patterns)
β 6. Update docs with API key requirement
ββ Code Location: api/app/middleware/auth.py (NEW)
SKILL #22: Python Testing Patterns (First Use)
ββ Duration: 2-3 hours
ββ Task: Create testing infrastructure & auth tests
ββ Deliverable: tests/test_api_auth.py with 10+ tests
ββ Actions:
β 1. Read SKILL.md (fixtures, mocking, parametrization)
β 2. Create conftest.py with auth fixtures
β 3. Write tests for JWT generation, validation, failure cases
β 4. Implement pytest fixtures for authenticated client
β 5. Run: pytest tests/test_api_auth.py -v
ββ Outcome: 80% test coverage on auth module
SKILL #2: Workflow Orchestration Patterns
ββ Duration: 4-6 hours
ββ Task: Fix state propagation in LangGraph workflow
ββ Deliverable: biomarker_flags & safety_alerts propagate end-to-end
ββ Actions:
β 1. Read SKILL.md (LangGraph state management, parallel execution)
β 2. Review src/state.py current structure
β 3. Identify missing state fields in GuildState
β 4. Refactor agents to return complete state:
β - src/agents/biomarker_analyzer.py β return biomarker_flags
β - src/agents/biomarker_analyzer.py β return safety_alerts
β - src/agents/confidence_assessor.py β update state
β 5. Test with: python -c "from src.workflow import create_guild..."
β 6. Write integration tests (SKILL #22)
ββ Code Changes: src/state.py, src/agents/*.py
SKILL #16: AI Wrapper/Structured Output
ββ Duration: 3-5 hours
ββ Task: Unify workflow β API response schema
ββ Deliverable: Single canonical response format (Pydantic model)
ββ Actions:
β 1. Read SKILL.md (structured outputs, Pydantic, validation)
β 2. Create api/app/models/response.py with unified schema
β 3. Define BaseAnalysisResponse with all required fields
β 4. Update api/app/services/ragbot.py to use unified schema
β 5. Ensure ResponseSynthesizerAgent outputs match schema
β 6. Add Pydantic validation in all endpoints
β 7. Run: pytest tests/test_response_schema.py -v
ββ Code Location: api/app/models/response.py (REFACTORED)
Week 2: Days 6-10
SKILL #3: Multi-Agent Orchestration
ββ Duration: 3-4 hours
ββ Task: Fix deterministic execution of parallel agents
ββ Deliverable: Agents execute without race conditions
ββ Actions:
β 1. Read SKILL.md (agent coordination, deterministic scheduling)
β 2. Review src/workflow.py parallel execution
β 3. Ensure explicit state passing between agents:
β - Biomarker Analyzer outputs β Disease Explainer inputs
β - Sequential where needed (Analyzer before Linker)
β - Parallel where safe (Explainer & Guidelines)
β 4. Add logging to track execution order
β 5. Run 10 times: python scripts/test_chat_demo.py (same output each time)
ββ Outcome: Deterministic workflow execution
SKILL #19: LLM Security
ββ Duration: 3-4 hours
ββ Task: Prevent LLM-specific attacks
ββ Deliverable: Input validation against prompt injection
ββ Actions:
β 1. Read SKILL.md (prompt injection, token limit attacks)
β 2. Add input sanitization in api/app/services/extraction.py
β 3. Implement prompt injection detection:
β - Check for "ignore instructions" patterns
β - Limit biomarker input length
β - Escape special characters
β 4. Add rate limiting per user (SKILL #20)
β 5. Write security tests
ββ Code Location: api/app/middleware/input_validation.py (NEW)
SKILL #20: API Rate Limiting
ββ Duration: 2-3 hours
ββ Task: Implement tiered rate limiting
ββ Deliverable: /api/v1/analyze limited to 10/min free, 1000/min pro
ββ Actions:
β 1. Read SKILL.md (token bucket, sliding window algorithms)
β 2. Import python-ratelimit library
β 3. Add rate limiter middleware to api/main.py
β 4. Implement tiered limits (free/pro based on API key)
β 5. Return 429 with retry-after headers
β 6. Test rate limiting behavior
ββ Code Location: api/app/middleware/rate_limiter.py (NEW)
END OF PHASE 1 OUTCOMES:
β
Security audit complete with fixes prioritized
β
JWT authentication on REST API
β
biomarker_flags & safety_alerts propagating through workflow
β
Unified response schema (API & CLI use same format)
β
LLM prompt injection protection
β
Rate limiting in place
β
Auth + security tests written (15+ new tests)
β
Coverage increased to ~75%
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
PHASE 2: TEST EXPANSION & AGENT OPTIMIZATION (Week 3-5)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GOAL: 90%+ test coverage + improved agent decision logic + prompt optimization
Week 3: Days 11-15
SKILL #22: Python Testing Patterns (Advanced Use)
ββ Duration: 8-10 hours (this is the main focus)
ββ Task: Parametrized testing for biomarker combinations
ββ Deliverable: 50+ new parametrized tests
ββ Actions:
β 1. Read SKILL.md sections on parametrization & fixtures
β 2. Create tests/fixtures/biomarkers.py with test data:
β - Normal values tuple
β - Diabetes indicators tuple
β - Mixed abnormal values tuple
β - Edge cases tuple
β 3. Write parametrized test for each biomarker combination:
β @pytest.mark.parametrize("biomarkers,expected_disease", [...])
β def test_disease_prediction(biomarkers, expected_disease):
β assert predict_disease(biomarkers) == expected_disease
β 4. Create mocking fixtures for LLM calls:
β @pytest.fixture
β def mock_groq_client(monkeypatch):
β # Mock all LLM interactions
β 5. Test agent outputs:
β - Biomarker Analyzer with 10 scenarios
β - Disease Explainer with 5 diseases
β - Confidence Assessor with low/medium/high confidence cases
β 6. Run: pytest tests/ -v --cov src --cov-report=html
β 7. Goal: 90%+ coverage on agents/
ββ Code Location: tests/test_parametrized_*.py
SKILL #26: Python Design Patterns
ββ Duration: 4-5 hours
ββ Task: Refactor agent implementations with design patterns
ββ Deliverable: Cleaner, more maintainable agent code
ββ Actions:
β 1. Read SKILL.md (SOLID, composition, factory patterns)
β 2. Identify code smells in src/agents/
β 3. Extract common agent logic to BaseAgent class:
β class BaseAgent:
β def invoke(self, input_data) -> AgentOutput
β def validate_inputs(self)
β def log_execution(self)
β 4. Use composition over inheritance:
β - Each agent has optional retriever, validator, cache
β - Reduce coupling between agents
β 5. Implement Factory pattern for agent creation:
β AgentFactory.create("biomarker_analyzer")
β 6. Refactor tests to use new pattern
ββ Code Location: src/agents/base_agent.py (NEW)
SKILL #4: Agentic Development
ββ Duration: 3-4 hours
ββ Task: Improve agent decision logic
ββ Deliverable: Better biomarker analysis confidence scores
ββ Actions:
β 1. Read SKILL.md (planning, reasoning, decision making)
β 2. Add confidence threshold in BiomarkerAnalyzerAgent
β 3. Instead of returning all results:
β - Only return HIGH confidence matches
β - Flag LOW confidence for manual review
β - Add reasoning trace (why this conclusion)
β 4. Update response format with:
β - confidence_score (0-1)
β - evidence_count (# sources)
β - alternative_hypotheses (if low confidence)
β 5. Update tests
ββ Code Location: src/agents/biomarker_analyzer.py (MODIFIED)
SKILL #13: Senior Prompt Engineer (First Use)
ββ Duration: 5-6 hours
ββ Task: Optimize prompts for medical accuracy
ββ Deliverable: Updated agent prompts with better accuracy
ββ Actions:
β 1. Read SKILL.md (prompt patterns, few-shot, CoT)
β 2. Audit current agent prompts in src/agents/*.py
β 3. Apply few-shot learning to extraction agent:
β - Add 3 examples of correct biomarker extraction
β - Show format expected
β - Show handling of ambiguous inputs
β 4. Add chain-of-thought reasoning:
β "First identify the biomarkers mentioned. Then look up their ranges.
β Then determine if abnormal. Then assess severity."
β 5. Add role prompting:
β "You are an expert medical lab analyst with 20 years experience..."
β 6. Implement structured output prompts:
β "Return JSON with these exact fields: biomarkers, disease, confidence"
β 7. Benchmark against baseline accuracy
β 8. Run: python scripts/test_evaluation_system.py (SKILL #14)
ββ Code Location: src/agents/*/invoke() prompts
Week 4: Days 16-20
SKILL #14: LLM Evaluation
ββ Duration: 4-5 hours
ββ Task: Benchmark LLM quality improvements
ββ Deliverable: Metrics dashboard showing promise of improvements
ββ Actions:
β 1. Read SKILL.md (evaluation metrics, benchmarking)
β 2. Create tests/evaluation_metrics.py with metrics:
β - Accuracy (correct disease prediction)
β - Precision (of biomarker extraction)
β - Recall (of clinical recommendations)
β - F1 score (biomarker identification)
β 3. Create test dataset with 20 patient scenarios:
β tests/fixtures/evaluation_patients.py
β 4. Benchmark Groq vs Gemini on accuracy, latency, cost
β 5. Create evaluation report:
β "Before optimization: 65% accuracy, 25s latency
β After optimization: 80% accuracy, 18s latency"
β 6. Generate graphs/charts of improvements
ββ Code Location: tests/evaluation_metrics.py
SKILL #5: Tool/Function Calling Patterns
ββ Duration: 3-4 hours
ββ Task: Use function calling for reliable LLM outputs
ββ Deliverable: Structured output via function calling (not prompting)
ββ Actions:
β 1. Read SKILL.md (tool definition, structured returns)
β 2. Define tools for extraction agent:
β - extract_biomarkers(text: str) -> dict
β - classify_severity(value: float, range: tuple) -> str
β - assess_disease_risk(biomarkers: dict) -> dict
β 3. Modify extraction service to use function calling:
β Instead of parsing JSON from text, call literal functions
β 4. Groq free tier check (may not support function calling)
β Alternative: Use strict Pydantic output validation
β 5. Test: Parsing should never fail, always return valid output
β 6. Error handling: If LLM output wrong format, retry with function calling
ββ Code Location: api/app/services/extraction.py (MODIFIED)
SKILL #21: Python Error Handling
ββ Duration: 3-4 hours
ββ Task: Comprehensive error handling for production
ββ Deliverable: Custom exception hierarchy, graceful degradation
ββ Actions:
β 1. Read SKILL.md (exception patterns, logging, recovery)
β 2. Create src/exceptions.py with hierarchy:
β - RagBotException (base)
β - BiomarkerValidationError
β - LLMTimeoutError (with retry logic)
β - VectorStoreError
β - SchemaValidationError
β 3. Wrap agent calls with try-except:
β try:
β result = agent.invoke(input)
β except LLMTimeoutError:
β retry_with_smaller_context()
β except BiomarkerValidationError:
β return low_confidence_response()
β 4. Add telemetry: which exceptions most common?
β 5. Write exception tests (10+ scenarios)
ββ Code Location: src/exceptions.py (NEW)
Week 5: Days 21-25
SKILL #27: Python Observability (First Use)
ββ Duration: 4-5 hours
ββ Task: Structured logging for debugging & monitoring
ββ Deliverable: JSON-formatted logs with context
ββ Actions:
β 1. Read SKILL.md (structured logging, correlation IDs)
β 2. Replace print() with logger calls:
β logger.info("analyzing biomarkers", extra={
β "biomarkers": {"glucose": 140},
β "user_id": "user123",
β "correlation_id": "req-abc123"
β })
β 3. Add correlation IDs to track requests through agents
β 4. Structure logs as JSON (not text):
β - timestamp
β - level
β - message
β - context (user, request, agent)
β - metrics (latency, tokens used)
β 5. Implement in all agents (src/agents/*)
β 6. Test: Review logs.jsonl output
ββ Code Location: src/observability.py (NEW)
SKILL #24: GitHub Actions Templates
ββ Duration: 2-3 hours
ββ Task: Set up CI/CD pipeline
ββ Deliverable: .github/workflows/test.yml (auto-run tests on PR)
ββ Actions:
β 1. Read SKILL.md (GitHub Actions workflow syntax)
β 2. Create .github/workflows/test.yml:
β name: Run Tests
β on: [push, pull_request]
β jobs:
β test:
β runs-on: ubuntu-latest
β steps:
β - uses: actions/checkout@v3
β - uses: actions/setup-python@v4
β - run: pip install -r requirements.txt
β - run: pytest tests/ -v --cov src --cov-report=xml
β - run: coverage report (fail if <90%)
β 3. Create .github/workflows/security.yml:
β - Run OWASP checks
β - Lint code
β - Check dependencies for CVEs
β 4. Create .github/workflows/docker.yml:
β - Build Docker image
β - Push to registry (optional)
β 5. Test: Create a PR, verify workflows run
ββ Location: .github/workflows/
END OF PHASE 2 OUTCOMES:
β
90%+ test coverage achieved
β
50+ parametrized tests added
β
Agent code refactored with design patterns
β
LLM prompts optimized for medical accuracy
β
Evaluation metrics show +15% accuracy improvement
β
Function calling prevents JSON parsing failures
β
Comprehensive error handling in place
β
Structured JSON logging implemented
β
CI/CD pipeline automated
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
PHASE 3: RETRIEVAL OPTIMIZATION & KNOWLEDGE GRAPHS (Week 6-8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GOAL: Better medical knowledge retrieval + citations + knowledge graphs
Week 6: Days 26-30
SKILL #8: Hybrid Search Implementation
ββ Duration: 4-6 hours
ββ Task: Combine semantic + keyword search for better recall
ββ Deliverable: Hybrid retriever for RagBot (BM25 + FAISS)
ββ Actions:
β 1. Read SKILL.md (hybrid search architecture, reciprocal rank fusion)
β 2. Current state: Only FAISS semantic search (misses rare diseases)
β 3. Add BM25 keyword search:
β pip install rank-bm25
β 4. Create src/retrievers/hybrid_retriever.py:
β class HybridRetriever:
β def semantic_search(query, k=5) # FAISS
β def keyword_search(query, k=5) # BM25
β def hybrid_search(query): # Combine + rerank
β 5. Reranking (Reciprocal Rank Fusion):
β score = 1/(k + rank_semantic) + 1/(k + rank_keyword)
β 6. Replace old retriever in disease_explainer agent:
β old: retriever = faiss_retriever
β new: retriever = hybrid_retriever
β 7. Benchmark: Test retrieval quality on 10 disease cases
β 8. Test rare disease retrieval (uncommon biomarker combinations)
ββ Code Location: src/retrievers/hybrid_retriever.py (NEW)
SKILL #9: Chunking Strategy
ββ Duration: 4-5 hours
ββ Task: Optimize medical document chunking
ββ Deliverable: Improved chunks for better context
ββ Actions:
β 1. Read SKILL.md (chunking strategies, semantic boundaries)
β 2. Current: Fixed 1000-char chunks (may split mid-sentence)
β 3. Implement intelligent chunking:
β - Split by medical sections (diagnosis, treatment, etc.)
β - Keep related content together
β - Maintain minimum 500 chars (context) max 2000 chars (context window)
β 4. Preserve medical structure:
β - Disease headers stay with symptoms
β - Labs stay with reference ranges
β - Treatment options stay together
β 5. Create src/chunking_strategy.py:
β def chunk_medical_pdf(pdf_text) -> List[Chunk]:
β # Split by disease headers, maintain structure
β 6. Re-chunk medical_knowledge.faiss (2,861 chunks β how many?)
β 7. Re-embed with new chunks
β 8. Benchmark: Document retrieval precision improved?
ββ Code Location: src/chunking_strategy.py (REFACTORED)
SKILL #10: Embedding Pipeline Builder
ββ Duration: 3-4 hours
ββ Task: Optimize embeddings for medical terminology
ββ Deliverable: Better semantic search for medical terms
ββ Actions:
β 1. Read SKILL.md (embedding models, fine-tuning considerations)
β 2. Current: sentence-transformers/all-MiniLM-L6-v2 (generic)
β 3. Options for medical embeddings:
β - all-MiniLM-L6-v2 (157M params, fast, baseline)
β - all-mpnet-base-v2 (438M params, better quality)
β - Medical-specific: SciBERT or BioSentenceTransformer (if available)
β 4. Benchmark embeddings on medical queries:
β Query: "High glucose and elevated HbA1c"
β Expected top result: Diabetes diagnosis section
β 5. If using different model:
β pip install [new-model]
β Re-embed all medical documents
β Save new FAISS index
β 6. Measure: Mean reciprocal rank (MRR) of correct document
β 7. Update src/pdf_processor.py with better embeddings
ββ Code Location: src/llm_config.py (MODIFIED)
SKILL #11: RAG Implementation
ββ Duration: 3-4 hours
ββ Task: Enforce citation enforcement in responses
ββ Deliverable: All claims backed by retrieved documents
ββ Actions:
β 1. Read SKILL.md (citation tracking, source attribution)
β 2. Modify disease_explainer agent to track sources:
β result = retriever.hybrid_search(query)
β sources = [doc.metadata['source'] for doc in result]
β # Keep track of which statements came from which docs
β 3. Update ResponseSynthesizerAgent to require citations:
β Every claim must be followed by [source: page N]
β 4. Add validation:
β if not has_citations(response):
β return "Insufficient evidence for this conclusion"
β 5. Modify API response to include citations:
β {
β "disease": "Diabetes",
β "evidence": [
β {"claim": "High glucose", "source": "Clinical_Guidelines.pdf:p45"}
β ]
β }
β 6. Test: Every response should have citations
ββ Code Location: src/agents/disease_explainer.py (MODIFIED)
Week 7: Days 31-35
SKILL #12: Knowledge Graph Builder
ββ Duration: 6-8 hours
ββ Task: Extract and use knowledge graphs for relationships
ββ Deliverable: Biomarker β Disease β Treatment graph
ββ Actions:
β 1. Read SKILL.md (knowledge graphs, entity extraction, relationships)
β 2. Design graph structure:
β Nodes: Biomarkers, Diseases, Treatments, Symptoms
β Edges: "elevated_glucose" -[indicates]-> "diabetes"
β "diabetes" -[treated_by]-> "metformin"
β 3. Extract entities from medical PDFs:
β Use LLM to identify: (biomarker, disease, treatment) triples
β Store in graph database (networkx for simplicity)
β 4. Build src/knowledge_graph.py:
β class MedicalKnowledgeGraph:
β def find_diseases_for_biomarker(biomarker) -> List[Disease]
β def find_treatments_for_disease(disease) -> List[Treatment]
β def shortest_path(biomarker, disease) -> List[Node]
β 5. Integrate with biomarker_analyzer:
β Instead of rule-based disease prediction,
β Use knowledge graph paths
β 6. Test: Graph should have >100 nodes, >500 edges
β 7. Visualize: Create graph.html (D3.js visualization)
ββ Code Location: src/knowledge_graph.py (NEW)
SKILL #1: LangChain Architecture (Deep Dive)
ββ Duration: 3-4 hours
ββ Task: Advanced LangChain patterns for RAG
ββ Deliverable: More sophisticated agent chain design
ββ Actions:
β 1. Read SKILL.md (advanced chains, custom tools)
β 2. Add custom tools to agents:
β @tool
β def lookup_reference_range(biomarker: str) -> dict:
β """Get normal range for biomarker"""
β return config.biomarker_references[biomarker]
β 3. Create composite chains:
β Chain = (lookup_range_tool | linter | analyzer)
β 4. Implement memory for conversation context:
β buffer = ConversationBufferMemory()
β chain = RunnableWithMessageHistory(agent, buffer)
β 5. Add callbacks for observability:
β .with_config(callbacks=[logger_callback])
β 6. Test chain composition & memory
ββ Code Location: src/agents/tools/ (NEW)
SKILL #28: Memory Management
ββ Duration: 3-4 hours
ββ Task: Optimize context window usage
ββ Deliverable: Fit more patient history without exceeding token limits
ββ Actions:
β 1. Read SKILL.md (context compression, memory hierarchies)
β 2. Implement sliding window memory:
β Keep last 5 messages (pruned conversation)
β Summarize older messages into facts
β 3. Add context compression:
β "User mentioned: glucose 140, HbA1c 10" (compressed)
β Instead of full raw conversation
β 4. Monitor token usage:
β - Groq free tier: ~500 requests/month
β - Each request: ~1-2K tokens average
β 5. Optimize prompts to use fewer tokens:
β Remove verbose preamble
β Use shorthand for common terms
β 6. Test: Save 20-30% on token usage
ββ Code Location: src/memory_manager.py (NEW)
Week 8: Days 36-40
SKILL #15: Cost-Aware LLM Pipeline
ββ Duration: 4-5 hours
ββ Task: Optimize API costs (reduce Groq/Gemini usage)
ββ Deliverable: Model routing by task complexity
ββ Actions:
β 1. Read SKILL.md (cost estimation, model selection, caching)
β 2. Analyze current costs:
β - Groq llama-3.3-70B: Expensive for simple tasks
β - Gemini free tier: Rate-limited
β 3. Implement model routing:
β Simple task: Route to smaller model (if available) or cache
β Complex task: Use llama-3.3-70B
β 4. Example routing:
β if task == "extract_biomarkers" and has_cache:
β return cached_result
β elif task == "complex_reasoning":
β use_groq_70b()
β else:
β use_gemini_free()
β 5. Implement caching:
β hash(query) -> check cache -> LLM -> store result
β 6. Track costs:
β log every API call with cost
β Generate monthly cost report
β 7. Target: -40% cost reduction
ββ Code Location: src/llm_config.py (MODIFIED)
END OF PHASE 3 OUTCOMES:
β
Hybrid search implemented (semantic + keyword)
β
Medical chunking improves knowledge quality
β
Embeddings optimized for medical terminology
β
Citation enforcement in all RAG outputs
β
Knowledge graph built from medical PDFs
β
LangChain advanced patterns implemented
β
Context window optimization reduces token waste
β
Model routing saves -40% on API costs
β
Better disease prediction via knowledge graphs
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
PHASE 4: DEPLOYMENT, MONITORING & SCALING (Week 9-12)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GOAL: Production-ready system with monitoring, docs, and deployment
Week 9: Days 41-45
SKILL #25: FastAPI Templates
ββ Duration: 3-4 hours
ββ Task: Production-grade FastAPI configuration
ββ Deliverable: Optimized FastAPI settings, middleware
ββ Actions:
β 1. Read SKILL.md (async patterns, dependency injection, middleware)
β 2. Apply async best practices:
β - All endpoints async def
β - Use asyncio for parallel agent calls
β - Remove any sync blocking calls
β 3. Add middleware chain:
β - CORS middleware (for web frontend)
β - Request logging (correlation IDs)
β - Error handling
β - Rate limiting
β - Auth
β 4. Optimize configuration:
β - Connection pooling for databases
β - Caching headers (HTTP)
β - Compression (gzip)
β 5. Add health checks:
β /health - basic healthcheck
β /health/deep - check dependencies (FAISS, LLM)
β 6. Test: Load testing with async
ββ Code Location: api/app/main.py (REFACTORED)
SKILL #29: API Docs Generator
ββ Duration: 2-3 hours
ββ Task: Auto-generate OpenAPI spec + interactive docs
ββ Deliverable: /docs (Swagger UI) + /redoc (ReDoc)
ββ Actions:
β 1. Read SKILL.md (OpenAPI, Swagger UI, ReDoc)
β 2. FastAPI auto-generates OpenAPI from endpoints
β 3. Enhance documentation:
β Add detailed descriptions to each endpoint
β Add example responses
β Add error codes
β 4. Example:
β @app.post("/api/v1/analyze/structured")
β async def analyze_structured(request: AnalysisRequest):
β """
β Analyze biomarkers (structured input)
β
β - **biomarkers**: Dict of biomarker names β values
β - **response**: Full analysis with disease prediction
β
β Example:
β {"biomarkers": {"glucose": 140, "HbA1c": 10}}
β """
β 5. Auto-docs available at:
β http://localhost:8000/docs
β http://localhost:8000/redoc
β 6. Generate OpenAPI JSON:
β http://localhost:8000/openapi.json
β 7. Create client SDK (optional):
β OpenAPI Generator β Python, JS, Go clients
ββ Docs auto-generated from code
SKILL #30: GitHub PR Review Workflow
ββ Duration: 2-3 hours
ββ Task: Establish code review standards
ββ Deliverable: CODEOWNERS, PR templates, branch protection
ββ Actions:
β 1. Read SKILL.md (PR templates, CODEOWNERS, review process)
β 2. Create .github/CODEOWNERS:
β # Security reviews required for:
β /api/app/middleware/ @security-team
β # Testing reviews required for:
β /tests/ @qa-team
β 3. Create .github/pull_request_template.md:
β ## Description
β ## Type of change
β ## Tests added
β ## Checklist
β ## Related issues
β 4. Configure branch protection:
β - Require 1 approval before merge
β - Require status checks pass (tests, lint)
β - Require up-to-date branch
β 5. Create CONTRIBUTING.md with guidelines
ββ Location: .github/
Week 10: Days 46-50
SKILL #27: Python Observability (Advanced)
ββ Duration: 4-5 hours
ββ Task: Metrics collection + monitoring dashboard
ββ Deliverable: Key metrics tracked (latency, accuracy, errors)
ββ Actions:
β 1. Read SKILL.md (metrics, histograms, summaries)
β 2. Add prometheus metrics:
β pip install prometheus-client
β 3. Track key metrics:
β - request_latency_ms (histogram)
β - disease_prediction_accuracy (gauge)
β - llm_api_calls_total (counter)
β - error_rate (gauge)
β - citations_found_rate (gauge)
β 4. Add to all agents:
β with timer("biomarker_analyzer"):
β result = analyzer.invoke(input)
β 5. Expose metrics at /metrics
β 6. Integrate with monitoring (optional):
β Send to Prometheus -> Grafana dashboard
β 7. Alerts:
β If latency > 25s: alert
β If accuracy < 75%: alert
β If error rate > 5%: alert
ββ Code Location: src/monitoring/ (NEW)
SKILL #23: Code Review Excellence
ββ Duration: 2-3 hours
ββ Task: Review and improve code quality
ββ Deliverable: Code quality assessment report
ββ Actions:
β 1. Read SKILL.md (code review patterns, common issues)
β 2. Self-review all Phase 1-3 changes:
β - Are functions <20 lines? (if not, break up)
β - Are variable names clear? (rename if not)
β - Are error cases handled? (if not, add)
β - Are tests present? (required: >90% coverage)
β 3. Common medical code patterns to enforce:
β - Never assume biomarker values are valid
β - Always include units (mg/dL, etc.)
β - Always cite medical literature
β - Never hardcode disease thresholds
β 4. Create REVIEW_GUIDELINES.md
β 5. Review Agent implementations:
β Check for: typos, unclear logic, missing docstrings
ββ Code Location: docs/REVIEW_GUIDELINES.md (NEW)
SKILL #31: CI-CD Best Practices
ββ Duration: 3-4 hours
ββ Task: Enhance CI/CD with deployment
ββ Deliverable: Automated deployment pipeline
ββ Actions:
β 1. Read SKILL.md (deployment strategies, environments)
β 2. Add deployment workflow:
β .github/workflows/deploy.yml:
β - Build Docker image
β - Push to registry
β - Deploy to staging
β - Run smoke tests
β - Manual approval for production
β - Deploy to production
β 3. Environment management:
β - .env.development (localhost)
β - .env.staging (staging server)
β - .env.production (prod server)
β 4. Deployment strategy:
β Canary: Deploy to 10% of traffic first
β Monitor for errors
β If OK, deploy to 100%
β If errors, rollback
β 5. Docker configuration:
β Multi-stage build for smaller images
β Security: Non-root user, minimal base image
β 6. Test deployment locally:
β docker build -t ragbot .
β docker run -p 8000:8000 ragbot
ββ Location: .github/workflows/deploy.yml (NEW)
SKILL #32: Frontend Accessibility (if building web frontend)
ββ Duration: 2-3 hours (optional, skip if CLI only)
ββ Task: Accessibility standards for web interface
ββ Deliverable: WCAG 2.1 AA compliant UI
ββ Actions:
β 1. Read SKILL.md (a11y, screen readers, keyboard nav)
β 2. If building React frontend for medical results:
β - All buttons keyboard accessible
β - Screen reader labels on medical data
β - High contrast for readability
β - Clear error messages
β 3. Test with screen reader (NVDA or JAWS)
ββ Code Location: examples/web_interface/ (if needed)
Week 11: Days 51-55
SKILL #6: LLM Application Dev with LangChain
ββ Duration: 4-5 hours
ββ Task: Production LangChain patterns
ββ Deliverable: Robust, maintainable agent code
ββ Actions:
β 1. Read SKILL.md (production patterns, error handling, logging)
β 2. Implement agent lifecycle:
β - Setup (load models, prepare context)
β - Execution (with retries)
β - Cleanup (save state, log metrics)
β 3. Add retry logic for LLM calls:
β @retry(max_attempts=3, backoff=exponential)
β def invoke_agent(self, input):
β return self.llm.predict(...)
β 4. Add graceful degradation:
β If LLM fails, return cached result
β If vector store fails, return rule-based result
β 5. Implement agent composition:
β Multi-step workflows where agents call other agents
β 6. Test: 99.99% uptime in staging
ββ Code Location: src/agents/base_agent.py (REFINED)
SKILL #33: Webhook Receiver Hardener
ββ Duration: 2-3 hours
ββ Task: Secure webhook handling (for integrations)
ββ Deliverable: Webhook endpoint with signature verification
ββ Actions:
β 1. Read SKILL.md (signature verification, replay protection)
β 2. If accepting webhooks from external systems:
β - Verify HMAC signature
β - Check timestamp (prevent replay attacks)
β - Idempotency key handling
β 3. Example: EHR system sends patient updates
β POST /webhooks/patient-update
β Verify: X-Webhook-Signature header
β Prevent: Same update processed twice
β 4. Create api/app/webhooks/ (NEW if needed)
β 5. Test: Webhook security scenarios
ββ Code Location: api/app/webhooks/ (OPTIONAL)
Week 12: Days 56-60
SKILL #7: RAG Agent Builder
ββ Duration: 4-5 hours
ββ Task: Full RAG agent architecture review
ββ Deliverable: Production-ready RAG agents
ββ Actions:
β 1. Read SKILL.md (RAG agent design, retrieval QA chains)
β 2. Comprehensive RAG review:
β - Retriever quality (hybrid search, ranking)
β - Prompt quality (citations, evidence)
β - Response quality (accurate, safe)
β 3. Disease Explainer Agent refactor:
β Step 1: Retrieve relevant medical documents
β Step 2: Extract key evidence from docs
β Step 3: Synthesize explanation with citations
β Step 4: Assess confidence (high/medium/low)
β 4. Test: All responses have citations
β 5. Test: No medical hallucinations
β 6. Benchmark: Accuracy, latency, cost
ββ Code Location: src/agents/ (FINAL REVIEW)
Final Week Integration (Days 56-60):
SKILL #2: Workflow Orchestration (Refinement)
ββ Final review of entire workflow
ββ Ensure all agents work together
ββ Test end-to-end: CLI and API
Comprehensive Testing:
ββ Functional tests: All features work
ββ Security tests: No vulnerabilities
ββ Performance tests: <20s latency
ββ Load tests: Handle 10 concurrent requests
Documentation:
ββ Update README with new features
ββ Document API at /docs
ββ Create deployment guide
ββ Create troubleshooting guide
Production Deployment:
ββ Stage: Test with real environment
ββ Canary: 10% of traffic
ββ Monitor: Errors, latency, accuracy
ββ Full deployment: 100% of traffic
END OF PHASE 4 OUTCOMES:
β
FastAPI optimized for production
β
API documentation auto-generated
β
Code review standards established
β
Full observability (logging, metrics)
β
CI/CD with automated deployment
β
Security best practices implemented
β
Production-ready RAG agents
β
System deployed and monitored
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
IMPLEMENTATION SUMMARY
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SKILLS USED IN ORDER:
Phase 1 (Security + Fixes): 2, 3, 4, 16, 17, 18, 19, 20, 22
Phase 2 (Testing + Agents): 22, 26, 4, 13, 14, 5, 21, 27, 24
Phase 3 (Retrieval + Graphs): 8, 9, 10, 11, 12, 1, 28, 15
Phase 4 (Production): 25, 29, 30, 27, 23, 31, 32(*), 6, 33(*), 7
(*) Optional based on needs
TOTAL IMPLEMENTATION TIME:
Phase 1: ~30-40 hours
Phase 2: ~35-45 hours
Phase 3: ~30-40 hours
Phase 4: ~30-40 hours
βββββββββββββββββββββ
TOTAL: ~130-160 hours over 12 weeks (~10-12 hours/week)
EXPECTED OUTCOMES:
Metrics:
Test Coverage: 70% β 90%+
Response Latency: 25s β 15-20s (-30%)
Accuracy: 65% β 80% (+15-20%)
API Costs: -40% via optimization
Citations: 0% β 100%
Quality:
β
OWASP compliant
β
HIPAA aligned
β
Production-ready
β
Enterprise monitoring
β
Automated deployments
System Capabilities:
β
Hybrid semantic + keyword search
β
Knowledge graphs for reasoning
β
Cost-optimized LLM routing
β
Full citation enforcement
β
Advanced observability
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
WEEKLY CHECKLIST
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Each week, verify:
β‘ Code committed with clear commit messages
β‘ Tests pass locally: pytest -v --cov
β‘ Coverage >85% on any new code
β‘ PR created with documentation
β‘ Code reviewed (self or team)
β‘ No security warnings
β‘ Documentation updated
β‘ Metrics tracked (custom dashboard)
β‘ No breaking changes to API
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
DONE! Your 4-month implementation plan is ready.
Start with Phase 1 Week 1.
Execute systematically.
Measure progress weekly.
Celebrate wins!
Your RagBot will be enterprise-grade. π
|