# MediGuard AI / RagBot - Comprehensive Remediation Plan > **Generated:** February 24, 2026 > **Status:** ✅ COMPLETED > **Last Updated:** Session completion > **Priority Levels:** P0 (Critical) → P3 (Nice-to-have) --- ## Implementation Status | # | Issue | Status | Notes | |---|-------|--------|-------| | 1 | Dual Architecture | ✅ Complete | Consolidated to src/main.py | | 2 | Fake ML Prediction | ✅ Complete | Renamed to rule-based heuristics | | 3 | Vector Store Abstraction | ✅ Complete | Created unified retriever interface | | 4 | Evolution System | ✅ Complete | Archived to archive/evolution/ | | 5 | Evaluation System | ✅ Complete | Added deterministic mode | | 6 | HuggingFace Duplication | ✅ Complete | Reduced from 1175→1086 lines | | 7 | Test Coverage | ✅ Complete | Added tests/test_integration.py | | 8 | Database Schema | ⏭️ Deferred | Not needed for HuggingFace | | 9 | Documentation | ✅ Complete | README.md updated | | 10 | Gradio Dependencies | ✅ Complete | Shared utils created | --- ## Table of Contents 1. [Executive Summary](#executive-summary) 2. [Issue 1: Dual Architecture Confusion](#issue-1-dual-architecture-confusion-p0) 3. [Issue 2: Fake ML Disease Prediction](#issue-2-fake-ml-disease-prediction-p1) 4. [Issue 3: Vector Store Abstraction](#issue-3-vector-store-abstraction-p1) 5. [Issue 4: Orphaned Evolution System](#issue-4-orphaned-evolution-system-p2) 6. [Issue 5: Unreliable Evaluation System](#issue-5-unreliable-evaluation-system-p2) 7. [Issue 6: HuggingFace Code Duplication](#issue-6-huggingface-code-duplication-p2) 8. [Issue 7: Inadequate Test Coverage](#issue-7-inadequate-test-coverage-p1) 9. [Issue 8: Database Schema Unused](#issue-8-database-schema-unused-p3) 10. [Issue 9: Documentation Misalignment](#issue-9-documentation-misalignment-p1) 11. [Issue 10: Gradio App Dependencies](#issue-10-gradio-app-dependencies-p2) 12. [Implementation Roadmap](#implementation-roadmap) --- ## Executive Summary The RagBot codebase has **10 structural issues** that create confusion, maintenance burden, and misleading claims. The most critical issues are: | Priority | Issue | Impact | Effort | |----------|-------|--------|--------| | P0 | Dual Architecture | High confusion, duplicated code paths | 3-5 days | | P1 | Fake ML Prediction | Misleading users, false claims | 2-3 days | | P1 | Vector Store Mess | Production vs local mismatch | 2 days | | P1 | Missing Tests | Unreliable deployments | 3-4 days | | P1 | Doc Misalignment | User confusion | 1 day | | P2 | Orphaned Evolution | Dead code, wasted complexity | 1-2 days | | P2 | Evaluation System | Unreliable quality metrics | 2 days | | P2 | HuggingFace Duplication | 1175-line standalone app | 2-3 days | | P2 | Gradio Dependencies | Can't run standalone | 0.5 days | | P3 | Unused Database | Alembic setup with no migrations | 1 day | --- ## Issue 1: Dual Architecture Confusion (P0) ### Problem Two competing LangGraph workflows exist: | Component | Path | Purpose | |-----------|------|---------| | **ClinicalInsightGuild** | `src/workflow.py` | Original 6-agent biomarker analysis | | **AgenticRAGService** | `src/services/agents/agentic_rag.py` | Newer Q&A RAG pipeline | The API routes them confusingly: - `/analyze/*` → ClinicalInsightGuild via `api/app/services/ragbot.py` - `/ask` → AgenticRAGService via `src/routers/ask.py` **Evidence:** - `src/main.py` initializes BOTH services at startup (lines 91-106) - `api/app/main.py` is a SEPARATE FastAPI app from `src/main.py` - Users don't know which one is "production" ### Solution **Option A: Merge into Single Unified Pipeline (Recommended)** ``` ┌────────────────────────────────────────────────────────────────┐ │ Unified RAG Pipeline │ ├────────────────────────────────────────────────────────────────┤ │ Input → Guardrail → Router → ┬→ Biomarker Analysis Path │ │ │ (6 specialist agents) │ │ └→ General Q&A Path │ │ (retrieve → grade → gen) │ │ → Output Synthesizer → Response │ └────────────────────────────────────────────────────────────────┘ ``` **Implementation Steps:** 1. **Create unified graph** in `src/pipelines/unified_rag.py`: ```python # Merge both workflows into one StateGraph # Use routing logic from guardrail_node to dispatch ``` 2. **Delete redundant files:** - Move `api/app/` logic into `src/routers/` - Delete `api/app/main.py` (use `src/main.py` only) - Keep `api/app/services/ragbot.py` as legacy adapter 3. **Single entry point:** - `src/main.py` becomes THE server - `uvicorn src.main:app` everywhere 4. **Update imports:** ```python # In src/main.py, replace: from api.app.services.ragbot import get_ragbot_service # With: from src.pipelines.unified_rag import UnifiedRAGService ``` **Files to Create:** - `src/pipelines/__init__.py` - `src/pipelines/unified_rag.py` - `src/pipelines/nodes/__init__.py` (merge all nodes) **Files to Delete/Archive:** - `api/app/main.py` → Archive to `api/app/main_legacy.py` - `api/app/routes/` → Merge into `src/routers/` --- ## Issue 2: Fake ML Disease Prediction (P1) ### Problem The README claims "ML prediction" but `predict_disease_simple()` is pure if/else: ```python # scripts/chat.py lines 151-216 if glucose > 126: scores["Diabetes"] += 0.4 if hba1c >= 6.5: scores["Diabetes"] += 0.5 ``` There's also an LLM-based predictor (`predict_disease_llm()`) that just asks an LLM to guess. ### Solution **Option A: Be Honest (Quick Fix)** Update all documentation to say "rule-based heuristics" not "ML prediction": ```markdown # In README.md: - **Disease Prediction** - Rule-based scoring on 5 conditions (Diabetes, Anemia, Heart Disease, Thrombocytopenia, Thalassemia) ``` **Option B: Implement Real ML (Longer)** 1. **Create a proper classifier:** ```python # src/models/disease_classifier.py from sklearn.ensemble import RandomForestClassifier import joblib class DiseaseClassifier: def __init__(self, model_path: str = "models/disease_rf.joblib"): self.model = joblib.load(model_path) self.feature_names = [...] # 24 biomarkers def predict(self, biomarkers: dict) -> dict: features = self._to_feature_vector(biomarkers) proba = self.model.predict_proba([features])[0] return { "disease": self.model.classes_[proba.argmax()], "confidence": float(proba.max()), "probabilities": dict(zip(self.model.classes_, proba.tolist())) } ``` 2. **Train on synthetic data:** - Create `scripts/train_disease_model.py` - Generate synthetic patient data with known conditions - Train RandomForest/XGBoost classifier - Save to `models/disease_rf.joblib` 3. **Replace predictor calls:** ```python # Instead of predict_disease_simple(biomarkers) from src.models.disease_classifier import get_classifier prediction = get_classifier().predict(biomarkers) ``` **Recommendation:** Do Option A immediately, Option B as a follow-up feature. --- ## Issue 3: Vector Store Abstraction (P1) ### Problem Two different vector stores used inconsistently: | Context | Store | Configuration | |---------|-------|---------------| | Local dev | FAISS | `data/vector_stores/medical_knowledge.faiss` | | Production | OpenSearch | `OPENSEARCH__HOST` env var | | HuggingFace | FAISS | Bundled in `huggingface/` | The code has: - `src/pdf_processor.py` → FAISS - `src/services/opensearch/client.py` → OpenSearch - `src/services/agents/nodes/retrieve_node.py` → OpenSearch only ### Solution **Create a unified retriever interface:** ```python # src/services/retrieval/interface.py from abc import ABC, abstractmethod from typing import List, Dict, Any class BaseRetriever(ABC): @abstractmethod def search(self, query: str, top_k: int = 10) -> List[Dict[str, Any]]: """Return list of {id, score, text, title, section, metadata}""" pass @abstractmethod def search_hybrid(self, query: str, embedding: List[float], top_k: int = 10) -> List[Dict[str, Any]]: pass ``` ```python # src/services/retrieval/faiss_retriever.py class FAISSRetriever(BaseRetriever): def __init__(self, vector_store_path: str, embedding_model): self.store = FAISS.load_local(vector_store_path, embedding_model, ...) def search(self, query: str, top_k: int = 10): docs = self.store.similarity_search(query, k=top_k) return [{"id": i, "score": 0, "text": d.page_content, ...} for i, d in enumerate(docs)] ``` ```python # src/services/retrieval/opensearch_retriever.py class OpenSearchRetriever(BaseRetriever): def __init__(self, client: OpenSearchClient): self.client = client def search(self, query: str, top_k: int = 10): return self.client.search_bm25(query, top_k=top_k) ``` ```python # src/services/retrieval/__init__.py def get_retriever() -> BaseRetriever: """Factory that returns appropriate retriever based on config.""" settings = get_settings() if settings.opensearch.host and _opensearch_available(): return OpenSearchRetriever(make_opensearch_client()) else: return FAISSRetriever("data/vector_stores", get_embedding_model()) ``` **Update retrieve_node.py:** ```python def retrieve_node(state: dict, *, context: Any) -> dict: retriever = context.retriever # Now uses unified interface results = retriever.search_hybrid(query, embedding, top_k=10) ... ``` --- ## Issue 4: Orphaned Evolution System (P2) ### Problem `src/evolution/` contains a complete SOP evolution system that: - Has `SOPGenePool` for versioning - Has `performance_diagnostician()` for diagnosis - Has `sop_architect()` for mutations - Has an Airflow DAG (`airflow/dags/sop_evolution.py`) **But:** - No Airflow deployment exists - `run_evolution_cycle()` requires manual invocation - No UI to trigger evolution - No tracking of which SOP version is in use ### Solution **Option A: Remove It (Quick)** Delete or archive the unused code: ``` mkdir -p archive/evolution mv src/evolution/* archive/evolution/ mv airflow/dags/sop_evolution.py archive/ ``` Update imports to remove references. **Option B: Wire It Up (If Actually Wanted)** 1. **Add CLI command:** ```python # scripts/evolve_sop.py from src.evolution.director import run_evolution_cycle from src.workflow import create_guild if __name__ == "__main__": gene_pool = SOPGenePool() # Load baseline, run evolution, save results ``` 2. **Add API endpoint:** ```python # src/routers/admin.py @router.post("/admin/evolve") async def trigger_evolution(request: Request): # Requires admin auth result = run_evolution_cycle(...) return {"new_versions": len(result)} ``` 3. **Persist to database:** - Use Alembic migrations to create `sop_versions` table - Store evolved SOPs with evaluation scores --- ## Issue 5: Unreliable Evaluation System (P2) ### Problem `src/evaluation/evaluators.py` uses LLM-as-judge for: - `evaluate_clinical_accuracy()` - LLM grades medical correctness - `evaluate_actionability()` - LLM grades recommendations **Problems:** 1. LLMs are unreliable judges of medical accuracy 2. No ground truth comparison 3. Scores can fluctuate between runs 4. Falls back to 0.5 on JSON parse errors (line 91) ### Solution **Replace with deterministic metrics where possible:** ```python # For clinical_accuracy: Use BiomarkerValidator as ground truth def evaluate_clinical_accuracy_v2(response: Dict, biomarkers: Dict) -> GradedScore: validator = BiomarkerValidator() # Check if flagged biomarkers match validator expected_flags = validator.validate_all(biomarkers)[0] actual_flags = response.get("biomarker_flags", []) expected_abnormal = {f.name for f in expected_flags if f.status != "NORMAL"} actual_abnormal = {f["name"] for f in actual_flags if f["status"] != "NORMAL"} precision = len(expected_abnormal & actual_abnormal) / max(len(actual_abnormal), 1) recall = len(expected_abnormal & actual_abnormal) / max(len(expected_abnormal), 1) f1 = 2 * precision * recall / max(precision + recall, 0.001) return GradedScore( score=f1, reasoning=f"Precision: {precision:.2f}, Recall: {recall:.2f}" ) ``` **Keep LLM-as-judge only for subjective metrics:** - Clarity (readability) - already programmatic ✓ - Helpfulness of recommendations - needs human judgment **Add human-in-the-loop:** ```python # src/evaluation/human_eval.py def collect_human_rating(response_id: str) -> Optional[float]: """Store human ratings for later analysis.""" # Integrate with Langfuse or custom feedback endpoint ``` --- ## Issue 6: HuggingFace Code Duplication (P2) ### Problem `huggingface/app.py` is **1175 lines** that reimplements: - Biomarker parsing (duplicated from chat.py) - Disease prediction (duplicated) - Guild initialization (duplicated) - Gradio UI (different from src/gradio_app.py) - Environment handling (custom) ### Solution **Refactor to import from main package:** ```python # huggingface/app.py (simplified to ~200 lines) import sys sys.path.insert(0, "..") from src.workflow import create_guild from src.state import PatientInput from scripts.chat import extract_biomarkers, predict_disease_simple # Only Gradio-specific code here def analyze_biomarkers(input_text: str): biomarkers, context = extract_biomarkers(input_text) prediction = predict_disease_simple(biomarkers) patient_input = PatientInput( biomarkers=biomarkers, model_prediction=prediction, patient_context=context ) guild = get_guild() result = guild.run(patient_input) return format_result(result) # Gradio interface... ``` **Create shared utilities module:** ```python # src/utils/biomarker_extraction.py # Move extract_biomarkers() from chat.py here # src/utils/disease_scoring.py # Move predict_disease_simple() here ``` --- ## Issue 7: Inadequate Test Coverage (P1) ### Problem Current tests are mostly: - Import validation (`test_basic.py`) - Unit tests with mocks (`test_agentic_rag.py`) - Schema validation (`test_schemas.py`) **Missing:** - End-to-end workflow tests - API integration tests - Regression tests for medical accuracy ### Solution **Add integration tests:** ```python # tests/integration/test_full_workflow.py import pytest from src.workflow import create_guild from src.state import PatientInput @pytest.fixture(scope="module") def guild(): return create_guild() def test_diabetes_patient_analysis(guild): patient = PatientInput( biomarkers={"Glucose": 185, "HbA1c": 8.2}, model_prediction={"disease": "Diabetes", "confidence": 0.87, "probabilities": {}}, patient_context={"age": 52, "gender": "male"} ) result = guild.run(patient) # Assertions assert result.get("final_response") is not None assert len(result.get("biomarker_flags", [])) >= 2 assert any(f["name"] == "Glucose" for f in result["biomarker_flags"]) assert "Diabetes" in result["final_response"]["prediction_explanation"]["primary_disease"] def test_anemia_patient_analysis(guild): patient = PatientInput( biomarkers={"Hemoglobin": 9.5, "MCV": 75}, model_prediction={"disease": "Anemia", "confidence": 0.75, "probabilities": {}}, patient_context={} ) result = guild.run(patient) assert result.get("final_response") is not None ``` **Add API tests:** ```python # tests/integration/test_api_endpoints.py import pytest from fastapi.testclient import TestClient from src.main import app @pytest.fixture def client(): return TestClient(app) def test_health_endpoint(client): response = client.get("/health") assert response.status_code == 200 assert response.json()["status"] == "healthy" def test_analyze_structured(client): response = client.post("/analyze/structured", json={ "biomarkers": {"Glucose": 140, "HbA1c": 7.0} }) assert response.status_code == 200 assert "prediction" in response.json() ``` **Add to CI:** ```yaml # .github/workflows/test.yml - name: Run integration tests run: pytest tests/integration/ -v env: GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }} ``` --- ## Issue 8: Database Schema Unused (P3) ### Problem - `alembic/` is configured but `alembic/versions/` is empty - `src/database.py` exists but is barely used - `src/db/models.py` defines tables that aren't created ### Solution **If database features are wanted:** 1. Create initial migration: ```bash cd src alembic revision --autogenerate -m "Initial schema" alembic upgrade head ``` 2. Use models for: - Storing analysis history - Persisting evolved SOPs - User feedback collection **If not needed:** - Remove `alembic/` directory - Remove `src/database.py` - Remove `src/db/` if empty - Remove `postgres` from `docker-compose.yml` --- ## Issue 9: Documentation Misalignment (P1) ### Problem README.md claims: - "ML prediction" → It's rule-based - "6 Specialist Agents" → Also has agentic RAG (7+ nodes) - "Production-ready" → Two competing entry points ### Solution **Update README.md:** ```markdown ## How It Works ### Analysis Pipeline RagBot uses a **multi-agent LangGraph workflow** to analyze biomarkers: 1. **Input Routing** - Validates query is medical, routes to analysis or Q&A 2. **Biomarker Analyzer** - Validates values against clinical reference ranges 3. **Disease Scorer** - Rule-based heuristics predict most likely condition 4. **Disease Explainer** - RAG retrieval for pathophysiology from medical PDFs 5. **Guidelines Agent** - RAG retrieval for treatment recommendations 6. **Response Synthesizer** - Compiles findings into patient-friendly summary ### Supported Conditions - Diabetes (via Glucose, HbA1c) - Anemia (via Hemoglobin, MCV) - Heart Disease (via Cholesterol, Troponin, LDL) - Thrombocytopenia (via Platelets) - Thalassemia (via MCV + Hemoglobin pattern) > **Note:** Disease prediction uses rule-based scoring, not ML models. > Future versions may include trained classifiers. ``` --- ## Issue 10: Gradio App Dependencies (P2) ### Problem `src/gradio_app.py` is just an HTTP client: ```python def _call_ask(question: str) -> str: resp = client.post(f"{API_BASE}/ask", json={"question": question}) ``` It requires the FastAPI server running at `http://localhost:8000`. ### Solution **Option A: Document the dependency clearly:** Add startup instructions: ```markdown ## Running the Gradio UI 1. Start the API server: ```bash uvicorn src.main:app --reload ``` 2. In another terminal, start Gradio: ```bash python -m src.gradio_app ``` 3. Open http://localhost:7860 ``` **Option B: Add embedded mode:** ```python # src/gradio_app.py def _call_ask_embedded(question: str) -> str: """Direct workflow invocation without HTTP.""" from src.services.agents.agentic_rag import AgenticRAGService service = get_rag_service() result = service.ask(query=question) return result.get("final_answer", "No answer.") def launch_gradio(embedded: bool = False, share: bool = False): ask_fn = _call_ask_embedded if embedded else _call_ask # ... rest of UI ``` --- ## Implementation Roadmap ### Phase 1: Critical Fixes (Week 1) | Day | Task | Owner | |-----|------|-------| | 1 | Fix documentation claims (README.md) | - | | 1-2 | Consolidate entry points (delete api/app/main.py) | - | | 2-3 | Create unified retriever interface | - | | 3-4 | Add integration tests for workflow | - | | 5 | Update Gradio startup docs | - | ### Phase 2: Architecture Cleanup (Week 2) | Day | Task | Owner | |-----|------|-------| | 1-2 | Merge AgenticRAG + ClinicalInsightGuild | - | | 3 | Refactor HuggingFace app to use shared code | - | | 4 | Wire up or remove evolution system | - | | 5 | Review and deploy | - | ### Phase 3: Quality Improvements (Week 3) | Day | Task | Owner | |-----|------|-------| | 1 | Replace LLM-as-judge with deterministic metrics | - | | 2 | Add proper disease classifier (optional) | - | | 3-4 | Expand test coverage to 80%+ | - | | 5 | Final documentation pass | - | --- ## Quick Wins (Do Today) 1. **Rename `predict_disease_simple`** to `score_disease_heuristic` to be honest 2. **Add `## Architecture` section** to README explaining the two workflows 3. **Create `scripts/start_full.ps1`** that starts both API and Gradio 4. **Delete empty `alembic/versions/`** and document "DB not implemented" 5. **Add type hints** to top 5 most-used functions --- ## Checklist - [ ] P0: Single FastAPI entry point (`src/main.py` only) - [ ] P1: Documentation accurately describes capabilities - [ ] P1: Unified retriever interface (FAISS + OpenSearch) - [ ] P1: Integration tests exist and pass - [ ] P2: Evolution system removed or functional - [ ] P2: HuggingFace app imports from main package - [ ] P2: Evaluation metrics are deterministic - [ ] P3: Database either used or removed