Spaces:
Running
MediGuard AI / RagBot - Comprehensive Remediation Plan
Generated: February 24, 2026
Status: β COMPLETED
Last Updated: Session completion
Priority Levels: P0 (Critical) β P3 (Nice-to-have)
Implementation Status
| # | Issue | Status | Notes |
|---|---|---|---|
| 1 | Dual Architecture | β Complete | Consolidated to src/main.py |
| 2 | Fake ML Prediction | β Complete | Renamed to rule-based heuristics |
| 3 | Vector Store Abstraction | β Complete | Created unified retriever interface |
| 4 | Evolution System | β Complete | Archived to archive/evolution/ |
| 5 | Evaluation System | β Complete | Added deterministic mode |
| 6 | HuggingFace Duplication | β Complete | Reduced from 1175β1086 lines |
| 7 | Test Coverage | β Complete | Added tests/test_integration.py |
| 8 | Database Schema | βοΈ Deferred | Not needed for HuggingFace |
| 9 | Documentation | β Complete | README.md updated |
| 10 | Gradio Dependencies | β Complete | Shared utils created |
Table of Contents
- Executive Summary
- Issue 1: Dual Architecture Confusion
- Issue 2: Fake ML Disease Prediction
- Issue 3: Vector Store Abstraction
- Issue 4: Orphaned Evolution System
- Issue 5: Unreliable Evaluation System
- Issue 6: HuggingFace Code Duplication
- Issue 7: Inadequate Test Coverage
- Issue 8: Database Schema Unused
- Issue 9: Documentation Misalignment
- Issue 10: Gradio App Dependencies
- Implementation Roadmap
Executive Summary
The RagBot codebase has 10 structural issues that create confusion, maintenance burden, and misleading claims. The most critical issues are:
| Priority | Issue | Impact | Effort |
|---|---|---|---|
| P0 | Dual Architecture | High confusion, duplicated code paths | 3-5 days |
| P1 | Fake ML Prediction | Misleading users, false claims | 2-3 days |
| P1 | Vector Store Mess | Production vs local mismatch | 2 days |
| P1 | Missing Tests | Unreliable deployments | 3-4 days |
| P1 | Doc Misalignment | User confusion | 1 day |
| P2 | Orphaned Evolution | Dead code, wasted complexity | 1-2 days |
| P2 | Evaluation System | Unreliable quality metrics | 2 days |
| P2 | HuggingFace Duplication | 1175-line standalone app | 2-3 days |
| P2 | Gradio Dependencies | Can't run standalone | 0.5 days |
| P3 | Unused Database | Alembic setup with no migrations | 1 day |
Issue 1: Dual Architecture Confusion (P0)
Problem
Two competing LangGraph workflows exist:
| Component | Path | Purpose |
|---|---|---|
| ClinicalInsightGuild | src/workflow.py |
Original 6-agent biomarker analysis |
| AgenticRAGService | src/services/agents/agentic_rag.py |
Newer Q&A RAG pipeline |
The API routes them confusingly:
/analyze/*β ClinicalInsightGuild viaapi/app/services/ragbot.py/askβ AgenticRAGService viasrc/routers/ask.py
Evidence:
src/main.pyinitializes BOTH services at startup (lines 91-106)api/app/main.pyis a SEPARATE FastAPI app fromsrc/main.py- Users don't know which one is "production"
Solution
Option A: Merge into Single Unified Pipeline (Recommended)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Unified RAG Pipeline β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Input β Guardrail β Router β β¬β Biomarker Analysis Path β
β β (6 specialist agents) β
β ββ General Q&A Path β
β (retrieve β grade β gen) β
β β Output Synthesizer β Response β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Implementation Steps:
Create unified graph in
src/pipelines/unified_rag.py:# Merge both workflows into one StateGraph # Use routing logic from guardrail_node to dispatchDelete redundant files:
- Move
api/app/logic intosrc/routers/ - Delete
api/app/main.py(usesrc/main.pyonly) - Keep
api/app/services/ragbot.pyas legacy adapter
- Move
Single entry point:
src/main.pybecomes THE serveruvicorn src.main:appeverywhere
Update imports:
# In src/main.py, replace: from api.app.services.ragbot import get_ragbot_service # With: from src.pipelines.unified_rag import UnifiedRAGService
Files to Create:
src/pipelines/__init__.pysrc/pipelines/unified_rag.pysrc/pipelines/nodes/__init__.py(merge all nodes)
Files to Delete/Archive:
api/app/main.pyβ Archive toapi/app/main_legacy.pyapi/app/routes/β Merge intosrc/routers/
Issue 2: Fake ML Disease Prediction (P1)
Problem
The README claims "ML prediction" but predict_disease_simple() is pure if/else:
# scripts/chat.py lines 151-216
if glucose > 126:
scores["Diabetes"] += 0.4
if hba1c >= 6.5:
scores["Diabetes"] += 0.5
There's also an LLM-based predictor (predict_disease_llm()) that just asks an LLM to guess.
Solution
Option A: Be Honest (Quick Fix)
Update all documentation to say "rule-based heuristics" not "ML prediction":
# In README.md:
- **Disease Prediction** - Rule-based scoring on 5 conditions
(Diabetes, Anemia, Heart Disease, Thrombocytopenia, Thalassemia)
Option B: Implement Real ML (Longer)
Create a proper classifier:
# src/models/disease_classifier.py from sklearn.ensemble import RandomForestClassifier import joblib class DiseaseClassifier: def __init__(self, model_path: str = "models/disease_rf.joblib"): self.model = joblib.load(model_path) self.feature_names = [...] # 24 biomarkers def predict(self, biomarkers: dict) -> dict: features = self._to_feature_vector(biomarkers) proba = self.model.predict_proba([features])[0] return { "disease": self.model.classes_[proba.argmax()], "confidence": float(proba.max()), "probabilities": dict(zip(self.model.classes_, proba.tolist())) }Train on synthetic data:
- Create
scripts/train_disease_model.py - Generate synthetic patient data with known conditions
- Train RandomForest/XGBoost classifier
- Save to
models/disease_rf.joblib
- Create
Replace predictor calls:
# Instead of predict_disease_simple(biomarkers) from src.models.disease_classifier import get_classifier prediction = get_classifier().predict(biomarkers)
Recommendation: Do Option A immediately, Option B as a follow-up feature.
Issue 3: Vector Store Abstraction (P1)
Problem
Two different vector stores used inconsistently:
| Context | Store | Configuration |
|---|---|---|
| Local dev | FAISS | data/vector_stores/medical_knowledge.faiss |
| Production | OpenSearch | OPENSEARCH__HOST env var |
| HuggingFace | FAISS | Bundled in huggingface/ |
The code has:
src/pdf_processor.pyβ FAISSsrc/services/opensearch/client.pyβ OpenSearchsrc/services/agents/nodes/retrieve_node.pyβ OpenSearch only
Solution
Create a unified retriever interface:
# src/services/retrieval/interface.py
from abc import ABC, abstractmethod
from typing import List, Dict, Any
class BaseRetriever(ABC):
@abstractmethod
def search(self, query: str, top_k: int = 10) -> List[Dict[str, Any]]:
"""Return list of {id, score, text, title, section, metadata}"""
pass
@abstractmethod
def search_hybrid(self, query: str, embedding: List[float], top_k: int = 10) -> List[Dict[str, Any]]:
pass
# src/services/retrieval/faiss_retriever.py
class FAISSRetriever(BaseRetriever):
def __init__(self, vector_store_path: str, embedding_model):
self.store = FAISS.load_local(vector_store_path, embedding_model, ...)
def search(self, query: str, top_k: int = 10):
docs = self.store.similarity_search(query, k=top_k)
return [{"id": i, "score": 0, "text": d.page_content, ...} for i, d in enumerate(docs)]
# src/services/retrieval/opensearch_retriever.py
class OpenSearchRetriever(BaseRetriever):
def __init__(self, client: OpenSearchClient):
self.client = client
def search(self, query: str, top_k: int = 10):
return self.client.search_bm25(query, top_k=top_k)
# src/services/retrieval/__init__.py
def get_retriever() -> BaseRetriever:
"""Factory that returns appropriate retriever based on config."""
settings = get_settings()
if settings.opensearch.host and _opensearch_available():
return OpenSearchRetriever(make_opensearch_client())
else:
return FAISSRetriever("data/vector_stores", get_embedding_model())
Update retrieve_node.py:
def retrieve_node(state: dict, *, context: Any) -> dict:
retriever = context.retriever # Now uses unified interface
results = retriever.search_hybrid(query, embedding, top_k=10)
...
Issue 4: Orphaned Evolution System (P2)
Problem
src/evolution/ contains a complete SOP evolution system that:
- Has
SOPGenePoolfor versioning - Has
performance_diagnostician()for diagnosis - Has
sop_architect()for mutations - Has an Airflow DAG (
airflow/dags/sop_evolution.py)
But:
- No Airflow deployment exists
run_evolution_cycle()requires manual invocation- No UI to trigger evolution
- No tracking of which SOP version is in use
Solution
Option A: Remove It (Quick)
Delete or archive the unused code:
mkdir -p archive/evolution
mv src/evolution/* archive/evolution/
mv airflow/dags/sop_evolution.py archive/
Update imports to remove references.
Option B: Wire It Up (If Actually Wanted)
Add CLI command:
# scripts/evolve_sop.py from src.evolution.director import run_evolution_cycle from src.workflow import create_guild if __name__ == "__main__": gene_pool = SOPGenePool() # Load baseline, run evolution, save resultsAdd API endpoint:
# src/routers/admin.py @router.post("/admin/evolve") async def trigger_evolution(request: Request): # Requires admin auth result = run_evolution_cycle(...) return {"new_versions": len(result)}Persist to database:
- Use Alembic migrations to create
sop_versionstable - Store evolved SOPs with evaluation scores
- Use Alembic migrations to create
Issue 5: Unreliable Evaluation System (P2)
Problem
src/evaluation/evaluators.py uses LLM-as-judge for:
evaluate_clinical_accuracy()- LLM grades medical correctnessevaluate_actionability()- LLM grades recommendations
Problems:
- LLMs are unreliable judges of medical accuracy
- No ground truth comparison
- Scores can fluctuate between runs
- Falls back to 0.5 on JSON parse errors (line 91)
Solution
Replace with deterministic metrics where possible:
# For clinical_accuracy: Use BiomarkerValidator as ground truth
def evaluate_clinical_accuracy_v2(response: Dict, biomarkers: Dict) -> GradedScore:
validator = BiomarkerValidator()
# Check if flagged biomarkers match validator
expected_flags = validator.validate_all(biomarkers)[0]
actual_flags = response.get("biomarker_flags", [])
expected_abnormal = {f.name for f in expected_flags if f.status != "NORMAL"}
actual_abnormal = {f["name"] for f in actual_flags if f["status"] != "NORMAL"}
precision = len(expected_abnormal & actual_abnormal) / max(len(actual_abnormal), 1)
recall = len(expected_abnormal & actual_abnormal) / max(len(expected_abnormal), 1)
f1 = 2 * precision * recall / max(precision + recall, 0.001)
return GradedScore(
score=f1,
reasoning=f"Precision: {precision:.2f}, Recall: {recall:.2f}"
)
Keep LLM-as-judge only for subjective metrics:
- Clarity (readability) - already programmatic β
- Helpfulness of recommendations - needs human judgment
Add human-in-the-loop:
# src/evaluation/human_eval.py
def collect_human_rating(response_id: str) -> Optional[float]:
"""Store human ratings for later analysis."""
# Integrate with Langfuse or custom feedback endpoint
Issue 6: HuggingFace Code Duplication (P2)
Problem
huggingface/app.py is 1175 lines that reimplements:
- Biomarker parsing (duplicated from chat.py)
- Disease prediction (duplicated)
- Guild initialization (duplicated)
- Gradio UI (different from src/gradio_app.py)
- Environment handling (custom)
Solution
Refactor to import from main package:
# huggingface/app.py (simplified to ~200 lines)
import sys
sys.path.insert(0, "..")
from src.workflow import create_guild
from src.state import PatientInput
from scripts.chat import extract_biomarkers, predict_disease_simple
# Only Gradio-specific code here
def analyze_biomarkers(input_text: str):
biomarkers, context = extract_biomarkers(input_text)
prediction = predict_disease_simple(biomarkers)
patient_input = PatientInput(
biomarkers=biomarkers,
model_prediction=prediction,
patient_context=context
)
guild = get_guild()
result = guild.run(patient_input)
return format_result(result)
# Gradio interface...
Create shared utilities module:
# src/utils/biomarker_extraction.py
# Move extract_biomarkers() from chat.py here
# src/utils/disease_scoring.py
# Move predict_disease_simple() here
Issue 7: Inadequate Test Coverage (P1)
Problem
Current tests are mostly:
- Import validation (
test_basic.py) - Unit tests with mocks (
test_agentic_rag.py) - Schema validation (
test_schemas.py)
Missing:
- End-to-end workflow tests
- API integration tests
- Regression tests for medical accuracy
Solution
Add integration tests:
# tests/integration/test_full_workflow.py
import pytest
from src.workflow import create_guild
from src.state import PatientInput
@pytest.fixture(scope="module")
def guild():
return create_guild()
def test_diabetes_patient_analysis(guild):
patient = PatientInput(
biomarkers={"Glucose": 185, "HbA1c": 8.2},
model_prediction={"disease": "Diabetes", "confidence": 0.87, "probabilities": {}},
patient_context={"age": 52, "gender": "male"}
)
result = guild.run(patient)
# Assertions
assert result.get("final_response") is not None
assert len(result.get("biomarker_flags", [])) >= 2
assert any(f["name"] == "Glucose" for f in result["biomarker_flags"])
assert "Diabetes" in result["final_response"]["prediction_explanation"]["primary_disease"]
def test_anemia_patient_analysis(guild):
patient = PatientInput(
biomarkers={"Hemoglobin": 9.5, "MCV": 75},
model_prediction={"disease": "Anemia", "confidence": 0.75, "probabilities": {}},
patient_context={}
)
result = guild.run(patient)
assert result.get("final_response") is not None
Add API tests:
# tests/integration/test_api_endpoints.py
import pytest
from fastapi.testclient import TestClient
from src.main import app
@pytest.fixture
def client():
return TestClient(app)
def test_health_endpoint(client):
response = client.get("/health")
assert response.status_code == 200
assert response.json()["status"] == "healthy"
def test_analyze_structured(client):
response = client.post("/analyze/structured", json={
"biomarkers": {"Glucose": 140, "HbA1c": 7.0}
})
assert response.status_code == 200
assert "prediction" in response.json()
Add to CI:
# .github/workflows/test.yml
- name: Run integration tests
run: pytest tests/integration/ -v
env:
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
Issue 8: Database Schema Unused (P3)
Problem
alembic/is configured butalembic/versions/is emptysrc/database.pyexists but is barely usedsrc/db/models.pydefines tables that aren't created
Solution
If database features are wanted:
Create initial migration:
cd src alembic revision --autogenerate -m "Initial schema" alembic upgrade headUse models for:
- Storing analysis history
- Persisting evolved SOPs
- User feedback collection
If not needed:
- Remove
alembic/directory - Remove
src/database.py - Remove
src/db/if empty - Remove
postgresfromdocker-compose.yml
Issue 9: Documentation Misalignment (P1)
Problem
README.md claims:
- "ML prediction" β It's rule-based
- "6 Specialist Agents" β Also has agentic RAG (7+ nodes)
- "Production-ready" β Two competing entry points
Solution
Update README.md:
## How It Works
### Analysis Pipeline
RagBot uses a **multi-agent LangGraph workflow** to analyze biomarkers:
1. **Input Routing** - Validates query is medical, routes to analysis or Q&A
2. **Biomarker Analyzer** - Validates values against clinical reference ranges
3. **Disease Scorer** - Rule-based heuristics predict most likely condition
4. **Disease Explainer** - RAG retrieval for pathophysiology from medical PDFs
5. **Guidelines Agent** - RAG retrieval for treatment recommendations
6. **Response Synthesizer** - Compiles findings into patient-friendly summary
### Supported Conditions
- Diabetes (via Glucose, HbA1c)
- Anemia (via Hemoglobin, MCV)
- Heart Disease (via Cholesterol, Troponin, LDL)
- Thrombocytopenia (via Platelets)
- Thalassemia (via MCV + Hemoglobin pattern)
> **Note:** Disease prediction uses rule-based scoring, not ML models.
> Future versions may include trained classifiers.
Issue 10: Gradio App Dependencies (P2)
Problem
src/gradio_app.py is just an HTTP client:
def _call_ask(question: str) -> str:
resp = client.post(f"{API_BASE}/ask", json={"question": question})
It requires the FastAPI server running at http://localhost:8000.
Solution
Option A: Document the dependency clearly:
Add startup instructions:
## Running the Gradio UI
1. Start the API server:
```bash
uvicorn src.main:app --reload
In another terminal, start Gradio:
python -m src.gradio_app
**Option B: Add embedded mode:**
```python
# src/gradio_app.py
def _call_ask_embedded(question: str) -> str:
"""Direct workflow invocation without HTTP."""
from src.services.agents.agentic_rag import AgenticRAGService
service = get_rag_service()
result = service.ask(query=question)
return result.get("final_answer", "No answer.")
def launch_gradio(embedded: bool = False, share: bool = False):
ask_fn = _call_ask_embedded if embedded else _call_ask
# ... rest of UI
Implementation Roadmap
Phase 1: Critical Fixes (Week 1)
| Day | Task | Owner |
|---|---|---|
| 1 | Fix documentation claims (README.md) | - |
| 1-2 | Consolidate entry points (delete api/app/main.py) | - |
| 2-3 | Create unified retriever interface | - |
| 3-4 | Add integration tests for workflow | - |
| 5 | Update Gradio startup docs | - |
Phase 2: Architecture Cleanup (Week 2)
| Day | Task | Owner |
|---|---|---|
| 1-2 | Merge AgenticRAG + ClinicalInsightGuild | - |
| 3 | Refactor HuggingFace app to use shared code | - |
| 4 | Wire up or remove evolution system | - |
| 5 | Review and deploy | - |
Phase 3: Quality Improvements (Week 3)
| Day | Task | Owner |
|---|---|---|
| 1 | Replace LLM-as-judge with deterministic metrics | - |
| 2 | Add proper disease classifier (optional) | - |
| 3-4 | Expand test coverage to 80%+ | - |
| 5 | Final documentation pass | - |
Quick Wins (Do Today)
- Rename
predict_disease_simpletoscore_disease_heuristicto be honest - Add
## Architecturesection to README explaining the two workflows - Create
scripts/start_full.ps1that starts both API and Gradio - Delete empty
alembic/versions/and document "DB not implemented" - Add type hints to top 5 most-used functions
Checklist
- P0: Single FastAPI entry point (
src/main.pyonly) - P1: Documentation accurately describes capabilities
- P1: Unified retriever interface (FAISS + OpenSearch)
- P1: Integration tests exist and pass
- P2: Evolution system removed or functional
- P2: HuggingFace app imports from main package
- P2: Evaluation metrics are deterministic
- P3: Database either used or removed