Agentic-RagBot / docs /REMEDIATION_PLAN.md
Nikhil Pravin Pise
Production Upgrade v2.0: SSE streaming, HIPAA compliance, Gradio Q&A UI
3ca1d38
# MediGuard AI / RagBot - Comprehensive Remediation Plan
> **Generated:** February 24, 2026
> **Status:** βœ… COMPLETED
> **Last Updated:** Session completion
> **Priority Levels:** P0 (Critical) β†’ P3 (Nice-to-have)
---
## Implementation Status
| # | Issue | Status | Notes |
|---|-------|--------|-------|
| 1 | Dual Architecture | βœ… Complete | Consolidated to src/main.py |
| 2 | Fake ML Prediction | βœ… Complete | Renamed to rule-based heuristics |
| 3 | Vector Store Abstraction | βœ… Complete | Created unified retriever interface |
| 4 | Evolution System | βœ… Complete | Archived to archive/evolution/ |
| 5 | Evaluation System | βœ… Complete | Added deterministic mode |
| 6 | HuggingFace Duplication | βœ… Complete | Reduced from 1175β†’1086 lines |
| 7 | Test Coverage | βœ… Complete | Added tests/test_integration.py |
| 8 | Database Schema | ⏭️ Deferred | Not needed for HuggingFace |
| 9 | Documentation | βœ… Complete | README.md updated |
| 10 | Gradio Dependencies | βœ… Complete | Shared utils created |
---
## Table of Contents
1. [Executive Summary](#executive-summary)
2. [Issue 1: Dual Architecture Confusion](#issue-1-dual-architecture-confusion-p0)
3. [Issue 2: Fake ML Disease Prediction](#issue-2-fake-ml-disease-prediction-p1)
4. [Issue 3: Vector Store Abstraction](#issue-3-vector-store-abstraction-p1)
5. [Issue 4: Orphaned Evolution System](#issue-4-orphaned-evolution-system-p2)
6. [Issue 5: Unreliable Evaluation System](#issue-5-unreliable-evaluation-system-p2)
7. [Issue 6: HuggingFace Code Duplication](#issue-6-huggingface-code-duplication-p2)
8. [Issue 7: Inadequate Test Coverage](#issue-7-inadequate-test-coverage-p1)
9. [Issue 8: Database Schema Unused](#issue-8-database-schema-unused-p3)
10. [Issue 9: Documentation Misalignment](#issue-9-documentation-misalignment-p1)
11. [Issue 10: Gradio App Dependencies](#issue-10-gradio-app-dependencies-p2)
12. [Implementation Roadmap](#implementation-roadmap)
---
## Executive Summary
The RagBot codebase has **10 structural issues** that create confusion, maintenance burden, and misleading claims. The most critical issues are:
| Priority | Issue | Impact | Effort |
|----------|-------|--------|--------|
| P0 | Dual Architecture | High confusion, duplicated code paths | 3-5 days |
| P1 | Fake ML Prediction | Misleading users, false claims | 2-3 days |
| P1 | Vector Store Mess | Production vs local mismatch | 2 days |
| P1 | Missing Tests | Unreliable deployments | 3-4 days |
| P1 | Doc Misalignment | User confusion | 1 day |
| P2 | Orphaned Evolution | Dead code, wasted complexity | 1-2 days |
| P2 | Evaluation System | Unreliable quality metrics | 2 days |
| P2 | HuggingFace Duplication | 1175-line standalone app | 2-3 days |
| P2 | Gradio Dependencies | Can't run standalone | 0.5 days |
| P3 | Unused Database | Alembic setup with no migrations | 1 day |
---
## Issue 1: Dual Architecture Confusion (P0)
### Problem
Two competing LangGraph workflows exist:
| Component | Path | Purpose |
|-----------|------|---------|
| **ClinicalInsightGuild** | `src/workflow.py` | Original 6-agent biomarker analysis |
| **AgenticRAGService** | `src/services/agents/agentic_rag.py` | Newer Q&A RAG pipeline |
The API routes them confusingly:
- `/analyze/*` β†’ ClinicalInsightGuild via `api/app/services/ragbot.py`
- `/ask` β†’ AgenticRAGService via `src/routers/ask.py`
**Evidence:**
- `src/main.py` initializes BOTH services at startup (lines 91-106)
- `api/app/main.py` is a SEPARATE FastAPI app from `src/main.py`
- Users don't know which one is "production"
### Solution
**Option A: Merge into Single Unified Pipeline (Recommended)**
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Unified RAG Pipeline β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Input β†’ Guardrail β†’ Router β†’ ┬→ Biomarker Analysis Path β”‚
β”‚ β”‚ (6 specialist agents) β”‚
β”‚ β””β†’ General Q&A Path β”‚
β”‚ (retrieve β†’ grade β†’ gen) β”‚
β”‚ β†’ Output Synthesizer β†’ Response β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
**Implementation Steps:**
1. **Create unified graph** in `src/pipelines/unified_rag.py`:
```python
# Merge both workflows into one StateGraph
# Use routing logic from guardrail_node to dispatch
```
2. **Delete redundant files:**
- Move `api/app/` logic into `src/routers/`
- Delete `api/app/main.py` (use `src/main.py` only)
- Keep `api/app/services/ragbot.py` as legacy adapter
3. **Single entry point:**
- `src/main.py` becomes THE server
- `uvicorn src.main:app` everywhere
4. **Update imports:**
```python
# In src/main.py, replace:
from api.app.services.ragbot import get_ragbot_service
# With:
from src.pipelines.unified_rag import UnifiedRAGService
```
**Files to Create:**
- `src/pipelines/__init__.py`
- `src/pipelines/unified_rag.py`
- `src/pipelines/nodes/__init__.py` (merge all nodes)
**Files to Delete/Archive:**
- `api/app/main.py` β†’ Archive to `api/app/main_legacy.py`
- `api/app/routes/` β†’ Merge into `src/routers/`
---
## Issue 2: Fake ML Disease Prediction (P1)
### Problem
The README claims "ML prediction" but `predict_disease_simple()` is pure if/else:
```python
# scripts/chat.py lines 151-216
if glucose > 126:
scores["Diabetes"] += 0.4
if hba1c >= 6.5:
scores["Diabetes"] += 0.5
```
There's also an LLM-based predictor (`predict_disease_llm()`) that just asks an LLM to guess.
### Solution
**Option A: Be Honest (Quick Fix)**
Update all documentation to say "rule-based heuristics" not "ML prediction":
```markdown
# In README.md:
- **Disease Prediction** - Rule-based scoring on 5 conditions
(Diabetes, Anemia, Heart Disease, Thrombocytopenia, Thalassemia)
```
**Option B: Implement Real ML (Longer)**
1. **Create a proper classifier:**
```python
# src/models/disease_classifier.py
from sklearn.ensemble import RandomForestClassifier
import joblib
class DiseaseClassifier:
def __init__(self, model_path: str = "models/disease_rf.joblib"):
self.model = joblib.load(model_path)
self.feature_names = [...] # 24 biomarkers
def predict(self, biomarkers: dict) -> dict:
features = self._to_feature_vector(biomarkers)
proba = self.model.predict_proba([features])[0]
return {
"disease": self.model.classes_[proba.argmax()],
"confidence": float(proba.max()),
"probabilities": dict(zip(self.model.classes_, proba.tolist()))
}
```
2. **Train on synthetic data:**
- Create `scripts/train_disease_model.py`
- Generate synthetic patient data with known conditions
- Train RandomForest/XGBoost classifier
- Save to `models/disease_rf.joblib`
3. **Replace predictor calls:**
```python
# Instead of predict_disease_simple(biomarkers)
from src.models.disease_classifier import get_classifier
prediction = get_classifier().predict(biomarkers)
```
**Recommendation:** Do Option A immediately, Option B as a follow-up feature.
---
## Issue 3: Vector Store Abstraction (P1)
### Problem
Two different vector stores used inconsistently:
| Context | Store | Configuration |
|---------|-------|---------------|
| Local dev | FAISS | `data/vector_stores/medical_knowledge.faiss` |
| Production | OpenSearch | `OPENSEARCH__HOST` env var |
| HuggingFace | FAISS | Bundled in `huggingface/` |
The code has:
- `src/pdf_processor.py` β†’ FAISS
- `src/services/opensearch/client.py` β†’ OpenSearch
- `src/services/agents/nodes/retrieve_node.py` β†’ OpenSearch only
### Solution
**Create a unified retriever interface:**
```python
# src/services/retrieval/interface.py
from abc import ABC, abstractmethod
from typing import List, Dict, Any
class BaseRetriever(ABC):
@abstractmethod
def search(self, query: str, top_k: int = 10) -> List[Dict[str, Any]]:
"""Return list of {id, score, text, title, section, metadata}"""
pass
@abstractmethod
def search_hybrid(self, query: str, embedding: List[float], top_k: int = 10) -> List[Dict[str, Any]]:
pass
```
```python
# src/services/retrieval/faiss_retriever.py
class FAISSRetriever(BaseRetriever):
def __init__(self, vector_store_path: str, embedding_model):
self.store = FAISS.load_local(vector_store_path, embedding_model, ...)
def search(self, query: str, top_k: int = 10):
docs = self.store.similarity_search(query, k=top_k)
return [{"id": i, "score": 0, "text": d.page_content, ...} for i, d in enumerate(docs)]
```
```python
# src/services/retrieval/opensearch_retriever.py
class OpenSearchRetriever(BaseRetriever):
def __init__(self, client: OpenSearchClient):
self.client = client
def search(self, query: str, top_k: int = 10):
return self.client.search_bm25(query, top_k=top_k)
```
```python
# src/services/retrieval/__init__.py
def get_retriever() -> BaseRetriever:
"""Factory that returns appropriate retriever based on config."""
settings = get_settings()
if settings.opensearch.host and _opensearch_available():
return OpenSearchRetriever(make_opensearch_client())
else:
return FAISSRetriever("data/vector_stores", get_embedding_model())
```
**Update retrieve_node.py:**
```python
def retrieve_node(state: dict, *, context: Any) -> dict:
retriever = context.retriever # Now uses unified interface
results = retriever.search_hybrid(query, embedding, top_k=10)
...
```
---
## Issue 4: Orphaned Evolution System (P2)
### Problem
`src/evolution/` contains a complete SOP evolution system that:
- Has `SOPGenePool` for versioning
- Has `performance_diagnostician()` for diagnosis
- Has `sop_architect()` for mutations
- Has an Airflow DAG (`airflow/dags/sop_evolution.py`)
**But:**
- No Airflow deployment exists
- `run_evolution_cycle()` requires manual invocation
- No UI to trigger evolution
- No tracking of which SOP version is in use
### Solution
**Option A: Remove It (Quick)**
Delete or archive the unused code:
```
mkdir -p archive/evolution
mv src/evolution/* archive/evolution/
mv airflow/dags/sop_evolution.py archive/
```
Update imports to remove references.
**Option B: Wire It Up (If Actually Wanted)**
1. **Add CLI command:**
```python
# scripts/evolve_sop.py
from src.evolution.director import run_evolution_cycle
from src.workflow import create_guild
if __name__ == "__main__":
gene_pool = SOPGenePool()
# Load baseline, run evolution, save results
```
2. **Add API endpoint:**
```python
# src/routers/admin.py
@router.post("/admin/evolve")
async def trigger_evolution(request: Request):
# Requires admin auth
result = run_evolution_cycle(...)
return {"new_versions": len(result)}
```
3. **Persist to database:**
- Use Alembic migrations to create `sop_versions` table
- Store evolved SOPs with evaluation scores
---
## Issue 5: Unreliable Evaluation System (P2)
### Problem
`src/evaluation/evaluators.py` uses LLM-as-judge for:
- `evaluate_clinical_accuracy()` - LLM grades medical correctness
- `evaluate_actionability()` - LLM grades recommendations
**Problems:**
1. LLMs are unreliable judges of medical accuracy
2. No ground truth comparison
3. Scores can fluctuate between runs
4. Falls back to 0.5 on JSON parse errors (line 91)
### Solution
**Replace with deterministic metrics where possible:**
```python
# For clinical_accuracy: Use BiomarkerValidator as ground truth
def evaluate_clinical_accuracy_v2(response: Dict, biomarkers: Dict) -> GradedScore:
validator = BiomarkerValidator()
# Check if flagged biomarkers match validator
expected_flags = validator.validate_all(biomarkers)[0]
actual_flags = response.get("biomarker_flags", [])
expected_abnormal = {f.name for f in expected_flags if f.status != "NORMAL"}
actual_abnormal = {f["name"] for f in actual_flags if f["status"] != "NORMAL"}
precision = len(expected_abnormal & actual_abnormal) / max(len(actual_abnormal), 1)
recall = len(expected_abnormal & actual_abnormal) / max(len(expected_abnormal), 1)
f1 = 2 * precision * recall / max(precision + recall, 0.001)
return GradedScore(
score=f1,
reasoning=f"Precision: {precision:.2f}, Recall: {recall:.2f}"
)
```
**Keep LLM-as-judge only for subjective metrics:**
- Clarity (readability) - already programmatic βœ“
- Helpfulness of recommendations - needs human judgment
**Add human-in-the-loop:**
```python
# src/evaluation/human_eval.py
def collect_human_rating(response_id: str) -> Optional[float]:
"""Store human ratings for later analysis."""
# Integrate with Langfuse or custom feedback endpoint
```
---
## Issue 6: HuggingFace Code Duplication (P2)
### Problem
`huggingface/app.py` is **1175 lines** that reimplements:
- Biomarker parsing (duplicated from chat.py)
- Disease prediction (duplicated)
- Guild initialization (duplicated)
- Gradio UI (different from src/gradio_app.py)
- Environment handling (custom)
### Solution
**Refactor to import from main package:**
```python
# huggingface/app.py (simplified to ~200 lines)
import sys
sys.path.insert(0, "..")
from src.workflow import create_guild
from src.state import PatientInput
from scripts.chat import extract_biomarkers, predict_disease_simple
# Only Gradio-specific code here
def analyze_biomarkers(input_text: str):
biomarkers, context = extract_biomarkers(input_text)
prediction = predict_disease_simple(biomarkers)
patient_input = PatientInput(
biomarkers=biomarkers,
model_prediction=prediction,
patient_context=context
)
guild = get_guild()
result = guild.run(patient_input)
return format_result(result)
# Gradio interface...
```
**Create shared utilities module:**
```python
# src/utils/biomarker_extraction.py
# Move extract_biomarkers() from chat.py here
# src/utils/disease_scoring.py
# Move predict_disease_simple() here
```
---
## Issue 7: Inadequate Test Coverage (P1)
### Problem
Current tests are mostly:
- Import validation (`test_basic.py`)
- Unit tests with mocks (`test_agentic_rag.py`)
- Schema validation (`test_schemas.py`)
**Missing:**
- End-to-end workflow tests
- API integration tests
- Regression tests for medical accuracy
### Solution
**Add integration tests:**
```python
# tests/integration/test_full_workflow.py
import pytest
from src.workflow import create_guild
from src.state import PatientInput
@pytest.fixture(scope="module")
def guild():
return create_guild()
def test_diabetes_patient_analysis(guild):
patient = PatientInput(
biomarkers={"Glucose": 185, "HbA1c": 8.2},
model_prediction={"disease": "Diabetes", "confidence": 0.87, "probabilities": {}},
patient_context={"age": 52, "gender": "male"}
)
result = guild.run(patient)
# Assertions
assert result.get("final_response") is not None
assert len(result.get("biomarker_flags", [])) >= 2
assert any(f["name"] == "Glucose" for f in result["biomarker_flags"])
assert "Diabetes" in result["final_response"]["prediction_explanation"]["primary_disease"]
def test_anemia_patient_analysis(guild):
patient = PatientInput(
biomarkers={"Hemoglobin": 9.5, "MCV": 75},
model_prediction={"disease": "Anemia", "confidence": 0.75, "probabilities": {}},
patient_context={}
)
result = guild.run(patient)
assert result.get("final_response") is not None
```
**Add API tests:**
```python
# tests/integration/test_api_endpoints.py
import pytest
from fastapi.testclient import TestClient
from src.main import app
@pytest.fixture
def client():
return TestClient(app)
def test_health_endpoint(client):
response = client.get("/health")
assert response.status_code == 200
assert response.json()["status"] == "healthy"
def test_analyze_structured(client):
response = client.post("/analyze/structured", json={
"biomarkers": {"Glucose": 140, "HbA1c": 7.0}
})
assert response.status_code == 200
assert "prediction" in response.json()
```
**Add to CI:**
```yaml
# .github/workflows/test.yml
- name: Run integration tests
run: pytest tests/integration/ -v
env:
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
```
---
## Issue 8: Database Schema Unused (P3)
### Problem
- `alembic/` is configured but `alembic/versions/` is empty
- `src/database.py` exists but is barely used
- `src/db/models.py` defines tables that aren't created
### Solution
**If database features are wanted:**
1. Create initial migration:
```bash
cd src
alembic revision --autogenerate -m "Initial schema"
alembic upgrade head
```
2. Use models for:
- Storing analysis history
- Persisting evolved SOPs
- User feedback collection
**If not needed:**
- Remove `alembic/` directory
- Remove `src/database.py`
- Remove `src/db/` if empty
- Remove `postgres` from `docker-compose.yml`
---
## Issue 9: Documentation Misalignment (P1)
### Problem
README.md claims:
- "ML prediction" β†’ It's rule-based
- "6 Specialist Agents" β†’ Also has agentic RAG (7+ nodes)
- "Production-ready" β†’ Two competing entry points
### Solution
**Update README.md:**
```markdown
## How It Works
### Analysis Pipeline
RagBot uses a **multi-agent LangGraph workflow** to analyze biomarkers:
1. **Input Routing** - Validates query is medical, routes to analysis or Q&A
2. **Biomarker Analyzer** - Validates values against clinical reference ranges
3. **Disease Scorer** - Rule-based heuristics predict most likely condition
4. **Disease Explainer** - RAG retrieval for pathophysiology from medical PDFs
5. **Guidelines Agent** - RAG retrieval for treatment recommendations
6. **Response Synthesizer** - Compiles findings into patient-friendly summary
### Supported Conditions
- Diabetes (via Glucose, HbA1c)
- Anemia (via Hemoglobin, MCV)
- Heart Disease (via Cholesterol, Troponin, LDL)
- Thrombocytopenia (via Platelets)
- Thalassemia (via MCV + Hemoglobin pattern)
> **Note:** Disease prediction uses rule-based scoring, not ML models.
> Future versions may include trained classifiers.
```
---
## Issue 10: Gradio App Dependencies (P2)
### Problem
`src/gradio_app.py` is just an HTTP client:
```python
def _call_ask(question: str) -> str:
resp = client.post(f"{API_BASE}/ask", json={"question": question})
```
It requires the FastAPI server running at `http://localhost:8000`.
### Solution
**Option A: Document the dependency clearly:**
Add startup instructions:
```markdown
## Running the Gradio UI
1. Start the API server:
```bash
uvicorn src.main:app --reload
```
2. In another terminal, start Gradio:
```bash
python -m src.gradio_app
```
3. Open http://localhost:7860
```
**Option B: Add embedded mode:**
```python
# src/gradio_app.py
def _call_ask_embedded(question: str) -> str:
"""Direct workflow invocation without HTTP."""
from src.services.agents.agentic_rag import AgenticRAGService
service = get_rag_service()
result = service.ask(query=question)
return result.get("final_answer", "No answer.")
def launch_gradio(embedded: bool = False, share: bool = False):
ask_fn = _call_ask_embedded if embedded else _call_ask
# ... rest of UI
```
---
## Implementation Roadmap
### Phase 1: Critical Fixes (Week 1)
| Day | Task | Owner |
|-----|------|-------|
| 1 | Fix documentation claims (README.md) | - |
| 1-2 | Consolidate entry points (delete api/app/main.py) | - |
| 2-3 | Create unified retriever interface | - |
| 3-4 | Add integration tests for workflow | - |
| 5 | Update Gradio startup docs | - |
### Phase 2: Architecture Cleanup (Week 2)
| Day | Task | Owner |
|-----|------|-------|
| 1-2 | Merge AgenticRAG + ClinicalInsightGuild | - |
| 3 | Refactor HuggingFace app to use shared code | - |
| 4 | Wire up or remove evolution system | - |
| 5 | Review and deploy | - |
### Phase 3: Quality Improvements (Week 3)
| Day | Task | Owner |
|-----|------|-------|
| 1 | Replace LLM-as-judge with deterministic metrics | - |
| 2 | Add proper disease classifier (optional) | - |
| 3-4 | Expand test coverage to 80%+ | - |
| 5 | Final documentation pass | - |
---
## Quick Wins (Do Today)
1. **Rename `predict_disease_simple`** to `score_disease_heuristic` to be honest
2. **Add `## Architecture` section** to README explaining the two workflows
3. **Create `scripts/start_full.ps1`** that starts both API and Gradio
4. **Delete empty `alembic/versions/`** and document "DB not implemented"
5. **Add type hints** to top 5 most-used functions
---
## Checklist
- [ ] P0: Single FastAPI entry point (`src/main.py` only)
- [ ] P1: Documentation accurately describes capabilities
- [ ] P1: Unified retriever interface (FAISS + OpenSearch)
- [ ] P1: Integration tests exist and pass
- [ ] P2: Evolution system removed or functional
- [ ] P2: HuggingFace app imports from main package
- [ ] P2: Evaluation metrics are deterministic
- [ ] P3: Database either used or removed