Agentic-RagBot / docs /REMEDIATION_PLAN.md
Nikhil Pravin Pise
Production Upgrade v2.0: SSE streaming, HIPAA compliance, Gradio Q&A UI
3ca1d38

MediGuard AI / RagBot - Comprehensive Remediation Plan

Generated: February 24, 2026
Status: βœ… COMPLETED
Last Updated: Session completion
Priority Levels: P0 (Critical) β†’ P3 (Nice-to-have)


Implementation Status

# Issue Status Notes
1 Dual Architecture βœ… Complete Consolidated to src/main.py
2 Fake ML Prediction βœ… Complete Renamed to rule-based heuristics
3 Vector Store Abstraction βœ… Complete Created unified retriever interface
4 Evolution System βœ… Complete Archived to archive/evolution/
5 Evaluation System βœ… Complete Added deterministic mode
6 HuggingFace Duplication βœ… Complete Reduced from 1175β†’1086 lines
7 Test Coverage βœ… Complete Added tests/test_integration.py
8 Database Schema ⏭️ Deferred Not needed for HuggingFace
9 Documentation βœ… Complete README.md updated
10 Gradio Dependencies βœ… Complete Shared utils created

Table of Contents

  1. Executive Summary
  2. Issue 1: Dual Architecture Confusion
  3. Issue 2: Fake ML Disease Prediction
  4. Issue 3: Vector Store Abstraction
  5. Issue 4: Orphaned Evolution System
  6. Issue 5: Unreliable Evaluation System
  7. Issue 6: HuggingFace Code Duplication
  8. Issue 7: Inadequate Test Coverage
  9. Issue 8: Database Schema Unused
  10. Issue 9: Documentation Misalignment
  11. Issue 10: Gradio App Dependencies
  12. Implementation Roadmap

Executive Summary

The RagBot codebase has 10 structural issues that create confusion, maintenance burden, and misleading claims. The most critical issues are:

Priority Issue Impact Effort
P0 Dual Architecture High confusion, duplicated code paths 3-5 days
P1 Fake ML Prediction Misleading users, false claims 2-3 days
P1 Vector Store Mess Production vs local mismatch 2 days
P1 Missing Tests Unreliable deployments 3-4 days
P1 Doc Misalignment User confusion 1 day
P2 Orphaned Evolution Dead code, wasted complexity 1-2 days
P2 Evaluation System Unreliable quality metrics 2 days
P2 HuggingFace Duplication 1175-line standalone app 2-3 days
P2 Gradio Dependencies Can't run standalone 0.5 days
P3 Unused Database Alembic setup with no migrations 1 day

Issue 1: Dual Architecture Confusion (P0)

Problem

Two competing LangGraph workflows exist:

Component Path Purpose
ClinicalInsightGuild src/workflow.py Original 6-agent biomarker analysis
AgenticRAGService src/services/agents/agentic_rag.py Newer Q&A RAG pipeline

The API routes them confusingly:

  • /analyze/* β†’ ClinicalInsightGuild via api/app/services/ragbot.py
  • /ask β†’ AgenticRAGService via src/routers/ask.py

Evidence:

  • src/main.py initializes BOTH services at startup (lines 91-106)
  • api/app/main.py is a SEPARATE FastAPI app from src/main.py
  • Users don't know which one is "production"

Solution

Option A: Merge into Single Unified Pipeline (Recommended)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Unified RAG Pipeline                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Input β†’ Guardrail β†’ Router β†’ ┬→ Biomarker Analysis Path     β”‚
β”‚                                β”‚   (6 specialist agents)       β”‚
β”‚                                β””β†’ General Q&A Path             β”‚
β”‚                                    (retrieve β†’ grade β†’ gen)    β”‚
β”‚                          β†’ Output Synthesizer β†’ Response       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Implementation Steps:

  1. Create unified graph in src/pipelines/unified_rag.py:

    # Merge both workflows into one StateGraph
    # Use routing logic from guardrail_node to dispatch
    
  2. Delete redundant files:

    • Move api/app/ logic into src/routers/
    • Delete api/app/main.py (use src/main.py only)
    • Keep api/app/services/ragbot.py as legacy adapter
  3. Single entry point:

    • src/main.py becomes THE server
    • uvicorn src.main:app everywhere
  4. Update imports:

    # In src/main.py, replace:
    from api.app.services.ragbot import get_ragbot_service
    # With:
    from src.pipelines.unified_rag import UnifiedRAGService
    

Files to Create:

  • src/pipelines/__init__.py
  • src/pipelines/unified_rag.py
  • src/pipelines/nodes/__init__.py (merge all nodes)

Files to Delete/Archive:

  • api/app/main.py β†’ Archive to api/app/main_legacy.py
  • api/app/routes/ β†’ Merge into src/routers/

Issue 2: Fake ML Disease Prediction (P1)

Problem

The README claims "ML prediction" but predict_disease_simple() is pure if/else:

# scripts/chat.py lines 151-216
if glucose > 126:
    scores["Diabetes"] += 0.4
if hba1c >= 6.5:
    scores["Diabetes"] += 0.5

There's also an LLM-based predictor (predict_disease_llm()) that just asks an LLM to guess.

Solution

Option A: Be Honest (Quick Fix)

Update all documentation to say "rule-based heuristics" not "ML prediction":

# In README.md:
- **Disease Prediction** - Rule-based scoring on 5 conditions
  (Diabetes, Anemia, Heart Disease, Thrombocytopenia, Thalassemia)

Option B: Implement Real ML (Longer)

  1. Create a proper classifier:

    # src/models/disease_classifier.py
    from sklearn.ensemble import RandomForestClassifier
    import joblib
    
    class DiseaseClassifier:
        def __init__(self, model_path: str = "models/disease_rf.joblib"):
            self.model = joblib.load(model_path)
            self.feature_names = [...]  # 24 biomarkers
        
        def predict(self, biomarkers: dict) -> dict:
            features = self._to_feature_vector(biomarkers)
            proba = self.model.predict_proba([features])[0]
            return {
                "disease": self.model.classes_[proba.argmax()],
                "confidence": float(proba.max()),
                "probabilities": dict(zip(self.model.classes_, proba.tolist()))
            }
    
  2. Train on synthetic data:

    • Create scripts/train_disease_model.py
    • Generate synthetic patient data with known conditions
    • Train RandomForest/XGBoost classifier
    • Save to models/disease_rf.joblib
  3. Replace predictor calls:

    # Instead of predict_disease_simple(biomarkers)
    from src.models.disease_classifier import get_classifier
    prediction = get_classifier().predict(biomarkers)
    

Recommendation: Do Option A immediately, Option B as a follow-up feature.


Issue 3: Vector Store Abstraction (P1)

Problem

Two different vector stores used inconsistently:

Context Store Configuration
Local dev FAISS data/vector_stores/medical_knowledge.faiss
Production OpenSearch OPENSEARCH__HOST env var
HuggingFace FAISS Bundled in huggingface/

The code has:

  • src/pdf_processor.py β†’ FAISS
  • src/services/opensearch/client.py β†’ OpenSearch
  • src/services/agents/nodes/retrieve_node.py β†’ OpenSearch only

Solution

Create a unified retriever interface:

# src/services/retrieval/interface.py
from abc import ABC, abstractmethod
from typing import List, Dict, Any

class BaseRetriever(ABC):
    @abstractmethod
    def search(self, query: str, top_k: int = 10) -> List[Dict[str, Any]]:
        """Return list of {id, score, text, title, section, metadata}"""
        pass
    
    @abstractmethod
    def search_hybrid(self, query: str, embedding: List[float], top_k: int = 10) -> List[Dict[str, Any]]:
        pass
# src/services/retrieval/faiss_retriever.py
class FAISSRetriever(BaseRetriever):
    def __init__(self, vector_store_path: str, embedding_model):
        self.store = FAISS.load_local(vector_store_path, embedding_model, ...)
    
    def search(self, query: str, top_k: int = 10):
        docs = self.store.similarity_search(query, k=top_k)
        return [{"id": i, "score": 0, "text": d.page_content, ...} for i, d in enumerate(docs)]
# src/services/retrieval/opensearch_retriever.py
class OpenSearchRetriever(BaseRetriever):
    def __init__(self, client: OpenSearchClient):
        self.client = client
    
    def search(self, query: str, top_k: int = 10):
        return self.client.search_bm25(query, top_k=top_k)
# src/services/retrieval/__init__.py
def get_retriever() -> BaseRetriever:
    """Factory that returns appropriate retriever based on config."""
    settings = get_settings()
    if settings.opensearch.host and _opensearch_available():
        return OpenSearchRetriever(make_opensearch_client())
    else:
        return FAISSRetriever("data/vector_stores", get_embedding_model())

Update retrieve_node.py:

def retrieve_node(state: dict, *, context: Any) -> dict:
    retriever = context.retriever  # Now uses unified interface
    results = retriever.search_hybrid(query, embedding, top_k=10)
    ...

Issue 4: Orphaned Evolution System (P2)

Problem

src/evolution/ contains a complete SOP evolution system that:

  • Has SOPGenePool for versioning
  • Has performance_diagnostician() for diagnosis
  • Has sop_architect() for mutations
  • Has an Airflow DAG (airflow/dags/sop_evolution.py)

But:

  • No Airflow deployment exists
  • run_evolution_cycle() requires manual invocation
  • No UI to trigger evolution
  • No tracking of which SOP version is in use

Solution

Option A: Remove It (Quick)

Delete or archive the unused code:

mkdir -p archive/evolution
mv src/evolution/* archive/evolution/
mv airflow/dags/sop_evolution.py archive/

Update imports to remove references.

Option B: Wire It Up (If Actually Wanted)

  1. Add CLI command:

    # scripts/evolve_sop.py
    from src.evolution.director import run_evolution_cycle
    from src.workflow import create_guild
    
    if __name__ == "__main__":
        gene_pool = SOPGenePool()
        # Load baseline, run evolution, save results
    
  2. Add API endpoint:

    # src/routers/admin.py
    @router.post("/admin/evolve")
    async def trigger_evolution(request: Request):
        # Requires admin auth
        result = run_evolution_cycle(...)
        return {"new_versions": len(result)}
    
  3. Persist to database:

    • Use Alembic migrations to create sop_versions table
    • Store evolved SOPs with evaluation scores

Issue 5: Unreliable Evaluation System (P2)

Problem

src/evaluation/evaluators.py uses LLM-as-judge for:

  • evaluate_clinical_accuracy() - LLM grades medical correctness
  • evaluate_actionability() - LLM grades recommendations

Problems:

  1. LLMs are unreliable judges of medical accuracy
  2. No ground truth comparison
  3. Scores can fluctuate between runs
  4. Falls back to 0.5 on JSON parse errors (line 91)

Solution

Replace with deterministic metrics where possible:

# For clinical_accuracy: Use BiomarkerValidator as ground truth
def evaluate_clinical_accuracy_v2(response: Dict, biomarkers: Dict) -> GradedScore:
    validator = BiomarkerValidator()
    
    # Check if flagged biomarkers match validator
    expected_flags = validator.validate_all(biomarkers)[0]
    actual_flags = response.get("biomarker_flags", [])
    
    expected_abnormal = {f.name for f in expected_flags if f.status != "NORMAL"}
    actual_abnormal = {f["name"] for f in actual_flags if f["status"] != "NORMAL"}
    
    precision = len(expected_abnormal & actual_abnormal) / max(len(actual_abnormal), 1)
    recall = len(expected_abnormal & actual_abnormal) / max(len(expected_abnormal), 1)
    f1 = 2 * precision * recall / max(precision + recall, 0.001)
    
    return GradedScore(
        score=f1,
        reasoning=f"Precision: {precision:.2f}, Recall: {recall:.2f}"
    )

Keep LLM-as-judge only for subjective metrics:

  • Clarity (readability) - already programmatic βœ“
  • Helpfulness of recommendations - needs human judgment

Add human-in-the-loop:

# src/evaluation/human_eval.py
def collect_human_rating(response_id: str) -> Optional[float]:
    """Store human ratings for later analysis."""
    # Integrate with Langfuse or custom feedback endpoint

Issue 6: HuggingFace Code Duplication (P2)

Problem

huggingface/app.py is 1175 lines that reimplements:

  • Biomarker parsing (duplicated from chat.py)
  • Disease prediction (duplicated)
  • Guild initialization (duplicated)
  • Gradio UI (different from src/gradio_app.py)
  • Environment handling (custom)

Solution

Refactor to import from main package:

# huggingface/app.py (simplified to ~200 lines)
import sys
sys.path.insert(0, "..")

from src.workflow import create_guild
from src.state import PatientInput
from scripts.chat import extract_biomarkers, predict_disease_simple

# Only Gradio-specific code here
def analyze_biomarkers(input_text: str):
    biomarkers, context = extract_biomarkers(input_text)
    prediction = predict_disease_simple(biomarkers)
    patient_input = PatientInput(
        biomarkers=biomarkers,
        model_prediction=prediction,
        patient_context=context
    )
    guild = get_guild()
    result = guild.run(patient_input)
    return format_result(result)

# Gradio interface...

Create shared utilities module:

# src/utils/biomarker_extraction.py
# Move extract_biomarkers() from chat.py here

# src/utils/disease_scoring.py
# Move predict_disease_simple() here

Issue 7: Inadequate Test Coverage (P1)

Problem

Current tests are mostly:

  • Import validation (test_basic.py)
  • Unit tests with mocks (test_agentic_rag.py)
  • Schema validation (test_schemas.py)

Missing:

  • End-to-end workflow tests
  • API integration tests
  • Regression tests for medical accuracy

Solution

Add integration tests:

# tests/integration/test_full_workflow.py
import pytest
from src.workflow import create_guild
from src.state import PatientInput

@pytest.fixture(scope="module")
def guild():
    return create_guild()

def test_diabetes_patient_analysis(guild):
    patient = PatientInput(
        biomarkers={"Glucose": 185, "HbA1c": 8.2},
        model_prediction={"disease": "Diabetes", "confidence": 0.87, "probabilities": {}},
        patient_context={"age": 52, "gender": "male"}
    )
    result = guild.run(patient)
    
    # Assertions
    assert result.get("final_response") is not None
    assert len(result.get("biomarker_flags", [])) >= 2
    assert any(f["name"] == "Glucose" for f in result["biomarker_flags"])
    assert "Diabetes" in result["final_response"]["prediction_explanation"]["primary_disease"]

def test_anemia_patient_analysis(guild):
    patient = PatientInput(
        biomarkers={"Hemoglobin": 9.5, "MCV": 75},
        model_prediction={"disease": "Anemia", "confidence": 0.75, "probabilities": {}},
        patient_context={}
    )
    result = guild.run(patient)
    assert result.get("final_response") is not None

Add API tests:

# tests/integration/test_api_endpoints.py
import pytest
from fastapi.testclient import TestClient
from src.main import app

@pytest.fixture
def client():
    return TestClient(app)

def test_health_endpoint(client):
    response = client.get("/health")
    assert response.status_code == 200
    assert response.json()["status"] == "healthy"

def test_analyze_structured(client):
    response = client.post("/analyze/structured", json={
        "biomarkers": {"Glucose": 140, "HbA1c": 7.0}
    })
    assert response.status_code == 200
    assert "prediction" in response.json()

Add to CI:

# .github/workflows/test.yml
- name: Run integration tests
  run: pytest tests/integration/ -v
  env:
    GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}

Issue 8: Database Schema Unused (P3)

Problem

  • alembic/ is configured but alembic/versions/ is empty
  • src/database.py exists but is barely used
  • src/db/models.py defines tables that aren't created

Solution

If database features are wanted:

  1. Create initial migration:

    cd src
    alembic revision --autogenerate -m "Initial schema"
    alembic upgrade head
    
  2. Use models for:

    • Storing analysis history
    • Persisting evolved SOPs
    • User feedback collection

If not needed:

  • Remove alembic/ directory
  • Remove src/database.py
  • Remove src/db/ if empty
  • Remove postgres from docker-compose.yml

Issue 9: Documentation Misalignment (P1)

Problem

README.md claims:

  • "ML prediction" β†’ It's rule-based
  • "6 Specialist Agents" β†’ Also has agentic RAG (7+ nodes)
  • "Production-ready" β†’ Two competing entry points

Solution

Update README.md:

## How It Works

### Analysis Pipeline
RagBot uses a **multi-agent LangGraph workflow** to analyze biomarkers:

1. **Input Routing** - Validates query is medical, routes to analysis or Q&A
2. **Biomarker Analyzer** - Validates values against clinical reference ranges
3. **Disease Scorer** - Rule-based heuristics predict most likely condition
4. **Disease Explainer** - RAG retrieval for pathophysiology from medical PDFs
5. **Guidelines Agent** - RAG retrieval for treatment recommendations
6. **Response Synthesizer** - Compiles findings into patient-friendly summary

### Supported Conditions
- Diabetes (via Glucose, HbA1c)
- Anemia (via Hemoglobin, MCV)
- Heart Disease (via Cholesterol, Troponin, LDL)
- Thrombocytopenia (via Platelets)
- Thalassemia (via MCV + Hemoglobin pattern)

> **Note:** Disease prediction uses rule-based scoring, not ML models.
> Future versions may include trained classifiers.

Issue 10: Gradio App Dependencies (P2)

Problem

src/gradio_app.py is just an HTTP client:

def _call_ask(question: str) -> str:
    resp = client.post(f"{API_BASE}/ask", json={"question": question})

It requires the FastAPI server running at http://localhost:8000.

Solution

Option A: Document the dependency clearly:

Add startup instructions:

## Running the Gradio UI

1. Start the API server:
   ```bash
   uvicorn src.main:app --reload
  1. In another terminal, start Gradio:

    python -m src.gradio_app
    
  2. Open http://localhost:7860


**Option B: Add embedded mode:**

```python
# src/gradio_app.py
def _call_ask_embedded(question: str) -> str:
    """Direct workflow invocation without HTTP."""
    from src.services.agents.agentic_rag import AgenticRAGService
    service = get_rag_service()
    result = service.ask(query=question)
    return result.get("final_answer", "No answer.")

def launch_gradio(embedded: bool = False, share: bool = False):
    ask_fn = _call_ask_embedded if embedded else _call_ask
    # ... rest of UI

Implementation Roadmap

Phase 1: Critical Fixes (Week 1)

Day Task Owner
1 Fix documentation claims (README.md) -
1-2 Consolidate entry points (delete api/app/main.py) -
2-3 Create unified retriever interface -
3-4 Add integration tests for workflow -
5 Update Gradio startup docs -

Phase 2: Architecture Cleanup (Week 2)

Day Task Owner
1-2 Merge AgenticRAG + ClinicalInsightGuild -
3 Refactor HuggingFace app to use shared code -
4 Wire up or remove evolution system -
5 Review and deploy -

Phase 3: Quality Improvements (Week 3)

Day Task Owner
1 Replace LLM-as-judge with deterministic metrics -
2 Add proper disease classifier (optional) -
3-4 Expand test coverage to 80%+ -
5 Final documentation pass -

Quick Wins (Do Today)

  1. Rename predict_disease_simple to score_disease_heuristic to be honest
  2. Add ## Architecture section to README explaining the two workflows
  3. Create scripts/start_full.ps1 that starts both API and Gradio
  4. Delete empty alembic/versions/ and document "DB not implemented"
  5. Add type hints to top 5 most-used functions

Checklist

  • P0: Single FastAPI entry point (src/main.py only)
  • P1: Documentation accurately describes capabilities
  • P1: Unified retriever interface (FAISS + OpenSearch)
  • P1: Integration tests exist and pass
  • P2: Evolution system removed or functional
  • P2: HuggingFace app imports from main package
  • P2: Evaluation metrics are deterministic
  • P3: Database either used or removed