Agentic-RagBot / docs /DEEP_REVIEW.md
Nikhil Pravin Pise
docs: update all documentation to reflect current codebase state
aefac4f

RagBot Deep Review

Last updated: February 2026
Items marked [RESOLVED] have been fixed. Items marked [OPEN] remain as future work.

Scope

This review covers the end-to-end workflow and supporting services for RagBot, focusing on design correctness, reliability, safety guardrails, and maintainability. The review is based on a close reading of the workflow orchestration, agent implementations, API wiring, extraction and prediction logic, and the knowledge base pipeline.

Primary files reviewed:

  • src/workflow.py
  • src/state.py
  • src/config.py
  • src/agents/*
  • src/biomarker_validator.py
  • src/pdf_processor.py
  • api/app/main.py
  • api/app/routes/analyze.py
  • api/app/services/extraction.py
  • api/app/services/ragbot.py
  • scripts/chat.py

Architectural Understanding (Condensed)

End-to-End Flow

  1. Input arrives via CLI (scripts/chat.py) or REST API (api/app/routes/analyze.py).
  2. Natural language inputs are parsed by the extraction service (api/app/services/extraction.py) to produce normalized biomarkers and patient context.
  3. A rule-based prediction (predict_disease_simple) produces a disease hypothesis and probabilities.
  4. The LangGraph workflow (src/workflow.py) orchestrates six agents: Biomarker Analyzer, Disease Explainer, Biomarker Linker, Clinical Guidelines, Confidence Assessor, Response Synthesizer.
  5. The synthesized output is formatted into API schemas (api/app/services/ragbot.py) or into CLI-friendly responses (scripts/chat.py).

Key Data Structures

  • GuildState in src/state.py is the shared workflow state; it depends on additive accumulation for parallel outputs.
  • PatientInput holds structured biomarkers, prediction data, and patient context.
  • The response format is built in ResponseSynthesizerAgent and then translated into API schemas in RagBotService.

Knowledge Base

  • PDFs are chunked and embedded into FAISS (src/pdf_processor.py).
  • Three retrievers (disease explainer, biomarker linker, clinical guidelines) share the same FAISS index with varying k values.

Deep Review Findings

Critical Issues

  1. [OPEN] State propagation is incomplete across the workflow.

    • src/agents/biomarker_analyzer.py returns only agent_outputs and not the computed biomarker_flags or safety_alerts into the top-level GuildState keys that the workflow expects to accumulate.
    • src/workflow.py initializes biomarker_flags and safety_alerts in the state, but none of the agents return updates to those keys. As a result, workflow_result.get("biomarker_flags") and workflow_result.get("safety_alerts") are likely empty when the API response is formatted in api/app/services/ragbot.py.
    • Effect: API output will frequently miss biomarkers and alerts, and downstream consumers will incorrectly assume a clean result set.
    • Recommendation: return biomarker_flags and safety_alerts from the Biomarker Analyzer agent so they accumulate in the state. Ensure the response synth uses those same keys.
  2. [OPEN] LangGraph merge behavior is unsafe for parallel outputs.

    • GuildState uses Annotated[List[AgentOutput], operator.add] for additive merging, but the nodes return only { 'agent_outputs': [output] } and nothing else. This is okay for agent_outputs, but parallel agents also read from the full agent_outputs list inside the state to infer prior results.
    • In parallel branches, a given agent might read a partial agent_outputs list depending on execution order. This is visible in the BiomarkerDiseaseLinkerAgent and ClinicalGuidelinesAgent which read the prior Biomarker Analyzer output by searching agent_outputs.
    • Effect: nondeterministic behavior if LangGraph schedules a branch before the Biomarker Analyzer output is merged, or if merges occur after the branch starts. This can degrade evidence selection and recommendations.
    • Recommendation: explicitly pass relevant artifacts as dedicated state fields updated by the Biomarker Analyzer, and read those fields directly instead of scanning agent_outputs.
  3. [RESOLVED] Schema mismatch between workflow output and API formatter.

    • ResponseSynthesizerAgent returns a structured response with keys like patient_summary, prediction_explanation, clinical_recommendations, confidence_assessment, and safety_alerts.
    • RagBotService._format_response() now correctly reads from final_response and handles both Pydantic objects and dicts.
    • The CLI (scripts/chat.py) uses _coerce_to_dict() and format_conversational() to safely handle all output types.
    • Fix applied: _format_response() updated + _coerce_to_dict() helper added.

High Priority Issues

  1. [OPEN] Prediction confidence is forced to 0.5 and default disease is always Diabetes.

    • Both the API and CLI predict_disease_simple functions enforce a minimum confidence of 0.5 and default to Diabetes when confidence is low.
    • Effect: leads to biased predictions and false confidence. This is risky in a medical domain and undermines reliability assessments.
    • Recommendation: return a low-confidence prediction explicitly and mark reliability as low; avoid forcing a disease when evidence is insufficient.
  2. [RESOLVED] Different biomarker naming schemes across extraction modules.

    • Both CLI and API now use the shared src/biomarker_normalization.py module with 80+ aliases mapped to 24 canonical names.
    • Fix applied: unified normalization in both scripts/chat.py and api/app/services/extraction.py.
  3. [RESOLVED] Use of console glyphs and non-ASCII prefixes in logs and output.

    • Debug prints removed from CLI. Logging suppressed for noisy HuggingFace/transformers output.
    • API responses use clean JSON only; CLI uses UTF-8 emojis only in user-facing output.
    • Fix applied: [DEBUG] prints removed, BertModel LOAD REPORT suppressed, HuggingFace deprecation warnings filtered.

Medium Priority Issues

  1. [RESOLVED] Inconsistent model selection between agents.

    • All agents now use llm_config centralized configuration (planner, analyzer, explainer, synthesizer properties).
    • Fix applied: src/llm_config.py provides LLMConfig singleton with per-role properties.
  2. [RESOLVED] Potential JSON parsing fragility in extraction.

    • _parse_llm_json() now handles markdown fences, trailing text, and partial JSON recovery.
    • Fix applied: robust JSON parser in api/app/services/extraction.py with test coverage (test_json_parsing.py).
  3. [RESOLVED] Knowledge base retrieval does not enforce citations.

    • Disease Explainer agent now checks sop.require_pdf_citations and returns "insufficient evidence" when no documents are retrieved.
    • Fix applied: citation guardrail in src/agents/disease_explainer.py with test (test_citation_guardrails.py).

Low Priority Issues

  1. [OPEN] Error handling does not preserve original exceptions cleanly in API layer.

    • Exceptions are wrapped in RuntimeError without detail separation; RagBotService.analyze() does not attach contextual hints (e.g., which agent failed).
    • Recommendation: wrap exceptions with agent name and error classification to improve observability.
  2. [RESOLVED] Hard-coded expected biomarker count (24) in Confidence Assessor.

    • Now uses BiomarkerValidator().expected_biomarker_count() which reads from config/biomarker_references.json.
    • Test: test_validator_count.py verifies count matches reference config.

Suggested Improvements (Summary)

  1. Align workflow output and API schema. [RESOLVED]
  2. Promote biomarker flags and safety alerts to first-class state fields in the workflow. [OPEN]
  3. Use a shared normalization utility. [RESOLVED]
  4. Remove forced minimum confidence and default disease; permit "low confidence" results. [OPEN]
  5. Introduce citation enforcement as a guardrail for RAG outputs. [RESOLVED]
  6. Centralize model selection and logging format. [RESOLVED]

Verification Gaps

The following should be tested once fixes are made:

  • Natural language extraction with partial and noisy inputs.
  • Workflow run where no abnormal biomarkers are detected.
  • API response schema validation for both natural and structured routes.
  • Parallel agent execution determinism (state access to biomarker analysis).
  • CLI behavior for biomarker names that differ from API normalization.