Spaces:

T0X1N
/

Agentic-RagBot

Sleeping

File size: 8,499 Bytes

aefac4f

# RagBot Deep Review

> **Last updated**: February 2026  
> Items marked **[RESOLVED]** have been fixed. Items marked **[OPEN]** remain as future work.

## Scope

This review covers the end-to-end workflow and supporting services for RagBot, focusing on design correctness, reliability, safety guardrails, and maintainability. The review is based on a close reading of the workflow orchestration, agent implementations, API wiring, extraction and prediction logic, and the knowledge base pipeline.

Primary files reviewed:
- `src/workflow.py`
- `src/state.py`
- `src/config.py`
- `src/agents/*`
- `src/biomarker_validator.py`
- `src/pdf_processor.py`
- `api/app/main.py`
- `api/app/routes/analyze.py`
- `api/app/services/extraction.py`
- `api/app/services/ragbot.py`
- `scripts/chat.py`

## Architectural Understanding (Condensed)

### End-to-End Flow
1. Input arrives via CLI (`scripts/chat.py`) or REST API (`api/app/routes/analyze.py`).
2. Natural language inputs are parsed by the extraction service (`api/app/services/extraction.py`) to produce normalized biomarkers and patient context.
3. A rule-based prediction (`predict_disease_simple`) produces a disease hypothesis and probabilities.
4. The LangGraph workflow (`src/workflow.py`) orchestrates six agents: Biomarker Analyzer, Disease Explainer, Biomarker Linker, Clinical Guidelines, Confidence Assessor, Response Synthesizer.
5. The synthesized output is formatted into API schemas (`api/app/services/ragbot.py`) or into CLI-friendly responses (`scripts/chat.py`).

### Key Data Structures
- `GuildState` in `src/state.py` is the shared workflow state; it depends on additive accumulation for parallel outputs.
- `PatientInput` holds structured biomarkers, prediction data, and patient context.
- The response format is built in `ResponseSynthesizerAgent` and then translated into API schemas in `RagBotService`.

### Knowledge Base
- PDFs are chunked and embedded into FAISS (`src/pdf_processor.py`).
- Three retrievers (disease explainer, biomarker linker, clinical guidelines) share the same FAISS index with varying `k` values.

## Deep Review Findings

### Critical Issues

1. **[OPEN] State propagation is incomplete across the workflow.**
   - `src/agents/biomarker_analyzer.py` returns only `agent_outputs` and not the computed `biomarker_flags` or `safety_alerts` into the top-level `GuildState` keys that the workflow expects to accumulate.
   - `src/workflow.py` initializes `biomarker_flags` and `safety_alerts` in the state, but none of the agents return updates to those keys. As a result, `workflow_result.get("biomarker_flags")` and `workflow_result.get("safety_alerts")` are likely empty when the API response is formatted in `api/app/services/ragbot.py`.
   - Effect: API output will frequently miss biomarkers and alerts, and downstream consumers will incorrectly assume a clean result set.
   - Recommendation: return `biomarker_flags` and `safety_alerts` from the Biomarker Analyzer agent so they accumulate in the state. Ensure the response synth uses those same keys.

2. **[OPEN] LangGraph merge behavior is unsafe for parallel outputs.**
   - `GuildState` uses `Annotated[List[AgentOutput], operator.add]` for additive merging, but the nodes return only `{ 'agent_outputs': [output] }` and nothing else. This is okay for `agent_outputs`, but parallel agents also read from the full `agent_outputs` list inside the state to infer prior results.
   - In parallel branches, a given agent might read a partial `agent_outputs` list depending on execution order. This is visible in the `BiomarkerDiseaseLinkerAgent` and `ClinicalGuidelinesAgent` which read the prior Biomarker Analyzer output by searching `agent_outputs`.
   - Effect: nondeterministic behavior if LangGraph schedules a branch before the Biomarker Analyzer output is merged, or if merges occur after the branch starts. This can degrade evidence selection and recommendations.
   - Recommendation: explicitly pass relevant artifacts as dedicated state fields updated by the Biomarker Analyzer, and read those fields directly instead of scanning `agent_outputs`.

3. **[RESOLVED] Schema mismatch between workflow output and API formatter.**
   - `ResponseSynthesizerAgent` returns a structured response with keys like `patient_summary`, `prediction_explanation`, `clinical_recommendations`, `confidence_assessment`, and `safety_alerts`.
   - `RagBotService._format_response()` now correctly reads from `final_response` and handles both Pydantic objects and dicts.
   - The CLI (`scripts/chat.py`) uses `_coerce_to_dict()` and `format_conversational()` to safely handle all output types.
   - **Fix applied**: `_format_response()` updated + `_coerce_to_dict()` helper added.

### High Priority Issues

1. **[OPEN] Prediction confidence is forced to 0.5 and default disease is always Diabetes.**
   - Both the API and CLI `predict_disease_simple` functions enforce a minimum confidence of 0.5 and default to Diabetes when confidence is low.
   - Effect: leads to biased predictions and false confidence. This is risky in a medical domain and undermines reliability assessments.
   - Recommendation: return a low-confidence prediction explicitly and mark reliability as low; avoid forcing a disease when evidence is insufficient.

2. **[RESOLVED] Different biomarker naming schemes across extraction modules.**
   - Both CLI and API now use the shared `src/biomarker_normalization.py` module with 80+ aliases mapped to 24 canonical names.
   - **Fix applied**: unified normalization in both `scripts/chat.py` and `api/app/services/extraction.py`.

3. **[RESOLVED] Use of console glyphs and non-ASCII prefixes in logs and output.**
   - Debug prints removed from CLI. Logging suppressed for noisy HuggingFace/transformers output.
   - API responses use clean JSON only; CLI uses UTF-8 emojis only in user-facing output.
   - **Fix applied**: `[DEBUG]` prints removed, `BertModel LOAD REPORT` suppressed, HuggingFace deprecation warnings filtered.

### Medium Priority Issues

1. **[RESOLVED] Inconsistent model selection between agents.**
   - All agents now use `llm_config` centralized configuration (planner, analyzer, explainer, synthesizer properties).
   - **Fix applied**: `src/llm_config.py` provides `LLMConfig` singleton with per-role properties.

2. **[RESOLVED] Potential JSON parsing fragility in extraction.**
   - `_parse_llm_json()` now handles markdown fences, trailing text, and partial JSON recovery.
   - **Fix applied**: robust JSON parser in `api/app/services/extraction.py` with test coverage (`test_json_parsing.py`).

3. **[RESOLVED] Knowledge base retrieval does not enforce citations.**
   - Disease Explainer agent now checks `sop.require_pdf_citations` and returns "insufficient evidence" when no documents are retrieved.
   - **Fix applied**: citation guardrail in `src/agents/disease_explainer.py` with test (`test_citation_guardrails.py`).

### Low Priority Issues

1. **[OPEN] Error handling does not preserve original exceptions cleanly in API layer.**
   - Exceptions are wrapped in `RuntimeError` without detail separation; `RagBotService.analyze()` does not attach contextual hints (e.g., which agent failed).
   - Recommendation: wrap exceptions with agent name and error classification to improve observability.

2. **[RESOLVED] Hard-coded expected biomarker count (24) in Confidence Assessor.**
   - Now uses `BiomarkerValidator().expected_biomarker_count()` which reads from `config/biomarker_references.json`.
   - Test: `test_validator_count.py` verifies count matches reference config.

## Suggested Improvements (Summary)

1. ~~Align workflow output and API schema.~~ **[RESOLVED]**
2. Promote biomarker flags and safety alerts to first-class state fields in the workflow. **[OPEN]**
3. ~~Use a shared normalization utility.~~ **[RESOLVED]**
4. Remove forced minimum confidence and default disease; permit "low confidence" results. **[OPEN]**
5. ~~Introduce citation enforcement as a guardrail for RAG outputs.~~ **[RESOLVED]**
6. ~~Centralize model selection and logging format.~~ **[RESOLVED]**

## Verification Gaps

The following should be tested once fixes are made:
- Natural language extraction with partial and noisy inputs.
- Workflow run where no abnormal biomarkers are detected.
- API response schema validation for both natural and structured routes.
- Parallel agent execution determinism (state access to biomarker analysis).
- CLI behavior for biomarker names that differ from API normalization.