Agentic-RagBot / docs /DEEP_REVIEW.md
Nikhil Pravin Pise
docs: update all documentation to reflect current codebase state
aefac4f
# RagBot Deep Review
> **Last updated**: February 2026
> Items marked **[RESOLVED]** have been fixed. Items marked **[OPEN]** remain as future work.
## Scope
This review covers the end-to-end workflow and supporting services for RagBot, focusing on design correctness, reliability, safety guardrails, and maintainability. The review is based on a close reading of the workflow orchestration, agent implementations, API wiring, extraction and prediction logic, and the knowledge base pipeline.
Primary files reviewed:
- `src/workflow.py`
- `src/state.py`
- `src/config.py`
- `src/agents/*`
- `src/biomarker_validator.py`
- `src/pdf_processor.py`
- `api/app/main.py`
- `api/app/routes/analyze.py`
- `api/app/services/extraction.py`
- `api/app/services/ragbot.py`
- `scripts/chat.py`
## Architectural Understanding (Condensed)
### End-to-End Flow
1. Input arrives via CLI (`scripts/chat.py`) or REST API (`api/app/routes/analyze.py`).
2. Natural language inputs are parsed by the extraction service (`api/app/services/extraction.py`) to produce normalized biomarkers and patient context.
3. A rule-based prediction (`predict_disease_simple`) produces a disease hypothesis and probabilities.
4. The LangGraph workflow (`src/workflow.py`) orchestrates six agents: Biomarker Analyzer, Disease Explainer, Biomarker Linker, Clinical Guidelines, Confidence Assessor, Response Synthesizer.
5. The synthesized output is formatted into API schemas (`api/app/services/ragbot.py`) or into CLI-friendly responses (`scripts/chat.py`).
### Key Data Structures
- `GuildState` in `src/state.py` is the shared workflow state; it depends on additive accumulation for parallel outputs.
- `PatientInput` holds structured biomarkers, prediction data, and patient context.
- The response format is built in `ResponseSynthesizerAgent` and then translated into API schemas in `RagBotService`.
### Knowledge Base
- PDFs are chunked and embedded into FAISS (`src/pdf_processor.py`).
- Three retrievers (disease explainer, biomarker linker, clinical guidelines) share the same FAISS index with varying `k` values.
## Deep Review Findings
### Critical Issues
1. **[OPEN] State propagation is incomplete across the workflow.**
- `src/agents/biomarker_analyzer.py` returns only `agent_outputs` and not the computed `biomarker_flags` or `safety_alerts` into the top-level `GuildState` keys that the workflow expects to accumulate.
- `src/workflow.py` initializes `biomarker_flags` and `safety_alerts` in the state, but none of the agents return updates to those keys. As a result, `workflow_result.get("biomarker_flags")` and `workflow_result.get("safety_alerts")` are likely empty when the API response is formatted in `api/app/services/ragbot.py`.
- Effect: API output will frequently miss biomarkers and alerts, and downstream consumers will incorrectly assume a clean result set.
- Recommendation: return `biomarker_flags` and `safety_alerts` from the Biomarker Analyzer agent so they accumulate in the state. Ensure the response synth uses those same keys.
2. **[OPEN] LangGraph merge behavior is unsafe for parallel outputs.**
- `GuildState` uses `Annotated[List[AgentOutput], operator.add]` for additive merging, but the nodes return only `{ 'agent_outputs': [output] }` and nothing else. This is okay for `agent_outputs`, but parallel agents also read from the full `agent_outputs` list inside the state to infer prior results.
- In parallel branches, a given agent might read a partial `agent_outputs` list depending on execution order. This is visible in the `BiomarkerDiseaseLinkerAgent` and `ClinicalGuidelinesAgent` which read the prior Biomarker Analyzer output by searching `agent_outputs`.
- Effect: nondeterministic behavior if LangGraph schedules a branch before the Biomarker Analyzer output is merged, or if merges occur after the branch starts. This can degrade evidence selection and recommendations.
- Recommendation: explicitly pass relevant artifacts as dedicated state fields updated by the Biomarker Analyzer, and read those fields directly instead of scanning `agent_outputs`.
3. **[RESOLVED] Schema mismatch between workflow output and API formatter.**
- `ResponseSynthesizerAgent` returns a structured response with keys like `patient_summary`, `prediction_explanation`, `clinical_recommendations`, `confidence_assessment`, and `safety_alerts`.
- `RagBotService._format_response()` now correctly reads from `final_response` and handles both Pydantic objects and dicts.
- The CLI (`scripts/chat.py`) uses `_coerce_to_dict()` and `format_conversational()` to safely handle all output types.
- **Fix applied**: `_format_response()` updated + `_coerce_to_dict()` helper added.
### High Priority Issues
1. **[OPEN] Prediction confidence is forced to 0.5 and default disease is always Diabetes.**
- Both the API and CLI `predict_disease_simple` functions enforce a minimum confidence of 0.5 and default to Diabetes when confidence is low.
- Effect: leads to biased predictions and false confidence. This is risky in a medical domain and undermines reliability assessments.
- Recommendation: return a low-confidence prediction explicitly and mark reliability as low; avoid forcing a disease when evidence is insufficient.
2. **[RESOLVED] Different biomarker naming schemes across extraction modules.**
- Both CLI and API now use the shared `src/biomarker_normalization.py` module with 80+ aliases mapped to 24 canonical names.
- **Fix applied**: unified normalization in both `scripts/chat.py` and `api/app/services/extraction.py`.
3. **[RESOLVED] Use of console glyphs and non-ASCII prefixes in logs and output.**
- Debug prints removed from CLI. Logging suppressed for noisy HuggingFace/transformers output.
- API responses use clean JSON only; CLI uses UTF-8 emojis only in user-facing output.
- **Fix applied**: `[DEBUG]` prints removed, `BertModel LOAD REPORT` suppressed, HuggingFace deprecation warnings filtered.
### Medium Priority Issues
1. **[RESOLVED] Inconsistent model selection between agents.**
- All agents now use `llm_config` centralized configuration (planner, analyzer, explainer, synthesizer properties).
- **Fix applied**: `src/llm_config.py` provides `LLMConfig` singleton with per-role properties.
2. **[RESOLVED] Potential JSON parsing fragility in extraction.**
- `_parse_llm_json()` now handles markdown fences, trailing text, and partial JSON recovery.
- **Fix applied**: robust JSON parser in `api/app/services/extraction.py` with test coverage (`test_json_parsing.py`).
3. **[RESOLVED] Knowledge base retrieval does not enforce citations.**
- Disease Explainer agent now checks `sop.require_pdf_citations` and returns "insufficient evidence" when no documents are retrieved.
- **Fix applied**: citation guardrail in `src/agents/disease_explainer.py` with test (`test_citation_guardrails.py`).
### Low Priority Issues
1. **[OPEN] Error handling does not preserve original exceptions cleanly in API layer.**
- Exceptions are wrapped in `RuntimeError` without detail separation; `RagBotService.analyze()` does not attach contextual hints (e.g., which agent failed).
- Recommendation: wrap exceptions with agent name and error classification to improve observability.
2. **[RESOLVED] Hard-coded expected biomarker count (24) in Confidence Assessor.**
- Now uses `BiomarkerValidator().expected_biomarker_count()` which reads from `config/biomarker_references.json`.
- Test: `test_validator_count.py` verifies count matches reference config.
## Suggested Improvements (Summary)
1. ~~Align workflow output and API schema.~~ **[RESOLVED]**
2. Promote biomarker flags and safety alerts to first-class state fields in the workflow. **[OPEN]**
3. ~~Use a shared normalization utility.~~ **[RESOLVED]**
4. Remove forced minimum confidence and default disease; permit "low confidence" results. **[OPEN]**
5. ~~Introduce citation enforcement as a guardrail for RAG outputs.~~ **[RESOLVED]**
6. ~~Centralize model selection and logging format.~~ **[RESOLVED]**
## Verification Gaps
The following should be tested once fixes are made:
- Natural language extraction with partial and noisy inputs.
- Workflow run where no abnormal biomarkers are detected.
- API response schema validation for both natural and structured routes.
- Parallel agent execution determinism (state access to biomarker analysis).
- CLI behavior for biomarker names that differ from API normalization.