Spaces:

T0X1N
/

Agentic-RagBot

Sleeping

App Files Files Community

Agentic-RagBot / docs /DEEP_REVIEW.md

Nikhil Pravin Pise

docs: update all documentation to reflect current codebase state

aefac4f 19 days ago

preview code

raw

history blame contribute delete

8.5 kB

	# RagBot Deep Review

	> Last updated: February 2026
	> Items marked [RESOLVED] have been fixed. Items marked [OPEN] remain as future work.

	## Scope

	This review covers the end-to-end workflow and supporting services for RagBot, focusing on design correctness, reliability, safety guardrails, and maintainability. The review is based on a close reading of the workflow orchestration, agent implementations, API wiring, extraction and prediction logic, and the knowledge base pipeline.

	Primary files reviewed:
	- `src/workflow.py`
	- `src/state.py`
	- `src/config.py`
	- `src/agents/*`
	- `src/biomarker_validator.py`
	- `src/pdf_processor.py`
	- `api/app/main.py`
	- `api/app/routes/analyze.py`
	- `api/app/services/extraction.py`
	- `api/app/services/ragbot.py`
	- `scripts/chat.py`

	## Architectural Understanding (Condensed)

	### End-to-End Flow
	1. Input arrives via CLI (`scripts/chat.py`) or REST API (`api/app/routes/analyze.py`).
	2. Natural language inputs are parsed by the extraction service (`api/app/services/extraction.py`) to produce normalized biomarkers and patient context.
	3. A rule-based prediction (`predict_disease_simple`) produces a disease hypothesis and probabilities.
	4. The LangGraph workflow (`src/workflow.py`) orchestrates six agents: Biomarker Analyzer, Disease Explainer, Biomarker Linker, Clinical Guidelines, Confidence Assessor, Response Synthesizer.
	5. The synthesized output is formatted into API schemas (`api/app/services/ragbot.py`) or into CLI-friendly responses (`scripts/chat.py`).

	### Key Data Structures
	- `GuildState` in `src/state.py` is the shared workflow state; it depends on additive accumulation for parallel outputs.
	- `PatientInput` holds structured biomarkers, prediction data, and patient context.
	- The response format is built in `ResponseSynthesizerAgent` and then translated into API schemas in `RagBotService`.

	### Knowledge Base
	- PDFs are chunked and embedded into FAISS (`src/pdf_processor.py`).
	- Three retrievers (disease explainer, biomarker linker, clinical guidelines) share the same FAISS index with varying `k` values.

	## Deep Review Findings

	### Critical Issues

	1. [OPEN] State propagation is incomplete across the workflow.
	- `src/agents/biomarker_analyzer.py` returns only `agent_outputs` and not the computed `biomarker_flags` or `safety_alerts` into the top-level `GuildState` keys that the workflow expects to accumulate.
	- `src/workflow.py` initializes `biomarker_flags` and `safety_alerts` in the state, but none of the agents return updates to those keys. As a result, `workflow_result.get("biomarker_flags")` and `workflow_result.get("safety_alerts")` are likely empty when the API response is formatted in `api/app/services/ragbot.py`.
	- Effect: API output will frequently miss biomarkers and alerts, and downstream consumers will incorrectly assume a clean result set.
	- Recommendation: return `biomarker_flags` and `safety_alerts` from the Biomarker Analyzer agent so they accumulate in the state. Ensure the response synth uses those same keys.

	2. [OPEN] LangGraph merge behavior is unsafe for parallel outputs.
	- `GuildState` uses `Annotated[List[AgentOutput], operator.add]` for additive merging, but the nodes return only `{ 'agent_outputs': [output] }` and nothing else. This is okay for `agent_outputs`, but parallel agents also read from the full `agent_outputs` list inside the state to infer prior results.
	- In parallel branches, a given agent might read a partial `agent_outputs` list depending on execution order. This is visible in the `BiomarkerDiseaseLinkerAgent` and `ClinicalGuidelinesAgent` which read the prior Biomarker Analyzer output by searching `agent_outputs`.
	- Effect: nondeterministic behavior if LangGraph schedules a branch before the Biomarker Analyzer output is merged, or if merges occur after the branch starts. This can degrade evidence selection and recommendations.
	- Recommendation: explicitly pass relevant artifacts as dedicated state fields updated by the Biomarker Analyzer, and read those fields directly instead of scanning `agent_outputs`.

	3. [RESOLVED] Schema mismatch between workflow output and API formatter.
	- `ResponseSynthesizerAgent` returns a structured response with keys like `patient_summary`, `prediction_explanation`, `clinical_recommendations`, `confidence_assessment`, and `safety_alerts`.
	- `RagBotService._format_response()` now correctly reads from `final_response` and handles both Pydantic objects and dicts.
	- The CLI (`scripts/chat.py`) uses `_coerce_to_dict()` and `format_conversational()` to safely handle all output types.
	- Fix applied: `_format_response()` updated + `_coerce_to_dict()` helper added.

	### High Priority Issues

	1. [OPEN] Prediction confidence is forced to 0.5 and default disease is always Diabetes.
	- Both the API and CLI `predict_disease_simple` functions enforce a minimum confidence of 0.5 and default to Diabetes when confidence is low.
	- Effect: leads to biased predictions and false confidence. This is risky in a medical domain and undermines reliability assessments.
	- Recommendation: return a low-confidence prediction explicitly and mark reliability as low; avoid forcing a disease when evidence is insufficient.

	2. [RESOLVED] Different biomarker naming schemes across extraction modules.
	- Both CLI and API now use the shared `src/biomarker_normalization.py` module with 80+ aliases mapped to 24 canonical names.
	- Fix applied: unified normalization in both `scripts/chat.py` and `api/app/services/extraction.py`.

	3. [RESOLVED] Use of console glyphs and non-ASCII prefixes in logs and output.
	- Debug prints removed from CLI. Logging suppressed for noisy HuggingFace/transformers output.
	- API responses use clean JSON only; CLI uses UTF-8 emojis only in user-facing output.
	- Fix applied: `[DEBUG]` prints removed, `BertModel LOAD REPORT` suppressed, HuggingFace deprecation warnings filtered.

	### Medium Priority Issues

	1. [RESOLVED] Inconsistent model selection between agents.
	- All agents now use `llm_config` centralized configuration (planner, analyzer, explainer, synthesizer properties).
	- Fix applied: `src/llm_config.py` provides `LLMConfig` singleton with per-role properties.

	2. [RESOLVED] Potential JSON parsing fragility in extraction.
	- `_parse_llm_json()` now handles markdown fences, trailing text, and partial JSON recovery.
	- Fix applied: robust JSON parser in `api/app/services/extraction.py` with test coverage (`test_json_parsing.py`).

	3. [RESOLVED] Knowledge base retrieval does not enforce citations.
	- Disease Explainer agent now checks `sop.require_pdf_citations` and returns "insufficient evidence" when no documents are retrieved.
	- Fix applied: citation guardrail in `src/agents/disease_explainer.py` with test (`test_citation_guardrails.py`).

	### Low Priority Issues

	1. [OPEN] Error handling does not preserve original exceptions cleanly in API layer.
	- Exceptions are wrapped in `RuntimeError` without detail separation; `RagBotService.analyze()` does not attach contextual hints (e.g., which agent failed).
	- Recommendation: wrap exceptions with agent name and error classification to improve observability.

	2. [RESOLVED] Hard-coded expected biomarker count (24) in Confidence Assessor.
	- Now uses `BiomarkerValidator().expected_biomarker_count()` which reads from `config/biomarker_references.json`.
	- Test: `test_validator_count.py` verifies count matches reference config.

	## Suggested Improvements (Summary)

	1. ~~Align workflow output and API schema.~~ [RESOLVED]
	2. Promote biomarker flags and safety alerts to first-class state fields in the workflow. [OPEN]
	3. ~~Use a shared normalization utility.~~ [RESOLVED]
	4. Remove forced minimum confidence and default disease; permit "low confidence" results. [OPEN]
	5. ~~Introduce citation enforcement as a guardrail for RAG outputs.~~ [RESOLVED]
	6. ~~Centralize model selection and logging format.~~ [RESOLVED]

	## Verification Gaps

	The following should be tested once fixes are made:
	- Natural language extraction with partial and noisy inputs.
	- Workflow run where no abnormal biomarkers are detected.
	- API response schema validation for both natural and structured routes.
	- Parallel agent execution determinism (state access to biomarker analysis).
	- CLI behavior for biomarker names that differ from API normalization.