Spaces:
Sleeping
MedicalTriage Codebase Analysis
Date: 2026-04-07 Scope: Full repository review of environment logic, agents, training/evaluation pipeline, scripts, packaging, docs, and tests.
1. Executive Summary
This repository contains a working triage simulation core and passing unit tests for the environment itself, but the surrounding training/evaluation ecosystem is partially broken due to naming drift and API mismatches.
In short:
- The core environment loop is functional and reasonably well-shaped for RL experimentation.
- Most script entrypoints for RL/Q-learning training and comparison are currently not runnable as-is.
- Documentation and examples are partially stale and describe an older message-echo API that no longer matches the triage action schema.
- Packaging configuration is incomplete for distributable usage.
2. What The System Is Doing
2.1 Core Runtime Model
The main simulation is implemented in TriageEnvironment and follows a standard episodic loop:
reset()initializes 3 patients and limited resources.step(action)processes one action (treat,allocate_ventilator,wait).- Reward is computed from:
- immediate action quality,
- time progression penalties,
- health delta,
- global stability bonus,
- terminal reward at episode end.
- Episode ends on step limit, all-dead state, or all-alive stabilized threshold.
Evidence:
triage_env/server/triage_env_environment.py:39triage_env/server/triage_env_environment.py:63triage_env/server/triage_env_environment.py:176triage_env/server/triage_env_environment.py:190triage_env/server/triage_env_environment.py:304
2.2 API Surface
- Client payload shape is action-first (
action_type,patient_id), not message-first. - Observation includes
patients,resources,step_count,message,reward,done,metadata.
Evidence:
triage_env/client.py:12triage_env/models.py:20triage_env/models.py:25
2.3 Agent Layer
Current agents include:
RandomAgent: random valid action among wait/treat (does not use ventilators).RuleBasedAgent: treats alive patient with lowest health.LLMAgent: builds prompt from patient status and parses JSON response.- RL/Q-learning implementations exist but are inconsistent across files.
Evidence:
triage_env/agents/random_agent.py:8triage_env/agents/rule_based_agent.py:10triage_env/agents/llm_agent.py:19triage_env/agents/rl_agents.py:13triage_env/agents/q_learning_agents.py:9
3. Validation Performed
3.1 Tests
Executed:
python -m pytest -q
Result:
- 17 passed
Interpretation:
- Environment core behavior is stable for covered scenarios.
- Passing tests do not guarantee script/packaging/training pipeline health.
3.2 Compile/Syntax Check
Executed:
python -m compileall -q triage_env
Result:
- No syntax compile errors.
Interpretation:
- Most breakages are semantic/runtime (imports, wrong API assumptions), not syntax errors.
3.3 Runtime Checks For Entry Points
Validated failures:
triage_env.scripts.train_rlfails due to missing moduletriage_env.agents.rl_agent.triage_env.training.train_q_agentfails due to missing moduletriage_env.agents.q_learning_agent.triage_env.scripts.compare_baselinesfails due to importing non-existentevaluatesymbol.training.rollout.run_episodefails becauseenv.reset(task=...)passes unsupported kwarg.RLAgent.actfails becauseobservation.taskdoes not exist in model.
4. Findings (Prioritized)
Critical
- Broken RL/Q-learning import paths (hard runtime failure)
trained_q_agent.pyimportstriage_env.agents.q_learning_agent, but file isq_learning_agents.py.train_q_agent.pyuses same bad import.- Multiple scripts import
triage_env.agents.rl_agent, but file isrl_agents.py.
Evidence:
triage_env/agents/trained_q_agent.py:1triage_env/training/train_q_agent.py:3triage_env/scripts/train_rl.py:3triage_env/scripts/evaluate_all_agents.py:5triage_env/scripts/evaluate_rl.py:3
Impact:
- RL and Q-learning workflows are effectively unusable without manual fixes.
- Training rollout uses incompatible environment API
run_episode()callsenv.reset(task=task), butTriageEnvironment.reset()accepts notaskargument.
Evidence:
triage_env/training/rollout.py:2triage_env/server/triage_env_environment.py:39
Impact:
- Any pipeline depending on
training.rollout.run_episodecrashes immediately.
- RL state encoding relies on nonexistent observation field
RLAgent._state_key()accessesobservation.task, not present inTriageObservation.
Evidence:
triage_env/agents/rl_agents.py:33triage_env/agents/rl_agents.py:44triage_env/models.py:25
Impact:
- RL action selection and updates crash at runtime.
High
- Evaluator API mismatch across scripts
evaluation/evaluator.pydefinesevaluate_agent, but several scripts import/useevaluate.
Evidence:
triage_env/evaluation/evaluator.py:22triage_env/scripts/compare_baselines.py:5triage_env/scripts/evaluate_all_agents.py:6triage_env/scripts/evaluate_rule_based_agent.py:4triage_env/scripts/evaluate_random_agent.py:4
Impact:
- Baseline comparison scripts fail or require ad-hoc edits.
- Packaging metadata omits major subpackages
pyproject.tomlonly includestriage_envandtriage_env.serverin setuptools package list.triage_env.agents,triage_env.evaluation,triage_env.training,triage_env.scriptsare not packaged for distribution.
Evidence:
triage_env/pyproject.toml:44
Impact:
- Installed package may work partially in development but fails in clean/distributed usage.
- README examples are stale and describe old message-echo API
- Uses
TriageAction(message=...)andobservation.echoed_message, which are not in current models.
Evidence:
README.md:94README.md:100triage_env/models.py:20triage_env/models.py:25
Impact:
- New contributors receive incorrect onboarding instructions and hit immediate errors.
Medium
- Concurrency intent mismatch between environment and app settings
- Environment declares
SUPPORTS_CONCURRENT_SESSIONS = True. - Server app is configured with
max_concurrent_envs=1.
Evidence:
triage_env/server/triage_env_environment.py:24triage_env/server/app.py:52
Impact:
- Performance/scaling behavior may not match expectations from code comments/docs.
- Unused/partially integrated prompt tooling
prompt_builder.pydefines a richer prompt pipeline but is not integrated intoLLMAgent.- Also returns nothing when no alive patients (return path only inside
if sorted_alive).
Evidence:
triage_env/agents/prompt_builder.py:7triage_env/agents/prompt_builder.py:27triage_env/agents/prompt_builder.py:35triage_env/agents/llm_agent.py:20
Impact:
- Prompt quality and safety controls are fragmented; hidden bug in edge state if reused.
- Difficulty/task concept is declared but not used in environment dynamics
difficultyexists in constructor but does not influence reset distributions or transition behavior.
Evidence:
triage_env/server/triage_env_environment.py:26triage_env/server/triage_env_environment.py:28triage_env/server/triage_env_environment.py:39
Impact:
- Evaluation across "easy/medium/hard" in scripts is currently nominal, not environmental.
Low
- Duplicate/parallel script ecosystems increase drift risk
- Similar logic appears under both
triage_env/evaluationandtriage_env/scriptswith inconsistent imports.
Evidence:
triage_env/evaluation/run_benchmark.py:1triage_env/scripts/compare_baselines.py:1triage_env/evaluation/run_rule_based.py:1triage_env/scripts/run_random.py:1
Impact:
- Maintenance burden and future regressions increase.
- Trailing whitespace / formatting cleanliness in some modules
- Not functionally harmful but indicates uneven code hygiene.
Evidence:
triage_env/agents/llm_agent.py:75
5. Strengths
- Core environment logic is coherent and test-covered.
- Reward decomposition is explicit and auditable via metadata (
reward_breakdown). - Resource reset and patient progression are deterministic and understandable.
- Unit tests validate important environment invariants.
- Reset, step progression, invalid action penalties, death behavior, and done state are covered.
- Model layer is clear and strongly typed.
- Pydantic models for action/observation/state improve interface clarity.
6. Gaps In Current Test Strategy
Current tests focus almost exclusively on environment internals and do not cover:
- Script entrypoint execution (
triage_env/scripts/*) - Import path correctness after packaging/install
- RL/Q-learning training loops
- LLM integration safety and fallback behavior
- README quickstart correctness
Practical result: core tests pass while user-facing workflows remain broken.
7. Recommended Remediation Plan
Phase 1 (Stabilize Runtime)
- Normalize module names/imports:
- pick singular or plural convention (
rl_agentvsrl_agents,q_learning_agentvsq_learning_agents) and align all imports.
- pick singular or plural convention (
- Fix evaluator API usage:
- either expose
evaluate()wrapper in evaluator module or update all scripts toevaluate_agent.
- either expose
- Repair rollout/task wiring:
- remove
taskkwarg in reset call, or formally add task support in environment model/state.
- remove
- Fix RL observation schema usage:
- replace
observation.taskwith valid features from current observation/state.
- replace
Phase 2 (Consistency + Packaging)
- Update README and examples to current action schema (
action_type,patient_id). - Update
pyproject.tomlto include all importable subpackages. - Consolidate duplicate script sets into one canonical runner path.
Phase 3 (Quality + Coverage)
- Add smoke tests that execute each main script module.
- Add regression tests for RL and Q-learning initialization paths.
- Add docs-validation test to ensure README snippets match public models.
8. Architecture Snapshot
Primary flow:
- Agent ->
TriageAction->TriageEnvironment.step()->TriageObservation+ reward metadata - Training/evaluation wrappers orchestrate repeated episodes and aggregate metrics
- OpenEnv server adapter exposes environment over HTTP/WebSocket
Data contracts are good at the model level, but orchestration layers have drifted from those contracts.
9. Bottom Line
The simulation kernel is in good shape and test-backed, but your surrounding experimentation stack is in a partially broken state due to API and naming drift. If your goal is to iterate quickly on agent strategies, you should first complete Phase 1 fixes; otherwise most RL/evaluation scripts will continue to fail despite green unit tests.