Spaces:

nik-55
/

medchain-openenv-hackathon

Sleeping

App Files Files Community

nik-55 commited on Apr 3

Commit

4afc4db

verified ·

1 Parent(s): 39cf111

Upload folder using huggingface_hub

Browse files

Files changed (30) hide show

CLAUDE.md +105 -0
Dockerfile +81 -0
README.md +269 -5
__init__.py +15 -0
client.py +157 -0
hackathon_guide.md +365 -0
inference.py +524 -0
logs/inference_20260403_015029.log +79 -0
logs/inference_20260403_015146.log +118 -0
logs/inference_20260403_015354.log +69 -0
logs/inference_20260403_015503.log +58 -0
logs/inference_20260403_020521.log +147 -0
logs/inference_20260403_021847.log +22 -0
logs/inference_20260403_130724.log +63 -0
models.py +49 -0
openenv.yaml +78 -0
pyproject.toml +39 -0
sample_inference.py +187 -0
server/Dockerfile +80 -0
server/__init__.py +5 -0
server/app.py +51 -0
server/erp_formatter.py +379 -0
server/grader.py +225 -0
server/medchain_env_environment.py +382 -0
server/requirements.txt +6 -0
server/simulation.py +994 -0
server/tasks.py +404 -0
test.py +152 -0
uv.lock +0 -0
validate-submission.sh +185 -0

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,105 @@

+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## Project Context
+Hackathon submission for the Scaler School of Technology — Meta PyTorch Hackathon (OpenEnv track).
+Deadline: 8 April 2026.
+Environment: **MedChain Env** — hospital supply chain management simulation.
+Agents operate a legacy ERP system across 4 tasks of increasing difficulty.
+## Key Commands
+```bash
+# Build Docker image
+docker build -t <LOCAL_IMAGE_NAME> -f server/Dockerfile .
+# Set env vars then run inference
+export API_BASE_URL=https://router.huggingface.co/v1
+export MODEL_NAME=openai/gpt-oss-120b:groq
+export HF_TOKEN=<your_token>
+export LOCAL_IMAGE_NAME=nik-55_medchain-openenv
+uv run python inference.py
+# Run non-LLM integration test (writes test_inference.log)
+uv run python test.py
+# Run a single task container manually
+docker run -e MEDCHAIN_TASK=hospital_network_crisis -p 8000:8000 $LOCAL_IMAGE_NAME
+```
+### Environment variables
+| Variable | Description |
+|----------|-------------|
+| `API_BASE_URL` | LLM API endpoint (OpenAI-compatible) |
+| `MODEL_NAME` / `MODEL` | Model identifier |
+| `HF_TOKEN` / `API_KEY` | API authentication key |
+| `LOCAL_IMAGE_NAME` | Docker image tag used by inference.py |
+| `MEDCHAIN_TASK` | Select a single task in one container (default: `single_ward_stable`) |
+| `SLEEP_BETWEEN_STEPS` | Seconds between LLM calls, default 2 |
+| `LOG_LEVEL` | `INFO` (default) or `DEBUG` (writes timestamped log to `logs/`) |
+## Architecture
+### Package layout
+The repo root is both the `medchain_env` package root and a Python namespace. `pyproject.toml` maps:
+- `medchain_env` → repo root (`client.py`, `models.py`)
+- `medchain_env.server` → `server/` subdirectory
+`inference.py` imports from `medchain_env` directly (not `server/`).
+### Request flow (server-side)
+```
+FastAPI (app.py)
+  └─ MedchainEnvironment (medchain_env_environment.py)   ← reward logic lives here
+       ├─ MedchainSimulation (simulation.py)             ← FEFO, demand, events, budget
+       └─ _MedchainMCPDelegate (MCPEnvironment)          ← FastMCP schema + tool dispatch
+            └─ FastMCP server with 9 @mcp.tool functions
+```
+`MedchainEnvironment` is an `Environment` (not an `MCPEnvironment`). It holds a `_MedchainMCPDelegate` for tool dispatch via composition, then wraps the result in a `MedchainToolObservation` with reward already computed. This pattern exists so that reward calculation happens in the outer class, not inside FastMCP.
+### Two reward streams
+1. **Per-step shaping** (`_shaping_reward` in `medchain_env_environment.py`): small fixed rewards for useful actions (read_inbox/query_erp: +0.01 first time per shift; submit_po success: +0.02; transfer/quarantine: +0.01; incoherent justification: −0.05).
+2. **Terminal score** (`grader.py`, called by `_sim.get_last_reward()` on `end_shift`): deterministic formula with task-specific weights — no LLM judge.
+### Expedited order flow (two-step)
+`submit_po(..., priority="expedited")` → returns `BUDGET_OVERRIDE_REQUIRED` with a ticket ID → agent must call `file_justification(ticket_id=..., reason=...)` to proceed. Justification grading in `grader.py:grade_justification()` uses keyword matching against active event types — incoherent justifications score −0.05 each (capped at −0.15).
+### Context window management (inference.py)
+At each `end_shift`, the conversation is pruned to: `[system_prompt, all_past_shift_summaries_msg, last_SHIFT_HISTORY_KEEP=6_messages]`. Summaries are extracted from `end_shift` result text by keyword (DEMAND, FULFILLED, DELIVERIES, SPEND, WASTE, EXPIRY, CRITICAL, etc.). This keeps context O(days) not O(steps).
+### Key constants in inference.py
+- `MAX_STEPS_PER_TASK = 150` — hard cap per task episode
+- `TEMPERATURE = 0.1` — low temperature for deterministic ordering decisions
+- `MAX_TOKENS = 6000` — max completion tokens
+- `SHIFT_HISTORY_KEEP = 6` — recent messages retained across shift boundary
+- `MAX_CONSECUTIVE_ERRORS = 5` — aborts episode after 5 consecutive BadRequestErrors
+## Tasks at a Glance
+| Task | Days | Actions/shift | Budget |
+|------|------|---------------|--------|
+| `orientation_ward` | 2 | 5 | $5k |
+| `single_ward_stable` | 3 | 6 | $20k |
+| `multi_ward_seasonal` | 6 | 8 | $50k |
+| `hospital_network_crisis` | 12 | 10 | $150k |
+The hardest task, `hospital_network_crisis` (internally `TASK3`, 0-indexed), has 5 overlapping events: cold-chain breach (day 3), supplier force majeure (day 6), MCI standby warning (day 8), MCI activation with 3× blood demand (days 9-11), and mandatory IV saline lot recall (day 11).
+## Simulation internals
+- **FEFO**: inventory stored as `(qty, expiry_day)` lots in `simulation.py`, consumed oldest-first.
+- **Event system**: `SimEvent.trigger_day` fires the inbox message; `warning_message` fires one day earlier. The simulation checks active events each `end_shift()` to apply demand multipliers or supplier lead-time changes.
+- **Justification coherence**: `grade_justification()` checks reason text against a dict of event-type keywords; falls back to generic keywords (urgent, critical, stockout…) if no active event type matches.
+- **No external deps**: fully self-contained — no databases, no external APIs.

Dockerfile ADDED Viewed

	@@ -0,0 +1,81 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+# Multi-stage build using openenv-base
+# This Dockerfile is flexible and works for both:
+# - In-repo environments (with local OpenEnv sources)
+# - Standalone environments (with openenv from PyPI/Git)
+# The build script (openenv build) handles context detection and sets appropriate build args.
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Ensure git is available (required for installing dependencies from VCS)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+# Build argument to control whether we're building standalone or in-repo
+ARG BUILD_MODE=in-repo
+ARG ENV_NAME=medchain_env
+# Copy environment code (always at root of build context)
+COPY . /app/env
+# For in-repo builds, openenv is already vendored in the build context
+# For standalone builds, openenv will be installed via pyproject.toml
+WORKDIR /app/env
+# Ensure uv is available (for local builds where base image lacks it)
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+# Install dependencies using uv sync
+# If uv.lock exists, use it; otherwise resolve on the fly
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# Final runtime stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+# Copy the virtual environment from builder
+COPY --from=builder /app/env/.venv /app/.venv
+# Copy the environment code
+COPY --from=builder /app/env /app/env
+# Set PATH to use the virtual environment
+ENV PATH="/app/.venv/bin:$PATH"
+# Set PYTHONPATH so imports work correctly
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+# Run the FastAPI server
+# The module path is constructed to work with the /app/env structure
+ENV ENABLE_WEB_INTERFACE=true
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

README.md CHANGED Viewed

@@ -1,10 +1,274 @@
 ---
-title: Medchain Openenv Hackathon
-emoji: 😻
-colorFrom: gray
-colorTo: gray
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Medchain Env Environment Server
+emoji: 🎰
+colorFrom: red
+colorTo: yellow
 sdk: docker
 pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
 ---
+# MedChain Env — Hospital Supply Chain Management
+> Train AI agents to keep hospitals stocked — where running out of blood costs lives and ordering too much costs money.
+---
+## What is hospital supply chain management?
+A hospital runs on supplies: surgical gloves, IV bags, blood, insulin, saline — hundreds of products across dozens of wards and operating theatres. Someone has to make sure the right amount of each product is in the right place at the right time. That person is the supply chain manager.
+The job sounds simple but has several compounding pressures:
+**Orders take time to arrive.** A supplier might take 2–7 days to deliver. If a ward runs out of IV bags today, you can't just call and get them in an hour. You have to have ordered them days ago — which means constantly forecasting future need.
+**Supplies expire.** Blood expires in 42 days. Platelets in 5 days. Insulin in 28 days. Order too much and it rots on the shelf, wasting both money and the product. Order too little and patients go without.
+**Demand spikes without warning.** A multi-vehicle accident sends 18 trauma patients to the ER at once. Flu season doubles demand for antivirals over two weeks. An elective surgery backlog clears and suddenly the ward needs 35% more consumables. A manager who doesn't read the situation will be caught short.
+**Emergencies require documentation.** In a real hospital, spending extra money to rush an urgent order isn't automatic — it requires a written justification audited by Finance. A manager who writes "routine restock" when they're actually responding to a mass casualty event raises a red flag.
+**Multiple suppliers, multiple trade-offs.** The cheap supplier takes 7 days; the premium supplier delivers in 1 day at 40% higher cost. When the cheap supplier announces delays due to a warehouse strike, you have to pivot — but if you always use the premium supplier you'll blow the budget.
+---
+## What is this environment?
+MedChain Env puts an AI agent in the role of a hospital supply chain manager. Each episode runs for 2–12 simulated days. Every day is a **shift**: the agent gets a limited number of actions to check the situation and respond before the simulation advances.
+The agent interacts with a **simulated legacy ERP system** — the kind of fragmented, text-heavy enterprise software real hospitals actually use. There are three sub-systems to navigate:
+- **COMMS Pager** — the inbox. Unstructured text messages from the incident command system, suppliers, ward managers, and the pharmacy. Critical alerts arrive here: MCI activations, supplier disruptions, lot recalls.
+- **Inventory DB** — query stock levels by ward, product, and expiry date. Returns table-formatted output.
+- **Procurement Portal** — place purchase orders, file justifications for expedited spending, track deliveries.
+The agent must decide *what to look up* (with a limited action budget), interpret what it finds, and act. There is no clean "here is everything you need to know" state dump — just the same messy interface a human manager would use.
+---
+## Why does this matter for AI research?
+Classical inventory management algorithms (like the (s, S) policy — "reorder when stock drops below S, order up to s") work reasonably well when demand is predictable and stable.
+They fail completely when the environment sends a message like:
+> *"Incident Command activated. Multi-vehicle accident I-95 northbound. Confirmed 18 critically injured en route. Blood bank placed on AMBER alert."*
+A classical algorithm sees only numbers. It cannot read that message and reason: "O-negative blood is the universal donor, expiry is 42 days but we only have 8 units left, lead time from the blood bank is 1 day, I need to order 30 units now and file a justification because this is expedited."
+An LLM that reads and understands that message can respond correctly.
+**The heuristic→LLM performance gap in Task 3 (0.38 → 0.58) is the environment's core scientific contribution** — a quantified measure of how much contextual language understanding is worth in real operational decisions.
+---
+## Tasks
+Four tasks of increasing difficulty. Each task is a self-contained episode with its own scenario, products, suppliers, and events.
+### Task 1 — `orientation_ward` (Easy, 2 days)
+A single general ward, 3 non-perishable supplies, one reliable supplier with 1-day delivery. Initial stock covers only the first day. The agent's job: read the inbox to understand the situation, check what's in stock, and place at least one replenishment order before supplies run out on Day 2.
+This task is purely about exploring the interface. No crises, no expiry pressure, no multi-supplier decisions.
+| | |
+|---|---|
+| **Score formula** | 70% service level + 30% whether at least one order was placed |
+| **Expected scores** | Random agent: 0.55 · Heuristic: 0.88 · LLM: 0.95 |
+---
+### Task 2 — `single_ward_stable` (Medium, 3 days)
+One ward, 6 products (some with expiry dates), stable demand, 2-day delivery. Initial stock covers 2 days — an agent that waits until Day 2 to order will arrive at Day 3 with empty shelves. The task introduces cost efficiency: placing sensibly-sized orders (not over-ordering) is rewarded alongside not running out.
+| | |
+|---|---|
+| **Score formula** | 50% service level + 50% cost efficiency vs. benchmark |
+| **Expected scores** | Random agent: 0.30 · Heuristic: 0.68 · LLM: 0.82 |
+---
+### Task 3 — `multi_ward_seasonal` (Medium-Hard, 6 days)
+Three wards plus a central pharmacy. Ten products. Two suppliers with different speed/cost trade-offs:
+- **FastMed**: delivers in 1 day, costs 40% more
+- **MedLine**: delivers in 4 days, base price
+Two events unfold over the episode — both announced in the inbox before they hit:
+**Day 2 (early warning) → Day 3–5 (active):** Regional influenza alert. Antiviral, mask, and paracetamol demand surges 50% above normal. An agent that pre-orders on Day 2 after reading the warning is protected. An agent that ignores the warning scrambles to catch up.
+**Day 4–6:** MedLine warehouse strike. Standard delivery extends from 4 to 7 days — which means any order placed after Day 4 via MedLine arrives after the episode ends. The agent must pivot to FastMed (at higher cost) for anything urgent.
+The task tests whether the agent can act on early warnings rather than reacting to crises after they arrive.
+| | |
+|---|---|
+| **Score formula** | 40% service level + 35% cost efficiency + 15% capacity management + 10% transfer efficiency |
+| **Expected scores** | Random: 0.22 · Heuristic (ignores alerts): 0.55 · LLM (reads alerts): 0.73 |
+---
+### Task 4 — `hospital_network_crisis` (Hard, 12 days)
+A full regional network: 3 hospitals plus a regional distribution centre, 15 products including **life-critical perishables**: O-negative blood (universal donor, expires in 42 days), platelets (expires in 5 days), fresh frozen plasma. Budget ceiling: $150,000 outstanding at any time.
+Five crisis events unfold across the 12-day episode — some overlapping:
+| Day | Event |
+|-----|-------|
+| 3 | **Cold chain breach** at the regional DC — refrigeration failure destroys all platelet inventory. An alert fires. Agent must order replacements immediately or hospitals run out within days. |
+| 6–14 | **Supplier force majeure** — HealthCo Supplies lead time extends from 3 to 7 days due to flu absenteeism. Agent must switch to the premium express supplier for urgent items. |
+| 8 (warning) → 9–11 (active) | **Mass casualty incident** — large multi-vehicle accident. Blood product demand triples across all hospitals for 3 days. An agent that reads the Day 8 warning and pre-orders blood survives this; one that waits until Day 9 faces critical stockouts. |
+| 11 | **Mandatory product recall** — a specific lot of IV Saline is flagged by the health authority. The agent must find which locations hold that lot, quarantine all units, and order replacements. Failing to quarantine by end of shift is a patient safety failure. |
+This task also introduces the **paper trail mechanic**: any expedited (rush) order triggers a mandatory written justification that Finance reviews. If the agent writes "routine restock" when there is an active mass casualty event, the justification is flagged as incoherent and a score penalty applies.
+| | |
+|---|---|
+| **Score formula** | 35% service level + 25% cost efficiency + 20% critical product availability + 15% waste reduction (expired product value) + 5% crisis response speed |
+| **Justification penalty** | −0.05 per incoherent expedited justification (max −0.15) |
+| **Expected scores** | Random: 0.12 · Heuristic (no alert reading): 0.38 · LLM: 0.58 · Near-optimal: 0.82 |
+---
+## How scoring works
+Each task produces a score between 0.0 and 1.0 computed deterministically from the simulation history. There is no LLM judge.
+### Partial credit during the episode
+Small reward signals after each action reinforce useful behaviour:
+- Reading the inbox or checking inventory (first time each shift): **+0.01**
+- Successfully placing an order: **+0.02**
+- Executing a transfer or quarantine: **+0.01**
+- Filing a coherent justification: **+0.01**
+- Filing an incoherent justification (flagged by Finance): **−0.05**
+### Terminal score at episode end
+The main score is computed when the episode completes. Components vary by task:
+- **Service level** — what fraction of demand was actually fulfilled across all products, locations, and days
+- **Cost efficiency** — how close actual spend was to a reasonable benchmark (spending less than the benchmark is good; spending far over it penalises over-ordering)
+- **Critical product availability** (Task 4) — blood and platelets tracked separately; running out of these carries heavy penalty
+- **Waste fraction** (Task 4) — value of expired inventory divided by total spend; rewards active expiry management
+- **Crisis response score** (Task 4) — how quickly the agent positioned blood during the MCI window and handled the recall
+---
+## How the agent interacts — the 9 tools
+The agent has 9 tools, each consuming one action from the shift budget:
+| Tool | What it does |
+|------|-------------|
+| `read_inbox` | Read messages from the COMMS pager (filter: unread / all / flagged) |
+| `query_erp` | Query stock levels, expiry dates, pipeline orders, or demand history |
+| `query_supplier` | Get a supplier's current lead time and any disruption notices |
+| `query_forecast` | Request a demand forecast for a product at a location |
+| `submit_po` | Place a purchase order (standard or expedited) |
+| `transfer` | Move stock from one location to another |
+| `quarantine_lot` | Isolate a specific inventory lot (for recalls or cold chain failures) |
+| `file_justification` | Write the Finance audit reason for an expedited order |
+| `end_shift` | Close the shift — simulation advances by one day |
+**The action budget is the core constraint.** Each shift, the agent gets 5–10 actions depending on the task. Using all 10 actions to query every product at every location leaves no budget to place orders. The agent must triage: what is most urgent to check right now?
+---
+## How the LLM agent works (`inference.py`)
+The inference script runs a multi-turn LLM agent using the OpenAI API format. Any model accessible through an OpenAI-compatible endpoint works.
+**Each shift is one LLM conversation:**
+1. The agent sees the shift dashboard (what day it is, how many actions remain, inbox alerts count)
+2. The LLM picks a tool to call
+3. The tool result is appended to the conversation
+4. Loop repeats until the agent calls `end_shift()` or runs out of action budget
+5. At `end_shift()`, the conversation is compressed into a short summary before the next shift begins
+**Context window management:** After each shift, the history is pruned down to the system prompt + all past shift summaries + the last 6 messages from the current shift. This keeps context bounded regardless of how many days the episode runs.
+**System prompt guidance:** The agent is told to prioritise: read inbox → check inventory → place orders → end shift. It's given explicit guidance on lead time arithmetic, expiry rotation, how to respond to MCI alerts, and when expedited orders are warranted.
+---
+## Setup & Running
+### Prerequisites
+- Docker
+- Python 3.10+ with `uv`
+### Build the Docker image
+```bash
+docker build -t <LOCAL_IMAGE_NAME> -f server/Dockerfile .
+```
+### Set environment variables and run
+```bash
+export API_BASE_URL=https://router.huggingface.co/v1
+export MODEL_NAME=openai/gpt-oss-120b:groq
+export HF_TOKEN=<your_token>
+export LOCAL_IMAGE_NAME=nik-55_medchain-openenv
+uv run python inference.py
+```
+The script runs all 4 tasks in sequence and emits structured logs to stdout:
+```
+[START] task=orientation_ward env=medchain model=openai/gpt-oss-120b:groq
+[STEP]  step=1 action=read_inbox({}) reward=0.01 done=false error=null
+[STEP]  step=2 action=query_erp({...}) reward=0.01 done=false error=null
+...
+[END]   success=true steps=10 score=0.923 rewards=0.01,0.01,0.02,...
+```
+### Environment variables
+| Variable | Description |
+|----------|-------------|
+| `API_BASE_URL` | LLM API endpoint (OpenAI-compatible) |
+| `MODEL_NAME` | Model identifier |
+| `HF_TOKEN` or `API_KEY` | API authentication key |
+| `LOCAL_IMAGE_NAME` | Docker image tag used by inference.py to launch containers |
+| `MEDCHAIN_TASK` | Run a single task in one container (default: `single_ward_stable`) |
+| `SLEEP_BETWEEN_STEPS` | Seconds between LLM calls, default 2 |
+| `LOG_LEVEL` | `INFO` (default) or `DEBUG` (writes timestamped log to `logs/`) |
+### Run a single task container
+```bash
+docker run -e MEDCHAIN_TASK=hospital_network_crisis -p 8000:8000 <LOCAL_IMAGE_NAME>
+```
+---
+## Project Structure
+```
+medchain_env/
+├── inference.py                     # Entry point — LLM agent runs all 4 tasks
+├── client.py                        # MedchainEnv OpenEnv client
+├── models.py                        # State and observation types
+├── openenv.yaml                     # OpenEnv manifest
+├── pyproject.toml                   # Dependencies
+└── server/
+    ├── app.py                       # FastAPI application
+    ├── Dockerfile                   # Container build
+    ├── medchain_env_environment.py  # OpenEnv Environment + 9 MCP tools
+    ├── simulation.py                # Simulation engine (inventory, demand, events)
+    ├── tasks.py                     # Task configurations
+    ├── grader.py                    # Terminal reward computation
+    └── erp_formatter.py             # ERP text output formatters
+```

__init__.py ADDED Viewed

	@@ -0,0 +1,15 @@

+"""MedChain Env — Hospital Supply Chain Management Environment."""
+from openenv.core.env_server.mcp_types import CallToolAction, CallToolObservation
+from .client import MedchainEnv
+from .models import AVAILABLE_TOOLS, MedchainState, MedchainToolObservation
+__all__ = [
+    "MedchainEnv",
+    "MedchainState",
+    "MedchainToolObservation",
+    "AVAILABLE_TOOLS",
+    "CallToolAction",
+    "CallToolObservation",
+]

client.py ADDED Viewed

	@@ -0,0 +1,157 @@

+"""MedChain Env Environment Client."""
+import logging
+import re
+from typing import Any, Dict, List, Optional
+from openenv.core import EnvClient
+from openenv.core.client_types import StepResult
+from openenv.core.env_server.mcp_types import (
+    CallToolAction,
+    ListToolsAction,
+    ListToolsObservation,
+    Tool,
+)
+from openenv.core.env_server.types import Observation, State
+from .models import MedchainState
+_log = logging.getLogger(__name__)
+class MedchainEnv(EnvClient[CallToolAction, Observation, MedchainState]):
+    """
+    Client for the MedChain Env hospital supply chain environment.
+    Inherits from EnvClient and communicates via the standard OpenEnv
+    WebSocket protocol (simulation mode).
+    Example:
+        >>> async with MedchainEnv(base_url="http://localhost:8000") as env:
+        ...     obs = await env.reset()
+        ...     print(obs.observation.metadata["dashboard"])
+        ...     tools = await env.list_tools()
+        ...     result = await env.step(CallToolAction(tool_name="read_inbox", arguments={}))
+    Example with Docker:
+        >>> env = await MedchainEnv.from_docker_image(
+        ...     "medchain_env-env:latest",
+        ...     env_vars={"MEDCHAIN_TASK": "single_ward_stable"},
+        ... )
+        >>> obs = await env.reset()
+    """
+    def __init__(self, **kwargs: Any) -> None:
+        kwargs.setdefault("message_timeout_s", 1500.0)
+        super().__init__(**kwargs)
+        self._tools_cache: Optional[List[Tool]] = None
+    # ── EnvClient abstract methods ─────────────────────────────────────────
+    def _step_payload(self, action: Any) -> Dict[str, Any]:
+        if isinstance(action, ListToolsAction):
+            return {"type": "list_tools"}
+        if isinstance(action, CallToolAction):
+            return {
+                "type": "call_tool",
+                "tool_name": action.tool_name,
+                "arguments": action.arguments,
+            }
+        raise ValueError(f"Unsupported action type: {type(action).__name__}")
+    def _parse_result(self, payload: Dict[str, Any]) -> StepResult[Observation]:
+        obs_data = payload.get("observation", {})
+        reward = payload.get("reward")
+        done = payload.get("done", False) or obs_data.get("done", False)
+        # ── List-tools response ──────────────────────────────────────────
+        if "tools" in obs_data:
+            tools = [
+                Tool(
+                    name=t.get("name", ""),
+                    description=t.get("description", ""),
+                    input_schema=t.get("input_schema", t.get("inputSchema", {})),
+                )
+                for t in obs_data.get("tools", [])
+            ]
+            observation = ListToolsObservation(
+                tools=tools,
+                done=done,
+                reward=reward,
+            )
+            return StepResult(observation=observation, reward=reward, done=done)
+        # ── Reset response (has "dashboard" field) ───────────────────────
+        if "dashboard" in obs_data:
+            observation = Observation(done=done, reward=reward, metadata=obs_data)
+            return StepResult(observation=observation, reward=reward, done=done)
+        # ── Tool-call response (has "tool_name" and "tool_result") ───────
+        if "tool_name" in obs_data:
+            result_text = obs_data.get("tool_result", "")
+            # Safety net: if reward is still None (should not happen after the
+            # serialization fix), fall back to parsing the Final Score from text.
+            if reward is None and result_text:
+                m = re.search(r"Final Score:\s*([\d.]+)", result_text)
+                if m:
+                    reward = float(m.group(1))
+            observation = Observation(
+                done=done,
+                reward=reward,
+                metadata={"tool_result": result_text},
+            )
+            return StepResult(observation=observation, reward=reward, done=done)
+        # ── Generic fallback ─────────────────────────────────────────────
+        observation = Observation(done=done, reward=reward, metadata=obs_data)
+        return StepResult(observation=observation, reward=reward, done=done)
+    def _parse_state(self, payload: Dict[str, Any]) -> MedchainState:
+        return MedchainState(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+            task=payload.get("task", ""),
+            day=payload.get("day", 0),
+            max_days=payload.get("max_days", 0),
+            actions_remaining=payload.get("actions_remaining", 0),
+            budget_used=payload.get("budget_used", 0.0),
+            budget_limit=payload.get("budget_limit", 0.0),
+            unread_messages=payload.get("unread_messages", 0),
+            orders_in_transit=payload.get("orders_in_transit", 0),
+        )
+    # ── Tool discovery ─────────────────────────────────────────────────────
+    async def list_tools(self, use_cache: bool = True) -> List[Tool]:
+        """
+        Discover the 9 ERP tools available in this environment.
+        Args:
+            use_cache: Return cached tools if available (default True).
+        Returns:
+            List of Tool objects with name, description, and input_schema.
+        """
+        if use_cache and self._tools_cache is not None:
+            return self._tools_cache
+        result = await self.step(ListToolsAction())
+        if isinstance(result.observation, ListToolsObservation):
+            self._tools_cache = result.observation.tools
+            return self._tools_cache
+        self._tools_cache = []
+        return self._tools_cache
+    # ── Resource cleanup ───────────────────────────────────────────────────
+    async def close(self) -> None:
+        """Close client, tolerating Docker stop timeouts gracefully."""
+        try:
+            await super().close()
+        except Exception as e:
+            # docker stop can time out (10 s) when the container is slow to exit.
+            # Log and swallow so the inference script doesn't crash.
+            _log.warning("MedchainEnv.close() suppressed error during shutdown: %s", e)

hackathon_guide.md ADDED Viewed

	@@ -0,0 +1,365 @@

+# Scaler School of Technology — Meta PyTorch Hackathon
+## OpenEnv Hackathon Dashboard
+**URL:** https://www.scaler.com/school-of-technology/meta-pytorch-hackathon/dashboard#form
+---
+## Timeline
+| Stage        | Dates                    |
+|--------------|--------------------------|
+| Registration | 14th March – 3rd April   |
+| Declaration  | Before Round 1           |
+| Prepare      | Now – 25th March         |
+| Round 1      | 25th March – 8th April   |
+| Results      | 10th April               |
+| Finals       | 25th – 26th (April)      |
+---
+## Community
+- **Discord:** Join the Discord Community — all announcements, mentor access, and team matching happens here.
+---
+## Participation
+- Currently registered as **Solo Warrior**
+- Locked for Round 1 — cannot switch to a team until Round 1 is over.
+---
+## Problem Statement
+### The Task
+> Build a complete, real-world OpenEnv environment that an AI agent can learn from through the standard `step()` / `reset()` / `state()` API.
+---
+## Key Requirements at a Glance
+- Must simulate a real-world task (not games or toys)
+- Implement full OpenEnv spec: typed models, `step()`/`reset()`/`state()`, `openenv.yaml`
+- Minimum 3 tasks with agent graders (easy → medium → hard, scores 0.0–1.0)
+- Meaningful reward function with partial progress signals
+- Baseline inference script with reproducible scores
+- Deploy to Hugging Face Spaces + working Dockerfile
+- README with environment description, action/observation spaces, setup instructions
+---
+## Detailed Requirements
+### Real-world task simulation
+The environment must simulate a task humans actually do. Not games, not toys.
+Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation.
+### OpenEnv spec compliance
+Implement the full OpenEnv interface:
+- Typed `Observation`, `Action`, and `Reward` Pydantic models
+- `step(action)` → returns observation, reward, done, info
+- `reset()` → returns initial observation
+- `state()` → returns current state
+- `openenv.yaml` with metadata
+- Tested via `openenv validate`
+### Minimum 3 tasks with agent graders
+Each task defines a concrete objective an agent must accomplish, with a programmatic grader that scores performance (0.0–1.0). Tasks should range: easy → medium → hard. Graders must have clear, deterministic success/failure criteria.
+### Meaningful reward function
+Provides signal over the full trajectory (not just binary end-of-episode). Rewards partial progress toward task completion. Penalizes clearly undesirable behavior (e.g. infinite loops, destructive actions).
+### Baseline inference script
+Uses the OpenAI API client to run a model against the environment. Reads API credentials from environment variables (`OPENAI_API_KEY`). Produces a reproducible baseline score on all 3 tasks.
+---
+## Non-Functional Requirements
+### Deploys to a Hugging Face Space
+Environment must run as a containerized HF Space tagged with `openenv`.
+Must include a working Dockerfile. The environment should start cleanly with `docker build` + `docker run`.
+### Documentation
+README must include:
+- Environment description and motivation
+- Action and observation space definitions
+- Task descriptions with expected difficulty
+- Setup and usage instructions
+- Baseline scores
+---
+## Evaluation Criteria
+| Parameter                  | Weight | Description                                                                 |
+|----------------------------|--------|-----------------------------------------------------------------------------|
+| Real-world utility         | 30%    | Does the environment model a genuine task? Would someone actually use this to train or evaluate agents? |
+| Task & grader quality      | 25%    | Are tasks well-defined with clear objectives? Do graders accurately and fairly measure success? Meaningful difficulty progression? |
+| Environment design         | 20%    | Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries |
+| Code quality & spec compliance | 15% | Follows OpenEnv spec, clean project structure, typed models, documented, tested, Dockerfile works |
+| Creativity & novelty       | 10%    | Novel problem domain, interesting mechanics, clever reward design, original approach |
+### Scoring Breakdown (Real-world utility)
+- **0–5:** Toy/artificial problem with no practical application
+- **6–15:** Valid domain but shallow modeling of the real task
+- **16–25:** Good domain modeling, would be useful for agent evaluation
+- **26–30:** Excellent — fills a real gap, immediate value for the RL/agent community
+### Scoring Checklist Questions
+**Task & grader quality:**
+- 3+ tasks with difficulty range?
+- Graders produce scores between 0.0–1.0?
+- Graders deterministic and reproducible?
+- Hard task genuinely challenges frontier models?
+**Environment design:**
+- `reset()` produces clean state?
+- Action/observation types well-designed and documented?
+- Reward function provides useful varying signal (not just sparse)?
+- Episode boundaries sensible?
+**Code quality & spec compliance:**
+- `openenv validate` passes?
+- `docker build && docker run` works?
+- HF Space deploys and responds?
+- Baseline script runs and reproduces scores?
+**Creativity & novelty:**
+- Domain we haven't seen in OpenEnv before?
+- Reward design has interesting properties?
+- Clever mechanics that make the environment engaging?
+---
+## How Judging Works
+- **Phase 1 — Automated Validation:** Pass/fail gate — HF Space deploys, OpenEnv spec compliance, Dockerfile builds, baseline reproduces, 3+ tasks with graders.
+- **Phase 2 — Agentic Evaluation:** Scored — baseline agent re-run, standard Open LLM agent (e.g. Nemotron 3 Super) run against all environments, score variance check.
+- **Phase 3 — Human Review:** Top submissions reviewed by Meta and Hugging Face engineers for real-world utility, creativity, and exploit checks.
+---
+## Disqualification Criteria
+- Environment does not deploy or respond
+- Plagiarized or trivially modified existing environments
+- Graders that always return the same score
+- No baseline inference script
+---
+## Pre-Submission Checklist — all must pass or you're disqualified
+| Check                    | Requirement                                                                                     |
+|--------------------------|-------------------------------------------------------------------------------------------------|
+| HF Space deploys         | Automated ping to the Space URL — must return 200 and respond to `reset()`                     |
+| OpenEnv spec compliance  | Validate `openenv.yaml`, typed models, `step()`/`reset()`/`state()` endpoints                  |
+| Dockerfile builds        | Automated docker build on the submitted repo                                                    |
+| Baseline reproduces      | Run the submitted inference script — must complete without error and produce scores             |
+| 3+ tasks with graders    | Enumerate tasks, run each grader, verify scores in 0.0–1.0 range                               |
+| Infra Restrictions       | Runtime of inference script should be less than 20 min. Must run on vcpu=2, memory=8gb        |
+| Validator                | Run the pre-submission validation script before submitting                                      |
+### Mandatory Additional Instructions
+Before submitting, ensure the following variables are defined in your environment configuration:
+| Variable        | Description                              |
+|-----------------|------------------------------------------|
+| `API_BASE_URL`  | The API endpoint for the LLM            |
+| `MODEL_NAME`    | The model identifier to use for inference |
+| `HF_TOKEN`      | Your Hugging Face / API key             |
+- The inference script must be named `inference.py` and placed in the root directory of the project.
+- Participants must use the OpenAI Client for all LLM calls using the above variables.
+- Participants must emit structured stdout logs strictly following the `[START]`, `[STEP]`, and `[END]` format defined in the sample inference script. Any deviation in field names, ordering, or formatting will result in incorrect evaluation scoring. Refer to [`sample_inference.py`](./sample_inference.py) for the complete format specification and examples.
+### Infra Restrictions
+- Runtime of inference script must be less than 20 minutes.
+- Ensure your env and inference can run on a machine with `vcpu=2`, `memory=8gb`.
+### Validator
+Run the pre-submission validation script at [`pre_validate.sh`](./pre_validate.sh) before submitting.
+### Sample Inference Script
+See [`sample_inference.py`](./sample_inference.py) for the complete example, including the mandatory `[START]`, `[STEP]`, and `[END]` structured log format.
+---
+## Submission
+- **Submission window opens:** 28th March
+- **Deadline:** 8 April 2026, 11:59 PM IST
+### Step 1
+Choose solo or team before you can start the assessment.
+### Step 2
+Complete Step 1 first. Problem Statement is live. Build and submit.
+---
+## Study Material
+**4 modules · ~3.5 hours**
+Each module: read the README first, then open the notebook in Colab. No local setup needed.
+### Module 1 — Essential for Round 1 (45 min)
+**What you'll do:** Connect to 3 real AI environments hosted online — an Echo bot, a Catch game, and Wordle — and interact with each using the exact same code pattern.
+### Module 2 — Essential for Round 1 (50 min)
+**What you'll do:** Write 4 different game-playing strategies for a Catch game, run a competition between them, then switch to a completely different game using the same code.
+### Module 3 — Essential for Round 1 (45 min)
+**What you'll do:** Clone an existing environment, modify it, run it on your machine, then deploy your version live to Hugging Face Spaces with one command.
+### Module 4 — Most Important for Round 1
+**What you'll do:** Build a complete word-guessing game environment from scratch — define the rules, implement the logic, test it locally, and deploy it live. About 100 lines of real code.
+- View full course repository
+---
+## Guide
+### What to Expect
+Example of what a problem statement looks like:
+> "Build a mini-game RL environment with clearly defined tasks, automated graders, and deploy it live to Hugging Face Spaces."
+### Prerequisites (from Step 1 assessment)
+- Write graders that verify task completion
+- Define reward logic for scoring
+- Package using OpenEnv for automated evaluation
+**Install before April 1st:**
+| Tool                  | Requirement                          | Command                                      |
+|-----------------------|--------------------------------------|----------------------------------------------|
+| Python 3.10+          | Install 3.10, 3.11, or 3.12          | `python --version`                           |
+| Git + GitHub account  | Push your submission to GitHub or HF | `git --version`                              |
+| Hugging Face CLI      | Deploy to HF Spaces                  | `pip install huggingface_hub`                |
+|                       |                                      | `huggingface-cli login`                      |
+| OpenEnv               | The framework                        | `pip install openenv-core`                   |
+| Google Colab          | Prep course runs in Colab (free tier works) | colab.research.google.com             |
+| Docker                | Isolated container testing           | `docker --version`                           |
+| VS Code (Recommended) | Best Python + Docker support         |                                              |
+### Step 1 Evaluation Criteria
+| Criteria              | Standard                        |
+|-----------------------|---------------------------------|
+| Runtime correctness   | Runs without errors             |
+| Interface compliance  | Follows OpenEnv standard        |
+| Task design           | Clear, realistic, testable      |
+| Grading logic         | Reward system makes sense       |
+### How to Submit
+When Round 1 starts on 1 April:
+**Step 1 — Application Form**
+Choose your problem domain. The task is open-ended — build any real-world OpenEnv environment that a human would actually do.
+**Step 2 — Scaffold**
+```bash
+openenv init my_env
+```
+Generate project structure.
+**Step 3 — Build**
+Define your environment in the generated files.
+**Step 4 — Test locally**
+```bash
+uv run server
+```
+**Step 5 — Deploy**
+```bash
+openenv push --repo-id your-username/my-env
+```
+**Step 6 — Submit**
+Paste your HF Spaces URL on the platform before the deadline.
+- Submission window opens 28th March
+- Deadline: 8 April 2026, 11:59 PM IST
+> **Note:** Only team leaders can make the final submission.
+> **Note:** The Guide above references "4–5 problem statements" — this is outdated. Round 1 is open-ended. There is no fixed list of problem statements to choose from. Build any real-world environment that a human would actually do (e.g. email triage, code review, data cleaning). The requirements and evaluation criteria remain the same.
+---
+## FAQs
+### How does the team/solo declaration work?
+If you choose to compete solo, you will participate individually for Round 1.
+If you form a team (2–3 members), only the Team Lead fills out the team formation form before the Round 1 assessment window opens and adds teammates using their registered email IDs. Once a team is confirmed, it cannot be changed.
+Note: Since Round 2 is a 48-hour in-person hackathon, solo participants who qualify will be matched with other qualifying participants to form teams for the final round.
+### Who should fill the team form?
+Only the team lead completes the team registration form. Teammates do not need to fill out anything at this stage. Once the Team Lead submits the form, listed members will receive an invite on their dashboards. The team will be reflected on their dashboards only after they accept the invite.
+### What if someone already added me to their team?
+This will only happen once you accept their invite; your dashboard will then automatically update to reflect the team you have joined. After confirmation, you will not be able to switch to solo mode or join/form another team. Team assignments are permanent once confirmed.
+### Can I change my team or switch to solo after confirming?
+No. Teams are permanent once confirmed, no changes are allowed. Solo declarations are locked for Round 1. A confirmation prompt is shown before submission, so please review carefully before proceeding.
+### Do I need to complete the prep course?
+While not mandatory, it is strongly recommended.
+### What happens during Round 1?
+You will select one problem statement from a set of challenges and build an RL environment using the OpenEnv framework.
+### Can I update my submission?
+Yes. You may update your submission multiple times until the Round 1 deadline (5th April, 11:59 PM IST). Only the latest submission will be evaluated.
+### How are submissions evaluated?
+Round 1 uses an LLM-based evaluator with structured rubrics. The finale includes LLM screening, manual review, and judging by Meta's global team. Evaluation criteria include runtime correctness, OpenEnv interface compliance, task design quality, grading logic, and overall code quality.
+### What framework must be used?
+All environments must be built using the OpenEnv framework by Meta and Hugging Face.
+### What happens after Round 1?
+Results will be announced on 10 April. The top 3,000 teams will advance to the Grand Finale, a 48-hour on-campus hackathon at Scaler School of Technology, Bangalore (25th–26th April).
+### What do I need to submit?
+A public GitHub repository with your environment code, a `requirements.txt`, a demo script, and a README. A deployed Hugging Face Spaces URL showcasing your working demo.
+### Where can I get help?
+Join the Discord community for announcements and support.
+For account or registration issues, email: help_openenvhackathon@scaler.com
+---
+## Support
+**Need help? Reach out to us:**
+- Email: help_openenvhackathon@scaler.com

inference.py ADDED Viewed

	@@ -0,0 +1,524 @@

+"""
+MedChain Env — Inference Script
+================================
+Runs all tasks sequentially and reports scores.
+MANDATORY environment variables:
+    API_BASE_URL        The API endpoint for the LLM
+    MODEL_NAME / MODEL  The model identifier for inference
+    HF_TOKEN / API_KEY  Your Hugging Face / API key
+STDOUT FORMAT
+- The script emits exactly three line types to stdout, in this order:
+    [START] task=<task_name> env=medchain model=<model_name>
+    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> score=<0.000> rewards=<r1,r2,...,rn>
+  Rules:
+    - One [START] line at episode begin.
+    - One [STEP] line per step, immediately after env.step() returns.
+    - One [END] line after env.close(), always emitted (even on exception).
+    - reward and rewards are formatted to 2 decimal places; score to 3.
+    - done and success are lowercase booleans: true or false.
+    - error is the raw error string, or null if none.
+    - All fields on a single line with no newlines within a line.
+"""
+import asyncio
+import json
+import logging
+import os
+import sys
+from datetime import datetime
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+from openai import BadRequestError, OpenAI
+sys.path.insert(0, str(Path(__file__).parent.parent))
+from medchain_env import CallToolAction, MedchainEnv
+# Both modes log to stdout. DEBUG additionally saves to a timestamped file under logs/
+LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO").upper()
+_log_fmt = logging.Formatter(
+    "%(asctime)s [%(levelname)s] %(message)s", datefmt="%H:%M:%S"
+)
+_stream_handler = logging.StreamHandler(sys.stdout)
+_stream_handler.setFormatter(_log_fmt)
+_handlers: list = [_stream_handler]
+if LOG_LEVEL == "DEBUG":
+    os.makedirs("logs", exist_ok=True)
+    _log_filename = datetime.now().strftime("logs/inference_%Y%m%d_%H%M%S.log")
+    _file_handler = logging.FileHandler(_log_filename)
+    _file_handler.setFormatter(_log_fmt)
+    _handlers.append(_file_handler)
+    print(f"[DEBUG] Logging to file: {_log_filename}", flush=True)
+logging.basicConfig(level=logging.WARNING, handlers=_handlers)
+log = logging.getLogger(__name__)
+log.setLevel(getattr(logging, LOG_LEVEL, logging.INFO))
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+MODEL_NAME = os.getenv("MODEL_NAME") or os.getenv("MODEL", "openai/gpt-oss-120b:groq")
+TASKS = [
+    "orientation_ward",
+    "single_ward_stable",
+    "multi_ward_seasonal",
+    "hospital_network_crisis",
+]
+MAX_STEPS_PER_TASK = 150
+MAX_TOKENS = 6000
+TEMPERATURE = 0.1
+MAX_CONSECUTIVE_ERRORS = 5
+SLEEP_BETWEEN_STEPS = int(os.getenv("SLEEP_BETWEEN_STEPS", "2"))
+SHIFT_HISTORY_KEEP = 6
+BENCHMARK = "medchain"
+SYSTEM_PROMPT = """You are an experienced hospital supply chain manager operating a legacy ERP system.
+Your goal is to maintain adequate medical supplies across all locations while controlling costs.
+CRITICAL — ACTION BUDGET: You have a strictly limited number of actions per shift.
+Budget does NOT roll over. Unspent actions are lost at end_shift().
+Recommended budget allocation (highest priority first):
+  1. read_inbox()                         — ALWAYS do this first to catch urgent alerts
+  2. query_erp(table='inventory')         — check current stock levels across all locations
+  3. submit_po(...)                        — place orders for items below safety stock (PRIORITY)
+  4. end_shift()                           — call this when budget is exhausted OR tasks are done
+Query tools (query_erp expiry/pipeline, query_forecast, query_supplier) are LOW PRIORITY.
+Only use them if you have budget remaining AFTER placing critical orders.
+MANDATORY RULES:
+- If you receive "Action budget exhausted" → call end_shift() as your VERY NEXT action.
+  Do NOT call any other tool. The budget cannot be restored until end_shift() is called.
+- Order early: factor in lead times. If lead time is 2 days, order today to avoid stockout in 2 days.
+- Expedited orders require file_justification(ticket_id=...) with a real clinical reason.
+- FEFO: oldest stock consumed first — check expiry and rotate perishables proactively.
+- Recalls: quarantine the recalled lot immediately, then order a replacement.
+- MCI events: pre-emptive ordering beats reactive ordering. Order extra blood/critical supplies NOW.
+Safety stock target: aim for at least (lead_time + 1) × daily_demand units on hand.
+When calling tools, use the EXACT parameter names shown in the tool descriptions.
+"""
+def log_start(task: str, model: str) -> None:
+    print(f"[START] task={task} env={BENCHMARK} model={model}", flush=True)
+def log_step(
+    step: int, action: str, reward: float, done: bool, error: Optional[str]
+) -> None:
+    error_val = error if error else "null"
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={str(done).lower()} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
+        flush=True,
+    )
+def _tools_to_openai_format(tools) -> List[dict]:
+    """Convert MCP tools to OpenAI function-calling format."""
+    openai_tools = []
+    for tool in tools:
+        properties = {}
+        required = []
+        if tool.input_schema and "properties" in tool.input_schema:
+            for name, schema in tool.input_schema["properties"].items():
+                properties[name] = {
+                    "type": schema.get("type", "string"),
+                    "description": schema.get("description", ""),
+                }
+            required = tool.input_schema.get("required", [])
+        openai_tools.append(
+            {
+                "type": "function",
+                "function": {
+                    "name": tool.name,
+                    "description": tool.description or "",
+                    "parameters": {
+                        "type": "object",
+                        "properties": properties,
+                        "required": required,
+                    },
+                },
+            }
+        )
+        log.debug("Tool registered: %s (required=%s)", tool.name, required)
+    return openai_tools
+def _make_shift_summary(shift_day: int, end_shift_result: str) -> str:
+    """Build a compact summary of a completed shift for the context window."""
+    lines = []
+    for line in (end_shift_result or "").splitlines():
+        stripped = line.strip()
+        if stripped and any(
+            kw in stripped
+            for kw in [
+                "DEMAND:",
+                "FULFILLED:",
+                "DELIVERIES:",
+                "SPEND:",
+                "WASTE:",
+                "EXPIRY:",
+                "STOCKOUT",
+                "CRITICAL",
+                "END OF SHIFT",
+                "Day ",
+                "Score:",
+                "═",
+                "─",
+            ]
+        ):
+            lines.append(stripped)
+        if len(lines) >= 20:
+            break
+    summary_body = "\n".join(lines) if lines else (end_shift_result or "")[:400]
+    return f"[SHIFT DAY {shift_day} SUMMARY]\n{summary_body}"
+async def run_task_episode(
+    env: MedchainEnv,
+    client: OpenAI,
+    tools: List[dict],
+    task_name: str,
+) -> Dict[str, Any]:
+    """Run one episode of a task and return the result."""
+    tool_names = [t["function"]["name"] for t in tools]
+    obs = await env.reset()
+    obs = obs.observation
+    dashboard = obs.metadata.get("dashboard", "")
+    log_start(task=task_name, model=MODEL_NAME)
+    log.debug("[%s] Episode started. Tools: %s", task_name, tool_names)
+    chat_history: List[dict] = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {
+            "role": "user",
+            "content": f"Your shift has started. Current dashboard:\n\n{dashboard}",
+        },
+    ]
+    step_count = 0
+    final_reward = 0.0
+    done = obs.done
+    consecutive_errors = 0
+    rewards: List[float] = []
+    past_shift_summaries: List[str] = []
+    current_shift_messages: List[dict] = []
+    while not done and step_count < MAX_STEPS_PER_TASK:
+        step_count += 1
+        log.debug(
+            "[%s] Step %d/%d — %d messages in context",
+            task_name,
+            step_count,
+            MAX_STEPS_PER_TASK,
+            len(chat_history),
+        )
+        try:
+            response = client.chat.completions.create(
+                model=MODEL_NAME,
+                messages=chat_history,
+                tools=tools,
+                tool_choice="required",
+                max_completion_tokens=MAX_TOKENS,
+                temperature=TEMPERATURE,
+            )
+            consecutive_errors = 0
+        except BadRequestError as e:
+            consecutive_errors += 1
+            log.warning(
+                "[%s] Step %d — BadRequestError (%d/%d): %s",
+                task_name,
+                step_count,
+                consecutive_errors,
+                MAX_CONSECUTIVE_ERRORS,
+                e,
+            )
+            if consecutive_errors >= MAX_CONSECUTIVE_ERRORS:
+                log.error(
+                    "[%s] Aborting after %d consecutive errors",
+                    task_name,
+                    MAX_CONSECUTIVE_ERRORS,
+                )
+                break
+            err_msg = (
+                f"Your previous tool call was rejected with an error:\n{e}\n\n"
+                "Please retry with a valid tool call. If your budget is exhausted, call end_shift()."
+            )
+            chat_history.append({"role": "user", "content": err_msg})
+            current_shift_messages.append({"role": "user", "content": err_msg})
+            continue
+        message = response.choices[0].message
+        log.debug(
+            "[%s] Step %d — finish_reason=%s tool_calls=%d",
+            task_name,
+            step_count,
+            response.choices[0].finish_reason,
+            len(message.tool_calls) if message.tool_calls else 0,
+        )
+        if not message.tool_calls:
+            log.warning(
+                "[%s] Step %d — no tool_calls in response; falling back to end_shift",
+                task_name,
+                step_count,
+            )
+            tool_name = "end_shift"
+            tool_args = {}
+            tool_call_id = "fallback"
+        else:
+            tc = message.tool_calls[0]
+            tool_name = tc.function.name
+            tool_call_id = tc.id
+            try:
+                tool_args = json.loads(tc.function.arguments)
+            except (json.JSONDecodeError, AttributeError):
+                log.warning(
+                    "[%s] Step %d — failed to parse tool arguments: %r",
+                    task_name,
+                    step_count,
+                    tc.function.arguments,
+                )
+                tool_args = {}
+        if tool_name not in tool_names:
+            log.warning(
+                "[%s] Step %d — unknown tool %r; falling back to end_shift",
+                task_name,
+                step_count,
+                tool_name,
+            )
+            tool_name = "end_shift"
+            tool_args = {}
+        log.debug(
+            "[%s] Step %d — calling %s(%s)", task_name, step_count, tool_name, tool_args
+        )
+        assistant_msg = {
+            "role": "assistant",
+            "content": None,
+            "tool_calls": [
+                {
+                    "id": tool_call_id,
+                    "type": "function",
+                    "function": {
+                        "name": tool_name,
+                        "arguments": json.dumps(tool_args),
+                    },
+                }
+            ],
+        }
+        chat_history.append(assistant_msg)
+        current_shift_messages.append(assistant_msg)
+        action = CallToolAction(tool_name=tool_name, arguments=tool_args)
+        step_result = await env.step(action)
+        obs = step_result.observation
+        done = obs.done
+        result_text = obs.metadata.get("tool_result", str(obs.metadata))
+        step_reward = obs.reward or 0.0
+        step_error: Optional[str] = None
+        if "EPISODE COMPLETE" in (result_text or ""):
+            log.info("[%s] Step %d — episode complete detected", task_name, step_count)
+            done = True
+        if obs.reward is not None and obs.reward > 0:
+            final_reward = obs.reward
+        rewards.append(step_reward)
+        action_str = f"{tool_name}({json.dumps(tool_args)})"
+        log_step(
+            step=step_count,
+            action=action_str,
+            reward=step_reward,
+            done=done,
+            error=step_error,
+        )
+        tool_result_msg = {
+            "role": "tool",
+            "tool_call_id": tool_call_id,
+            "content": result_text[:2000] if result_text else "OK",
+        }
+        chat_history.append(tool_result_msg)
+        current_shift_messages.append(tool_result_msg)
+        # Budget exhausted — inject directive and skip sleep
+        if "Action budget exhausted" in (result_text or ""):
+            log.info(
+                "[%s] Step %d — budget exhausted; injecting end_shift directive",
+                task_name,
+                step_count,
+            )
+            directive = (
+                "SYSTEM ALERT: Your action budget for this shift is fully exhausted. "
+                "You MUST call end_shift() as your very next action. "
+                "Every other tool call will fail until you do."
+            )
+            chat_history.append({"role": "user", "content": directive})
+            current_shift_messages.append({"role": "user", "content": directive})
+            continue
+        if SLEEP_BETWEEN_STEPS > 0:
+            await asyncio.sleep(SLEEP_BETWEEN_STEPS)
+        # Shift ended — summarise and prune context, then set up next shift
+        if (
+            tool_name == "end_shift"
+            and "END OF SHIFT" in (result_text or "")
+            and not done
+        ):
+            shift_day = "?"
+            for part in (result_text or "").split():
+                if part.isdigit():
+                    shift_day = part
+                    break
+            shift_summary = _make_shift_summary(shift_day, result_text or "")
+            past_shift_summaries.append(shift_summary)
+            log.info(
+                "[%s] Step %d — shift %s ended; pruning context (%d summaries)",
+                task_name,
+                step_count,
+                shift_day,
+                len(past_shift_summaries),
+            )
+            summaries_msg = {
+                "role": "user",
+                "content": "COMPLETED SHIFT SUMMARIES:\n\n"
+                + "\n\n".join(past_shift_summaries),
+            }
+            trimmed = (
+                current_shift_messages[-SHIFT_HISTORY_KEEP:]
+                if len(current_shift_messages) > SHIFT_HISTORY_KEEP
+                else list(current_shift_messages)
+            )
+            # Strip orphaned leading tool-response messages to avoid API errors
+            while trimmed and trimmed[0].get("role") == "tool":
+                log.debug(
+                    "[%s] Dropping orphaned leading tool msg (tool_call_id=%s)",
+                    task_name,
+                    trimmed[0].get("tool_call_id"),
+                )
+                trimmed = trimmed[1:]
+            chat_history = (
+                [
+                    {"role": "system", "content": SYSTEM_PROMPT},
+                    summaries_msg,
+                ]
+                + trimmed
+                + [
+                    {
+                        "role": "user",
+                        "content": "Your next shift has begun. The dashboard is shown above in the last tool result. "
+                        "Continue managing the supply chain.",
+                    },
+                ]
+            )
+            current_shift_messages = []
+    log.info(
+        "[%s] Episode finished. steps=%d done=%s final_reward=%.4f",
+        task_name,
+        step_count,
+        done,
+        final_reward,
+    )
+    return {
+        "task": task_name,
+        "reward": final_reward,
+        "steps": step_count,
+        "done": done,
+        "rewards": rewards,
+    }
+async def async_main() -> None:
+    if not API_KEY:
+        raise SystemExit("HF_TOKEN or API_KEY must be set.")
+    if not MODEL_NAME:
+        raise SystemExit("MODEL_NAME or MODEL must be set.")
+    log.info("Starting. API_BASE_URL=%s MODEL_NAME=%s", API_BASE_URL, MODEL_NAME)
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    results = []
+    for task_name in TASKS:
+        log.info("Launching task: %s", task_name)
+        env = await MedchainEnv.from_docker_image(
+            "medchain_env-env:latest",
+            env_vars={"MEDCHAIN_TASK": task_name},
+        )
+        final_reward = 0.0
+        success = False
+        steps = 0
+        step_rewards: List[float] = []
+        try:
+            mcp_tools = await env.list_tools()
+            tools = _tools_to_openai_format(mcp_tools)
+            log.info("[%s] %d tools discovered", task_name, len(tools))
+            result = await run_task_episode(env, client, tools, task_name)
+            results.append(result)
+            final_reward = result["reward"]
+            steps = result["steps"]
+            success = result["done"]
+            step_rewards = result["rewards"]
+            log.info(
+                "[%s] Task complete: reward=%.4f steps=%d",
+                task_name,
+                final_reward,
+                steps,
+            )
+        except Exception as e:
+            log.error("[%s] Task failed with exception: %s", task_name, e)
+        finally:
+            try:
+                await env.close()
+            except Exception as e:
+                log.error("[%s] env.close() failed: %s", task_name, e)
+            log_end(
+                success=success, steps=steps, score=final_reward, rewards=step_rewards
+            )
+    if results:
+        avg_reward = sum(r["reward"] for r in results) / len(results)
+        log.info("All tasks complete. avg_reward=%.4f", avg_reward)
+def main() -> None:
+    asyncio.run(async_main())
+if __name__ == "__main__":
+    main()

logs/inference_20260403_015029.log ADDED Viewed

	@@ -0,0 +1,79 @@

+01:50:29 [DEBUG] Using selector: EpollSelector
+01:50:29 [INFO] Starting. API_BASE_URL=https://router.huggingface.co/v1 MODEL_NAME=openai/gpt-oss-20b:groq
+01:50:29 [INFO] Launching task: orientation_ward
+01:50:30 [DEBUG] Starting new HTTP connection (1): localhost:60093
+01:50:30 [DEBUG] Starting new HTTP connection (1): localhost:60093
+01:50:31 [DEBUG] Starting new HTTP connection (1): localhost:60093
+01:50:31 [DEBUG] Starting new HTTP connection (1): localhost:60093
+01:50:32 [DEBUG] Starting new HTTP connection (1): localhost:60093
+01:50:32 [DEBUG] Starting new HTTP connection (1): localhost:60093
+01:50:33 [DEBUG] Starting new HTTP connection (1): localhost:60093
+01:50:33 [DEBUG] Starting new HTTP connection (1): localhost:60093
+01:50:34 [DEBUG] Starting new HTTP connection (1): localhost:60093
+01:50:34 [DEBUG] http://localhost:60093 "GET /health HTTP/1.1" 200 20
+01:50:34 [DEBUG] = connection is CONNECTING
+01:50:34 [DEBUG] > GET /ws HTTP/1.1
+01:50:34 [DEBUG] > Host: localhost:60093
+01:50:34 [DEBUG] > Upgrade: websocket
+01:50:34 [DEBUG] > Connection: Upgrade
+01:50:34 [DEBUG] > Sec-WebSocket-Key: +R8/wDU495sD6dSvTQfJJA==
+01:50:34 [DEBUG] > Sec-WebSocket-Version: 13
+01:50:34 [DEBUG] > Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
+01:50:34 [DEBUG] > User-Agent: Python/3.12 websockets/16.0
+01:50:34 [DEBUG] < HTTP/1.1 101 Switching Protocols
+01:50:34 [DEBUG] < Upgrade: websocket
+01:50:34 [DEBUG] < Connection: Upgrade
+01:50:34 [DEBUG] < Sec-WebSocket-Accept: dLM5YlCPG5KeH5IniKYMatgH6Ao=
+01:50:34 [DEBUG] < Sec-WebSocket-Extensions: permessage-deflate
+01:50:34 [DEBUG] < date: Thu, 02 Apr 2026 20:20:33 GMT
+01:50:34 [DEBUG] < server: uvicorn
+01:50:34 [DEBUG] = connection is OPEN
+01:50:34 [DEBUG] > TEXT '{"type": "step", "data": {"type": "list_tools"}}' [48 bytes]
+01:50:34 [DEBUG] < TEXT '{"type":"observation","data":{"observation":{"t...rd":null,"done":false}}' [5881 bytes]
+01:50:34 [DEBUG] Tool registered: read_inbox (required=[])
+01:50:34 [DEBUG] Tool registered: query_erp (required=['table'])
+01:50:34 [DEBUG] Tool registered: query_supplier (required=['supplier_id'])
+01:50:34 [DEBUG] Tool registered: query_forecast (required=['product_id', 'location_id'])
+01:50:34 [DEBUG] Tool registered: submit_po (required=['supplier_id', 'product_id', 'destination_id', 'quantity'])
+01:50:34 [DEBUG] Tool registered: transfer (required=['from_location_id', 'to_location_id', 'product_id', 'quantity'])
+01:50:34 [DEBUG] Tool registered: quarantine_lot (required=['location_id', 'sku', 'lot_id'])
+01:50:34 [DEBUG] Tool registered: file_justification (required=['ticket_id', 'reason'])
+01:50:34 [DEBUG] Tool registered: end_shift (required=[])
+01:50:34 [INFO] [orientation_ward] 9 tools discovered
+01:50:34 [DEBUG] > TEXT '{"type": "reset", "data": {}}' [29 bytes]
+01:50:34 [DEBUG] < TEXT '{"type":"observation","data":{"observation":{"d...ard":0.0,"done":false}}' [1985 bytes]
+01:50:34 [INFO] [orientation_ward] Episode started. Tools: ['read_inbox', 'query_erp', 'query_supplier', 'query_forecast', 'submit_po', 'transfer', 'quarantine_lot', 'file_justification', 'end_shift']
+01:50:34 [INFO] [orientation_ward] Step 1/150 — 2 messages in context
+01:50:34 [DEBUG] Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'idempotency_key': 'stainless-python-retry-120d6e65-e8bc-4e74-9bc2-54608097b49a', 'content': None, 'json_data': {'messages': [{'role': 'system', 'content': 'You are an experienced hospital supply chain manager operating a legacy ERP system.\nYour goal is to maintain adequate medical supplies across all locations while controlling costs.\n\nCRITICAL — ACTION BUDGET: You have a strictly limited number of actions per shift.\nBudget does NOT roll over. Unspent actions are lost at end_shift().\n\nRecommended budget allocation (highest priority first):\n  1. read_inbox()                         — ALWAYS do this first to catch urgent alerts\n  2. query_erp(table=\'inventory\')         — check current stock levels across all locations\n  3. submit_po(...)                        — place orders for items below safety stock (PRIORITY)\n  4. end_shift()                           — call this when budget is exhausted OR tasks are done\n\nQuery tools (query_erp expiry/pipeline, query_forecast, query_supplier) are LOW PRIORITY.\nOnly use them if you have budget remaining AFTER placing critical orders.\n\nMANDATORY RULES:\n- If you receive "Action budget exhausted" → call end_shift() as your VERY NEXT action.\n  Do NOT call any other tool. The budget cannot be restored until end_shift() is called.\n- Order early: factor in lead times. If lead time is 2 days, order today to avoid stockout in 2 days.\n- Expedited orders require file_justification(ticket_id=...) with a real clinical reason.\n- FEFO: oldest stock consumed first — check expiry and rotate perishables proactively.\n- Recalls: quarantine the recalled lot immediately, then order a replacement.\n- MCI events: pre-emptive ordering beats reactive ordering. Order extra blood/critical supplies NOW.\n\nSafety stock target: aim for at least (lead_time + 1) × daily_demand units on hand.\n\nWhen calling tools, use the EXACT parameter names shown in the tool descriptions.\n'}, {'role': 'user', 'content': 'Your shift has started. Current dashboard:\n\n╔════════════════════════════════════════════════════════════════════╗\n║   MEDSUPPLY ERP v2.1 — CENTRAL HOSPITAL NETWORK                    ║\n║   Task: orientation_ward | Shift: Day 1 of 2                       ║\n║   Actions remaining: 5/5                                           ║\n║   Budget used: $0 / $5,000                                         ║\n╠════════════════════════════════════════════════════════════════════╣\n║   [!] COMMS PAGER: 1 unread message(s)                             ║\n║   [·] INVDB: No expiry warnings                                    ║\n║   [·] PROCURENET: 0 order(s) in transit                            ║\n╠════════════════════════════════════════════════════════════════════╣\n║   SUPPLIERS (use exact IDs below):                                 ║\n║   MEDLINE → GLOVE-001, SYR-10, MASK-001                            ║\n╚════════════════════════════════════════════════════════════════════╝\nAwaiting input.\nAvailable tools: read_inbox, query_erp, query_supplier, query_forecast, submit_po, transfer, quarantine_lot, file_justification, end_shift'}], 'model': 'openai/gpt-oss-20b:groq', 'max_completion_tokens': 6000, 'temperature': 0.1, 'tool_choice': 'required', 'tools': [{'type': 'function', 'function': {'name': 'read_inbox', 'description': "Read messages from the COMMS PAGER inbox.\n\nArgs:\n    filter: Message filter — 'unread' (default), 'all', or 'flagged'\n\nReturns:\n    Formatted inbox messages as raw text", 'parameters': {'type': 'object', 'properties': {'filter': {'type': 'string', 'description': ''}}, 'required': []}}}, {'type': 'function', 'function': {'name': 'query_erp', 'description': "Query the legacy ERP database.\n\nArgs:\n    table: Table to query — 'inventory', 'expiry', 'pipeline_orders', or 'demand_history'\n    location: Location ID or 'all'. E.g. 'ward_general', 'ward_icu', 'hospital_a'\n    sku: Product SKU or 'all'. E.g. 'B-001', 'IV-500', 'GLOVE-001'\n\nReturns:\n    ASCII table with query results (legacy ERP format)", 'parameters': {'type': 'object', 'properties': {'table': {'type': 'string', 'description': ''}, 'location': {'type': 'string', 'description': ''}, 'sku': {'type': 'string', 'description': ''}}, 'required': ['table']}}}, {'type': 'function', 'function': {'name': 'query_supplier', 'description': 'Query supplier information including current lead times and disruptions.\n\nArgs:\n    supplier_id: Supplier identifier. Check the dashboard for valid supplier IDs.\n\nReturns:\n    Supplier status text including lead times and any active disruptions', 'parameters': {'type': 'object', 'properties': {'supplier_id': {'type': 'string', 'description': ''}}, 'required': ['supplier_id']}}}, {'type': 'function', 'function': {'name': 'query_forecast', 'description': "Get demand forecast for a product at a location.\n\nArgs:\n    product_id: Product SKU to forecast. Use query_erp(table='inventory') to see available SKUs.\n    location_id: Location to forecast for. Use query_erp(table='inventory') to see valid location IDs.\n    horizon_days: Forecast horizon in days (1-21, default 7)\n\nReturns:\n    Forecasted daily demand table", 'parameters': {'type': 'object', 'properties': {'product_id': {'type': 'string', 'description': ''}, 'location_id': {'type': 'string', 'description': ''}, 'horizon_days': {'type': 'integer', 'description': ''}}, 'required': ['product_id', 'location_id']}}}, {'type': 'function', 'function': {'name': 'submit_po', 'description': "Submit a purchase order to a supplier.\n\nArgs:\n    supplier_id: Supplier to order from. Check the dashboard for valid supplier IDs.\n    product_id: Product SKU to order. Use query_erp(table='inventory') to see available SKUs.\n    destination_id: Delivery location. Use query_erp(table='inventory') to see valid location IDs.\n    quantity: Number of units to order (must be positive)\n    priority: 'standard' (default) or 'expedited' (+50% cost, -2 day lead time; requires justification)\n\nReturns:\n    Confirmation with PO ID and ETA, or error if budget/validation fails.\n    For expedited orders: returns BUDGET_OVERRIDE_REQUIRED with a ticket ID.\n    Use file_justification(ticket_id=...) to proceed.", 'parameters': {'type': 'object', 'properties': {'supplier_id': {'type': 'string', 'description': ''}, 'product_id': {'type': 'string', 'description': ''}, 'destination_id': {'type': 'string', 'description': ''}, 'quantity': {'type': 'integer', 'description': ''}, 'priority': {'type': 'string', 'description': ''}}, 'required': ['supplier_id', 'product_id', 'destination_id', 'quantity']}}}, {'type': 'function', 'function': {'name': 'transfer', 'description': "Transfer inventory between locations (small handling fee).\n\nArgs:\n    from_location_id: Source location. Use query_erp(table='inventory') to see valid location IDs.\n    to_location_id: Destination location. Use query_erp(table='inventory') to see valid location IDs.\n    product_id: Product SKU to transfer. Use query_erp(table='inventory') to see available SKUs.\n    quantity: Units to transfer (must not exceed available stock at source)\n\nReturns:\n    Confirmation or error (insufficient stock, capacity exceeded)", 'parameters': {'type': 'object', 'properties': {'from_location_id': {'type': 'string', 'description': ''}, 'to_location_id': {'type': 'string', 'description': ''}, 'product_id': {'type': 'string', 'description': ''}, 'quantity': {'type': 'integer', 'description': ''}}, 'required': ['from_location_id', 'to_location_id', 'product_id', 'quantity']}}}, {'type': 'function', 'function': {'name': 'quarantine_lot', 'description': "Quarantine a specific inventory lot (e.g. for product recalls or cold chain breaches).\nQuarantined lots are excluded from demand fulfillment.\n\nArgs:\n    location_id: Location where the lot is stored\n    sku: Product SKU of the lot\n    lot_id: Lot identifier (from inventory query). Use 'all' to quarantine all lots of this SKU at this location.\n\nReturns:\n    Confirmation with quarantine details and disposal ticket ID", 'parameters': {'type': 'object', 'properties': {'location_id': {'type': 'string', 'description': ''}, 'sku': {'type': 'string', 'description': ''}, 'lot_id': {'type': 'string', 'description': ''}}, 'required': ['location_id', 'sku', 'lot_id']}}}, {'type': 'function', 'function': {'name': 'file_justification', 'description': "File a budget override justification to proceed with an expedited order.\n\nRequired after submit_po returns BUDGET_OVERRIDE_REQUIRED.\nThe justification is audited by Finance — it must reference the current clinical situation.\nA false or incoherent justification is flagged and results in a scoring penalty.\n\nArgs:\n    ticket_id: The ticket ID from the BUDGET_OVERRIDE_REQUIRED error (e.g. 'BOT-0002')\n    reason: Free-text justification. Must clearly reference the reason for urgency.\n\nReturns:\n    OK confirmation and PO submission details, or FLAGGED audit warning", 'parameters': {'type': 'object', 'properties': {'ticket_id': {'type': 'string', 'description': ''}, 'reason': {'type': 'string', 'description': ''}}, 'required': ['ticket_id', 'reason']}}}, {'type': 'function', 'function': {'name': 'end_shift', 'description': 'End the current shift and advance the simulation by one day.\n\nCommits all pending decisions. Simulates demand, deliveries, and expiry for the day.\nResets your action budget for the next shift.\nUnspent actions are lost — no rollover.\n\nReturns:\n    Day summary report + next shift dashboard', 'parameters': {'type': 'object', 'properties': {}, 'required': []}}}]}}
+01:50:34 [DEBUG] Sending HTTP Request: POST https://router.huggingface.co/v1/chat/completions
+01:50:34 [DEBUG] connect_tcp.started host='router.huggingface.co' port=443 local_address=None timeout=5.0 socket_options=None
+01:50:34 [DEBUG] connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x747917eb6240>
+01:50:34 [DEBUG] start_tls.started ssl_context=<ssl.SSLContext object at 0x74791cb135d0> server_hostname='router.huggingface.co' timeout=5.0
+01:50:34 [DEBUG] start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x747917eb4500>
+01:50:34 [DEBUG] send_request_headers.started request=<Request [b'POST']>
+01:50:34 [DEBUG] send_request_headers.complete
+01:50:34 [DEBUG] send_request_body.started request=<Request [b'POST']>
+01:50:34 [DEBUG] send_request_body.complete
+01:50:34 [DEBUG] receive_response_headers.started request=<Request [b'POST']>
+01:50:35 [DEBUG] receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'Date', b'Thu, 02 Apr 2026 20:20:35 GMT'), (b'x-ratelimit-reset-requests', b'60ms'), (b'x-ratelimit-reset-tokens', b'227ms'), (b'X-Powered-By', b'huggingface-moon'), (b'x-request-id', b'req_01kn7xnnw2et7b44t68w493sbh'), (b'cross-origin-opener-policy', b'same-origin'), (b'Referrer-Policy', b'strict-origin-when-cross-origin'), (b'vary', b'Origin'), (b'Access-Control-Allow-Origin', b'*'), (b'Access-Control-Expose-Headers', b'X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,X-Total-Count,ETag,Link,Accept-Ranges,Content-Range,X-Linked-Size,X-Linked-ETag,X-Xet-Hash'), (b'X-Robots-Tag', b'none'), (b'x-inference-provider', b'groq'), (b'cache-control', b'private, max-age=0, no-store, no-cache, must-revalidate'), (b'cf-cache-status', b'DYNAMIC'), (b'cf-ray', b'9e6288f8ae430833-IAD'), (b'server', b'cloudflare'), (b'set-cookie', b'__cf_bm=3TErgywCHmJ_bGgyJphVZdQpluX.hxyLeoVsm8_e_CY-1775161235.3111424-1.0.1.1-U.ik6GSM5toZqCr9TaLOgDqgUJnOiV4SOllSRV14nfHpefPesAChv35s_asXm1_5rjY.rCEVP99ehFWf4o7xE4KUcikmwkYAGRHlPBg.mqYf.csPV611ScNfRGbN3h8G; HttpOnly; Secure; Path=/; Domain=groq.com; Expires=Thu, 02 Apr 2026 20:50:35 GMT'), (b'strict-transport-security', b'max-age=15552000'), (b'x-groq-region', b'msp'), (b'x-ratelimit-limit-requests', b'1440000'), (b'x-ratelimit-limit-tokens', b'750000'), (b'x-ratelimit-remaining-requests', b'1439999'), (b'x-ratelimit-remaining-tokens', b'747158'), (b'X-Cache', b'Miss from cloudfront'), (b'Via', b'1.1 e6a37c61d86d6e0bcdada5b6b948004c.cloudfront.net (CloudFront)'), (b'X-Amz-Cf-Pop', b'BLR50-P4'), (b'X-Amz-Cf-Id', b'E630iyt9rgMwsBGL2qEjR0o9NYKo5FhjdWsUglT3OhS6BLdKRxVETQ==')])
+01:50:35 [INFO] HTTP Request: POST https://router.huggingface.co/v1/chat/completions "HTTP/1.1 200 OK"
+01:50:35 [DEBUG] receive_response_body.started request=<Request [b'POST']>
+01:50:35 [DEBUG] receive_response_body.complete
+01:50:35 [DEBUG] response_closed.started
+01:50:35 [DEBUG] response_closed.complete
+01:50:35 [DEBUG] HTTP Response: POST https://router.huggingface.co/v1/chat/completions "200 OK" Headers({'content-type': 'application/json', 'transfer-encoding': 'chunked', 'connection': 'keep-alive', 'date': 'Thu, 02 Apr 2026 20:20:35 GMT', 'x-ratelimit-reset-requests': '60ms', 'x-ratelimit-reset-tokens': '227ms', 'x-powered-by': 'huggingface-moon', 'x-request-id': 'req_01kn7xnnw2et7b44t68w493sbh', 'cross-origin-opener-policy': 'same-origin', 'referrer-policy': 'strict-origin-when-cross-origin', 'vary': 'Origin', 'access-control-allow-origin': '*', 'access-control-expose-headers': 'X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,X-Total-Count,ETag,Link,Accept-Ranges,Content-Range,X-Linked-Size,X-Linked-ETag,X-Xet-Hash', 'x-robots-tag': 'none', 'x-inference-provider': 'groq', 'cache-control': 'private, max-age=0, no-store, no-cache, must-revalidate', 'cf-cache-status': 'DYNAMIC', 'cf-ray': '9e6288f8ae430833-IAD', 'server': 'cloudflare', 'set-cookie': '__cf_bm=3TErgywCHmJ_bGgyJphVZdQpluX.hxyLeoVsm8_e_CY-1775161235.3111424-1.0.1.1-U.ik6GSM5toZqCr9TaLOgDqgUJnOiV4SOllSRV14nfHpefPesAChv35s_asXm1_5rjY.rCEVP99ehFWf4o7xE4KUcikmwkYAGRHlPBg.mqYf.csPV611ScNfRGbN3h8G; HttpOnly; Secure; Path=/; Domain=groq.com; Expires=Thu, 02 Apr 2026 20:50:35 GMT', 'strict-transport-security': 'max-age=15552000', 'x-groq-region': 'msp', 'x-ratelimit-limit-requests': '1440000', 'x-ratelimit-limit-tokens': '750000', 'x-ratelimit-remaining-requests': '1439999', 'x-ratelimit-remaining-tokens': '747158', 'x-cache': 'Miss from cloudfront', 'via': '1.1 e6a37c61d86d6e0bcdada5b6b948004c.cloudfront.net (CloudFront)', 'x-amz-cf-pop': 'BLR50-P4', 'x-amz-cf-id': 'E630iyt9rgMwsBGL2qEjR0o9NYKo5FhjdWsUglT3OhS6BLdKRxVETQ=='})
+01:50:35 [DEBUG] request_id: req_01kn7xnnw2et7b44t68w493sbh
+01:50:35 [DEBUG] [orientation_ward] Step 1 — finish_reason=tool_calls tool_calls=1
+01:50:35 [DEBUG] [orientation_ward] Step 1 — calling read_inbox({'filter': 'unread'})
+01:50:35 [DEBUG] > TEXT '{"type": "step", "data": {"type": "call_tool", ... {"filter": "unread"}}}' [109 bytes]
+01:50:35 [DEBUG] < TEXT '{"type":"observation","data":{"observation":{"t...rd":0.01,"done":false}}' [530 bytes]
+01:50:36 [DEBUG] > TEXT '{"type": "close"}' [17 bytes]
+01:50:36 [DEBUG] > CLOSE 1000 (OK) [2 bytes]
+01:50:36 [DEBUG] = connection is CLOSING
+01:50:36 [DEBUG] < CLOSE 1000 (OK) [2 bytes]
+01:50:36 [DEBUG] < EOF
+01:50:36 [DEBUG] > EOF
+01:50:36 [DEBUG] = connection is CLOSED
+01:50:36 [DEBUG] x half-closing TCP connection
+01:50:37 [DEBUG] close.started
+01:50:37 [DEBUG] close.complete

logs/inference_20260403_015146.log ADDED Viewed

	@@ -0,0 +1,118 @@

+01:51:46 [DEBUG] Using selector: EpollSelector
+01:51:46 [INFO] Starting. API_BASE_URL=https://router.huggingface.co/v1 MODEL_NAME=openai/gpt-oss-20b:groq
+01:51:46 [INFO] Launching task: orientation_ward
+01:51:48 [DEBUG] Starting new HTTP connection (1): localhost:56401
+01:51:48 [DEBUG] Starting new HTTP connection (1): localhost:56401
+01:51:49 [DEBUG] Starting new HTTP connection (1): localhost:56401
+01:51:49 [DEBUG] Starting new HTTP connection (1): localhost:56401
+01:51:50 [DEBUG] Starting new HTTP connection (1): localhost:56401
+01:51:50 [DEBUG] Starting new HTTP connection (1): localhost:56401
+01:51:51 [DEBUG] Starting new HTTP connection (1): localhost:56401
+01:51:51 [DEBUG] Starting new HTTP connection (1): localhost:56401
+01:51:51 [DEBUG] http://localhost:56401 "GET /health HTTP/1.1" 200 20
+01:51:51 [DEBUG] = connection is CONNECTING
+01:51:51 [DEBUG] > GET /ws HTTP/1.1
+01:51:51 [DEBUG] > Host: localhost:56401
+01:51:51 [DEBUG] > Upgrade: websocket
+01:51:51 [DEBUG] > Connection: Upgrade
+01:51:51 [DEBUG] > Sec-WebSocket-Key: UPQpJ2hOA/eVbLZYiexy3A==
+01:51:51 [DEBUG] > Sec-WebSocket-Version: 13
+01:51:51 [DEBUG] > Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
+01:51:51 [DEBUG] > User-Agent: Python/3.12 websockets/16.0
+01:51:51 [DEBUG] < HTTP/1.1 101 Switching Protocols
+01:51:51 [DEBUG] < Upgrade: websocket
+01:51:51 [DEBUG] < Connection: Upgrade
+01:51:51 [DEBUG] < Sec-WebSocket-Accept: uozQiiC2n5BYDgzKNQ0WuU0Lrhs=
+01:51:51 [DEBUG] < Sec-WebSocket-Extensions: permessage-deflate
+01:51:51 [DEBUG] < date: Thu, 02 Apr 2026 20:21:51 GMT
+01:51:51 [DEBUG] < server: uvicorn
+01:51:51 [DEBUG] = connection is OPEN
+01:51:51 [DEBUG] > TEXT '{"type": "step", "data": {"type": "list_tools"}}' [48 bytes]
+01:51:51 [DEBUG] < TEXT '{"type":"observation","data":{"observation":{"t...rd":null,"done":false}}' [5881 bytes]
+01:51:51 [DEBUG] Tool registered: read_inbox (required=[])
+01:51:51 [DEBUG] Tool registered: query_erp (required=['table'])
+01:51:51 [DEBUG] Tool registered: query_supplier (required=['supplier_id'])
+01:51:51 [DEBUG] Tool registered: query_forecast (required=['product_id', 'location_id'])
+01:51:51 [DEBUG] Tool registered: submit_po (required=['supplier_id', 'product_id', 'destination_id', 'quantity'])
+01:51:51 [DEBUG] Tool registered: transfer (required=['from_location_id', 'to_location_id', 'product_id', 'quantity'])
+01:51:51 [DEBUG] Tool registered: quarantine_lot (required=['location_id', 'sku', 'lot_id'])
+01:51:51 [DEBUG] Tool registered: file_justification (required=['ticket_id', 'reason'])
+01:51:51 [DEBUG] Tool registered: end_shift (required=[])
+01:51:51 [INFO] [orientation_ward] 9 tools discovered
+01:51:51 [DEBUG] > TEXT '{"type": "reset", "data": {}}' [29 bytes]
+01:51:51 [DEBUG] < TEXT '{"type":"observation","data":{"observation":{"d...ard":0.0,"done":false}}' [1985 bytes]
+01:51:51 [INFO] [orientation_ward] Episode started. Tools: ['read_inbox', 'query_erp', 'query_supplier', 'query_forecast', 'submit_po', 'transfer', 'quarantine_lot', 'file_justification', 'end_shift']
+01:51:51 [INFO] [orientation_ward] Step 1/150 — 2 messages in context
+01:51:51 [DEBUG] Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'idempotency_key': 'stainless-python-retry-e3ff487d-994c-4de9-bab8-4d1e8b2b2359', 'content': None, 'json_data': {'messages': [{'role': 'system', 'content': 'You are an experienced hospital supply chain manager operating a legacy ERP system.\nYour goal is to maintain adequate medical supplies across all locations while controlling costs.\n\nCRITICAL — ACTION BUDGET: You have a strictly limited number of actions per shift.\nBudget does NOT roll over. Unspent actions are lost at end_shift().\n\nRecommended budget allocation (highest priority first):\n  1. read_inbox()                         — ALWAYS do this first to catch urgent alerts\n  2. query_erp(table=\'inventory\')         — check current stock levels across all locations\n  3. submit_po(...)                        — place orders for items below safety stock (PRIORITY)\n  4. end_shift()                           — call this when budget is exhausted OR tasks are done\n\nQuery tools (query_erp expiry/pipeline, query_forecast, query_supplier) are LOW PRIORITY.\nOnly use them if you have budget remaining AFTER placing critical orders.\n\nMANDATORY RULES:\n- If you receive "Action budget exhausted" → call end_shift() as your VERY NEXT action.\n  Do NOT call any other tool. The budget cannot be restored until end_shift() is called.\n- Order early: factor in lead times. If lead time is 2 days, order today to avoid stockout in 2 days.\n- Expedited orders require file_justification(ticket_id=...) with a real clinical reason.\n- FEFO: oldest stock consumed first — check expiry and rotate perishables proactively.\n- Recalls: quarantine the recalled lot immediately, then order a replacement.\n- MCI events: pre-emptive ordering beats reactive ordering. Order extra blood/critical supplies NOW.\n\nSafety stock target: aim for at least (lead_time + 1) × daily_demand units on hand.\n\nWhen calling tools, use the EXACT parameter names shown in the tool descriptions.\n'}, {'role': 'user', 'content': 'Your shift has started. Current dashboard:\n\n╔════════════════════════════════════════════════════════════════════╗\n║   MEDSUPPLY ERP v2.1 — CENTRAL HOSPITAL NETWORK                    ║\n║   Task: orientation_ward | Shift: Day 1 of 2                       ║\n║   Actions remaining: 5/5                                           ║\n║   Budget used: $0 / $5,000                                         ║\n╠════════════════════════════════════════════════════════════════════╣\n║   [!] COMMS PAGER: 1 unread message(s)                             ║\n║   [·] INVDB: No expiry warnings                                    ║\n║   [·] PROCURENET: 0 order(s) in transit                            ║\n╠════════════════════════════════════════════════════════════════════╣\n║   SUPPLIERS (use exact IDs below):                                 ║\n║   MEDLINE → GLOVE-001, SYR-10, MASK-001                            ║\n╚════════════════════════════════════════════════════════════════════╝\nAwaiting input.\nAvailable tools: read_inbox, query_erp, query_supplier, query_forecast, submit_po, transfer, quarantine_lot, file_justification, end_shift'}], 'model': 'openai/gpt-oss-20b:groq', 'max_completion_tokens': 6000, 'temperature': 0.1, 'tool_choice': 'required', 'tools': [{'type': 'function', 'function': {'name': 'read_inbox', 'description': "Read messages from the COMMS PAGER inbox.\n\nArgs:\n    filter: Message filter — 'unread' (default), 'all', or 'flagged'\n\nReturns:\n    Formatted inbox messages as raw text", 'parameters': {'type': 'object', 'properties': {'filter': {'type': 'string', 'description': ''}}, 'required': []}}}, {'type': 'function', 'function': {'name': 'query_erp', 'description': "Query the legacy ERP database.\n\nArgs:\n    table: Table to query — 'inventory', 'expiry', 'pipeline_orders', or 'demand_history'\n    location: Location ID or 'all'. E.g. 'ward_general', 'ward_icu', 'hospital_a'\n    sku: Product SKU or 'all'. E.g. 'B-001', 'IV-500', 'GLOVE-001'\n\nReturns:\n    ASCII table with query results (legacy ERP format)", 'parameters': {'type': 'object', 'properties': {'table': {'type': 'string', 'description': ''}, 'location': {'type': 'string', 'description': ''}, 'sku': {'type': 'string', 'description': ''}}, 'required': ['table']}}}, {'type': 'function', 'function': {'name': 'query_supplier', 'description': 'Query supplier information including current lead times and disruptions.\n\nArgs:\n    supplier_id: Supplier identifier. Check the dashboard for valid supplier IDs.\n\nReturns:\n    Supplier status text including lead times and any active disruptions', 'parameters': {'type': 'object', 'properties': {'supplier_id': {'type': 'string', 'description': ''}}, 'required': ['supplier_id']}}}, {'type': 'function', 'function': {'name': 'query_forecast', 'description': "Get demand forecast for a product at a location.\n\nArgs:\n    product_id: Product SKU to forecast. Use query_erp(table='inventory') to see available SKUs.\n    location_id: Location to forecast for. Use query_erp(table='inventory') to see valid location IDs.\n    horizon_days: Forecast horizon in days (1-21, default 7)\n\nReturns:\n    Forecasted daily demand table", 'parameters': {'type': 'object', 'properties': {'product_id': {'type': 'string', 'description': ''}, 'location_id': {'type': 'string', 'description': ''}, 'horizon_days': {'type': 'integer', 'description': ''}}, 'required': ['product_id', 'location_id']}}}, {'type': 'function', 'function': {'name': 'submit_po', 'description': "Submit a purchase order to a supplier.\n\nArgs:\n    supplier_id: Supplier to order from. Check the dashboard for valid supplier IDs.\n    product_id: Product SKU to order. Use query_erp(table='inventory') to see available SKUs.\n    destination_id: Delivery location. Use query_erp(table='inventory') to see valid location IDs.\n    quantity: Number of units to order (must be positive)\n    priority: 'standard' (default) or 'expedited' (+50% cost, -2 day lead time; requires justification)\n\nReturns:\n    Confirmation with PO ID and ETA, or error if budget/validation fails.\n    For expedited orders: returns BUDGET_OVERRIDE_REQUIRED with a ticket ID.\n    Use file_justification(ticket_id=...) to proceed.", 'parameters': {'type': 'object', 'properties': {'supplier_id': {'type': 'string', 'description': ''}, 'product_id': {'type': 'string', 'description': ''}, 'destination_id': {'type': 'string', 'description': ''}, 'quantity': {'type': 'integer', 'description': ''}, 'priority': {'type': 'string', 'description': ''}}, 'required': ['supplier_id', 'product_id', 'destination_id', 'quantity']}}}, {'type': 'function', 'function': {'name': 'transfer', 'description': "Transfer inventory between locations (small handling fee).\n\nArgs:\n    from_location_id: Source location. Use query_erp(table='inventory') to see valid location IDs.\n    to_location_id: Destination location. Use query_erp(table='inventory') to see valid location IDs.\n    product_id: Product SKU to transfer. Use query_erp(table='inventory') to see available SKUs.\n    quantity: Units to transfer (must not exceed available stock at source)\n\nReturns:\n    Confirmation or error (insufficient stock, capacity exceeded)", 'parameters': {'type': 'object', 'properties': {'from_location_id': {'type': 'string', 'description': ''}, 'to_location_id': {'type': 'string', 'description': ''}, 'product_id': {'type': 'string', 'description': ''}, 'quantity': {'type': 'integer', 'description': ''}}, 'required': ['from_location_id', 'to_location_id', 'product_id', 'quantity']}}}, {'type': 'function', 'function': {'name': 'quarantine_lot', 'description': "Quarantine a specific inventory lot (e.g. for product recalls or cold chain breaches).\nQuarantined lots are excluded from demand fulfillment.\n\nArgs:\n    location_id: Location where the lot is stored\n    sku: Product SKU of the lot\n    lot_id: Lot identifier (from inventory query). Use 'all' to quarantine all lots of this SKU at this location.\n\nReturns:\n    Confirmation with quarantine details and disposal ticket ID", 'parameters': {'type': 'object', 'properties': {'location_id': {'type': 'string', 'description': ''}, 'sku': {'type': 'string', 'description': ''}, 'lot_id': {'type': 'string', 'description': ''}}, 'required': ['location_id', 'sku', 'lot_id']}}}, {'type': 'function', 'function': {'name': 'file_justification', 'description': "File a budget override justification to proceed with an expedited order.\n\nRequired after submit_po returns BUDGET_OVERRIDE_REQUIRED.\nThe justification is audited by Finance — it must reference the current clinical situation.\nA false or incoherent justification is flagged and results in a scoring penalty.\n\nArgs:\n    ticket_id: The ticket ID from the BUDGET_OVERRIDE_REQUIRED error (e.g. 'BOT-0002')\n    reason: Free-text justification. Must clearly reference the reason for urgency.\n\nReturns:\n    OK confirmation and PO submission details, or FLAGGED audit warning", 'parameters': {'type': 'object', 'properties': {'ticket_id': {'type': 'string', 'description': ''}, 'reason': {'type': 'string', 'description': ''}}, 'required': ['ticket_id', 'reason']}}}, {'type': 'function', 'function': {'name': 'end_shift', 'description': 'End the current shift and advance the simulation by one day.\n\nCommits all pending decisions. Simulates demand, deliveries, and expiry for the day.\nResets your action budget for the next shift.\nUnspent actions are lost — no rollover.\n\nReturns:\n    Day summary report + next shift dashboard', 'parameters': {'type': 'object', 'properties': {}, 'required': []}}}]}}
+01:51:51 [DEBUG] Sending HTTP Request: POST https://router.huggingface.co/v1/chat/completions
+01:51:51 [DEBUG] connect_tcp.started host='router.huggingface.co' port=443 local_address=None timeout=5.0 socket_options=None
+01:51:51 [DEBUG] connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x723ff2d1e240>
+01:51:51 [DEBUG] start_tls.started ssl_context=<ssl.SSLContext object at 0x723ff3a175d0> server_hostname='router.huggingface.co' timeout=5.0
+01:51:51 [DEBUG] start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x723ff2d5e1b0>
+01:51:51 [DEBUG] send_request_headers.started request=<Request [b'POST']>
+01:51:51 [DEBUG] send_request_headers.complete
+01:51:51 [DEBUG] send_request_body.started request=<Request [b'POST']>
+01:51:51 [DEBUG] send_request_body.complete
+01:51:51 [DEBUG] receive_response_headers.started request=<Request [b'POST']>
+01:51:52 [DEBUG] receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'Date', b'Thu, 02 Apr 2026 20:21:52 GMT'), (b'x-ratelimit-reset-requests', b'60ms'), (b'x-ratelimit-reset-tokens', b'243ms'), (b'X-Powered-By', b'huggingface-moon'), (b'x-request-id', b'req_01kn7xr0r6esjbexpypgn91mcn'), (b'cross-origin-opener-policy', b'same-origin'), (b'Referrer-Policy', b'strict-origin-when-cross-origin'), (b'vary', b'Origin'), (b'Access-Control-Allow-Origin', b'*'), (b'Access-Control-Expose-Headers', b'X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,X-Total-Count,ETag,Link,Accept-Ranges,Content-Range,X-Linked-Size,X-Linked-ETag,X-Xet-Hash'), (b'X-Robots-Tag', b'none'), (b'x-inference-provider', b'groq'), (b'cache-control', b'private, max-age=0, no-store, no-cache, must-revalidate'), (b'cf-cache-status', b'DYNAMIC'), (b'cf-ray', b'9e628ad7db6238b3-IAD'), (b'server', b'cloudflare'), (b'set-cookie', b'__cf_bm=3duIfDUqh7IEx.T59dNY7MCKZ6iACTEm0iEiUVIoY2U-1775161311.9782407-1.0.1.1-X0AsT1C1g8zYMMIH_MXwx4S8ez43Cv5CvQjRikgQxslvIxEKA.l.n68Qvk_S0dezSdAvUZJG5vYLnWbrIvo6IAlffVPELTYMA0aG8qTBxXhp7M94.R7HUdnNY27vxIG2; HttpOnly; Secure; Path=/; Domain=groq.com; Expires=Thu, 02 Apr 2026 20:51:52 GMT'), (b'strict-transport-security', b'max-age=15552000'), (b'x-groq-region', b'yul'), (b'x-ratelimit-limit-requests', b'1440000'), (b'x-ratelimit-limit-tokens', b'750000'), (b'x-ratelimit-remaining-requests', b'1439999'), (b'x-ratelimit-remaining-tokens', b'746961'), (b'X-Cache', b'Miss from cloudfront'), (b'Via', b'1.1 61abf3889106f07b08b458478f6aee64.cloudfront.net (CloudFront)'), (b'X-Amz-Cf-Pop', b'BLR50-P4'), (b'X-Amz-Cf-Id', b'7uub2pGJ0RN4YwVF3Pq0QciaRg4blFRbyBjxqtf4uVWgSrQC_pBGRQ==')])
+01:51:52 [INFO] HTTP Request: POST https://router.huggingface.co/v1/chat/completions "HTTP/1.1 200 OK"
+01:51:52 [DEBUG] receive_response_body.started request=<Request [b'POST']>
+01:51:52 [DEBUG] receive_response_body.complete
+01:51:52 [DEBUG] response_closed.started
+01:51:52 [DEBUG] response_closed.complete
+01:51:52 [DEBUG] HTTP Response: POST https://router.huggingface.co/v1/chat/completions "200 OK" Headers({'content-type': 'application/json', 'transfer-encoding': 'chunked', 'connection': 'keep-alive', 'date': 'Thu, 02 Apr 2026 20:21:52 GMT', 'x-ratelimit-reset-requests': '60ms', 'x-ratelimit-reset-tokens': '243ms', 'x-powered-by': 'huggingface-moon', 'x-request-id': 'req_01kn7xr0r6esjbexpypgn91mcn', 'cross-origin-opener-policy': 'same-origin', 'referrer-policy': 'strict-origin-when-cross-origin', 'vary': 'Origin', 'access-control-allow-origin': '*', 'access-control-expose-headers': 'X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,X-Total-Count,ETag,Link,Accept-Ranges,Content-Range,X-Linked-Size,X-Linked-ETag,X-Xet-Hash', 'x-robots-tag': 'none', 'x-inference-provider': 'groq', 'cache-control': 'private, max-age=0, no-store, no-cache, must-revalidate', 'cf-cache-status': 'DYNAMIC', 'cf-ray': '9e628ad7db6238b3-IAD', 'server': 'cloudflare', 'set-cookie': '__cf_bm=3duIfDUqh7IEx.T59dNY7MCKZ6iACTEm0iEiUVIoY2U-1775161311.9782407-1.0.1.1-X0AsT1C1g8zYMMIH_MXwx4S8ez43Cv5CvQjRikgQxslvIxEKA.l.n68Qvk_S0dezSdAvUZJG5vYLnWbrIvo6IAlffVPELTYMA0aG8qTBxXhp7M94.R7HUdnNY27vxIG2; HttpOnly; Secure; Path=/; Domain=groq.com; Expires=Thu, 02 Apr 2026 20:51:52 GMT', 'strict-transport-security': 'max-age=15552000', 'x-groq-region': 'yul', 'x-ratelimit-limit-requests': '1440000', 'x-ratelimit-limit-tokens': '750000', 'x-ratelimit-remaining-requests': '1439999', 'x-ratelimit-remaining-tokens': '746961', 'x-cache': 'Miss from cloudfront', 'via': '1.1 61abf3889106f07b08b458478f6aee64.cloudfront.net (CloudFront)', 'x-amz-cf-pop': 'BLR50-P4', 'x-amz-cf-id': '7uub2pGJ0RN4YwVF3Pq0QciaRg4blFRbyBjxqtf4uVWgSrQC_pBGRQ=='})
+01:51:52 [DEBUG] request_id: req_01kn7xr0r6esjbexpypgn91mcn
+01:51:52 [DEBUG] [orientation_ward] Step 1 — finish_reason=tool_calls tool_calls=1
+01:51:52 [DEBUG] [orientation_ward] Step 1 — calling read_inbox({'filter': 'unread'})
+01:51:52 [DEBUG] > TEXT '{"type": "step", "data": {"type": "call_tool", ... {"filter": "unread"}}}' [109 bytes]
+01:51:52 [DEBUG] < TEXT '{"type":"observation","data":{"observation":{"t...rd":0.01,"done":false}}' [530 bytes]
+01:51:54 [INFO] [orientation_ward] Step 2/150 — 4 messages in context
+01:51:54 [DEBUG] Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'idempotency_key': 'stainless-python-retry-c34f7880-6ee7-4fd1-8588-a7abbd81c6db', 'content': None, 'json_data': {'messages': [{'role': 'system', 'content': 'You are an experienced hospital supply chain manager operating a legacy ERP system.\nYour goal is to maintain adequate medical supplies across all locations while controlling costs.\n\nCRITICAL — ACTION BUDGET: You have a strictly limited number of actions per shift.\nBudget does NOT roll over. Unspent actions are lost at end_shift().\n\nRecommended budget allocation (highest priority first):\n  1. read_inbox()                         — ALWAYS do this first to catch urgent alerts\n  2. query_erp(table=\'inventory\')         — check current stock levels across all locations\n  3. submit_po(...)                        — place orders for items below safety stock (PRIORITY)\n  4. end_shift()                           — call this when budget is exhausted OR tasks are done\n\nQuery tools (query_erp expiry/pipeline, query_forecast, query_supplier) are LOW PRIORITY.\nOnly use them if you have budget remaining AFTER placing critical orders.\n\nMANDATORY RULES:\n- If you receive "Action budget exhausted" → call end_shift() as your VERY NEXT action.\n  Do NOT call any other tool. The budget cannot be restored until end_shift() is called.\n- Order early: factor in lead times. If lead time is 2 days, order today to avoid stockout in 2 days.\n- Expedited orders require file_justification(ticket_id=...) with a real clinical reason.\n- FEFO: oldest stock consumed first — check expiry and rotate perishables proactively.\n- Recalls: quarantine the recalled lot immediately, then order a replacement.\n- MCI events: pre-emptive ordering beats reactive ordering. Order extra blood/critical supplies NOW.\n\nSafety stock target: aim for at least (lead_time + 1) × daily_demand units on hand.\n\nWhen calling tools, use the EXACT parameter names shown in the tool descriptions.\n'}, {'role': 'user', 'content': 'Your shift has started. Current dashboard:\n\n╔════════════════════════════════════════════════════════════════════╗\n║   MEDSUPPLY ERP v2.1 — CENTRAL HOSPITAL NETWORK                    ║\n║   Task: orientation_ward | Shift: Day 1 of 2                       ║\n║   Actions remaining: 5/5                                           ║\n║   Budget used: $0 / $5,000                                         ║\n╠════════════════════════════════════════════════════════════════════╣\n║   [!] COMMS PAGER: 1 unread message(s)                             ║\n║   [·] INVDB: No expiry warnings                                    ║\n║   [·] PROCURENET: 0 order(s) in transit                            ║\n╠════════════════════════════════════════════════════════════════════╣\n║   SUPPLIERS (use exact IDs below):                                 ║\n║   MEDLINE → GLOVE-001, SYR-10, MASK-001                            ║\n╚════════════════════════════════════════════════════════════════════╝\nAwaiting input.\nAvailable tools: read_inbox, query_erp, query_supplier, query_forecast, submit_po, transfer, quarantine_lot, file_justification, end_shift'}, {'role': 'assistant', 'content': None, 'tool_calls': [{'id': 'fc_edbe312b-e6c1-4594-9789-a9b65675543d', 'type': 'function', 'function': {'name': 'read_inbox', 'arguments': '{"filter": "unread"}'}}]}, {'role': 'tool', 'tool_call_id': 'fc_edbe312b-e6c1-4594-9789-a9b65675543d', 'content': '\n[MSG MSG-0001 | READ | PRIORITY: LOW | Day 1 08:00]\nFROM: System\nSUBJ: Shift Handover Notes\n\nWelcome to the orientation_ward scenario.\nYou are managing medical supplies for 2 days.\nAction budget: 5 actions per shift.\nBudget ceiling: $5,000 outstanding orders.\n\nUse read_inbox to check messages, query_erp to check stock,\nsubmit_po to order supplies, and end_shift to advance the day.\n'}], 'model': 'openai/gpt-oss-20b:groq', 'max_completion_tokens': 6000, 'temperature': 0.1, 'tool_choice': 'required', 'tools': [{'type': 'function', 'function': {'name': 'read_inbox', 'description': "Read messages from the COMMS PAGER inbox.\n\nArgs:\n    filter: Message filter — 'unread' (default), 'all', or 'flagged'\n\nReturns:\n    Formatted inbox messages as raw text", 'parameters': {'type': 'object', 'properties': {'filter': {'type': 'string', 'description': ''}}, 'required': []}}}, {'type': 'function', 'function': {'name': 'query_erp', 'description': "Query the legacy ERP database.\n\nArgs:\n    table: Table to query — 'inventory', 'expiry', 'pipeline_orders', or 'demand_history'\n    location: Location ID or 'all'. E.g. 'ward_general', 'ward_icu', 'hospital_a'\n    sku: Product SKU or 'all'. E.g. 'B-001', 'IV-500', 'GLOVE-001'\n\nReturns:\n    ASCII table with query results (legacy ERP format)", 'parameters': {'type': 'object', 'properties': {'table': {'type': 'string', 'description': ''}, 'location': {'type': 'string', 'description': ''}, 'sku': {'type': 'string', 'description': ''}}, 'required': ['table']}}}, {'type': 'function', 'function': {'name': 'query_supplier', 'description': 'Query supplier information including current lead times and disruptions.\n\nArgs:\n    supplier_id: Supplier identifier. Check the dashboard for valid supplier IDs.\n\nReturns:\n    Supplier status text including lead times and any active disruptions', 'parameters': {'type': 'object', 'properties': {'supplier_id': {'type': 'string', 'description': ''}}, 'required': ['supplier_id']}}}, {'type': 'function', 'function': {'name': 'query_forecast', 'description': "Get demand forecast for a product at a location.\n\nArgs:\n    product_id: Product SKU to forecast. Use query_erp(table='inventory') to see available SKUs.\n    location_id: Location to forecast for. Use query_erp(table='inventory') to see valid location IDs.\n    horizon_days: Forecast horizon in days (1-21, default 7)\n\nReturns:\n    Forecasted daily demand table", 'parameters': {'type': 'object', 'properties': {'product_id': {'type': 'string', 'description': ''}, 'location_id': {'type': 'string', 'description': ''}, 'horizon_days': {'type': 'integer', 'description': ''}}, 'required': ['product_id', 'location_id']}}}, {'type': 'function', 'function': {'name': 'submit_po', 'description': "Submit a purchase order to a supplier.\n\nArgs:\n    supplier_id: Supplier to order from. Check the dashboard for valid supplier IDs.\n    product_id: Product SKU to order. Use query_erp(table='inventory') to see available SKUs.\n    destination_id: Delivery location. Use query_erp(table='inventory') to see valid location IDs.\n    quantity: Number of units to order (must be positive)\n    priority: 'standard' (default) or 'expedited' (+50% cost, -2 day lead time; requires justification)\n\nReturns:\n    Confirmation with PO ID and ETA, or error if budget/validation fails.\n    For expedited orders: returns BUDGET_OVERRIDE_REQUIRED with a ticket ID.\n    Use file_justification(ticket_id=...) to proceed.", 'parameters': {'type': 'object', 'properties': {'supplier_id': {'type': 'string', 'description': ''}, 'product_id': {'type': 'string', 'description': ''}, 'destination_id': {'type': 'string', 'description': ''}, 'quantity': {'type': 'integer', 'description': ''}, 'priority': {'type': 'string', 'description': ''}}, 'required': ['supplier_id', 'product_id', 'destination_id', 'quantity']}}}, {'type': 'function', 'function': {'name': 'transfer', 'description': "Transfer inventory between locations (small handling fee).\n\nArgs:\n    from_location_id: Source location. Use query_erp(table='inventory') to see valid location IDs.\n    to_location_id: Destination location. Use query_erp(table='inventory') to see valid location IDs.\n    product_id: Product SKU to transfer. Use query_erp(table='inventory') to see available SKUs.\n    quantity: Units to transfer (must not exceed available stock at source)\n\nReturns:\n    Confirmation or error (insufficient stock, capacity exceeded)", 'parameters': {'type': 'object', 'properties': {'from_location_id': {'type': 'string', 'description': ''}, 'to_location_id': {'type': 'string', 'description': ''}, 'product_id': {'type': 'string', 'description': ''}, 'quantity': {'type': 'integer', 'description': ''}}, 'required': ['from_location_id', 'to_location_id', 'product_id', 'quantity']}}}, {'type': 'function', 'function': {'name': 'quarantine_lot', 'description': "Quarantine a specific inventory lot (e.g. for product recalls or cold chain breaches).\nQuarantined lots are excluded from demand fulfillment.\n\nArgs:\n    location_id: Location where the lot is stored\n    sku: Product SKU of the lot\n    lot_id: Lot identifier (from inventory query). Use 'all' to quarantine all lots of this SKU at this location.\n\nReturns:\n    Confirmation with quarantine details and disposal ticket ID", 'parameters': {'type': 'object', 'properties': {'location_id': {'type': 'string', 'description': ''}, 'sku': {'type': 'string', 'description': ''}, 'lot_id': {'type': 'string', 'description': ''}}, 'required': ['location_id', 'sku', 'lot_id']}}}, {'type': 'function', 'function': {'name': 'file_justification', 'description': "File a budget override justification to proceed with an expedited order.\n\nRequired after submit_po returns BUDGET_OVERRIDE_REQUIRED.\nThe justification is audited by Finance — it must reference the current clinical situation.\nA false or incoherent justification is flagged and results in a scoring penalty.\n\nArgs:\n    ticket_id: The ticket ID from the BUDGET_OVERRIDE_REQUIRED error (e.g. 'BOT-0002')\n    reason: Free-text justification. Must clearly reference the reason for urgency.\n\nReturns:\n    OK confirmation and PO submission details, or FLAGGED audit warning", 'parameters': {'type': 'object', 'properties': {'ticket_id': {'type': 'string', 'description': ''}, 'reason': {'type': 'string', 'description': ''}}, 'required': ['ticket_id', 'reason']}}}, {'type': 'function', 'function': {'name': 'end_shift', 'description': 'End the current shift and advance the simulation by one day.\n\nCommits all pending decisions. Simulates demand, deliveries, and expiry for the day.\nResets your action budget for the next shift.\nUnspent actions are lost — no rollover.\n\nReturns:\n    Day summary report + next shift dashboard', 'parameters': {'type': 'object', 'properties': {}, 'required': []}}}]}}
+01:51:54 [DEBUG] Sending HTTP Request: POST https://router.huggingface.co/v1/chat/completions
+01:51:54 [DEBUG] send_request_headers.started request=<Request [b'POST']>
+01:51:54 [DEBUG] send_request_headers.complete
+01:51:54 [DEBUG] send_request_body.started request=<Request [b'POST']>
+01:51:54 [DEBUG] send_request_body.complete
+01:51:54 [DEBUG] receive_response_headers.started request=<Request [b'POST']>
+01:51:55 [DEBUG] receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'Date', b'Thu, 02 Apr 2026 20:21:55 GMT'), (b'x-ratelimit-reset-requests', b'60ms'), (b'x-ratelimit-reset-tokens', b'244ms'), (b'X-Powered-By', b'huggingface-moon'), (b'x-request-id', b'req_01kn7xr3eze10saazav3qq3avr'), (b'cross-origin-opener-policy', b'same-origin'), (b'Referrer-Policy', b'strict-origin-when-cross-origin'), (b'vary', b'Origin'), (b'Access-Control-Allow-Origin', b'*'), (b'Access-Control-Expose-Headers', b'X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,X-Total-Count,ETag,Link,Accept-Ranges,Content-Range,X-Linked-Size,X-Linked-ETag,X-Xet-Hash'), (b'X-Robots-Tag', b'none'), (b'x-inference-provider', b'groq'), (b'cache-control', b'private, max-age=0, no-store, no-cache, must-revalidate'), (b'cf-cache-status', b'DYNAMIC'), (b'cf-ray', b'9e628ae939d838b3-IAD'), (b'server', b'cloudflare'), (b'set-cookie', b'__cf_bm=dBrvfDghu6i3ulHi3vL4dcahvKG84d.NJza2A7k.naI-1775161314.7556033-1.0.1.1-UPaGIuRInFzY_8EW8sUq7zBvVsbkkqOvo3mEBMwIh1VHd1MUMM.mZaZIM8rYH_sDKwHI5XzOqeMmvxjIIvF4iCClMg3tRiVcAWUIFPecnhLf7plP.bbH0kheaUYHsL7z; HttpOnly; Secure; Path=/; Domain=groq.com; Expires=Thu, 02 Apr 2026 20:51:55 GMT'), (b'strict-transport-security', b'max-age=15552000'), (b'x-groq-region', b'yul'), (b'x-ratelimit-limit-requests', b'1440000'), (b'x-ratelimit-limit-tokens', b'750000'), (b'x-ratelimit-remaining-requests', b'1439999'), (b'x-ratelimit-remaining-tokens', b'746947'), (b'X-Cache', b'Miss from cloudfront'), (b'Via', b'1.1 61abf3889106f07b08b458478f6aee64.cloudfront.net (CloudFront)'), (b'X-Amz-Cf-Pop', b'BLR50-P4'), (b'X-Amz-Cf-Id', b'G8Q1q3n2mQUXYa12dQp0QZxz7lhl-M_7TOJJZ16T0sy44b8Mpz3ERw==')])
+01:51:55 [INFO] HTTP Request: POST https://router.huggingface.co/v1/chat/completions "HTTP/1.1 200 OK"
+01:51:55 [DEBUG] receive_response_body.started request=<Request [b'POST']>
+01:51:55 [DEBUG] receive_response_body.complete
+01:51:55 [DEBUG] response_closed.started
+01:51:55 [DEBUG] response_closed.complete
+01:51:55 [DEBUG] HTTP Response: POST https://router.huggingface.co/v1/chat/completions "200 OK" Headers({'content-type': 'application/json', 'transfer-encoding': 'chunked', 'connection': 'keep-alive', 'date': 'Thu, 02 Apr 2026 20:21:55 GMT', 'x-ratelimit-reset-requests': '60ms', 'x-ratelimit-reset-tokens': '244ms', 'x-powered-by': 'huggingface-moon', 'x-request-id': 'req_01kn7xr3eze10saazav3qq3avr', 'cross-origin-opener-policy': 'same-origin', 'referrer-policy': 'strict-origin-when-cross-origin', 'vary': 'Origin', 'access-control-allow-origin': '*', 'access-control-expose-headers': 'X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,X-Total-Count,ETag,Link,Accept-Ranges,Content-Range,X-Linked-Size,X-Linked-ETag,X-Xet-Hash', 'x-robots-tag': 'none', 'x-inference-provider': 'groq', 'cache-control': 'private, max-age=0, no-store, no-cache, must-revalidate', 'cf-cache-status': 'DYNAMIC', 'cf-ray': '9e628ae939d838b3-IAD', 'server': 'cloudflare', 'set-cookie': '__cf_bm=dBrvfDghu6i3ulHi3vL4dcahvKG84d.NJza2A7k.naI-1775161314.7556033-1.0.1.1-UPaGIuRInFzY_8EW8sUq7zBvVsbkkqOvo3mEBMwIh1VHd1MUMM.mZaZIM8rYH_sDKwHI5XzOqeMmvxjIIvF4iCClMg3tRiVcAWUIFPecnhLf7plP.bbH0kheaUYHsL7z; HttpOnly; Secure; Path=/; Domain=groq.com; Expires=Thu, 02 Apr 2026 20:51:55 GMT', 'strict-transport-security': 'max-age=15552000', 'x-groq-region': 'yul', 'x-ratelimit-limit-requests': '1440000', 'x-ratelimit-limit-tokens': '750000', 'x-ratelimit-remaining-requests': '1439999', 'x-ratelimit-remaining-tokens': '746947', 'x-cache': 'Miss from cloudfront', 'via': '1.1 61abf3889106f07b08b458478f6aee64.cloudfront.net (CloudFront)', 'x-amz-cf-pop': 'BLR50-P4', 'x-amz-cf-id': 'G8Q1q3n2mQUXYa12dQp0QZxz7lhl-M_7TOJJZ16T0sy44b8Mpz3ERw=='})
+01:51:55 [DEBUG] request_id: req_01kn7xr3eze10saazav3qq3avr
+01:51:55 [DEBUG] [orientation_ward] Step 2 — finish_reason=tool_calls tool_calls=1
+01:51:55 [DEBUG] [orientation_ward] Step 2 — calling query_erp({'location': 'all', 'sku': 'all', 'table': 'inventory'})
+01:51:55 [DEBUG] > TEXT '{"type": "step", "data": {"type": "call_tool", ..."table": "inventory"}}}' [143 bytes]
+01:51:55 [DEBUG] < TEXT '{"type":"observation","data":{"observation":{"t...rd":0.01,"done":false}}' [841 bytes]
+01:51:57 [INFO] [orientation_ward] Step 3/150 — 6 messages in context
+01:51:57 [DEBUG] Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'idempotency_key': 'stainless-python-retry-38cd9502-480f-4ac6-a7d5-67b52541d9bb', 'content': None, 'json_data': {'messages': [{'role': 'system', 'content': 'You are an experienced hospital supply chain manager operating a legacy ERP system.\nYour goal is to maintain adequate medical supplies across all locations while controlling costs.\n\nCRITICAL — ACTION BUDGET: You have a strictly limited number of actions per shift.\nBudget does NOT roll over. Unspent actions are lost at end_shift().\n\nRecommended budget allocation (highest priority first):\n  1. read_inbox()                         — ALWAYS do this first to catch urgent alerts\n  2. query_erp(table=\'inventory\')         — check current stock levels across all locations\n  3. submit_po(...)                        — place orders for items below safety stock (PRIORITY)\n  4. end_shift()                           — call this when budget is exhausted OR tasks are done\n\nQuery tools (query_erp expiry/pipeline, query_forecast, query_supplier) are LOW PRIORITY.\nOnly use them if you have budget remaining AFTER placing critical orders.\n\nMANDATORY RULES:\n- If you receive "Action budget exhausted" → call end_shift() as your VERY NEXT action.\n  Do NOT call any other tool. The budget cannot be restored until end_shift() is called.\n- Order early: factor in lead times. If lead time is 2 days, order today to avoid stockout in 2 days.\n- Expedited orders require file_justification(ticket_id=...) with a real clinical reason.\n- FEFO: oldest stock consumed first — check expiry and rotate perishables proactively.\n- Recalls: quarantine the recalled lot immediately, then order a replacement.\n- MCI events: pre-emptive ordering beats reactive ordering. Order extra blood/critical supplies NOW.\n\nSafety stock target: aim for at least (lead_time + 1) × daily_demand units on hand.\n\nWhen calling tools, use the EXACT parameter names shown in the tool descriptions.\n'}, {'role': 'user', 'content': 'Your shift has started. Current dashboard:\n\n╔════════════════════════════════════════════════════════════════════╗\n║   MEDSUPPLY ERP v2.1 — CENTRAL HOSPITAL NETWORK                    ║\n║   Task: orientation_ward | Shift: Day 1 of 2                       ║\n║   Actions remaining: 5/5                                           ║\n║   Budget used: $0 / $5,000                                         ║\n╠════════════════════════════════════════════════════════════════════╣\n║   [!] COMMS PAGER: 1 unread message(s)                             ║\n║   [·] INVDB: No expiry warnings                                    ║\n║   [·] PROCURENET: 0 order(s) in transit                            ║\n╠════════════════════════════════════════════════════════════════════╣\n║   SUPPLIERS (use exact IDs below):                                 ║\n║   MEDLINE → GLOVE-001, SYR-10, MASK-001                            ║\n╚════════════════════════════════════════════════════════════════════╝\nAwaiting input.\nAvailable tools: read_inbox, query_erp, query_supplier, query_forecast, submit_po, transfer, quarantine_lot, file_justification, end_shift'}, {'role': 'assistant', 'content': None, 'tool_calls': [{'id': 'fc_edbe312b-e6c1-4594-9789-a9b65675543d', 'type': 'function', 'function': {'name': 'read_inbox', 'arguments': '{"filter": "unread"}'}}]}, {'role': 'tool', 'tool_call_id': 'fc_edbe312b-e6c1-4594-9789-a9b65675543d', 'content': '\n[MSG MSG-0001 | READ | PRIORITY: LOW | Day 1 08:00]\nFROM: System\nSUBJ: Shift Handover Notes\n\nWelcome to the orientation_ward scenario.\nYou are managing medical supplies for 2 days.\nAction budget: 5 actions per shift.\nBudget ceiling: $5,000 outstanding orders.\n\nUse read_inbox to check messages, query_erp to check stock,\nsubmit_po to order supplies, and end_shift to advance the day.\n'}, {'role': 'assistant', 'content': None, 'tool_calls': [{'id': 'fc_d42e8acc-f5ce-43e5-98b1-0b7628ad4eea', 'type': 'function', 'function': {'name': 'query_erp', 'arguments': '{"location": "all", "sku": "all", "table": "inventory"}'}}]}, {'role': 'tool', 'tool_call_id': 'fc_d42e8acc-f5ce-43e5-98b1-0b7628ad4eea', 'content': 'SYSTEM QUERY RESULT [TABLE: INVENTORY] [LOC: ALL] [SKU: ALL]\n[TIMESTAMP: Day 1]\n----------------------------------------------------------------------\nLOT_ID                 | DESC                     |   QTY | EXP_DAY | DAYS_LEFT | STATUS\n----------------------------------------------------------------------\nINIT-GLOVE-001-ward_general | Surgical Gloves (box)    |    20 | N/A     | N/A       | NON-PERISH\nINIT-MASK-001-ward_general | Surgical Masks           |    25 | N/A     | N/A       | NON-PERISH\nINIT-SYR-10-ward_general | Syringes 10ml            |    30 | N/A     | N/A       | NON-PERISH\n----------------------------------------------------------------------\nQUERY OK | 3 row(s) returned'}], 'model': 'openai/gpt-oss-20b:groq', 'max_completion_tokens': 6000, 'temperature': 0.1, 'tool_choice': 'required', 'tools': [{'type': 'function', 'function': {'name': 'read_inbox', 'description': "Read messages from the COMMS PAGER inbox.\n\nArgs:\n    filter: Message filter — 'unread' (default), 'all', or 'flagged'\n\nReturns:\n    Formatted inbox messages as raw text", 'parameters': {'type': 'object', 'properties': {'filter': {'type': 'string', 'description': ''}}, 'required': []}}}, {'type': 'function', 'function': {'name': 'query_erp', 'description': "Query the legacy ERP database.\n\nArgs:\n    table: Table to query — 'inventory', 'expiry', 'pipeline_orders', or 'demand_history'\n    location: Location ID or 'all'. E.g. 'ward_general', 'ward_icu', 'hospital_a'\n    sku: Product SKU or 'all'. E.g. 'B-001', 'IV-500', 'GLOVE-001'\n\nReturns:\n    ASCII table with query results (legacy ERP format)", 'parameters': {'type': 'object', 'properties': {'table': {'type': 'string', 'description': ''}, 'location': {'type': 'string', 'description': ''}, 'sku': {'type': 'string', 'description': ''}}, 'required': ['table']}}}, {'type': 'function', 'function': {'name': 'query_supplier', 'description': 'Query supplier information including current lead times and disruptions.\n\nArgs:\n    supplier_id: Supplier identifier. Check the dashboard for valid supplier IDs.\n\nReturns:\n    Supplier status text including lead times and any active disruptions', 'parameters': {'type': 'object', 'properties': {'supplier_id': {'type': 'string', 'description': ''}}, 'required': ['supplier_id']}}}, {'type': 'function', 'function': {'name': 'query_forecast', 'description': "Get demand forecast for a product at a location.\n\nArgs:\n    product_id: Product SKU to forecast. Use query_erp(table='inventory') to see available SKUs.\n    location_id: Location to forecast for. Use query_erp(table='inventory') to see valid location IDs.\n    horizon_days: Forecast horizon in days (1-21, default 7)\n\nReturns:\n    Forecasted daily demand table", 'parameters': {'type': 'object', 'properties': {'product_id': {'type': 'string', 'description': ''}, 'location_id': {'type': 'string', 'description': ''}, 'horizon_days': {'type': 'integer', 'description': ''}}, 'required': ['product_id', 'location_id']}}}, {'type': 'function', 'function': {'name': 'submit_po', 'description': "Submit a purchase order to a supplier.\n\nArgs:\n    supplier_id: Supplier to order from. Check the dashboard for valid supplier IDs.\n    product_id: Product SKU to order. Use query_erp(table='inventory') to see available SKUs.\n    destination_id: Delivery location. Use query_erp(table='inventory') to see valid location IDs.\n    quantity: Number of units to order (must be positive)\n    priority: 'standard' (default) or 'expedited' (+50% cost, -2 day lead time; requires justification)\n\nReturns:\n    Confirmation with PO ID and ETA, or error if budget/validation fails.\n    For expedited orders: returns BUDGET_OVERRIDE_REQUIRED with a ticket ID.\n    Use file_justification(ticket_id=...) to proceed.", 'parameters': {'type': 'object', 'properties': {'supplier_id': {'type': 'string', 'description': ''}, 'product_id': {'type': 'string', 'description': ''}, 'destination_id': {'type': 'string', 'description': ''}, 'quantity': {'type': 'integer', 'description': ''}, 'priority': {'type': 'string', 'description': ''}}, 'required': ['supplier_id', 'product_id', 'destination_id', 'quantity']}}}, {'type': 'function', 'function': {'name': 'transfer', 'description': "Transfer inventory between locations (small handling fee).\n\nArgs:\n    from_location_id: Source location. Use query_erp(table='inventory') to see valid location IDs.\n    to_location_id: Destination location. Use query_erp(table='inventory') to see valid location IDs.\n    product_id: Product SKU to transfer. Use query_erp(table='inventory') to see available SKUs.\n    quantity: Units to transfer (must not exceed available stock at source)\n\nReturns:\n    Confirmation or error (insufficient stock, capacity exceeded)", 'parameters': {'type': 'object', 'properties': {'from_location_id': {'type': 'string', 'description': ''}, 'to_location_id': {'type': 'string', 'description': ''}, 'product_id': {'type': 'string', 'description': ''}, 'quantity': {'type': 'integer', 'description': ''}}, 'required': ['from_location_id', 'to_location_id', 'product_id', 'quantity']}}}, {'type': 'function', 'function': {'name': 'quarantine_lot', 'description': "Quarantine a specific inventory lot (e.g. for product recalls or cold chain breaches).\nQuarantined lots are excluded from demand fulfillment.\n\nArgs:\n    location_id: Location where the lot is stored\n    sku: Product SKU of the lot\n    lot_id: Lot identifier (from inventory query). Use 'all' to quarantine all lots of this SKU at this location.\n\nReturns:\n    Confirmation with quarantine details and disposal ticket ID", 'parameters': {'type': 'object', 'properties': {'location_id': {'type': 'string', 'description': ''}, 'sku': {'type': 'string', 'description': ''}, 'lot_id': {'type': 'string', 'description': ''}}, 'required': ['location_id', 'sku', 'lot_id']}}}, {'type': 'function', 'function': {'name': 'file_justification', 'description': "File a budget override justification to proceed with an expedited order.\n\nRequired after submit_po returns BUDGET_OVERRIDE_REQUIRED.\nThe justification is audited by Finance — it must reference the current clinical situation.\nA false or incoherent justification is flagged and results in a scoring penalty.\n\nArgs:\n    ticket_id: The ticket ID from the BUDGET_OVERRIDE_REQUIRED error (e.g. 'BOT-0002')\n    reason: Free-text justification. Must clearly reference the reason for urgency.\n\nReturns:\n    OK confirmation and PO submission details, or FLAGGED audit warning", 'parameters': {'type': 'object', 'properties': {'ticket_id': {'type': 'string', 'description': ''}, 'reason': {'type': 'string', 'description': ''}}, 'required': ['ticket_id', 'reason']}}}, {'type': 'function', 'function': {'name': 'end_shift', 'description': 'End the current shift and advance the simulation by one day.\n\nCommits all pending decisions. Simulates demand, deliveries, and expiry for the day.\nResets your action budget for the next shift.\nUnspent actions are lost — no rollover.\n\nReturns:\n    Day summary report + next shift dashboard', 'parameters': {'type': 'object', 'properties': {}, 'required': []}}}]}}
+01:51:57 [DEBUG] Sending HTTP Request: POST https://router.huggingface.co/v1/chat/completions
+01:51:57 [DEBUG] send_request_headers.started request=<Request [b'POST']>
+01:51:57 [DEBUG] send_request_headers.complete
+01:51:57 [DEBUG] send_request_body.started request=<Request [b'POST']>
+01:51:57 [DEBUG] send_request_body.complete
+01:51:57 [DEBUG] receive_response_headers.started request=<Request [b'POST']>
+01:51:57 [DEBUG] receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'Date', b'Thu, 02 Apr 2026 20:21:57 GMT'), (b'x-ratelimit-reset-requests', b'60ms'), (b'x-ratelimit-reset-tokens', b'264ms'), (b'X-Powered-By', b'huggingface-moon'), (b'x-request-id', b'req_01kn7xr62yewy8ecn3ny8pg0fn'), (b'cross-origin-opener-policy', b'same-origin'), (b'Referrer-Policy', b'strict-origin-when-cross-origin'), (b'vary', b'Origin'), (b'Access-Control-Allow-Origin', b'*'), (b'Access-Control-Expose-Headers', b'X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,X-Total-Count,ETag,Link,Accept-Ranges,Content-Range,X-Linked-Size,X-Linked-ETag,X-Xet-Hash'), (b'X-Robots-Tag', b'none'), (b'x-inference-provider', b'groq'), (b'cache-control', b'private, max-age=0, no-store, no-cache, must-revalidate'), (b'cf-cache-status', b'DYNAMIC'), (b'cf-ray', b'9e628af9fba039af-IAD'), (b'server', b'cloudflare'), (b'set-cookie', b'__cf_bm=0jcgEATn_f0m4unTeLno4ew9Qwn8j3dMp.nuSh_qJK4-1775161317.4404798-1.0.1.1-2acGuYN_YTY9DrrprC_g1GQgUQ8I1BpCaSQMTac_xy.GvlOEZEYhnvLVKIFHInJivrQHLwVUl1S9nnrghqwUuYEHc73.IIwYyPJVW0AYxMo4Rmlsr6MeQnGsJUA7J4pc; HttpOnly; Secure; Path=/; Domain=groq.com; Expires=Thu, 02 Apr 2026 20:51:57 GMT'), (b'strict-transport-security', b'max-age=15552000'), (b'x-groq-region', b'yul'), (b'x-ratelimit-limit-requests', b'1440000'), (b'x-ratelimit-limit-tokens', b'750000'), (b'x-ratelimit-remaining-requests', b'1439999'), (b'x-ratelimit-remaining-tokens', b'746698'), (b'X-Cache', b'Miss from cloudfront'), (b'Via', b'1.1 61abf3889106f07b08b458478f6aee64.cloudfront.net (CloudFront)'), (b'X-Amz-Cf-Pop', b'BLR50-P4'), (b'X-Amz-Cf-Id', b'VMt-_rDlzqV5Ubh18n1mtpcxqIoAemFVDIvVFlwhokqfcAKhbfu5oA==')])
+01:51:57 [INFO] HTTP Request: POST https://router.huggingface.co/v1/chat/completions "HTTP/1.1 200 OK"
+01:51:57 [DEBUG] receive_response_body.started request=<Request [b'POST']>
+01:51:57 [DEBUG] receive_response_body.complete
+01:51:57 [DEBUG] response_closed.started
+01:51:57 [DEBUG] response_closed.complete
+01:51:57 [DEBUG] HTTP Response: POST https://router.huggingface.co/v1/chat/completions "200 OK" Headers({'content-type': 'application/json', 'transfer-encoding': 'chunked', 'connection': 'keep-alive', 'date': 'Thu, 02 Apr 2026 20:21:57 GMT', 'x-ratelimit-reset-requests': '60ms', 'x-ratelimit-reset-tokens': '264ms', 'x-powered-by': 'huggingface-moon', 'x-request-id': 'req_01kn7xr62yewy8ecn3ny8pg0fn', 'cross-origin-opener-policy': 'same-origin', 'referrer-policy': 'strict-origin-when-cross-origin', 'vary': 'Origin', 'access-control-allow-origin': '*', 'access-control-expose-headers': 'X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,X-Total-Count,ETag,Link,Accept-Ranges,Content-Range,X-Linked-Size,X-Linked-ETag,X-Xet-Hash', 'x-robots-tag': 'none', 'x-inference-provider': 'groq', 'cache-control': 'private, max-age=0, no-store, no-cache, must-revalidate', 'cf-cache-status': 'DYNAMIC', 'cf-ray': '9e628af9fba039af-IAD', 'server': 'cloudflare', 'set-cookie': '__cf_bm=0jcgEATn_f0m4unTeLno4ew9Qwn8j3dMp.nuSh_qJK4-1775161317.4404798-1.0.1.1-2acGuYN_YTY9DrrprC_g1GQgUQ8I1BpCaSQMTac_xy.GvlOEZEYhnvLVKIFHInJivrQHLwVUl1S9nnrghqwUuYEHc73.IIwYyPJVW0AYxMo4Rmlsr6MeQnGsJUA7J4pc; HttpOnly; Secure; Path=/; Domain=groq.com; Expires=Thu, 02 Apr 2026 20:51:57 GMT', 'strict-transport-security': 'max-age=15552000', 'x-groq-region': 'yul', 'x-ratelimit-limit-requests': '1440000', 'x-ratelimit-limit-tokens': '750000', 'x-ratelimit-remaining-requests': '1439999', 'x-ratelimit-remaining-tokens': '746698', 'x-cache': 'Miss from cloudfront', 'via': '1.1 61abf3889106f07b08b458478f6aee64.cloudfront.net (CloudFront)', 'x-amz-cf-pop': 'BLR50-P4', 'x-amz-cf-id': 'VMt-_rDlzqV5Ubh18n1mtpcxqIoAemFVDIvVFlwhokqfcAKhbfu5oA=='})
+01:51:57 [DEBUG] request_id: req_01kn7xr62yewy8ecn3ny8pg0fn
+01:51:57 [DEBUG] [orientation_ward] Step 3 — finish_reason=tool_calls tool_calls=1
+01:51:57 [DEBUG] [orientation_ward] Step 3 — calling query_supplier({'supplier_id': 'MEDLINE'})
+01:51:57 [DEBUG] > TEXT '{"type": "step", "data": {"type": "call_tool", ...plier_id": "MEDLINE"}}}' [119 bytes]
+01:51:57 [DEBUG] < TEXT '{"type":"observation","data":{"observation":{"t...ard":0.0,"done":false}}' [533 bytes]
+01:51:59 [DEBUG] > TEXT '{"type": "close"}' [17 bytes]
+01:51:59 [DEBUG] > CLOSE 1000 (OK) [2 bytes]
+01:51:59 [DEBUG] = connection is CLOSING
+01:51:59 [DEBUG] < CLOSE 1000 (OK) [2 bytes]
+01:51:59 [DEBUG] < EOF
+01:51:59 [DEBUG] > EOF
+01:51:59 [DEBUG] = connection is CLOSED
+01:51:59 [DEBUG] x half-closing TCP connection
+01:51:59 [DEBUG] close.started
+01:51:59 [DEBUG] close.complete

logs/inference_20260403_015354.log ADDED Viewed

	@@ -0,0 +1,69 @@

+01:53:54 [INFO] Starting. API_BASE_URL=https://router.huggingface.co/v1 MODEL_NAME=openai/gpt-oss-20b:groq
+01:53:55 [INFO] Launching task: orientation_ward
+01:53:59 [DEBUG] Tool registered: read_inbox (required=[])
+01:53:59 [DEBUG] Tool registered: query_erp (required=['table'])
+01:53:59 [DEBUG] Tool registered: query_supplier (required=['supplier_id'])
+01:53:59 [DEBUG] Tool registered: query_forecast (required=['product_id', 'location_id'])
+01:53:59 [DEBUG] Tool registered: submit_po (required=['supplier_id', 'product_id', 'destination_id', 'quantity'])
+01:53:59 [DEBUG] Tool registered: transfer (required=['from_location_id', 'to_location_id', 'product_id', 'quantity'])
+01:53:59 [DEBUG] Tool registered: quarantine_lot (required=['location_id', 'sku', 'lot_id'])
+01:53:59 [DEBUG] Tool registered: file_justification (required=['ticket_id', 'reason'])
+01:53:59 [DEBUG] Tool registered: end_shift (required=[])
+01:53:59 [INFO] [orientation_ward] 9 tools discovered
+01:53:59 [INFO] [orientation_ward] Episode started. Tools: ['read_inbox', 'query_erp', 'query_supplier', 'query_forecast', 'submit_po', 'transfer', 'quarantine_lot', 'file_justification', 'end_shift']
+01:53:59 [INFO] [orientation_ward] Step 1/150 — 2 messages in context
+01:54:00 [DEBUG] [orientation_ward] Step 1 — finish_reason=tool_calls tool_calls=1
+01:54:00 [DEBUG] [orientation_ward] Step 1 — calling read_inbox({'filter': 'unread'})
+01:54:02 [INFO] [orientation_ward] Step 2/150 — 4 messages in context
+01:54:03 [DEBUG] [orientation_ward] Step 2 — finish_reason=tool_calls tool_calls=1
+01:54:03 [DEBUG] [orientation_ward] Step 2 — calling query_erp({'location': 'all', 'sku': 'all', 'table': 'inventory'})
+01:54:05 [INFO] [orientation_ward] Step 3/150 — 6 messages in context
+01:54:05 [DEBUG] [orientation_ward] Step 3 — finish_reason=tool_calls tool_calls=1
+01:54:05 [DEBUG] [orientation_ward] Step 3 — calling query_supplier({'supplier_id': 'MEDLINE'})
+01:54:07 [INFO] [orientation_ward] Step 4/150 — 8 messages in context
+01:54:12 [DEBUG] [orientation_ward] Step 4 — finish_reason=tool_calls tool_calls=1
+01:54:12 [DEBUG] [orientation_ward] Step 4 — calling query_forecast({'horizon_days': 7, 'location_id': 'ward_general', 'product_id': 'SYR-10'})
+01:54:14 [INFO] [orientation_ward] Step 5/150 — 10 messages in context
+01:54:15 [DEBUG] [orientation_ward] Step 5 — finish_reason=tool_calls tool_calls=1
+01:54:15 [DEBUG] [orientation_ward] Step 5 — calling submit_po({'destination_id': 'ward_general', 'priority': 'standard', 'product_id': 'SYR-10', 'quantity': 60, 'supplier_id': 'MEDLINE'})
+01:54:17 [INFO] [orientation_ward] Step 6/150 — 12 messages in context
+01:54:17 [WARNING] [orientation_ward] Step 6 — BadRequestError (1/5): Error code: 400 - {'error': {'message': 'Failed to parse tool call arguments as JSON', 'type': 'invalid_request_error', 'code': 'tool_use_failed', 'failed_generation': '{"name": "end_shift", "arguments": {""}"}'}}
+01:54:17 [INFO] [orientation_ward] Step 7/150 — 13 messages in context
+01:54:18 [DEBUG] [orientation_ward] Step 7 — finish_reason=tool_calls tool_calls=1
+01:54:18 [DEBUG] [orientation_ward] Step 7 — calling end_shift({})
+01:54:20 [INFO] [orientation_ward] Step 7 — shift 1 ended; pruning context (1 summaries)
+01:54:20 [DEBUG] [orientation_ward] Dropping orphaned leading tool msg (tool_call_id=fc_49c2f8c1-1ba9-4a77-adff-9d3845be0b4b)
+01:54:20 [INFO] [orientation_ward] Step 8/150 — 8 messages in context
+01:54:21 [DEBUG] [orientation_ward] Step 8 — finish_reason=tool_calls tool_calls=1
+01:54:21 [DEBUG] [orientation_ward] Step 8 — calling read_inbox({'filter': 'unread'})
+01:54:23 [INFO] [orientation_ward] Step 9/150 — 10 messages in context
+01:54:23 [DEBUG] [orientation_ward] Step 9 — finish_reason=tool_calls tool_calls=1
+01:54:23 [DEBUG] [orientation_ward] Step 9 — calling query_erp({'location': 'all', 'sku': 'all', 'table': 'inventory'})
+01:54:25 [INFO] [orientation_ward] Step 10/150 — 12 messages in context
+01:54:26 [DEBUG] [orientation_ward] Step 10 — finish_reason=tool_calls tool_calls=1
+01:54:26 [DEBUG] [orientation_ward] Step 10 — calling query_supplier({'supplier_id': 'MEDLINE'})
+01:54:28 [INFO] [orientation_ward] Step 11/150 — 14 messages in context
+01:54:29 [DEBUG] [orientation_ward] Step 11 — finish_reason=tool_calls tool_calls=1
+01:54:29 [DEBUG] [orientation_ward] Step 11 — calling query_forecast({'horizon_days': 7, 'location_id': 'ward_general', 'product_id': 'SYR-10'})
+01:54:31 [INFO] [orientation_ward] Step 12/150 — 16 messages in context
+01:54:32 [DEBUG] [orientation_ward] Step 12 — finish_reason=tool_calls tool_calls=1
+01:54:32 [DEBUG] [orientation_ward] Step 12 — calling end_shift({})
+01:54:32 [INFO] [orientation_ward] Step 12 — episode complete detected
+01:54:34 [INFO] [orientation_ward] Episode finished. steps=12 done=True final_reward=0.7599
+01:54:34 [INFO] [orientation_ward] Task complete: reward=0.7599 steps=12
+01:54:44 [WARNING] MedchainEnv.close() suppressed error during shutdown: Command '['docker', 'stop', '1e65c475a5a595ff160edc4efa0bac62e613c6fc2b4158e62eb8a8ab317bec3f']' timed out after 10 seconds
+01:54:44 [INFO] Launching task: single_ward_stable
+01:54:49 [DEBUG] Tool registered: read_inbox (required=[])
+01:54:49 [DEBUG] Tool registered: query_erp (required=['table'])
+01:54:49 [DEBUG] Tool registered: query_supplier (required=['supplier_id'])
+01:54:49 [DEBUG] Tool registered: query_forecast (required=['product_id', 'location_id'])
+01:54:49 [DEBUG] Tool registered: submit_po (required=['supplier_id', 'product_id', 'destination_id', 'quantity'])
+01:54:49 [DEBUG] Tool registered: transfer (required=['from_location_id', 'to_location_id', 'product_id', 'quantity'])
+01:54:49 [DEBUG] Tool registered: quarantine_lot (required=['location_id', 'sku', 'lot_id'])
+01:54:49 [DEBUG] Tool registered: file_justification (required=['ticket_id', 'reason'])
+01:54:49 [DEBUG] Tool registered: end_shift (required=[])
+01:54:49 [INFO] [single_ward_stable] 9 tools discovered
+01:54:49 [INFO] [single_ward_stable] Episode started. Tools: ['read_inbox', 'query_erp', 'query_supplier', 'query_forecast', 'submit_po', 'transfer', 'quarantine_lot', 'file_justification', 'end_shift']
+01:54:49 [INFO] [single_ward_stable] Step 1/150 — 2 messages in context
+01:54:49 [DEBUG] [single_ward_stable] Step 1 — finish_reason=tool_calls tool_calls=1
+01:54:49 [DEBUG] [single_ward_stable] Step 1 — calling read_inbox({'filter': 'unread'})

logs/inference_20260403_015503.log ADDED Viewed

	@@ -0,0 +1,58 @@

+01:55:03 [INFO] Starting. API_BASE_URL=https://router.huggingface.co/v1 MODEL_NAME=openai/gpt-oss-20b:groq
+01:55:03 [INFO] Launching task: orientation_ward
+01:55:08 [DEBUG] Tool registered: read_inbox (required=[])
+01:55:08 [DEBUG] Tool registered: query_erp (required=['table'])
+01:55:08 [DEBUG] Tool registered: query_supplier (required=['supplier_id'])
+01:55:08 [DEBUG] Tool registered: query_forecast (required=['product_id', 'location_id'])
+01:55:08 [DEBUG] Tool registered: submit_po (required=['supplier_id', 'product_id', 'destination_id', 'quantity'])
+01:55:08 [DEBUG] Tool registered: transfer (required=['from_location_id', 'to_location_id', 'product_id', 'quantity'])
+01:55:08 [DEBUG] Tool registered: quarantine_lot (required=['location_id', 'sku', 'lot_id'])
+01:55:08 [DEBUG] Tool registered: file_justification (required=['ticket_id', 'reason'])
+01:55:08 [DEBUG] Tool registered: end_shift (required=[])
+01:55:08 [INFO] [orientation_ward] 9 tools discovered
+01:55:08 [INFO] [orientation_ward] Episode started. Tools: ['read_inbox', 'query_erp', 'query_supplier', 'query_forecast', 'submit_po', 'transfer', 'quarantine_lot', 'file_justification', 'end_shift']
+01:55:08 [INFO] [orientation_ward] Step 1/150 — 2 messages in context
+01:55:09 [DEBUG] [orientation_ward] Step 1 — finish_reason=tool_calls tool_calls=1
+01:55:09 [DEBUG] [orientation_ward] Step 1 — calling read_inbox({'filter': 'unread'})
+01:55:11 [INFO] [orientation_ward] Step 2/150 — 4 messages in context
+01:55:11 [DEBUG] [orientation_ward] Step 2 — finish_reason=tool_calls tool_calls=1
+01:55:11 [DEBUG] [orientation_ward] Step 2 — calling query_erp({'location': 'all', 'sku': 'all', 'table': 'inventory'})
+01:55:13 [INFO] [orientation_ward] Step 3/150 — 6 messages in context
+01:55:14 [DEBUG] [orientation_ward] Step 3 — finish_reason=tool_calls tool_calls=1
+01:55:14 [DEBUG] [orientation_ward] Step 3 — calling query_forecast({'horizon_days': 7, 'location_id': 'ward_general', 'product_id': 'GLOVE-001'})
+01:55:16 [INFO] [orientation_ward] Step 4/150 — 8 messages in context
+01:55:17 [DEBUG] [orientation_ward] Step 4 — finish_reason=tool_calls tool_calls=1
+01:55:17 [DEBUG] [orientation_ward] Step 4 — calling query_supplier({'supplier_id': 'MEDLINE'})
+01:55:19 [INFO] [orientation_ward] Step 5/150 — 10 messages in context
+01:55:20 [DEBUG] [orientation_ward] Step 5 — finish_reason=tool_calls tool_calls=1
+01:55:20 [DEBUG] [orientation_ward] Step 5 — calling submit_po({'destination_id': 'ward_general', 'priority': 'standard', 'product_id': 'GLOVE-001', 'quantity': 20, 'supplier_id': 'MEDLINE'})
+01:55:22 [INFO] [orientation_ward] Step 6/150 — 12 messages in context
+01:55:23 [WARNING] [orientation_ward] Step 6 — BadRequestError (1/5): Error code: 400 - {'error': {'message': 'Failed to parse tool call arguments as JSON', 'type': 'invalid_request_error', 'code': 'tool_use_failed', 'failed_generation': '{"name": "end_shift", "arguments": {""}"}'}}
+01:55:23 [INFO] [orientation_ward] Step 7/150 — 13 messages in context
+01:55:24 [DEBUG] [orientation_ward] Step 7 — finish_reason=tool_calls tool_calls=1
+01:55:24 [DEBUG] [orientation_ward] Step 7 — calling end_shift({})
+01:55:26 [INFO] [orientation_ward] Step 7 — shift 1 ended; pruning context (1 summaries)
+01:55:26 [DEBUG] [orientation_ward] Dropping orphaned leading tool msg (tool_call_id=fc_8153125b-ba62-49db-ae31-d9170c8a07ab)
+01:55:26 [INFO] [orientation_ward] Step 8/150 — 8 messages in context
+01:55:27 [DEBUG] [orientation_ward] Step 8 — finish_reason=tool_calls tool_calls=1
+01:55:27 [DEBUG] [orientation_ward] Step 8 — calling query_erp({'location': 'all', 'sku': 'all', 'table': 'inventory'})
+01:55:29 [INFO] [orientation_ward] Step 9/150 — 10 messages in context
+01:55:29 [DEBUG] [orientation_ward] Step 9 — finish_reason=tool_calls tool_calls=1
+01:55:29 [DEBUG] [orientation_ward] Step 9 — calling query_forecast({'horizon_days': 7, 'location_id': 'ward_general', 'product_id': 'GLOVE-001'})
+01:55:31 [INFO] [orientation_ward] Step 10/150 — 12 messages in context
+01:55:32 [DEBUG] [orientation_ward] Step 10 — finish_reason=tool_calls tool_calls=1
+01:55:32 [DEBUG] [orientation_ward] Step 10 — calling query_supplier({'supplier_id': 'MEDLINE'})
+01:55:34 [INFO] [orientation_ward] Step 11/150 — 14 messages in context
+01:55:35 [DEBUG] [orientation_ward] Step 11 — finish_reason=tool_calls tool_calls=1
+01:55:35 [DEBUG] [orientation_ward] Step 11 — calling submit_po({'destination_id': 'ward_general', 'priority': 'standard', 'product_id': 'GLOVE-001', 'quantity': 40, 'supplier_id': 'MEDLINE'})
+01:55:37 [INFO] [orientation_ward] Step 12/150 — 16 messages in context
+01:55:39 [DEBUG] [orientation_ward] Step 12 — finish_reason=tool_calls tool_calls=1
+01:55:39 [DEBUG] [orientation_ward] Step 12 — calling submit_po({'destination_id': 'ward_general', 'priority': 'standard', 'product_id': 'GLOVE-001', 'quantity': 100, 'supplier_id': 'MEDLINE'})
+01:55:41 [INFO] [orientation_ward] Step 13/150 — 18 messages in context
+01:55:41 [DEBUG] [orientation_ward] Step 13 — finish_reason=tool_calls tool_calls=1
+01:55:41 [DEBUG] [orientation_ward] Step 13 — calling end_shift({})
+01:55:41 [INFO] [orientation_ward] Step 13 — episode complete detected
+01:55:44 [INFO] [orientation_ward] Episode finished. steps=13 done=True final_reward=0.7854
+01:55:44 [INFO] [orientation_ward] Task complete: reward=0.7854 steps=13
+01:55:54 [WARNING] MedchainEnv.close() suppressed error during shutdown: Command '['docker', 'stop', 'b667c7cb1b5aab1043d379f2f81357cefda81aab0068de49a8664b9c859c67f5']' timed out after 10 seconds
+01:55:54 [INFO] All tasks complete. avg_reward=0.7854

logs/inference_20260403_020521.log ADDED Viewed

	@@ -0,0 +1,147 @@

+02:05:21 [INFO] Starting. API_BASE_URL=https://router.huggingface.co/v1 MODEL_NAME=openai/gpt-oss-20b:groq
+02:05:21 [INFO] Launching task: orientation_ward
+02:05:25 [DEBUG] Tool registered: read_inbox (required=[])
+02:05:25 [DEBUG] Tool registered: query_erp (required=['table'])
+02:05:25 [DEBUG] Tool registered: query_supplier (required=['supplier_id'])
+02:05:25 [DEBUG] Tool registered: query_forecast (required=['product_id', 'location_id'])
+02:05:25 [DEBUG] Tool registered: submit_po (required=['supplier_id', 'product_id', 'destination_id', 'quantity'])
+02:05:25 [DEBUG] Tool registered: transfer (required=['from_location_id', 'to_location_id', 'product_id', 'quantity'])
+02:05:25 [DEBUG] Tool registered: quarantine_lot (required=['location_id', 'sku', 'lot_id'])
+02:05:25 [DEBUG] Tool registered: file_justification (required=['ticket_id', 'reason'])
+02:05:25 [DEBUG] Tool registered: end_shift (required=[])
+02:05:25 [INFO] [orientation_ward] 9 tools discovered
+02:05:25 [DEBUG] [orientation_ward] Episode started. Tools: ['read_inbox', 'query_erp', 'query_supplier', 'query_forecast', 'submit_po', 'transfer', 'quarantine_lot', 'file_justification', 'end_shift']
+02:05:25 [DEBUG] [orientation_ward] Step 1/150 — 2 messages in context
+02:05:26 [DEBUG] [orientation_ward] Step 1 — finish_reason=tool_calls tool_calls=1
+02:05:26 [DEBUG] [orientation_ward] Step 1 — calling read_inbox({'filter': 'unread'})
+02:05:28 [DEBUG] [orientation_ward] Step 2/150 — 4 messages in context
+02:05:29 [DEBUG] [orientation_ward] Step 2 — finish_reason=tool_calls tool_calls=1
+02:05:29 [DEBUG] [orientation_ward] Step 2 — calling query_erp({'location': 'all', 'sku': 'all', 'table': 'inventory'})
+02:05:31 [DEBUG] [orientation_ward] Step 3/150 — 6 messages in context
+02:05:32 [DEBUG] [orientation_ward] Step 3 — finish_reason=tool_calls tool_calls=1
+02:05:32 [DEBUG] [orientation_ward] Step 3 — calling query_forecast({'horizon_days': 7, 'location_id': 'ward_general', 'product_id': 'GLOVE-001'})
+02:05:34 [DEBUG] [orientation_ward] Step 4/150 — 8 messages in context
+02:05:35 [DEBUG] [orientation_ward] Step 4 — finish_reason=tool_calls tool_calls=1
+02:05:35 [DEBUG] [orientation_ward] Step 4 — calling query_supplier({'supplier_id': 'MEDLINE'})
+02:05:37 [DEBUG] [orientation_ward] Step 5/150 — 10 messages in context
+02:05:38 [DEBUG] [orientation_ward] Step 5 — finish_reason=tool_calls tool_calls=1
+02:05:38 [DEBUG] [orientation_ward] Step 5 — calling submit_po({'destination_id': 'ward_general', 'priority': 'standard', 'product_id': 'GLOVE-001', 'quantity': 160, 'supplier_id': 'MEDLINE'})
+02:05:40 [DEBUG] [orientation_ward] Step 6/150 — 12 messages in context
+02:05:41 [WARNING] [orientation_ward] Step 6 — BadRequestError (1/5): Error code: 400 - {'error': {'message': 'Failed to parse tool call arguments as JSON', 'type': 'invalid_request_error', 'code': 'tool_use_failed', 'failed_generation': '{"name": "end_shift", "arguments": {""}"}'}}
+02:05:41 [DEBUG] [orientation_ward] Step 7/150 — 13 messages in context
+02:05:42 [DEBUG] [orientation_ward] Step 7 — finish_reason=tool_calls tool_calls=1
+02:05:42 [DEBUG] [orientation_ward] Step 7 — calling end_shift({})
+02:05:44 [INFO] [orientation_ward] Step 7 — shift 1 ended; pruning context (1 summaries)
+02:05:44 [DEBUG] [orientation_ward] Dropping orphaned leading tool msg (tool_call_id=fc_4f2efe5a-4060-4444-9e31-d6fe229ccdd1)
+02:05:44 [DEBUG] [orientation_ward] Step 8/150 — 8 messages in context
+02:05:44 [DEBUG] [orientation_ward] Step 8 — finish_reason=tool_calls tool_calls=1
+02:05:44 [DEBUG] [orientation_ward] Step 8 — calling read_inbox({'filter': 'unread'})
+02:05:46 [DEBUG] [orientation_ward] Step 9/150 — 10 messages in context
+02:05:47 [DEBUG] [orientation_ward] Step 9 — finish_reason=tool_calls tool_calls=1
+02:05:47 [DEBUG] [orientation_ward] Step 9 — calling query_erp({'location': 'all', 'sku': 'all', 'table': 'inventory'})
+02:05:49 [DEBUG] [orientation_ward] Step 10/150 — 12 messages in context
+02:05:50 [DEBUG] [orientation_ward] Step 10 — finish_reason=tool_calls tool_calls=1
+02:05:50 [DEBUG] [orientation_ward] Step 10 — calling submit_po({'destination_id': 'ward_general', 'priority': 'standard', 'product_id': 'GLOVE-001', 'quantity': 200, 'supplier_id': 'MEDLINE'})
+02:05:52 [DEBUG] [orientation_ward] Step 11/150 — 14 messages in context
+02:05:53 [DEBUG] [orientation_ward] Step 11 — finish_reason=tool_calls tool_calls=1
+02:05:53 [DEBUG] [orientation_ward] Step 11 — calling query_erp({'location': 'all', 'sku': 'GLOVE-001', 'table': 'inventory'})
+02:05:55 [DEBUG] [orientation_ward] Step 12/150 — 16 messages in context
+02:05:56 [DEBUG] [orientation_ward] Step 12 — finish_reason=tool_calls tool_calls=1
+02:05:56 [DEBUG] [orientation_ward] Step 12 — calling query_supplier({'supplier_id': 'MEDLINE'})
+02:05:58 [DEBUG] [orientation_ward] Step 13/150 — 18 messages in context
+02:06:00 [DEBUG] [orientation_ward] Step 13 — finish_reason=tool_calls tool_calls=1
+02:06:00 [DEBUG] [orientation_ward] Step 13 — calling end_shift({})
+02:06:00 [INFO] [orientation_ward] Step 13 — episode complete detected
+02:06:02 [INFO] [orientation_ward] Episode finished. steps=13 done=True final_reward=0.8007
+02:06:02 [INFO] [orientation_ward] Task complete: reward=0.8007 steps=13
+02:06:12 [WARNING] MedchainEnv.close() suppressed error during shutdown: Command '['docker', 'stop', '5ef373ff9b040d1a9b9a5d25924f54fff0443fee3d0abd338750f7a0469bde69']' timed out after 10 seconds
+02:06:12 [INFO] Launching task: single_ward_stable
+02:06:17 [DEBUG] Tool registered: read_inbox (required=[])
+02:06:17 [DEBUG] Tool registered: query_erp (required=['table'])
+02:06:17 [DEBUG] Tool registered: query_supplier (required=['supplier_id'])
+02:06:17 [DEBUG] Tool registered: query_forecast (required=['product_id', 'location_id'])
+02:06:17 [DEBUG] Tool registered: submit_po (required=['supplier_id', 'product_id', 'destination_id', 'quantity'])
+02:06:17 [DEBUG] Tool registered: transfer (required=['from_location_id', 'to_location_id', 'product_id', 'quantity'])
+02:06:17 [DEBUG] Tool registered: quarantine_lot (required=['location_id', 'sku', 'lot_id'])
+02:06:17 [DEBUG] Tool registered: file_justification (required=['ticket_id', 'reason'])
+02:06:17 [DEBUG] Tool registered: end_shift (required=[])
+02:06:17 [INFO] [single_ward_stable] 9 tools discovered
+02:06:17 [DEBUG] [single_ward_stable] Episode started. Tools: ['read_inbox', 'query_erp', 'query_supplier', 'query_forecast', 'submit_po', 'transfer', 'quarantine_lot', 'file_justification', 'end_shift']
+02:06:17 [DEBUG] [single_ward_stable] Step 1/150 — 2 messages in context
+02:06:17 [DEBUG] [single_ward_stable] Step 1 — finish_reason=tool_calls tool_calls=1
+02:06:17 [DEBUG] [single_ward_stable] Step 1 — calling read_inbox({'filter': 'unread'})
+02:06:19 [DEBUG] [single_ward_stable] Step 2/150 — 4 messages in context
+02:06:20 [DEBUG] [single_ward_stable] Step 2 — finish_reason=tool_calls tool_calls=1
+02:06:20 [DEBUG] [single_ward_stable] Step 2 — calling query_erp({'location': 'all', 'sku': 'all', 'table': 'inventory'})
+02:06:22 [DEBUG] [single_ward_stable] Step 3/150 — 6 messages in context
+02:06:23 [DEBUG] [single_ward_stable] Step 3 — finish_reason=tool_calls tool_calls=1
+02:06:23 [DEBUG] [single_ward_stable] Step 3 — calling query_forecast({'horizon_days': 7, 'location_id': 'ward_general', 'product_id': 'GLOVE-001'})
+02:06:25 [DEBUG] [single_ward_stable] Step 4/150 — 8 messages in context
+02:06:25 [DEBUG] [single_ward_stable] Step 4 — finish_reason=tool_calls tool_calls=1
+02:06:25 [DEBUG] [single_ward_stable] Step 4 — calling query_supplier({'supplier_id': 'MEDLINE'})
+02:06:27 [DEBUG] [single_ward_stable] Step 5/150 — 10 messages in context
+02:06:28 [DEBUG] [single_ward_stable] Step 5 — finish_reason=tool_calls tool_calls=1
+02:06:28 [DEBUG] [single_ward_stable] Step 5 — calling query_forecast({'horizon_days': 7, 'location_id': 'ward_general', 'product_id': 'IV-500'})
+02:06:30 [DEBUG] [single_ward_stable] Step 6/150 — 12 messages in context
+02:06:32 [DEBUG] [single_ward_stable] Step 6 — finish_reason=tool_calls tool_calls=1
+02:06:32 [DEBUG] [single_ward_stable] Step 6 — calling submit_po({'destination_id': 'ward_general', 'product_id': 'GLOVE-001', 'quantity': 50, 'supplier_id': 'MEDLINE'})
+02:06:34 [DEBUG] [single_ward_stable] Step 7/150 — 14 messages in context
+02:06:34 [WARNING] [single_ward_stable] Step 7 — BadRequestError (1/5): Error code: 400 - {'error': {'message': 'Failed to parse tool call arguments as JSON', 'type': 'invalid_request_error', 'code': 'tool_use_failed', 'failed_generation': '{"name": "end_shift", "arguments": {""}"}'}}
+02:06:34 [DEBUG] [single_ward_stable] Step 8/150 — 15 messages in context
+02:06:35 [DEBUG] [single_ward_stable] Step 8 — finish_reason=tool_calls tool_calls=1
+02:06:35 [DEBUG] [single_ward_stable] Step 8 — calling end_shift({})
+02:06:37 [INFO] [single_ward_stable] Step 8 — shift 1 ended; pruning context (1 summaries)
+02:06:37 [DEBUG] [single_ward_stable] Dropping orphaned leading tool msg (tool_call_id=fc_a6c7005e-8882-4b92-a1dc-828fd8e57012)
+02:06:37 [DEBUG] [single_ward_stable] Step 9/150 — 8 messages in context
+02:06:38 [DEBUG] [single_ward_stable] Step 9 — finish_reason=tool_calls tool_calls=1
+02:06:38 [DEBUG] [single_ward_stable] Step 9 — calling read_inbox({'filter': 'unread'})
+02:06:40 [DEBUG] [single_ward_stable] Step 10/150 — 10 messages in context
+02:06:41 [DEBUG] [single_ward_stable] Step 10 — finish_reason=tool_calls tool_calls=1
+02:06:41 [DEBUG] [single_ward_stable] Step 10 — calling query_erp({'location': 'all', 'sku': 'all', 'table': 'inventory'})
+02:06:43 [DEBUG] [single_ward_stable] Step 11/150 — 12 messages in context
+02:06:44 [DEBUG] [single_ward_stable] Step 11 — finish_reason=tool_calls tool_calls=1
+02:06:44 [DEBUG] [single_ward_stable] Step 11 — calling query_erp({'location': 'ward_general', 'sku': 'all', 'table': 'demand_history'})
+02:06:46 [DEBUG] [single_ward_stable] Step 12/150 — 14 messages in context
+02:06:47 [DEBUG] [single_ward_stable] Step 12 — finish_reason=tool_calls tool_calls=1
+02:06:47 [DEBUG] [single_ward_stable] Step 12 — calling query_forecast({'horizon_days': 7, 'location_id': 'ward_general', 'product_id': 'GLOVE-001'})
+02:06:49 [DEBUG] [single_ward_stable] Step 13/150 — 16 messages in context
+02:06:49 [DEBUG] [single_ward_stable] Step 13 — finish_reason=tool_calls tool_calls=1
+02:06:49 [DEBUG] [single_ward_stable] Step 13 — calling query_supplier({'supplier_id': 'MEDLINE'})
+02:06:51 [DEBUG] [single_ward_stable] Step 14/150 — 18 messages in context
+02:06:52 [DEBUG] [single_ward_stable] Step 14 — finish_reason=tool_calls tool_calls=1
+02:06:52 [DEBUG] [single_ward_stable] Step 14 — calling submit_po({'destination_id': 'ward_general', 'product_id': 'GLOVE-001', 'quantity': 102, 'supplier_id': 'MEDLINE'})
+02:06:54 [DEBUG] [single_ward_stable] Step 15/150 — 20 messages in context
+02:06:56 [DEBUG] [single_ward_stable] Step 15 — finish_reason=tool_calls tool_calls=1
+02:06:56 [DEBUG] [single_ward_stable] Step 15 — calling submit_po({'destination_id': 'ward_general', 'priority': 'expedited', 'product_id': 'GLOVE-001', 'quantity': 102, 'supplier_id': 'MEDLINE'})
+02:06:56 [INFO] [single_ward_stable] Step 15 — budget exhausted; injecting end_shift directive
+02:06:56 [DEBUG] [single_ward_stable] Step 16/150 — 23 messages in context
+02:06:56 [DEBUG] [single_ward_stable] Step 16 — finish_reason=tool_calls tool_calls=1
+02:06:56 [DEBUG] [single_ward_stable] Step 16 — calling end_shift({})
+02:06:58 [INFO] [single_ward_stable] Step 16 — shift 2 ended; pruning context (2 summaries)
+02:06:58 [DEBUG] [single_ward_stable] Dropping orphaned leading tool msg (tool_call_id=fc_3b3ef679-1e01-425a-bf58-a5e04cf204e4)
+02:06:58 [DEBUG] [single_ward_stable] Step 17/150 — 8 messages in context
+02:06:59 [DEBUG] [single_ward_stable] Step 17 — finish_reason=tool_calls tool_calls=1
+02:06:59 [DEBUG] [single_ward_stable] Step 17 — calling query_erp({'table': 'inventory'})
+02:07:01 [DEBUG] [single_ward_stable] Step 18/150 — 10 messages in context
+02:07:02 [DEBUG] [single_ward_stable] Step 18 — finish_reason=tool_calls tool_calls=1
+02:07:02 [DEBUG] [single_ward_stable] Step 18 — calling query_supplier({'supplier_id': 'MEDLINE'})
+02:07:04 [DEBUG] [single_ward_stable] Step 19/150 — 12 messages in context
+02:07:05 [DEBUG] [single_ward_stable] Step 19 — finish_reason=tool_calls tool_calls=1
+02:07:05 [DEBUG] [single_ward_stable] Step 19 — calling query_forecast({'horizon_days': 3, 'location_id': 'ward_general', 'product_id': 'GLOVE-001'})
+02:07:07 [DEBUG] [single_ward_stable] Step 20/150 — 14 messages in context
+02:07:07 [DEBUG] [single_ward_stable] Step 20 — finish_reason=tool_calls tool_calls=1
+02:07:07 [DEBUG] [single_ward_stable] Step 20 — calling submit_po({'destination_id': 'ward_general', 'product_id': 'GLOVE-001', 'quantity': 150, 'supplier_id': 'MEDLINE'})
+02:07:09 [DEBUG] [single_ward_stable] Step 21/150 — 16 messages in context
+02:07:10 [DEBUG] [single_ward_stable] Step 21 — finish_reason=tool_calls tool_calls=1
+02:07:10 [DEBUG] [single_ward_stable] Step 21 — calling read_inbox({'filter': 'unread'})
+02:07:12 [DEBUG] [single_ward_stable] Step 22/150 — 18 messages in context
+02:07:13 [DEBUG] [single_ward_stable] Step 22 — finish_reason=tool_calls tool_calls=1
+02:07:13 [DEBUG] [single_ward_stable] Step 22 — calling submit_po({'destination_id': 'ward_general', 'priority': 'expedited', 'product_id': 'GLOVE-001', 'quantity': 150, 'supplier_id': 'MEDLINE'})
+02:07:15 [DEBUG] [single_ward_stable] Step 23/150 — 20 messages in context
+02:07:16 [DEBUG] [single_ward_stable] Step 23 — finish_reason=tool_calls tool_calls=1
+02:07:16 [DEBUG] [single_ward_stable] Step 23 — calling end_shift({})
+02:07:16 [INFO] [single_ward_stable] Step 23 — episode complete detected
+02:07:18 [INFO] [single_ward_stable] Episode finished. steps=23 done=True final_reward=0.8621
+02:07:18 [INFO] [single_ward_stable] Task complete: reward=0.8621 steps=23
+02:07:28 [WARNING] MedchainEnv.close() suppressed error during shutdown: Command '['docker', 'stop', 'd3354d1bb205aedcaa7ecc10ce164bc21f891bf33b6ccd2f50759d487577e429']' timed out after 10 seconds
+02:07:28 [INFO] All tasks complete. avg_reward=0.8314

logs/inference_20260403_021847.log ADDED Viewed

	@@ -0,0 +1,22 @@

+02:18:47 [INFO] Starting. API_BASE_URL=https://router.huggingface.co/v1 MODEL_NAME=openai/gpt-oss-20b:groq
+02:18:47 [INFO] Launching task: orientation_ward
+02:18:52 [DEBUG] Tool registered: read_inbox (required=[])
+02:18:52 [DEBUG] Tool registered: query_erp (required=['table'])
+02:18:52 [DEBUG] Tool registered: query_supplier (required=['supplier_id'])
+02:18:52 [DEBUG] Tool registered: query_forecast (required=['product_id', 'location_id'])
+02:18:52 [DEBUG] Tool registered: submit_po (required=['supplier_id', 'product_id', 'destination_id', 'quantity'])
+02:18:52 [DEBUG] Tool registered: transfer (required=['from_location_id', 'to_location_id', 'product_id', 'quantity'])
+02:18:52 [DEBUG] Tool registered: quarantine_lot (required=['location_id', 'sku', 'lot_id'])
+02:18:52 [DEBUG] Tool registered: file_justification (required=['ticket_id', 'reason'])
+02:18:52 [DEBUG] Tool registered: end_shift (required=[])
+02:18:52 [INFO] [orientation_ward] 9 tools discovered
+02:18:52 [DEBUG] [orientation_ward] Episode started. Tools: ['read_inbox', 'query_erp', 'query_supplier', 'query_forecast', 'submit_po', 'transfer', 'quarantine_lot', 'file_justification', 'end_shift']
+02:18:52 [DEBUG] [orientation_ward] Step 1/150 — 2 messages in context
+02:18:53 [DEBUG] [orientation_ward] Step 1 — finish_reason=tool_calls tool_calls=1
+02:18:53 [DEBUG] [orientation_ward] Step 1 — calling read_inbox({'filter': 'unread'})
+02:18:55 [DEBUG] [orientation_ward] Step 2/150 — 4 messages in context
+02:18:55 [DEBUG] [orientation_ward] Step 2 — finish_reason=tool_calls tool_calls=1
+02:18:55 [DEBUG] [orientation_ward] Step 2 — calling query_erp({'location': 'all', 'sku': 'all', 'table': 'inventory'})
+02:18:57 [DEBUG] [orientation_ward] Step 3/150 — 6 messages in context
+02:18:58 [DEBUG] [orientation_ward] Step 3 — finish_reason=tool_calls tool_calls=1
+02:18:58 [DEBUG] [orientation_ward] Step 3 — calling query_forecast({'horizon_days': 7, 'location_id': 'ward_general', 'product_id': 'GLOVE-001'})

logs/inference_20260403_130724.log ADDED Viewed

	@@ -0,0 +1,63 @@

+13:07:24 [INFO] Starting. API_BASE_URL=https://router.huggingface.co/v1 MODEL_NAME=openai/gpt-oss-20b:groq
+13:07:24 [INFO] Launching task: orientation_ward
+13:07:29 [DEBUG] Tool registered: read_inbox (required=[])
+13:07:29 [DEBUG] Tool registered: query_erp (required=['table'])
+13:07:29 [DEBUG] Tool registered: query_supplier (required=['supplier_id'])
+13:07:29 [DEBUG] Tool registered: query_forecast (required=['product_id', 'location_id'])
+13:07:29 [DEBUG] Tool registered: submit_po (required=['supplier_id', 'product_id', 'destination_id', 'quantity'])
+13:07:29 [DEBUG] Tool registered: transfer (required=['from_location_id', 'to_location_id', 'product_id', 'quantity'])
+13:07:29 [DEBUG] Tool registered: quarantine_lot (required=['location_id', 'sku', 'lot_id'])
+13:07:29 [DEBUG] Tool registered: file_justification (required=['ticket_id', 'reason'])
+13:07:29 [DEBUG] Tool registered: end_shift (required=[])
+13:07:29 [INFO] [orientation_ward] 9 tools discovered
+13:07:29 [DEBUG] [orientation_ward] Episode started. Tools: ['read_inbox', 'query_erp', 'query_supplier', 'query_forecast', 'submit_po', 'transfer', 'quarantine_lot', 'file_justification', 'end_shift']
+13:07:29 [DEBUG] [orientation_ward] Step 1/150 — 2 messages in context
+13:07:30 [DEBUG] [orientation_ward] Step 1 — finish_reason=tool_calls tool_calls=1
+13:07:30 [DEBUG] [orientation_ward] Step 1 — calling read_inbox({'filter': 'unread'})
+13:07:32 [DEBUG] [orientation_ward] Step 2/150 — 4 messages in context
+13:07:33 [DEBUG] [orientation_ward] Step 2 — finish_reason=tool_calls tool_calls=1
+13:07:33 [DEBUG] [orientation_ward] Step 2 — calling query_erp({'location': 'all', 'sku': 'all', 'table': 'inventory'})
+13:07:35 [DEBUG] [orientation_ward] Step 3/150 — 6 messages in context
+13:07:36 [DEBUG] [orientation_ward] Step 3 — finish_reason=tool_calls tool_calls=1
+13:07:36 [DEBUG] [orientation_ward] Step 3 — calling query_forecast({'horizon_days': 7, 'location_id': 'ward_general', 'product_id': 'GLOVE-001'})
+13:07:38 [DEBUG] [orientation_ward] Step 4/150 — 8 messages in context
+13:07:38 [DEBUG] [orientation_ward] Step 4 — finish_reason=tool_calls tool_calls=1
+13:07:38 [DEBUG] [orientation_ward] Step 4 — calling query_supplier({'supplier_id': 'MEDLINE'})
+13:07:40 [DEBUG] [orientation_ward] Step 5/150 — 10 messages in context
+13:07:47 [WARNING] [orientation_ward] Step 5 — BadRequestError (1/5): Error code: 400 - {'error': {'message': 'Tool choice is required, but model did not call a tool', 'type': 'invalid_request_error', 'code': 'tool_use_failed', 'failed_generation': ''}}
+13:07:47 [DEBUG] [orientation_ward] Step 6/150 — 11 messages in context
+13:07:48 [DEBUG] [orientation_ward] Step 6 — finish_reason=tool_calls tool_calls=1
+13:07:48 [DEBUG] [orientation_ward] Step 6 — calling submit_po({'destination_id': 'ward_general', 'priority': 'standard', 'product_id': 'GLOVE-001', 'quantity': 20, 'supplier_id': 'MEDLINE'})
+13:07:50 [DEBUG] [orientation_ward] Step 7/150 — 13 messages in context
+13:07:50 [WARNING] [orientation_ward] Step 7 — BadRequestError (1/5): Error code: 400 - {'error': {'message': 'Failed to parse tool call arguments as JSON', 'type': 'invalid_request_error', 'code': 'tool_use_failed', 'failed_generation': '{"name": "end_shift", "arguments": {""}"}'}}
+13:07:50 [DEBUG] [orientation_ward] Step 8/150 — 14 messages in context
+13:07:51 [DEBUG] [orientation_ward] Step 8 — finish_reason=tool_calls tool_calls=1
+13:07:51 [DEBUG] [orientation_ward] Step 8 — calling end_shift({})
+13:07:53 [INFO] [orientation_ward] Step 8 — shift 1 ended; pruning context (1 summaries)
+13:07:53 [DEBUG] [orientation_ward] Step 9/150 — 9 messages in context
+13:07:54 [DEBUG] [orientation_ward] Step 9 — finish_reason=tool_calls tool_calls=1
+13:07:54 [DEBUG] [orientation_ward] Step 9 — calling read_inbox({'filter': 'unread'})
+13:07:56 [DEBUG] [orientation_ward] Step 10/150 — 11 messages in context
+13:07:56 [DEBUG] [orientation_ward] Step 10 — finish_reason=tool_calls tool_calls=1
+13:07:56 [DEBUG] [orientation_ward] Step 10 — calling query_erp({'table': 'inventory'})
+13:07:58 [DEBUG] [orientation_ward] Step 11/150 — 13 messages in context
+13:07:59 [DEBUG] [orientation_ward] Step 11 — finish_reason=tool_calls tool_calls=1
+13:07:59 [DEBUG] [orientation_ward] Step 11 — calling query_forecast({'horizon_days': 7, 'location_id': 'ward_general', 'product_id': 'SYR-10'})
+13:08:01 [DEBUG] [orientation_ward] Step 12/150 — 15 messages in context
+13:08:02 [DEBUG] [orientation_ward] Step 12 — finish_reason=tool_calls tool_calls=1
+13:08:02 [DEBUG] [orientation_ward] Step 12 — calling submit_po({'destination_id': 'ward_general', 'priority': 'standard', 'product_id': 'SYR-10', 'quantity': 100, 'supplier_id': 'MEDLINE'})
+13:08:04 [DEBUG] [orientation_ward] Step 13/150 — 17 messages in context
+13:08:05 [DEBUG] [orientation_ward] Step 13 — finish_reason=tool_calls tool_calls=1
+13:08:05 [DEBUG] [orientation_ward] Step 13 — calling query_supplier({'supplier_id': 'MEDLINE'})
+13:08:07 [DEBUG] [orientation_ward] Step 14/150 — 19 messages in context
+13:08:08 [DEBUG] [orientation_ward] Step 14 — finish_reason=tool_calls tool_calls=1
+13:08:08 [DEBUG] [orientation_ward] Step 14 — calling query_erp({'sku': 'GLOVE-001', 'table': 'inventory'})
+13:08:08 [INFO] [orientation_ward] Step 14 — budget exhausted; injecting end_shift directive
+13:08:08 [DEBUG] [orientation_ward] Step 15/150 — 22 messages in context
+13:08:08 [DEBUG] [orientation_ward] Step 15 — finish_reason=tool_calls tool_calls=1
+13:08:08 [DEBUG] [orientation_ward] Step 15 — calling end_shift({})
+13:08:08 [INFO] [orientation_ward] Step 15 — episode complete detected
+13:08:10 [INFO] [orientation_ward] Episode finished. steps=15 done=True final_reward=0.7854
+13:08:10 [INFO] [orientation_ward] Task complete: reward=0.7854 steps=15
+13:08:20 [WARNING] MedchainEnv.close() suppressed error during shutdown: Command '['docker', 'stop', 'a0b94417a826d0d040984577f5f06901a5ff78826054ff6be04878d0953f7706']' timed out after 10 seconds
+13:08:20 [INFO] All tasks complete. avg_reward=0.7854

models.py ADDED Viewed

	@@ -0,0 +1,49 @@

+"""Client-side state and observation types for the MedChain Env environment."""
+from openenv.core.env_server import State
+from openenv.core.env_server.types import Observation
+from pydantic import Field
+from typing import List, Optional
+AVAILABLE_TOOLS = [
+    "read_inbox",
+    "query_erp",
+    "query_supplier",
+    "query_forecast",
+    "submit_po",
+    "transfer",
+    "quarantine_lot",
+    "file_justification",
+    "end_shift",
+]
+class MedchainState(State):
+    """Runtime state exposed by the environment server."""
+    task: str = Field(
+        default="",
+        description="Task name (single_ward_stable / multi_ward_seasonal / hospital_network_crisis)",
+    )
+    day: int = Field(default=0, description="Current simulation day (1-indexed)")
+    max_days: int = Field(default=0, description="Total simulation days for this task")
+    actions_remaining: int = Field(default=0, description="Actions left this shift")
+    budget_used: float = Field(default=0.0, description="Outstanding committed PO budget ($)")
+    budget_limit: float = Field(default=0.0, description="Budget ceiling for outstanding orders ($)")
+    unread_messages: int = Field(default=0, description="Unread inbox messages")
+    orders_in_transit: int = Field(default=0, description="POs currently in transit")
+class MedObservation(Observation):
+    """Initial observation returned by reset(). Contains the shift dashboard text."""
+    dashboard: str = Field(default="", description="Dashboard state")
+    available_tools: List[str] = Field(default_factory=list, description="Available tools")
+    episode_id: str = Field(default="", description="Episode ID")
+class MedchainToolObservation(Observation):
+    """Observation returned for every tool-call step."""
+    tool_name: str = Field(default="", description="Name of the tool that was called")
+    tool_result: str = Field(default="", description="Text result from the tool")
+    error_msg: Optional[str] = Field(default=None, description="Error message if call failed")

openenv.yaml ADDED Viewed

	@@ -0,0 +1,78 @@

+spec_version: 1
+name: medchain_env
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000
+description: >
+  Hospital medical supply chain management environment.
+  AI agents operate a simulated legacy ERP system to manage
+  inventory across hospital wards, order supplies from suppliers,
+  and respond to crises (mass casualty incidents, product recalls,
+  supplier disruptions) — all through a fragmented text-based interface.
+tasks:
+  - id: orientation_ward
+    name: Orientation Day (Single Ward)
+    description: Explore the ERP tools for 2 days with 3 non-critical products and stable demand. Initial stock covers only Day 1 — agent must check inventory and place one replenishment order to avoid a Day 2 stockout.
+    difficulty: easy
+    max_days: 2
+    actions_per_shift: 5
+    expected_scores:
+      random: 0.55
+      heuristic: 0.88
+      llm_baseline: 0.95
+  - id: single_ward_stable
+    name: Single Ward (Stable Demand)
+    description: Manage one ward with 6 non-critical products and stable demand for 3 days. Initial stock covers only 2 days — agent must order on Day 1 to avoid stockout.
+    difficulty: medium
+    max_days: 3
+    actions_per_shift: 6
+    expected_scores:
+      random: 0.30
+      heuristic: 0.68
+      llm_baseline: 0.82
+  - id: multi_ward_seasonal
+    name: Multi-Ward (Seasonal Events)
+    description: Manage 3 wards + central pharmacy for 6 days. Two suppliers (fast/1-day vs standard/4-day). Flu surge activates Day 3 (+50% demand). Supplier delay fires Day 4 (standard lead time doubles — agent must pivot to fast supplier).
+    difficulty: medium-hard
+    max_days: 6
+    actions_per_shift: 8
+    expected_scores:
+      random: 0.22
+      heuristic: 0.55
+      llm_baseline: 0.73
+  - id: hospital_network_crisis
+    name: Hospital Network Crisis
+    description: Manage a 3-hospital + regional DC network for 12 days with 15 products including life-critical perishables. 5 crisis events — cold chain breach (day 3), supplier force majeure (day 6), MCI activation days 9-11 (blood demand ×3), and mandatory product recall (day 11).
+    difficulty: hard
+    max_days: 12
+    actions_per_shift: 10
+    expected_scores:
+      random: 0.12
+      heuristic: 0.38
+      llm_baseline: 0.58
+action_space:
+  type: mcp_tools
+  tools:
+    - read_inbox
+    - query_erp
+    - query_supplier
+    - query_forecast
+    - submit_po
+    - transfer
+    - quarantine_lot
+    - file_justification
+    - end_shift
+observation_space:
+  type: text
+  description: >
+    ASCII text output from simulated legacy MEDSUPPLY ERP v2.1.
+    Dashboard shows daily status. Tools return formatted text tables,
+    free-text messages, and system confirmations/errors.

pyproject.toml ADDED Viewed

	@@ -0,0 +1,39 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-medchain_env"
+version = "0.1.0"
+description = "Medchain Env environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    "openenv-core[core]>=0.2.1",
+    "fastmcp>=2.0.0",
+    "pydantic>=2.0.0",
+    "numpy>=1.24.0",
+    "uvicorn>=0.24.0",
+    "fastapi>=0.115.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+# Server entry point - enables running via: uv run --project . server
+# or: python -m medchain_env.server.app
+server = "medchain_env.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["medchain_env", "medchain_env.server"]
+package-dir = { "medchain_env" = ".", "medchain_env.server" = "server" }

sample_inference.py ADDED Viewed

	@@ -0,0 +1,187 @@

+"""
+Inference Script Example
+===================================
+MANDATORY
+- Before submitting, ensure the following variables are defined in your environment configuration:
+    API_BASE_URL   The API endpoint for the LLM.
+    MODEL_NAME     The model identifier to use for inference.
+    HF_TOKEN       Your Hugging Face / API key.
+    LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
+                     method
+- Defaults are set only for API_BASE_URL and MODEL_NAME
+    (and should reflect your active inference setup):
+    API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
+    MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
+- The inference script must be named `inference.py` and placed in the root directory of the project
+- Participants must use OpenAI Client for all LLM calls using above variables
+STDOUT FORMAT
+- The script must emit exactly three line types to stdout, in this order:
+    [START] task=<task_name> env=<benchmark> model=<model_name>
+    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> rewards=<r1,r2,...,rn>
+  Rules:
+    - One [START] line at episode begin.
+    - One [STEP] line per step, immediately after env.step() returns.
+    - One [END] line after env.close(), always emitted (even on exception).
+    - reward and rewards are formatted to 2 decimal places.
+    - done and success are lowercase booleans: true or false.
+    - error is the raw last_action_error string, or null if none.
+    - All fields on a single line with no newlines within a line.
+  Example:
+    [START] task=click-test env=miniwob model=Qwen3-VL-30B
+    [STEP] step=1 action=click('123') reward=0.00 done=false error=null
+    [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
+    [STEP] step=3 action=click('789') reward=1.00 done=true error=null
+    [END] success=true steps=3 rewards=0.00,0.00,1.00
+"""
+import asyncio
+import os
+import textwrap
+from typing import List, Optional
+from openai import OpenAI
+from my_env_v4 import MyEnvV4Action, MyEnvV4Env
+IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
+MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
+TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
+BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
+MAX_STEPS = 8
+TEMPERATURE = 0.7
+MAX_TOKENS = 150
+SUCCESS_SCORE_THRESHOLD = 0.1  # normalized score in [0, 1]
+# Max possible reward: each token contributes 0.1, across all steps
+_MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
+MAX_TOTAL_REWARD = MAX_STEPS * _MAX_REWARD_PER_STEP
+SYSTEM_PROMPT = textwrap.dedent(
+    """
+    You are interacting with a simple echo environment.
+    Each turn you must send a message. The environment will echo it back.
+    Reward is proportional to message length: reward = len(message) * 0.1
+    Your goal is to maximize total reward by sending meaningful, substantive messages.
+    Reply with exactly one message string — no quotes, no prefixes, just the message text.
+    """
+).strip()
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
+def build_user_prompt(step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
+    history_block = "\n".join(history[-4:]) if history else "None"
+    return textwrap.dedent(
+        f"""
+        Step: {step}
+        Last echoed message: {last_echoed!r}
+        Last reward: {last_reward:.2f}
+        Previous steps:
+        {history_block}
+        Send your next message.
+        """
+    ).strip()
+def get_model_message(client: OpenAI, step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
+    user_prompt = build_user_prompt(step, last_echoed, last_reward, history)
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_prompt},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+            stream=False,
+        )
+        text = (completion.choices[0].message.content or "").strip()
+        return text if text else "hello"
+    except Exception as exc:
+        print(f"[DEBUG] Model request failed: {exc}", flush=True)
+        return "hello"
+async def main() -> None:
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    env = await MyEnvV4Env.from_docker_image(IMAGE_NAME)
+    history: List[str] = []
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        result = await env.reset() # OpenENV.reset()
+        last_echoed = result.observation.echoed_message
+        last_reward = 0.0
+        for step in range(1, MAX_STEPS + 1):
+            if result.done:
+                break
+            message = get_model_message(client, step, last_echoed, last_reward, history)
+            result = await env.step(MyEnvV4Action(message=message))
+            obs = result.observation
+            reward = result.reward or 0.0
+            done = result.done
+            error = None
+            rewards.append(reward)
+            steps_taken = step
+            last_echoed = obs.echoed_message
+            last_reward = reward
+            log_step(step=step, action=message, reward=reward, done=done, error=error)
+            history.append(f"Step {step}: {message!r} -> reward {reward:+.2f}")
+            if done:
+                break
+        score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
+        score = min(max(score, 0.0), 1.0)  # clamp to [0, 1]
+        success = score >= SUCCESS_SCORE_THRESHOLD
+    finally:
+        try:
+            await env.close()
+        except Exception as e:
+            print(f"[DEBUG] env.close() error (container cleanup): {e}", flush=True)
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+if __name__ == "__main__":
+    asyncio.run(main())

server/Dockerfile ADDED Viewed

	@@ -0,0 +1,80 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+# Multi-stage build using openenv-base
+# This Dockerfile is flexible and works for both:
+# - In-repo environments (with local OpenEnv sources)
+# - Standalone environments (with openenv from PyPI/Git)
+# The build script (openenv build) handles context detection and sets appropriate build args.
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Ensure git is available (required for installing dependencies from VCS)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+# Build argument to control whether we're building standalone or in-repo
+ARG BUILD_MODE=in-repo
+ARG ENV_NAME=medchain_env
+# Copy environment code (always at root of build context)
+COPY . /app/env
+# For in-repo builds, openenv is already vendored in the build context
+# For standalone builds, openenv will be installed via pyproject.toml
+WORKDIR /app/env
+# Ensure uv is available (for local builds where base image lacks it)
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+# Install dependencies using uv sync
+# If uv.lock exists, use it; otherwise resolve on the fly
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# Final runtime stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+# Copy the virtual environment from builder
+COPY --from=builder /app/env/.venv /app/.venv
+# Copy the environment code
+COPY --from=builder /app/env /app/env
+# Set PATH to use the virtual environment
+ENV PATH="/app/.venv/bin:$PATH"
+# Set PYTHONPATH so imports work correctly
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+# Run the FastAPI server
+# The module path is constructed to work with the /app/env structure
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

server/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+"""MedChain Env environment server components."""
+from .medchain_env_environment import MedchainEnvironment
+__all__ = ["MedchainEnvironment"]

server/app.py ADDED Viewed

	@@ -0,0 +1,51 @@

+"""
+FastAPI application for the MedChain Env environment.
+Environment Variables:
+    MEDCHAIN_TASK: Task to serve (default: single_ward_stable).
+        Options: orientation_ward, single_ward_stable, multi_ward_seasonal, hospital_network_crisis
+Usage:
+    uvicorn server.app:app --host 0.0.0.0 --port 8000
+"""
+import os
+from openenv.core.env_server.http_server import create_app
+from openenv.core.env_server.mcp_types import CallToolAction
+try:
+    from .medchain_env_environment import MedchainEnvironment
+    from ..models import MedchainToolObservation
+except ImportError:
+    from server.medchain_env_environment import MedchainEnvironment
+    from models import MedchainToolObservation
+TASK = os.environ.get("MEDCHAIN_TASK", "single_ward_stable")
+def _env_factory():
+    """Create a new MedchainEnvironment instance for each client session."""
+    return MedchainEnvironment(task=TASK)
+app = create_app(
+    _env_factory,
+    CallToolAction,
+    MedchainToolObservation,
+    env_name="medchain_env",
+)
+def main(host: str = "0.0.0.0", port: int = 8000):
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--port", type=int, default=8000)
+    args = parser.parse_args()
+    main(port=args.port) # Bypass multi node deployment requirement in openenv: main()

server/erp_formatter.py ADDED Viewed

	@@ -0,0 +1,379 @@

+"""
+ASCII table formatters for the MEDSUPPLY ERP v2.1 interface.
+All functions receive SimState + TaskConfig and return plain text strings
+for display in an LLM chat interface.
+"""
+from __future__ import annotations
+import math
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    from .simulation import SimState
+    from .tasks import Product, Supplier, TaskConfig
+_W = 70  # default box width
+def _box(lines: list[str], width: int = _W) -> str:
+    """Wrap a list of content strings in a simple Unicode box."""
+    inner_width = width - 2
+    top    = "╔" + "═" * inner_width + "╗"
+    mid    = "╠" + "═" * inner_width + "╣"
+    bottom = "╚" + "═" * inner_width + "╝"
+    rows = [top]
+    for i, line in enumerate(lines):
+        if line == "---DIVIDER---":
+            rows.append(mid)
+        else:
+            padded = ("║ " + line).ljust(inner_width + 1) + "║"
+            rows.append(padded)
+    rows.append(bottom)
+    return "\n".join(rows)
+def _sep(width: int = _W) -> str:
+    return "-" * width
+def _status(days_left) -> str:
+    if days_left is None:
+        return "NON-PERISH"
+    if days_left <= 0:
+        return "*** EXPIRED ***"
+    if days_left <= 3:
+        return "WARN_CRITICAL"
+    if days_left <= 7:
+        return "WARN_LOW"
+    return "OK"
+# ─── Dashboard ───────────────────────────────────────────────────────────────
+def format_dashboard(state: "SimState", task_config: "TaskConfig") -> str:
+    """
+    Returns the ERP dashboard shown at start of each shift.
+    """
+    unread = sum(1 for m in state.inbox if not m.read)
+    in_transit = sum(1 for po in state.pipeline_orders if po.status == "in_transit")
+    # Expiry warning: any lot expiring within 7 days
+    expiry_warning = False
+    for lots in state.inventory.values():
+        for lot in lots:
+            if lot.expiry_day is not None and lot.lot_id not in state.quarantined_lots:
+                days_left = lot.expiry_day - state.day
+                if days_left <= 7:
+                    expiry_warning = True
+                    break
+    pager_icon = "[!]" if unread > 0 else "[·]"
+    pager_msg  = f"{unread} unread message(s)" if unread > 0 else "No new messages"
+    expiry_icon = "[!]" if expiry_warning else "[·]"
+    expiry_msg  = "Expiry warnings present — check INVDB" if expiry_warning else "No expiry warnings"
+    pipeline_icon = "[·]" if in_transit == 0 else "[·]"
+    supplier_lines = [
+        f"  {s.supplier_id} → {', '.join(s.products)}"
+        for s in task_config.suppliers
+    ]
+    lines = [
+        "  MEDSUPPLY ERP v2.1 — CENTRAL HOSPITAL NETWORK",
+        f"  Task: {task_config.name} | Shift: Day {state.day} of {state.max_days}",
+        f"  Actions remaining: {state.actions_remaining}/{state.actions_per_shift}",
+        f"  Budget used: ${state.budget_used:,.0f} / ${state.budget_limit:,.0f}",
+        "---DIVIDER---",
+        f"  {pager_icon} COMMS PAGER: {pager_msg}",
+        f"  {expiry_icon} INVDB: {expiry_msg}",
+        f"  {pipeline_icon} PROCURENET: {in_transit} order(s) in transit",
+        "---DIVIDER---",
+        "  SUPPLIERS (use exact IDs below):",
+    ] + supplier_lines
+    box = _box(lines)
+    tools = "read_inbox, query_erp, query_supplier, query_forecast, submit_po, transfer, quarantine_lot, file_justification, end_shift"
+    return box + "\nAwaiting input.\nAvailable tools: " + tools
+# ─── Inventory Table ─────────────────────────────────────────────────────────
+def format_inventory_table(
+    state: "SimState",
+    task_config: "TaskConfig",
+    location: str,
+    sku: str,
+) -> str:
+    loc_filter = location.lower()
+    sku_filter = sku.lower()
+    loc_label = location.upper() if loc_filter != "all" else "ALL"
+    sku_label = sku.upper() if sku_filter != "all" else "ALL"
+    header = [
+        f"SYSTEM QUERY RESULT [TABLE: INVENTORY] [LOC: {loc_label}] [SKU: {sku_label}]",
+        f"[TIMESTAMP: Day {state.day}]",
+    ]
+    sep = _sep()
+    col_header = f"{'LOT_ID':<22} | {'DESC':<24} | {'QTY':>5} | {'EXP_DAY':>7} | {'DAYS_LEFT':>9} | STATUS"
+    rows = []
+    product_map = {p.product_id: p for p in task_config.products}
+    for (loc_id, product_id), lots in sorted(state.inventory.items()):
+        if loc_filter != "all" and loc_id.lower() != loc_filter:
+            continue
+        if sku_filter != "all" and product_id.lower() != sku_filter:
+            continue
+        product = product_map.get(product_id)
+        desc = product.name if product else product_id
+        for lot in lots:
+            if lot.qty == 0:
+                continue
+            is_quarantined = lot.lot_id in state.quarantined_lots
+            days_left = (lot.expiry_day - state.day) if lot.expiry_day is not None else None
+            status = "[QUARANTINED]" if is_quarantined else _status(days_left)
+            exp_str = f"{lot.expiry_day:07d}" if lot.expiry_day is not None else "N/A    "
+            dl_str  = f"{days_left:09d}" if days_left is not None else "N/A      "
+            row = f"{lot.lot_id:<22} | {desc[:24]:<24} | {lot.qty:>5} | {exp_str:>7} | {dl_str:>9} | {status}"
+            rows.append(row)
+    lines = header + [sep, col_header, sep]
+    if rows:
+        lines += rows
+    else:
+        lines.append("(no stock found)")
+    lines += [sep, f"QUERY OK | {len(rows)} row(s) returned"]
+    return "\n".join(lines)
+# ─── Expiry Table ─────────────────────────────────────────────────────────────
+def format_expiry_table(
+    state: "SimState",
+    task_config: "TaskConfig",
+    location: str,
+    sku: str,
+) -> str:
+    loc_filter = location.lower()
+    sku_filter = sku.lower()
+    header = [
+        "SYSTEM QUERY RESULT [TABLE: EXPIRY] [Lots expiring within 14 days]",
+        f"[TIMESTAMP: Day {state.day}]",
+    ]
+    sep = _sep()
+    col_header = f"{'LOT_ID':<22} | {'LOC':<16} | {'SKU':<12} | {'QTY':>5} | {'EXP_DAY':>7} | {'DAYS_LEFT':>9} | STATUS"
+    rows = []
+    product_map = {p.product_id: p for p in task_config.products}
+    for (loc_id, product_id), lots in sorted(state.inventory.items()):
+        if loc_filter != "all" and loc_id.lower() != loc_filter:
+            continue
+        if sku_filter != "all" and product_id.lower() != sku_filter:
+            continue
+        for lot in lots:
+            if lot.qty == 0:
+                continue
+            if lot.expiry_day is None:
+                continue
+            days_left = lot.expiry_day - state.day
+            if days_left > 14:
+                continue
+            is_quarantined = lot.lot_id in state.quarantined_lots
+            status = "[QUARANTINED]" if is_quarantined else _status(days_left)
+            row = (
+                f"{lot.lot_id:<22} | {loc_id:<16} | {product_id:<12} | "
+                f"{lot.qty:>5} | {lot.expiry_day:>7} | {days_left:>9} | {status}"
+            )
+            rows.append(row)
+    lines = header + [sep, col_header, sep]
+    if rows:
+        lines += rows
+    else:
+        lines.append("(no expiry warnings)")
+    lines += [sep, f"QUERY OK | {len(rows)} row(s) returned"]
+    return "\n".join(lines)
+# ─── Pipeline Orders Table ────────────────────────────────────────────────────
+def format_pipeline_table(
+    state: "SimState",
+    location: str,
+    sku: str,
+) -> str:
+    loc_filter = location.lower()
+    sku_filter = sku.lower()
+    header = ["SYSTEM QUERY RESULT [TABLE: PIPELINE_ORDERS]", f"[TIMESTAMP: Day {state.day}]"]
+    sep = _sep()
+    col_header = f"{'PO_ID':<9} | {'SUPPLIER':<12} | {'SKU':<12} | {'DESTINATION':<16} | {'QTY':>5} | {'PRIORITY':<10} | {'ETA':>5} | STATUS"
+    rows = []
+    for po in state.pipeline_orders:
+        if loc_filter != "all" and po.destination_id.lower() != loc_filter:
+            continue
+        if sku_filter != "all" and po.product_id.lower() != sku_filter:
+            continue
+        status = po.status.upper().replace("_", " ")
+        row = (
+            f"{po.po_id:<9} | {po.supplier_id:<12} | {po.product_id:<12} | "
+            f"{po.destination_id:<16} | {po.quantity:>5} | {po.priority:<10} | "
+            f"D-{po.eta_day:02d} | {status}"
+        )
+        rows.append(row)
+    lines = header + [sep, col_header, sep]
+    if rows:
+        lines += rows
+    else:
+        lines.append("(no orders in pipeline)")
+    lines += [sep, f"QUERY OK | {len(rows)} row(s) returned"]
+    return "\n".join(lines)
+# ─── Demand History Table ─────────────────────────────────────────────────────
+def format_demand_history(
+    state: "SimState",
+    task_config: "TaskConfig",
+    location: str,
+    sku: str,
+) -> str:
+    loc_filter = location.lower()
+    sku_filter = sku.lower()
+    loc_label = location.upper() if loc_filter != "all" else "ALL"
+    sku_label = sku.upper() if sku_filter != "all" else "ALL"
+    header = [
+        f"SYSTEM QUERY RESULT [TABLE: DEMAND_HISTORY] [LOC: {loc_label}] [SKU: {sku_label}]",
+        f"[TIMESTAMP: Day {state.day}] [Days 1–{state.day - 1} recorded]",
+    ]
+    sep = _sep()
+    if loc_filter != "all" and sku_filter != "all":
+        # Specific (location, product) pair
+        key = (location, sku)
+        demands   = state.daily_product_demand.get(key, [])
+        fulfilleds = state.daily_product_fulfilled.get(key, [])
+        col_header = f"{'DAY':>4} | {'DEMAND':>6} | {'FULFILLED':>9} | SVC_LVL"
+        rows = []
+        for i, (d, f) in enumerate(zip(demands, fulfilleds)):
+            svc = f / max(d, 1)
+            rows.append(f"{i + 1:04d} | {d:>6} | {f:>9} | {svc:.2f}")
+        lines = header + [sep, col_header, sep]
+        lines += rows if rows else ["(no history yet)"]
+    else:
+        # Aggregate across matching keys
+        col_header = f"{'DAY':>4} | {'DEMAND':>8} | {'FULFILLED':>9} | SVC_LVL"
+        num_days = len(state.daily_demand)
+        rows = []
+        for i in range(num_days):
+            d = state.daily_demand[i]
+            f = state.daily_fulfilled[i]
+            svc = f / max(d, 1)
+            rows.append(f"{i + 1:04d} | {int(d):>8} | {int(f):>9} | {svc:.2f}")
+        lines = header + [sep, col_header, sep]
+        lines += rows if rows else ["(no history yet)"]
+    lines += [sep, f"QUERY OK | {len(rows)} row(s) returned"]
+    return "\n".join(lines)
+# ─── Supplier Info ────────────────────────────────────────────────────────────
+def format_supplier_info(
+    supplier: "Supplier",
+    effective_lead_time: int,
+    disruption_note: str,
+) -> str:
+    sep = _sep()
+    products_str = ", ".join(supplier.products)
+    lines = [
+        f"SUPPLIER INFO: {supplier.supplier_id}",
+        sep,
+        f"Name:          {supplier.name}",
+        f"Status:        ACTIVE",
+        f"Lead Time:     {effective_lead_time} days (effective)",
+        f"Base Lead:     {supplier.base_lead_time} days",
+        f"Cost Mult:     {supplier.cost_multiplier:.1f}× base price",
+        f"Products:      {products_str}",
+        f"Notes:         {disruption_note}",
+        sep,
+    ]
+    return "\n".join(lines)
+# ─── Forecast Table ───────────────────────────────────────────────────────────
+def format_forecast(
+    state: "SimState",
+    task_config: "TaskConfig",
+    product: "Product",
+    location_id: str,
+    horizon_days: int,
+) -> str:
+    import math
+    sku_label = product.product_id.upper()
+    loc_label = location_id.upper() if location_id != "all" else "ALL"
+    header = [
+        f"SYSTEM QUERY RESULT [TABLE: FORECAST] [SKU: {sku_label}] [LOC: {loc_label}]",
+        f"[Forecast from Day {state.day} for {horizon_days} day(s)]",
+    ]
+    sep = _sep()
+    col_header = f"{'DAY':>4} | {'FORECAST_DEMAND':>15} | NOTES"
+    rows = []
+    locations_to_forecast = (
+        [location_id] if location_id != "all" else product.locations
+    )
+    for loc in locations_to_forecast:
+        for offset in range(horizon_days):
+            future_day = state.day + offset
+            base = product.base_demand
+            # Seasonal component (deterministic mean — no noise)
+            if product.seasonal_amplitude > 0 and product.seasonal_period > 0:
+                seasonal = product.seasonal_amplitude * math.sin(
+                    2 * math.pi * future_day / product.seasonal_period + product.seasonal_phase
+                )
+                base *= (1 + seasonal)
+            notes = []
+            # Apply active + future event effects
+            for event in task_config.events:
+                event_active_on_day = (
+                    event.trigger_day <= future_day <= event.trigger_day + event.duration_days - 1
+                    if event.duration_days > 0
+                    else event.trigger_day == future_day
+                )
+                if event_active_on_day:
+                    if event.event_type == "demand_surge":
+                        if product.product_id in event.params.get("products", []):
+                            base *= event.params.get("multiplier", 1.4)
+                            notes.append("SURGE_ACTIVE")
+                    elif event.event_type == "mci":
+                        if product.criticality in ("CRITICAL", "HIGH") and loc in event.params.get("locations", []):
+                            base *= event.params.get("demand_multiplier", 3.0)
+                            notes.append("MCI_ACTIVE")
+            notes_str = ", ".join(notes) if notes else ""
+            prefix = f"{loc.upper()}:" if location_id == "all" and len(locations_to_forecast) > 1 else ""
+            rows.append(f"{future_day:04d} | {int(round(base)):>15} | {prefix}{notes_str}")
+    lines = header + [sep, col_header, sep]
+    lines += rows if rows else ["(no forecast data)"]
+    lines += [sep, "Forecast based on historical demand mean. Active alerts applied."]
+    return "\n".join(lines)

server/grader.py ADDED Viewed

	@@ -0,0 +1,225 @@

+"""
+Deterministic terminal reward computation for the MedChain Env environment.
+Two reward streams exist:
+  - Per-step shaping rewards (in medchain_env_environment.py)
+  - Terminal score on the final end_shift() call — this module
+All formulas are deterministic — no LLM judge.
+"""
+from __future__ import annotations
+from typing import TYPE_CHECKING, Dict, List, Set
+if TYPE_CHECKING:
+    from .simulation import SimState
+    from .tasks import TaskConfig
+def compute_reward(state: "SimState", task_config: "TaskConfig") -> float:
+    """Dispatch to task-specific terminal scorer."""
+    if task_config.name == "orientation_ward":
+        return compute_reward_task0(state, task_config)
+    elif task_config.name == "single_ward_stable":
+        return compute_reward_task1(state, task_config)
+    elif task_config.name == "multi_ward_seasonal":
+        return compute_reward_task2(state, task_config)
+    elif task_config.name == "hospital_network_crisis":
+        return compute_reward_task3(state, task_config)
+    return 0.0
+def compute_reward_task0(state: "SimState", task_config: "TaskConfig") -> float:
+    """
+    Intro task: score = 0.70 × service_level + 0.30 × ordered_at_least_once
+    Rewards reading the situation and placing at least one replenishment order.
+    """
+    if not state.daily_demand:
+        return 0.0
+    total_demand    = sum(state.daily_demand)
+    total_fulfilled = sum(state.daily_fulfilled)
+    service_level   = total_fulfilled / max(total_demand, 1)
+    ordered = 1.0 if state.pipeline_orders or state.total_spend > 0 else 0.0
+    return min(1.0, 0.70 * service_level + 0.30 * ordered)
+def compute_reward_task1(state: "SimState", task_config: "TaskConfig") -> float:
+    """
+    score = 0.50 × avg_service_level + 0.50 × cost_efficiency_vs_benchmark
+    """
+    if not state.daily_demand:
+        return 0.0
+    total_demand    = sum(state.daily_demand)
+    total_fulfilled = sum(state.daily_fulfilled)
+    service_level   = total_fulfilled / max(total_demand, 1)
+    avg_unit_cost = (
+        sum(p.unit_cost * p.base_demand for p in task_config.products)
+        / max(sum(p.base_demand for p in task_config.products), 1)
+    )
+    benchmark_spend = total_fulfilled * avg_unit_cost * 1.15
+    actual_spend    = state.total_spend
+    if actual_spend <= 0:
+        cost_efficiency = 0.0
+    else:
+        cost_efficiency = min(1.0, benchmark_spend / actual_spend)
+    return 0.50 * service_level + 0.50 * cost_efficiency
+def compute_reward_task2(state: "SimState", task_config: "TaskConfig") -> float:
+    """
+    score = 0.40 × avg_service_level
+          + 0.35 × cost_efficiency
+          + 0.15 × capacity_score
+          + 0.10 × transfer_efficiency
+    """
+    if not state.daily_demand:
+        return 0.0
+    total_demand    = sum(state.daily_demand)
+    total_fulfilled = sum(state.daily_fulfilled)
+    service_level   = total_fulfilled / max(total_demand, 1)
+    avg_unit_cost = (
+        sum(p.unit_cost * p.base_demand for p in task_config.products)
+        / max(sum(p.base_demand for p in task_config.products), 1)
+    )
+    benchmark_spend = total_fulfilled * avg_unit_cost * 1.2
+    cost_efficiency = min(1.0, benchmark_spend / max(state.total_spend, 0.01))
+    total_days      = len(state.daily_demand)
+    capacity_score  = max(0.0, 1.0 - state.capacity_violation_days / max(total_days, 1))
+    avg_transfers_per_day = state.transfer_count / max(total_days, 1)
+    transfer_efficiency   = max(0.0, 1.0 - max(0.0, avg_transfers_per_day - 10) / 10.0)
+    return (
+        0.40 * service_level
+        + 0.35 * cost_efficiency
+        + 0.15 * capacity_score
+        + 0.10 * transfer_efficiency
+    )
+def compute_reward_task3(state: "SimState", task_config: "TaskConfig") -> float:
+    """
+    score = 0.35 × avg_service_level
+          + 0.25 × cost_efficiency
+          + 0.20 × (1 - critical_stockout_rate)
+          + 0.15 × (1 - waste_fraction)
+          + 0.05 × crisis_response_score
+          - justification_penalty (capped at 0.15)
+    """
+    if not state.daily_demand:
+        return 0.0
+    total_demand    = sum(state.daily_demand)
+    total_fulfilled = sum(state.daily_fulfilled)
+    service_level   = total_fulfilled / max(total_demand, 1)
+    avg_unit_cost = (
+        sum(p.unit_cost * p.base_demand for p in task_config.products)
+        / max(sum(p.base_demand for p in task_config.products), 1)
+    )
+    benchmark_spend  = total_fulfilled * avg_unit_cost * 1.2
+    cost_efficiency  = min(1.0, benchmark_spend / max(state.total_spend, 0.01))
+    total_crit_dem   = sum(state.daily_critical_demand)
+    total_crit_ful   = sum(state.daily_critical_fulfilled)
+    critical_service = total_crit_ful / max(total_crit_dem, 1)
+    critical_stockout_rate = 1.0 - critical_service
+    waste_fraction = min(1.0, state.total_wasted_value / max(state.total_spend, 0.01))
+    crisis_response_score = _compute_crisis_response_score(state, task_config)
+    incoherent_count     = sum(1 for r in state.justification_log if not r.is_coherent)
+    justification_penalty = min(0.15, incoherent_count * 0.05)
+    score = (
+        0.35 * service_level
+        + 0.25 * cost_efficiency
+        + 0.20 * (1.0 - critical_stockout_rate)
+        + 0.15 * (1.0 - waste_fraction)
+        + 0.05 * crisis_response_score
+        - justification_penalty
+    )
+    return max(0.0, min(1.0, score))
+def _compute_crisis_response_score(
+    state: "SimState",
+    task_config: "TaskConfig",
+) -> float:
+    """
+    Measures crisis response for MCI and recall events.
+    Returns 0.0 to 1.0.
+    """
+    score     = 0.0
+    max_score = 0.0
+    mci_event = next((e for e in task_config.events if e.event_id == "mci_activation"), None)
+    if mci_event:
+        max_score += 0.6
+        total_crit_dem = sum(state.daily_critical_demand)
+        total_crit_ful = sum(state.daily_critical_fulfilled)
+        mci_service = total_crit_ful / max(total_crit_dem, 1)
+        score += 0.6 * mci_service
+    recall_event = next((e for e in task_config.events if e.event_id == "iv_saline_recall"), None)
+    if recall_event:
+        max_score += 0.4
+        if state.recall_handled_by_day is not None:
+            days_delayed = state.recall_handled_by_day - recall_event.trigger_day
+            if days_delayed <= 0:
+                score += 0.4
+            elif days_delayed <= 2:
+                score += 0.2
+    if max_score == 0:
+        return 1.0
+    return score / max_score
+def grade_justification(reason: str, active_event_types: Set[str]) -> bool:
+    """
+    Deterministic keyword-based justification grading.
+    Returns True if coherent (no penalty), False if incoherent.
+    """
+    CRISIS_KEYWORDS: Dict[str, List[str]] = {
+        "mci": ["mci", "mass casualty", "trauma", "incident", "accident",
+                "emergency", "casualties", "blood", "critical patients"],
+        "supplier_disruption": ["disruption", "delay", "lead time", "supplier",
+                                "shortage", "force majeure", "extended"],
+        "product_recall": ["recall", "quarantine", "contamination", "lot",
+                           "health authority", "batch", "defective", "compromised"],
+        "budget_tighten": ["budget", "fiscal", "quarter", "constraint",
+                           "ceiling", "limit", "finance"],
+        "cold_chain_breach": ["cold chain", "temperature", "breach",
+                              "refriger", "spoilage", "compromised"],
+        "demand_surge": ["demand", "surge", "increased", "elevated",
+                         "high usage", "outbreak", "flu", "influenza"],
+    }
+    GENERIC_KEYWORDS = [
+        "urgent", "critical", "shortage", "low stock",
+        "stockout", "emergency", "insufficient",
+    ]
+    reason_lower = reason.lower()
+    if not active_event_types:
+        return any(kw in reason_lower for kw in GENERIC_KEYWORDS)
+    for event_type in active_event_types:
+        keywords = CRISIS_KEYWORDS.get(event_type, [])
+        if any(kw in reason_lower for kw in keywords):
+            return True
+    return any(kw in reason_lower for kw in GENERIC_KEYWORDS)

server/medchain_env_environment.py ADDED Viewed

	@@ -0,0 +1,382 @@

+"""
+MedChain Env — OpenEnv Environment implementation.
+Tool dispatch is delegated to an internal _MedchainMCPDelegate (MCPEnvironment)
+to retain FastMCP schema generation and MCP-level argument validation.
+MedchainEnvironment controls reward computation and returns MedchainToolObservation
+— a plain Pydantic model — so the reward field survives the WebSocket
+serialization round-trip intact.
+"""
+import uuid
+from typing import Any, Optional
+from fastmcp import FastMCP
+from openenv.core.env_server.interfaces import Environment
+from openenv.core.env_server.mcp_environment import MCPEnvironment
+from openenv.core.env_server.mcp_types import (
+    CallToolAction,
+    CallToolObservation,
+    ListToolsAction,
+    ListToolsObservation,
+)
+from openenv.core.env_server.types import Action, Observation
+try:
+    from ..models import AVAILABLE_TOOLS, MedchainState, MedObservation, MedchainToolObservation
+    from .simulation import MedchainSimulation
+    from .tasks import get_task_config
+except ImportError:
+    from models import AVAILABLE_TOOLS, MedchainState, MedObservation, MedchainToolObservation
+    from server.simulation import MedchainSimulation
+    from server.tasks import get_task_config
+class _MedchainMCPDelegate(MCPEnvironment):
+    """
+    Thin MCPEnvironment wrapper that holds the FastMCP server.
+    Provides _handle_call_tool() and _handle_list_tools() to
+    MedchainEnvironment without any reward logic.
+    """
+    def reset(self, **kwargs: Any) -> Observation:  # type: ignore[override]
+        return Observation(done=False, reward=0.0)
+    def _step_impl(self, action: Action, **kwargs: Any) -> Observation:
+        return Observation(done=False, reward=0.0)
+    @property
+    def state(self) -> MedchainState:
+        return MedchainState()
+class MedchainEnvironment(Environment):
+    """
+    Hospital supply chain management environment.
+    The agent operates a simulated legacy ERP system (MEDSUPPLY v2.1) with 9 tools.
+    Each call to end_shift() advances the simulation by one day.
+    The episode terminates after max_days with a terminal reward in [0, 1].
+    """
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    def __init__(self, task: str = "single_ward_stable"):
+        self._sim = MedchainSimulation(get_task_config(task))
+        self._step_count = 0
+        self._state = MedchainState()
+        # ── Register 9 MCP tools ──────────────────────────────────────────
+        mcp = FastMCP("medchain_env")
+        @mcp.tool
+        def read_inbox(filter: str = "unread") -> str:
+            """
+            Read messages from the COMMS PAGER inbox.
+            Args:
+                filter: Message filter — 'unread' (default), 'all', or 'flagged'
+            Returns:
+                Formatted inbox messages as raw text
+            """
+            return self._sim.read_inbox(filter)
+        @mcp.tool
+        def query_erp(table: str, location: str = "all", sku: str = "all") -> str:
+            """
+            Query the legacy ERP database.
+            Args:
+                table: Table to query — 'inventory', 'expiry', 'pipeline_orders', or 'demand_history'
+                location: Location ID or 'all'. E.g. 'ward_general', 'ward_icu', 'hospital_a'
+                sku: Product SKU or 'all'. E.g. 'B-001', 'IV-500', 'GLOVE-001'
+            Returns:
+                ASCII table with query results (legacy ERP format)
+            """
+            return self._sim.query_erp(table, location, sku)
+        @mcp.tool
+        def query_supplier(supplier_id: str) -> str:
+            """
+            Query supplier information including current lead times and disruptions.
+            Args:
+                supplier_id: Supplier identifier. Check the dashboard for valid supplier IDs.
+            Returns:
+                Supplier status text including lead times and any active disruptions
+            """
+            return self._sim.query_supplier(supplier_id)
+        @mcp.tool
+        def query_forecast(product_id: str, location_id: str, horizon_days: int = 7) -> str:
+            """
+            Get demand forecast for a product at a location.
+            Args:
+                product_id: Product SKU to forecast. Use query_erp(table='inventory') to see available SKUs.
+                location_id: Location to forecast for. Use query_erp(table='inventory') to see valid location IDs.
+                horizon_days: Forecast horizon in days (1-21, default 7)
+            Returns:
+                Forecasted daily demand table
+            """
+            return self._sim.query_forecast(product_id, location_id, horizon_days)
+        @mcp.tool
+        def submit_po(
+            supplier_id: str,
+            product_id: str,
+            destination_id: str,
+            quantity: int,
+            priority: str = "standard",
+        ) -> str:
+            """
+            Submit a purchase order to a supplier.
+            Args:
+                supplier_id: Supplier to order from. Check the dashboard for valid supplier IDs.
+                product_id: Product SKU to order. Use query_erp(table='inventory') to see available SKUs.
+                destination_id: Delivery location. Use query_erp(table='inventory') to see valid location IDs.
+                quantity: Number of units to order (must be positive)
+                priority: 'standard' (default) or 'expedited' (+50% cost, -2 day lead time; requires justification)
+            Returns:
+                Confirmation with PO ID and ETA, or error if budget/validation fails.
+                For expedited orders: returns BUDGET_OVERRIDE_REQUIRED with a ticket ID.
+                Use file_justification(ticket_id=...) to proceed.
+            """
+            return self._sim.submit_po(supplier_id, product_id, destination_id, quantity, priority)
+        @mcp.tool
+        def transfer(
+            from_location_id: str,
+            to_location_id: str,
+            product_id: str,
+            quantity: int,
+        ) -> str:
+            """
+            Transfer inventory between locations (small handling fee).
+            Args:
+                from_location_id: Source location. Use query_erp(table='inventory') to see valid location IDs.
+                to_location_id: Destination location. Use query_erp(table='inventory') to see valid location IDs.
+                product_id: Product SKU to transfer. Use query_erp(table='inventory') to see available SKUs.
+                quantity: Units to transfer (must not exceed available stock at source)
+            Returns:
+                Confirmation or error (insufficient stock, capacity exceeded)
+            """
+            return self._sim.transfer(from_location_id, to_location_id, product_id, quantity)
+        @mcp.tool
+        def quarantine_lot(location_id: str, sku: str, lot_id: str) -> str:
+            """
+            Quarantine a specific inventory lot (e.g. for product recalls or cold chain breaches).
+            Quarantined lots are excluded from demand fulfillment.
+            Args:
+                location_id: Location where the lot is stored
+                sku: Product SKU of the lot
+                lot_id: Lot identifier (from inventory query). Use 'all' to quarantine all lots of this SKU at this location.
+            Returns:
+                Confirmation with quarantine details and disposal ticket ID
+            """
+            return self._sim.quarantine_lot(location_id, sku, lot_id)
+        @mcp.tool
+        def file_justification(ticket_id: str, reason: str) -> str:
+            """
+            File a budget override justification to proceed with an expedited order.
+            Required after submit_po returns BUDGET_OVERRIDE_REQUIRED.
+            The justification is audited by Finance — it must reference the current clinical situation.
+            A false or incoherent justification is flagged and results in a scoring penalty.
+            Args:
+                ticket_id: The ticket ID from the BUDGET_OVERRIDE_REQUIRED error (e.g. 'BOT-0002')
+                reason: Free-text justification. Must clearly reference the reason for urgency.
+            Returns:
+                OK confirmation and PO submission details, or FLAGGED audit warning
+            """
+            return self._sim.file_justification(ticket_id, reason)
+        @mcp.tool
+        def end_shift() -> str:
+            """
+            End the current shift and advance the simulation by one day.
+            Commits all pending decisions. Simulates demand, deliveries, and expiry for the day.
+            Resets your action budget for the next shift.
+            Unspent actions are lost — no rollover.
+            Returns:
+                Day summary report + next shift dashboard
+            """
+            return self._sim.end_shift_tool()
+        # Initialise Environment base (no MCPEnvironment in MRO)
+        super().__init__()
+        # Composition: delegate tool dispatch to internal MCPEnvironment
+        self._mcp_env = _MedchainMCPDelegate(mcp)
+    # ── OpenEnv Interface ──────────────────────────────────────────────────
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **kwargs: Any,
+    ) -> Observation:
+        self._step_count = 0
+        ep_id    = episode_id or str(uuid.uuid4())
+        seed_val = seed if seed is not None else 42
+        dashboard = self._sim.reset(seed=seed_val, episode_id=ep_id)
+        return MedObservation(
+            dashboard=dashboard,
+            available_tools=AVAILABLE_TOOLS,
+            episode_id=ep_id,
+            done=False,
+            reward=0.0,
+        )
+    def step(
+        self,
+        action: Action,
+        timeout_s: Optional[float] = None,
+        **kwargs: Any,
+    ) -> Observation:
+        self._step_count += 1
+        if isinstance(action, ListToolsAction):
+            return self._mcp_env._handle_list_tools()
+        if isinstance(action, CallToolAction):
+            return self._handle_call_tool(action, timeout_s=timeout_s)
+        return Observation(
+            done=False,
+            reward=0.0,
+            metadata={
+                "error": (
+                    f"Unknown action type: {type(action).__name__}. "
+                    f"Use CallToolAction with one of: {AVAILABLE_TOOLS}"
+                )
+            },
+        )
+    def _handle_call_tool(
+        self,
+        action: CallToolAction,
+        timeout_s: Optional[float] = None,
+    ) -> MedchainToolObservation:
+        """
+        Dispatch a CallToolAction via the internal MCPEnvironment, compute
+        reward, and return a MedchainToolObservation with a plain float reward.
+        """
+        call_obs: CallToolObservation = self._mcp_env._handle_call_tool(
+            action, timeout_s=timeout_s
+        )
+        result_text = self._extract_result_text(call_obs)
+        is_error = (call_obs.error is not None) or result_text.startswith("ERROR")
+        if action.tool_name == "end_shift":
+            reward = self._sim.get_last_reward()
+            done = self._sim.is_done()
+        else:
+            reward = self._shaping_reward(action.tool_name, result_text, is_error)
+            done = False
+        error_msg: Optional[str] = None
+        if call_obs.error is not None:
+            err = call_obs.error
+            error_msg = err.message if hasattr(err, "message") else str(err)
+        return MedchainToolObservation(
+            tool_name=action.tool_name,
+            tool_result=result_text,
+            error_msg=error_msg,
+            done=done,
+            reward=reward,
+        )
+    @staticmethod
+    def _extract_result_text(call_obs: CallToolObservation) -> str:
+        """Extract plain text from a CallToolObservation."""
+        if call_obs.error is not None:
+            err = call_obs.error
+            msg = err.message if hasattr(err, "message") else str(err)
+            return f"ERROR: {msg}"
+        r = call_obs.result
+        if r is None:
+            return ""
+        if hasattr(r, "data") and r.data is not None:
+            return str(r.data)
+        if hasattr(r, "content") and r.content:
+            first = r.content[0]
+            return first.text if hasattr(first, "text") else str(first)
+        return str(r)
+    def _shaping_reward(
+        self, tool_name: str, result_text: str, is_error: bool
+    ) -> float:
+        """Per-call shaping rewards. Small signals that teach good ERP behaviour."""
+        if is_error and tool_name not in ("file_justification",):
+            return 0.0
+        state = self._sim._state
+        if state is None:
+            return 0.0
+        if tool_name in ("read_inbox", "query_erp"):
+            if tool_name not in state.info_rewards_given_this_shift:
+                if tool_name == "read_inbox" and "INBOX EMPTY" in result_text:
+                    return 0.0
+                state.info_rewards_given_this_shift.add(tool_name)
+                return 0.01
+            return 0.0
+        if tool_name == "submit_po":
+            if not is_error and "BUDGET_OVERRIDE_REQUIRED" not in result_text:
+                return 0.02
+            return 0.0
+        if tool_name == "transfer" and not is_error:
+            return 0.01
+        if tool_name == "quarantine_lot" and not is_error:
+            return 0.01
+        if tool_name == "file_justification":
+            if "FLAGGED" in result_text:
+                return -0.05
+            return 0.01
+        return 0.0
+    @property
+    def state(self) -> MedchainState:
+        s = self._sim._state
+        if s is None:
+            return MedchainState()
+        return MedchainState(
+            episode_id=s.episode_id,
+            step_count=self._step_count,
+            task=s.task,
+            day=s.day,
+            max_days=s.max_days,
+            actions_remaining=s.actions_remaining,
+            budget_used=s.budget_used,
+            budget_limit=s.budget_limit,
+            unread_messages=sum(1 for m in s.inbox if not m.read),
+            orders_in_transit=sum(1 for po in s.pipeline_orders if po.status == "in_transit"),
+        )

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+openenv-core[core]>=0.2.1
+fastmcp>=2.0.0
+numpy>=1.24.0
+fastapi>=0.115.0
+uvicorn>=0.24.0
+pydantic>=2.0.0

server/simulation.py ADDED Viewed

	@@ -0,0 +1,994 @@

+"""
+Core simulation engine for the MedChain Env environment.
+MedchainSimulation manages the full episode lifecycle:
+  - Inventory tracked as FEFO lots per (location, product)
+  - Purchase order pipeline with stochastic lead times
+  - Event-driven inbox messages (crises, recalls, demand surges)
+  - Daily demand generation and fulfillment
+"""
+from __future__ import annotations
+import uuid
+from dataclasses import dataclass, field
+from typing import Dict, List, Optional, Set, Tuple
+import numpy as np
+from .tasks import SimEvent, TaskConfig
+# ─── Simulation Dataclasses ───────────────────────────────────────────────────
+@dataclass
+class Lot:
+    lot_id: str
+    qty: int
+    expiry_day: Optional[int]   # None = non-perishable. Expired when current_day >= expiry_day.
+    cost_per_unit: float
+@dataclass
+class PurchaseOrder:
+    po_id: str
+    supplier_id: str
+    product_id: str
+    destination_id: str
+    quantity: int
+    priority: str               # "standard" or "expedited"
+    day_submitted: int
+    eta_day: int
+    unit_cost: float
+    total_cost: float
+    status: str                 # "pending_justification", "in_transit", "delivered"
+    lot_id: str
+@dataclass
+class PendingBudgetOverride:
+    ticket_id: str
+    po: PurchaseOrder
+@dataclass
+class InboxMessage:
+    msg_id: str
+    priority: str
+    timestamp_str: str          # "Day {n} {HH:MM}"
+    sender: str
+    subject: str
+    body: str
+    read: bool
+    flagged: bool
+    event_id: str
+@dataclass
+class JustificationRecord:
+    ticket_id: str
+    po_id: str
+    reason: str
+    is_coherent: bool
+@dataclass
+class SimState:
+    # Episode meta
+    task: str
+    episode_id: str
+    seed: int
+    rng: np.random.Generator
+    # Time
+    day: int
+    max_days: int
+    # Action budget
+    actions_remaining: int
+    actions_per_shift: int
+    # Budget
+    budget_used: float
+    budget_limit: float
+    # Inventory: (location_id, product_id) -> List[Lot]  (FEFO-sorted)
+    inventory: Dict[Tuple[str, str], List[Lot]]
+    # Orders
+    pipeline_orders: List[PurchaseOrder]
+    po_counter: int
+    # Inbox
+    inbox: List[InboxMessage]
+    msg_counter: int
+    # Budget override tickets
+    pending_overrides: Dict[str, PendingBudgetOverride]
+    # Quarantine
+    quarantined_lots: Set[str]
+    # Demand / fulfillment tracking (one value per completed day)
+    daily_demand: List[float]
+    daily_fulfilled: List[float]
+    daily_critical_demand: List[float]
+    daily_critical_fulfilled: List[float]
+    # Per-(location, product) daily tracking (for demand_history queries)
+    daily_product_demand: Dict[Tuple[str, str], List[int]]
+    daily_product_fulfilled: Dict[Tuple[str, str], List[int]]
+    # Spend tracking
+    total_spend: float
+    total_wasted_value: float
+    # Transfer tracking (task 2)
+    transfer_count: int
+    transfer_cost_paid: float
+    # Capacity violations (task 2)
+    capacity_violation_days: int
+    # Active event effects: event_id -> last_day_active (inclusive)
+    active_events: Dict[str, int]
+    # Per-shift shaping reward helpers
+    info_rewards_given_this_shift: Set[str]
+    daily_stockout_count: int
+    daily_expired_lots: int
+    # Task 3 crisis tracking
+    justification_log: List[JustificationRecord]
+    mci_preemptive_order: bool
+    recall_handled_by_day: Optional[int]
+# ─── MedchainSimulation ───────────────────────────────────────────────────────
+class MedchainSimulation:
+    """
+    Core simulation engine. Called by MedchainEnvironment's MCP tools.
+    All public tool methods return a string (displayed to agent as ERP output).
+    end_shift_tool() also stores _last_reward and _done for the environment.
+    """
+    def __init__(self, task_config: TaskConfig):
+        self._task = task_config
+        self._state: Optional[SimState] = None
+        self._last_reward: float = 0.0
+        self._done: bool = False
+    # ── Called by environment.reset() ──────────────────────────────────────
+    def reset(self, seed: int, episode_id: str) -> str:
+        """Initialize a new episode. Returns dashboard text."""
+        self._done = False
+        self._last_reward = 0.0
+        rng = np.random.default_rng(seed)
+        self._state = SimState(
+            task=self._task.name,
+            episode_id=episode_id,
+            seed=seed,
+            rng=rng,
+            day=1,
+            max_days=self._task.max_days,
+            actions_remaining=self._task.actions_per_shift,
+            actions_per_shift=self._task.actions_per_shift,
+            budget_used=0.0,
+            budget_limit=self._task.budget_limit,
+            inventory={},
+            pipeline_orders=[],
+            po_counter=1,
+            inbox=[],
+            msg_counter=1,
+            pending_overrides={},
+            quarantined_lots=set(),
+            daily_demand=[],
+            daily_fulfilled=[],
+            daily_critical_demand=[],
+            daily_critical_fulfilled=[],
+            daily_product_demand={},
+            daily_product_fulfilled={},
+            total_spend=0.0,
+            total_wasted_value=0.0,
+            transfer_count=0,
+            transfer_cost_paid=0.0,
+            capacity_violation_days=0,
+            active_events={},
+            info_rewards_given_this_shift=set(),
+            daily_stockout_count=0,
+            daily_expired_lots=0,
+            justification_log=[],
+            mci_preemptive_order=False,
+            recall_handled_by_day=None,
+        )
+        self._initialize_inventory()
+        self._inject_day1_inbox()
+        from .erp_formatter import format_dashboard
+        return format_dashboard(self._state, self._task)
+    def _initialize_inventory(self):
+        """Seed initial inventory: initial_stock_days × base_demand per location/product."""
+        state = self._state
+        for product in self._task.products:
+            for loc_id in product.locations:
+                key = (loc_id, product.product_id)
+                qty = int(product.base_demand * self._task.initial_stock_days)
+                expiry_day = (
+                    1 + int(product.shelf_life_days * 0.7)
+                    if product.shelf_life_days is not None
+                    else None
+                )
+                lot = Lot(
+                    lot_id=f"INIT-{product.product_id}-{loc_id}",
+                    qty=qty,
+                    expiry_day=expiry_day,
+                    cost_per_unit=product.unit_cost,
+                )
+                state.inventory[key] = [lot]
+    def _inject_day1_inbox(self):
+        """Add Day 1 inbox messages (welcome + any Day 1 events)."""
+        state = self._state
+        welcome = InboxMessage(
+            msg_id=f"MSG-{state.msg_counter:04d}",
+            priority="LOW",
+            timestamp_str="Day 1 08:00",
+            sender="System",
+            subject="Shift Handover Notes",
+            body=(
+                f"Welcome to the {self._task.name} scenario.\n"
+                f"You are managing medical supplies for {self._task.max_days} days.\n"
+                f"Action budget: {self._task.actions_per_shift} actions per shift.\n"
+                f"Budget ceiling: ${self._task.budget_limit:,.0f} outstanding orders.\n\n"
+                "Use read_inbox to check messages, query_erp to check stock,\n"
+                "submit_po to order supplies, and end_shift to advance the day."
+            ),
+            read=False,
+            flagged=False,
+            event_id="system_welcome",
+        )
+        state.inbox.append(welcome)
+        state.msg_counter += 1
+        self._inject_events_for_day(1)
+    # ── Action Budget Helper ────────────────────────────────────────────────
+    def _check_action_budget(self, tool_name: str) -> Optional[str]:
+        """Returns error string if budget exhausted, None if OK. Does NOT decrement."""
+        if tool_name == "end_shift":
+            return None
+        if self._state is None:
+            return "ERROR: Environment not initialized. Call reset() first."
+        if self._state.actions_remaining <= 0:
+            return (
+                "ERROR: Action budget exhausted for this shift.\n"
+                f"Actions used: {self._state.actions_per_shift}/{self._state.actions_per_shift}\n"
+                "Call end_shift() to advance to the next day and restore your action budget."
+            )
+        return None
+    # ── MCP Tool Implementations ────────────────────────────────────────────
+    def read_inbox(self, filter: str = "unread") -> str:
+        err = self._check_action_budget("read_inbox")
+        if err:
+            return err
+        self._state.actions_remaining -= 1
+        messages = list(self._state.inbox)
+        if filter == "unread":
+            messages = [m for m in messages if not m.read]
+        elif filter == "flagged":
+            messages = [m for m in messages if m.flagged]
+        # "all" → use full inbox
+        for m in messages:
+            m.read = True
+        if not messages:
+            return f"INBOX EMPTY\nFilter: {filter} | No messages matching filter."
+        lines = []
+        for m in messages:
+            read_status = "READ" if m.read else "UNREAD"
+            lines.append(
+                f"\n[MSG {m.msg_id} | {read_status} | PRIORITY: {m.priority} | {m.timestamp_str}]"
+            )
+            lines.append(f"FROM: {m.sender}")
+            lines.append(f"SUBJ: {m.subject}")
+            lines.append("")
+            lines.append(m.body)
+            lines.append("")
+        return "\n".join(lines)
+    def query_erp(self, table: str, location: str = "all", sku: str = "all") -> str:
+        err = self._check_action_budget("query_erp")
+        if err:
+            return err
+        self._state.actions_remaining -= 1
+        valid_tables = ["inventory", "expiry", "pipeline_orders", "demand_history"]
+        if table not in valid_tables:
+            return f"ERROR: Unknown table '{table}'. Valid tables: {valid_tables}"
+        from .erp_formatter import (
+            format_demand_history,
+            format_expiry_table,
+            format_inventory_table,
+            format_pipeline_table,
+        )
+        if table == "inventory":
+            return format_inventory_table(self._state, self._task, location, sku)
+        elif table == "expiry":
+            return format_expiry_table(self._state, self._task, location, sku)
+        elif table == "pipeline_orders":
+            return format_pipeline_table(self._state, location, sku)
+        elif table == "demand_history":
+            return format_demand_history(self._state, self._task, location, sku)
+        return "ERROR: Unexpected table."
+    def query_supplier(self, supplier_id: str) -> str:
+        err = self._check_action_budget("query_supplier")
+        if err:
+            return err
+        self._state.actions_remaining -= 1
+        supplier = next((s for s in self._task.suppliers if s.supplier_id == supplier_id), None)
+        if not supplier:
+            available = [s.supplier_id for s in self._task.suppliers]
+            return f"ERROR: Supplier '{supplier_id}' not found. Available: {available}"
+        effective_lead_time = supplier.base_lead_time
+        disruption_note = "No disruptions reported."
+        for event_id, last_day in self._state.active_events.items():
+            event = next((e for e in self._task.events if e.event_id == event_id), None)
+            if (
+                event
+                and event.event_type == "supplier_disruption"
+                and event.params.get("supplier_id") == supplier_id
+            ):
+                effective_lead_time = event.params["new_lead_time"]
+                disruption_note = (
+                    f"ACTIVE DISRUPTION: Lead time extended to {effective_lead_time} days. "
+                    f"Reason: {event.params['reason']}"
+                )
+        from .erp_formatter import format_supplier_info
+        return format_supplier_info(supplier, effective_lead_time, disruption_note)
+    def query_forecast(self, product_id: str, location_id: str, horizon_days: int = 7) -> str:
+        err = self._check_action_budget("query_forecast")
+        if err:
+            return err
+        self._state.actions_remaining -= 1
+        horizon_days = max(1, min(21, horizon_days))
+        product = next((p for p in self._task.products if p.product_id == product_id), None)
+        if not product:
+            return f"ERROR: Product '{product_id}' not found."
+        if location_id not in product.locations and location_id != "all":
+            return f"ERROR: Product '{product_id}' is not stocked at '{location_id}'."
+        from .erp_formatter import format_forecast
+        return format_forecast(self._state, self._task, product, location_id, horizon_days)
+    def submit_po(
+        self,
+        supplier_id: str,
+        product_id: str,
+        destination_id: str,
+        quantity: int,
+        priority: str = "standard",
+    ) -> str:
+        err = self._check_action_budget("submit_po")
+        if err:
+            return err
+        if priority not in ("standard", "expedited"):
+            return "ERROR: priority must be 'standard' or 'expedited'."
+        if quantity <= 0:
+            return "ERROR: quantity must be positive."
+        supplier = next((s for s in self._task.suppliers if s.supplier_id == supplier_id), None)
+        if not supplier:
+            return f"ERROR: Supplier '{supplier_id}' not found."
+        if product_id not in supplier.products:
+            return f"ERROR: Supplier '{supplier_id}' does not supply '{product_id}'."
+        valid_locs = [l.location_id for l in self._task.locations]
+        if destination_id not in valid_locs:
+            return f"ERROR: Destination '{destination_id}' not found. Valid: {valid_locs}"
+        product = next((p for p in self._task.products if p.product_id == product_id), None)
+        expedited_multiplier = 1.5 if priority == "expedited" else 1.0
+        unit_cost  = product.unit_cost * supplier.cost_multiplier * expedited_multiplier
+        total_cost = unit_cost * quantity
+        if self._state.budget_used + total_cost > self._state.budget_limit:
+            overage = (self._state.budget_used + total_cost) - self._state.budget_limit
+            return (
+                f"ERROR: BUDGET_EXCEEDED\n"
+                f"Order cost: ${total_cost:,.2f} | "
+                f"Current outstanding: ${self._state.budget_used:,.2f} | "
+                f"Limit: ${self._state.budget_limit:,.2f}\n"
+                f"Overage: ${overage:,.2f}\n"
+                f"Reduce order quantity or wait for existing orders to be delivered."
+            )
+        # Effective lead time (check active disruptions)
+        lead_time = supplier.base_lead_time
+        for event_id, last_day in self._state.active_events.items():
+            event = next((e for e in self._task.events if e.event_id == event_id), None)
+            if (
+                event
+                and event.event_type == "supplier_disruption"
+                and event.params.get("supplier_id") == supplier_id
+            ):
+                lead_time = event.params["new_lead_time"]
+        if priority == "expedited":
+            lead_time = max(1, lead_time - 2)
+        # Stochastic jitter for task 3
+        if supplier.lead_time_std > 0:
+            jitter = int(round(self._state.rng.normal(0, supplier.lead_time_std)))
+            lead_time = max(1, lead_time + jitter)
+        eta_day = self._state.day + lead_time
+        po_id  = f"POD-{self._state.po_counter:04d}"
+        lot_id = f"LOT-{po_id}"
+        self._state.po_counter += 1
+        # Expedited: requires justification
+        if priority == "expedited":
+            ticket_id = f"BOT-{self._state.po_counter:04d}"
+            self._state.po_counter += 1
+            po = PurchaseOrder(
+                po_id=po_id, supplier_id=supplier_id, product_id=product_id,
+                destination_id=destination_id, quantity=quantity, priority=priority,
+                day_submitted=self._state.day, eta_day=eta_day, unit_cost=unit_cost,
+                total_cost=total_cost, status="pending_justification", lot_id=lot_id,
+            )
+            self._state.pending_overrides[ticket_id] = PendingBudgetOverride(
+                ticket_id=ticket_id, po=po
+            )
+            self._state.actions_remaining -= 1
+            return (
+                f"ERROR: BUDGET_OVERRIDE_REQUIRED\n"
+                f"Order {po_id} ({priority}, ${total_cost:,.2f} incl. 50% expedite premium) "
+                f"requires justification.\n"
+                f"Ticket ID: {ticket_id}\n"
+                f"Use file_justification(ticket_id=\"{ticket_id}\", reason=\"...\") to proceed.\n"
+                f"Justification will be audited by Finance. False justifications are flagged."
+            )
+        # Standard order: submit immediately
+        self._state.actions_remaining -= 1
+        po = PurchaseOrder(
+            po_id=po_id, supplier_id=supplier_id, product_id=product_id,
+            destination_id=destination_id, quantity=quantity, priority=priority,
+            day_submitted=self._state.day, eta_day=eta_day, unit_cost=unit_cost,
+            total_cost=total_cost, status="in_transit", lot_id=lot_id,
+        )
+        self._state.pipeline_orders.append(po)
+        self._state.budget_used += total_cost
+        return (
+            f"OK — PO {po_id} submitted.\n"
+            f"Product: {product_id} × {quantity} units\n"
+            f"Supplier: {supplier_id} | Priority: {priority}\n"
+            f"Destination: {destination_id} | ETA: Day {eta_day}\n"
+            f"Cost: ${total_cost:,.2f} | "
+            f"Budget remaining: ${self._state.budget_limit - self._state.budget_used:,.2f}"
+        )
+    def transfer(
+        self,
+        from_location_id: str,
+        to_location_id: str,
+        product_id: str,
+        quantity: int,
+    ) -> str:
+        err = self._check_action_budget("transfer")
+        if err:
+            return err
+        self._state.actions_remaining -= 1
+        if quantity <= 0:
+            return "ERROR: quantity must be positive."
+        valid_locs = {l.location_id for l in self._task.locations}
+        if from_location_id not in valid_locs:
+            return f"ERROR: Location '{from_location_id}' not found."
+        if to_location_id not in valid_locs:
+            return f"ERROR: Location '{to_location_id}' not found."
+        key_from = (from_location_id, product_id)
+        lots = sorted(
+            [
+                l for l in self._state.inventory.get(key_from, [])
+                if l.lot_id not in self._state.quarantined_lots
+            ],
+            key=lambda l: (l.expiry_day is None, l.expiry_day or 0),
+        )
+        available = sum(l.qty for l in lots)
+        if available < quantity:
+            return (
+                f"ERROR: Insufficient stock at {from_location_id}. "
+                f"Available: {available} units of {product_id}."
+            )
+        # Check destination capacity (task 2)
+        dest_loc = next(
+            (l for l in self._task.locations if l.location_id == to_location_id), None
+        )
+        if dest_loc and dest_loc.capacity is not None:
+            current_at_dest = sum(
+                sum(lot.qty for lot in lots2)
+                for (loc, pid), lots2 in self._state.inventory.items()
+                if loc == to_location_id
+            )
+            if current_at_dest + quantity > dest_loc.capacity:
+                return (
+                    f"ERROR: CAPACITY_EXCEEDED — {to_location_id} capacity {dest_loc.capacity}. "
+                    f"Current: {current_at_dest}, Transfer: {quantity}."
+                )
+        # FEFO transfer
+        remaining = quantity
+        key_to = (to_location_id, product_id)
+        if key_to not in self._state.inventory:
+            self._state.inventory[key_to] = []
+        for lot in lots:
+            if remaining <= 0:
+                break
+            take = min(remaining, lot.qty)
+            lot.qty -= take
+            remaining -= take
+            self._state.inventory[key_to].append(
+                Lot(
+                    lot_id=f"XFR-{lot.lot_id}",
+                    qty=take,
+                    expiry_day=lot.expiry_day,
+                    cost_per_unit=lot.cost_per_unit,
+                )
+            )
+        self._state.inventory[key_from] = [
+            l for l in self._state.inventory[key_from] if l.qty > 0
+        ]
+        TRANSFER_FEE = 0.5
+        fee = quantity * TRANSFER_FEE
+        self._state.transfer_count += 1
+        self._state.transfer_cost_paid += fee
+        return (
+            f"OK — Transfer complete.\n"
+            f"{quantity} units of {product_id}: {from_location_id} → {to_location_id}\n"
+            f"Transfer fee: ${fee:.2f}"
+        )
+    def quarantine_lot(self, location_id: str, sku: str, lot_id: str) -> str:
+        err = self._check_action_budget("quarantine_lot")
+        if err:
+            return err
+        self._state.actions_remaining -= 1
+        valid_locs = {l.location_id for l in self._task.locations}
+        if location_id not in valid_locs:
+            return f"ERROR: Location '{location_id}' not found."
+        key = (location_id, sku)
+        lots = self._state.inventory.get(key, [])
+        if lot_id == "all":
+            target_lots = [l for l in lots]
+        else:
+            target_lots = [l for l in lots if l.lot_id == lot_id]
+            if not target_lots:
+                target_lots = [l for l in lots if lot_id in l.lot_id]
+        if not target_lots:
+            available_lots = [l.lot_id for l in lots]
+            return (
+                f"ERROR: Lot '{lot_id}' not found at {location_id} for SKU {sku}. "
+                f"Available lots: {available_lots}"
+            )
+        quarantined_qty = 0
+        disposal_ids = []
+        for lot in target_lots:
+            if lot.lot_id not in self._state.quarantined_lots:
+                self._state.quarantined_lots.add(lot.lot_id)
+                quarantined_qty += lot.qty
+                disposal_ids.append(lot.lot_id)
+        # Track recall completion for task 3
+        if sku == "IV-SAL-500" and "RECALL-LOT" in lot_id:
+            self._check_recall_completion()
+        disposal_ticket = f"DIS-{self._state.po_counter:04d}"
+        self._state.po_counter += 1
+        return (
+            f"OK — Quarantine complete.\n"
+            f"SKU: {sku} | Location: {location_id}\n"
+            f"Lots quarantined: {disposal_ids}\n"
+            f"Units quarantined: {quarantined_qty}\n"
+            f"Disposal ticket: {disposal_ticket} created."
+        )
+    def file_justification(self, ticket_id: str, reason: str) -> str:
+        err = self._check_action_budget("file_justification")
+        if err:
+            return err
+        self._state.actions_remaining -= 1
+        if ticket_id not in self._state.pending_overrides:
+            return (
+                f"ERROR: Ticket '{ticket_id}' not found or already processed.\n"
+                f"Active tickets: {list(self._state.pending_overrides.keys())}"
+            )
+        override = self._state.pending_overrides.pop(ticket_id)
+        po = override.po
+        active_event_types: Set[str] = set()
+        for event_id in self._state.active_events:
+            event = next((e for e in self._task.events if e.event_id == event_id), None)
+            if event:
+                active_event_types.add(event.event_type)
+        from .grader import grade_justification
+        is_coherent = grade_justification(reason, active_event_types)
+        record = JustificationRecord(
+            ticket_id=ticket_id, po_id=po.po_id, reason=reason, is_coherent=is_coherent
+        )
+        self._state.justification_log.append(record)
+        po.status = "in_transit"
+        self._state.pipeline_orders.append(po)
+        self._state.budget_used += po.total_cost
+        audit_note = ""
+        if not is_coherent:
+            audit_note = (
+                "\n[AUDIT FLAG] Justification does not reference active crisis conditions. "
+                "Flagged for Finance review. Penalty applied."
+            )
+        return (
+            f"OK — Justification {'accepted' if is_coherent else 'FLAGGED'}. "
+            f"PO {po.po_id} submitted.\n"
+            f"Product: {po.product_id} × {po.quantity} units | Destination: {po.destination_id}\n"
+            f"ETA: Day {po.eta_day} | Cost: ${po.total_cost:,.2f}"
+            f"{audit_note}"
+        )
+    def end_shift_tool(self) -> str:
+        """Advance simulation by one day. Stores _last_reward and _done."""
+        state = self._state
+        if state is None:
+            return "ERROR: Environment not initialized."
+        day = state.day
+        report_lines = [f"╔═══ END OF SHIFT — Day {day} {'═' * 40}╗"]
+        # ── Step 1: Deliver arriving orders ──────────────────────────────
+        delivered = []
+        for po in list(state.pipeline_orders):
+            if po.eta_day <= day:
+                product = next(
+                    (p for p in self._task.products if p.product_id == po.product_id), None
+                )
+                key = (po.destination_id, po.product_id)
+                if key not in state.inventory:
+                    state.inventory[key] = []
+                expiry_day = (day + product.shelf_life_days) if product.shelf_life_days else None
+                lot = Lot(
+                    lot_id=po.lot_id, qty=po.quantity,
+                    expiry_day=expiry_day, cost_per_unit=po.unit_cost
+                )
+                state.inventory[key].append(lot)
+                state.budget_used -= po.total_cost
+                state.total_spend += po.total_cost
+                po.status = "delivered"
+                delivered.append(po)
+        state.pipeline_orders = [po for po in state.pipeline_orders if po.status != "delivered"]
+        if delivered:
+            report_lines.append(f"  DELIVERIES: {len(delivered)} order(s) received.")
+        # ── Step 2: Expire old lots ───────────────────────────────────────
+        total_expired_units = 0
+        total_expired_value = 0.0
+        for key in list(state.inventory.keys()):
+            fresh, expired = [], []
+            for lot in state.inventory[key]:
+                if lot.expiry_day is not None and lot.expiry_day <= day:
+                    expired.append(lot)
+                else:
+                    fresh.append(lot)
+            if expired:
+                for lot in expired:
+                    total_expired_units += lot.qty
+                    total_expired_value += lot.qty * lot.cost_per_unit
+                    state.total_wasted_value += lot.qty * lot.cost_per_unit
+                state.daily_expired_lots += len(expired)
+            state.inventory[key] = fresh
+        if total_expired_units > 0:
+            report_lines.append(
+                f"  EXPIRED: {total_expired_units} units (${total_expired_value:,.2f} written off)"
+            )
+        # ── Step 3: Generate and fulfill demand ───────────────────────────
+        day_demand = 0.0
+        day_fulfilled = 0.0
+        day_critical_demand = 0.0
+        day_critical_fulfilled = 0.0
+        for product in self._task.products:
+            for loc_id in product.locations:
+                demand    = self._generate_demand(product, loc_id, day)
+                fulfilled = self._fefo_fulfill(product.product_id, loc_id, demand, day)
+                day_demand    += demand
+                day_fulfilled += fulfilled
+                if product.criticality == "CRITICAL":
+                    day_critical_demand    += demand
+                    day_critical_fulfilled += fulfilled
+                # Per-product daily tracking
+                key = (loc_id, product.product_id)
+                if key not in state.daily_product_demand:
+                    state.daily_product_demand[key] = []
+                    state.daily_product_fulfilled[key] = []
+                state.daily_product_demand[key].append(demand)
+                state.daily_product_fulfilled[key].append(fulfilled)
+        state.daily_demand.append(day_demand)
+        state.daily_fulfilled.append(day_fulfilled)
+        state.daily_critical_demand.append(day_critical_demand)
+        state.daily_critical_fulfilled.append(day_critical_fulfilled)
+        day_svc = day_fulfilled / max(day_demand, 1)
+        report_lines.append(
+            f"  DEMAND: {int(day_demand)} units | FULFILLED: {int(day_fulfilled)} ({100 * day_svc:.1f}%)"
+        )
+        # ── Step 4: Check capacity violations (task 2) ────────────────────
+        if any(l.capacity is not None for l in self._task.locations):
+            for location in self._task.locations:
+                if location.capacity is None:
+                    continue
+                current = sum(
+                    sum(lot.qty for lot in lots)
+                    for (lid, pid), lots in state.inventory.items()
+                    if lid == location.location_id
+                )
+                if current > location.capacity:
+                    state.capacity_violation_days += 1
+        # ── Step 5: Inject recall lot for task 3 (Day 2, silent) ──────��──
+        if self._task.name == "hospital_network_crisis" and day == 2:
+            self._inject_recall_lot()
+        # ── Step 6: Advance day, reset budget, inject next-day events ────
+        state.day += 1
+        state.actions_remaining = state.actions_per_shift
+        self._update_active_events(state.day)
+        self._inject_events_for_day(state.day)
+        # ── Step 7: Daily shaping reward ──────────────────────────────────
+        shaping = 0.0
+        day_service = day_fulfilled / max(day_demand, 1)
+        shaping += 0.10 * day_service
+        total_units = sum(
+            lot.qty
+            for lots in state.inventory.values()
+            for lot in lots
+            if lot.lot_id not in state.quarantined_lots
+        )
+        shaping -= 0.00005 * total_units
+        shaping -= min(0.30, state.daily_expired_lots * 0.10)
+        shaping -= min(0.50, state.daily_stockout_count * 0.20)
+        state.info_rewards_given_this_shift = set()
+        state.daily_stockout_count = 0
+        state.daily_expired_lots   = 0
+        # ── Step 8: Compute terminal score & check done ───────────────────
+        from .grader import compute_reward
+        final_score = compute_reward(state, self._task)
+        done = state.day > state.max_days
+        if done:
+            report_lines.append(
+                f"╠═══ EPISODE COMPLETE — Final Score: {final_score:.3f} {'═' * 30}╣"
+            )
+            total_d = sum(state.daily_demand)
+            total_f = sum(state.daily_fulfilled)
+            report_lines.append(
+                f"  Service Level:  {total_f / max(total_d, 1) * 100:.1f}%"
+            )
+            report_lines.append(f"  Total Spend:    ${state.total_spend:,.2f}")
+            report_lines.append(f"  Waste Value:    ${state.total_wasted_value:,.2f}")
+            report_lines.append(f"╚{'═' * 68}╝")
+            self._done = True
+            self._last_reward = final_score
+            return "\n".join(report_lines)
+        self._done = False
+        self._last_reward = shaping
+        report_lines.append(
+            f"╚═══ Day {day} committed. Day {state.day} begins. {'═' * 38}╝"
+        )
+        report_lines.append("")
+        from .erp_formatter import format_dashboard
+        report_lines.append(format_dashboard(state, self._task))
+        return "\n".join(report_lines)
+    # ── Private Helpers ────────────────────────────────────────────────────
+    def _generate_demand(self, product, location_id: str, day: int) -> int:
+        import math as _math
+        state = self._state
+        base  = product.base_demand
+        if product.seasonal_amplitude > 0 and product.seasonal_period > 0:
+            seasonal = product.seasonal_amplitude * _math.sin(
+                2 * _math.pi * day / product.seasonal_period + product.seasonal_phase
+            )
+            base *= (1 + seasonal)
+        for event_id, last_day in state.active_events.items():
+            event = next((e for e in self._task.events if e.event_id == event_id), None)
+            if event is None:
+                continue
+            if event.event_type == "mci":
+                if (
+                    product.criticality in ("CRITICAL", "HIGH")
+                    and location_id in event.params.get("locations", [])
+                ):
+                    base *= event.params.get("demand_multiplier", 3.0)
+            elif event.event_type == "demand_surge":
+                if product.product_id in event.params.get("products", []):
+                    base *= event.params.get("multiplier", 1.4)
+        noise = state.rng.normal(0, product.demand_std)
+        return max(0, int(round(base + noise)))
+    def _fefo_fulfill(
+        self, product_id: str, location_id: str, demand: int, day: int
+    ) -> int:
+        state = self._state
+        key   = (location_id, product_id)
+        lots  = state.inventory.get(key, [])
+        lots_sorted = sorted(
+            [l for l in lots if l.lot_id not in state.quarantined_lots and l.qty > 0],
+            key=lambda l: (l.expiry_day is None, l.expiry_day or 0),
+        )
+        fulfilled = 0
+        for lot in lots_sorted:
+            if fulfilled >= demand:
+                break
+            take = min(demand - fulfilled, lot.qty)
+            lot.qty   -= take
+            fulfilled += take
+        state.inventory[key] = [l for l in lots if l.qty > 0]
+        if fulfilled < demand:
+            state.daily_stockout_count += 1
+        return fulfilled
+    def _update_active_events(self, day: int):
+        state = self._state
+        state.active_events = {
+            eid: last_day
+            for eid, last_day in state.active_events.items()
+            if last_day >= day
+        }
+        for event in self._task.events:
+            if event.trigger_day == day and event.duration_days > 0:
+                state.active_events[event.event_id] = day + event.duration_days - 1
+    def _inject_events_for_day(self, day: int):
+        state = self._state
+        for event in self._task.events:
+            if event.trigger_day == day:
+                msg = InboxMessage(
+                    msg_id=f"MSG-{state.msg_counter:04d}",
+                    priority=event.message.priority,
+                    timestamp_str=f"Day {day} 06:00",
+                    sender=event.message.sender,
+                    subject=event.message.subject,
+                    body=event.message.body,
+                    read=False,
+                    flagged=(event.message.priority == "CRITICAL"),
+                    event_id=event.event_id,
+                )
+                state.inbox.append(msg)
+                state.msg_counter += 1
+                if event.event_type == "cold_chain_breach":
+                    self._apply_cold_chain_breach(event)
+                if event.event_type == "budget_tighten":
+                    state.budget_limit = event.params["new_budget_limit"]
+            if event.warning_message and event.trigger_day - 1 == day:
+                msg = InboxMessage(
+                    msg_id=f"MSG-{state.msg_counter:04d}",
+                    priority=event.warning_message.priority,
+                    timestamp_str=f"Day {day} 18:00",
+                    sender=event.warning_message.sender,
+                    subject=event.warning_message.subject,
+                    body=event.warning_message.body,
+                    read=False,
+                    flagged=False,
+                    event_id=f"{event.event_id}_warning",
+                )
+                state.inbox.append(msg)
+                state.msg_counter += 1
+    def _apply_cold_chain_breach(self, event: SimEvent):
+        state = self._state
+        loc  = event.params["location_id"]
+        prod = event.params["product_id"]
+        key  = (loc, prod)
+        for lot in state.inventory.get(key, []):
+            state.quarantined_lots.add(lot.lot_id)
+    def _inject_recall_lot(self):
+        state = self._state
+        recall_lot_id = "RECALL-LOT-IV2026-9821"
+        for event in self._task.events:
+            if event.event_id == "iv_saline_recall":
+                qty     = event.params["qty_per_location"]
+                product = next(
+                    (p for p in self._task.products if p.product_id == "IV-SAL-500"), None
+                )
+                if product is None:
+                    break
+                for loc_id in event.params["locations_with_lot"]:
+                    key = (loc_id, "IV-SAL-500")
+                    if key not in state.inventory:
+                        state.inventory[key] = []
+                    lot = Lot(
+                        lot_id=recall_lot_id,
+                        qty=qty,
+                        expiry_day=None,
+                        cost_per_unit=product.unit_cost,
+                    )
+                    state.inventory[key].append(lot)
+                break
+    def _check_recall_completion(self):
+        state = self._state
+        recall_lot_id = "RECALL-LOT-IV2026-9821"
+        if recall_lot_id not in state.quarantined_lots:
+            return
+        if state.recall_handled_by_day is None:
+            state.recall_handled_by_day = state.day
+    # ── Accessors used by MedchainEnvironment ──────────────────────────────
+    def get_last_reward(self) -> float:
+        return self._last_reward
+    def is_done(self) -> bool:
+        return self._done

server/tasks.py ADDED Viewed

	@@ -0,0 +1,404 @@

+"""
+Task configurations for the MedChain Env environment.
+Four tasks of increasing difficulty:
+  0. orientation_ward        — 2 days,  1 ward,  no events       (intro)
+  1. single_ward_stable      — 3 days,  1 ward,  no events
+  2. multi_ward_seasonal     — 6 days,  4 wards, flu surge + supplier delay
+  3. hospital_network_crisis — 12 days, 4 sites, 5 overlapping crises
+"""
+from dataclasses import dataclass, field
+from typing import Dict, List, Optional
+@dataclass
+class Product:
+    product_id: str
+    name: str
+    shelf_life_days: Optional[int]  # None = non-perishable
+    criticality: str                # "CRITICAL", "HIGH", "NORMAL"
+    unit_cost: float
+    locations: List[str]
+    base_demand: float
+    demand_std: float
+    seasonal_amplitude: float       # 0.0 = no seasonality
+    seasonal_period: int            # 0 = no cycle
+    seasonal_phase: float           # phase offset in radians
+@dataclass
+class Supplier:
+    supplier_id: str
+    name: str
+    base_lead_time: int
+    lead_time_std: float            # 0.0 = deterministic
+    cost_multiplier: float
+    products: List[str]
+@dataclass
+class Location:
+    location_id: str
+    name: str
+    capacity: Optional[int]         # None = unlimited
+@dataclass(frozen=True)
+class InboxMessageTemplate:
+    priority: str
+    sender: str
+    subject: str
+    body: str
+@dataclass
+class SimEvent:
+    event_id: str
+    event_type: str                 # "mci", "supplier_disruption", "product_recall",
+                                    # "demand_surge", "budget_tighten", "cold_chain_breach"
+    trigger_day: int
+    duration_days: int              # 0 = instant/one-off
+    params: Dict
+    message: InboxMessageTemplate
+    warning_message: Optional[InboxMessageTemplate]  # fires on trigger_day-1
+@dataclass
+class TaskConfig:
+    name: str
+    max_days: int
+    actions_per_shift: int
+    budget_limit: float
+    products: List[Product]
+    suppliers: List[Supplier]
+    locations: List[Location]
+    events: List[SimEvent]
+    initial_stock_days: int
+    benchmark_score: float
+# ─── Task 0: Orientation Ward ────────────────────────────────────────────────
+TASK0 = TaskConfig(
+    name="orientation_ward",
+    max_days=2,
+    actions_per_shift=5,
+    budget_limit=5_000.0,
+    initial_stock_days=1,
+    benchmark_score=0.88,
+    locations=[
+        Location("ward_general", "General Ward", capacity=None),
+    ],
+    products=[
+        Product("GLOVE-001", "Surgical Gloves (box)", None, "NORMAL", 5.0, ["ward_general"], 20, 3, 0.0, 0, 0.0),
+        Product("SYR-10",    "Syringes 10ml",         None, "NORMAL", 0.5, ["ward_general"], 30, 5, 0.0, 0, 0.0),
+        Product("MASK-001",  "Surgical Masks",        None, "NORMAL", 1.0, ["ward_general"], 25, 4, 0.0, 0, 0.0),
+    ],
+    suppliers=[
+        Supplier("MEDLINE", "MedLine Medical", base_lead_time=1, lead_time_std=0.0,
+                 cost_multiplier=1.0,
+                 products=["GLOVE-001", "SYR-10", "MASK-001"]),
+    ],
+    events=[],
+)
+# ─── Task 1: Single Ward Stable ──────────────────────────────────────────────
+TASK1 = TaskConfig(
+    name="single_ward_stable",
+    max_days=3,
+    actions_per_shift=6,
+    budget_limit=20_000.0,
+    initial_stock_days=2,
+    benchmark_score=0.68,
+    locations=[
+        Location("ward_general", "General Ward", capacity=None),
+    ],
+    products=[
+        Product("GLOVE-001", "Surgical Gloves (box)",  None, "NORMAL", 5.0,  ["ward_general"], 50, 8,  0.0, 0, 0.0),
+        Product("IV-500",    "IV Bags 500ml",          540,  "NORMAL", 3.5,  ["ward_general"], 30, 5,  0.0, 0, 0.0),
+        Product("SYR-10",    "Syringes 10ml",          None, "NORMAL", 0.5,  ["ward_general"], 80, 12, 0.0, 0, 0.0),
+        Product("PARA-500",  "Paracetamol 500mg",      360,  "NORMAL", 0.1,  ["ward_general"], 40, 6,  0.0, 0, 0.0),
+        Product("MASK-001",  "Surgical Masks",         None, "NORMAL", 1.0,  ["ward_general"], 60, 10, 0.0, 0, 0.0),
+        Product("SAL-001",   "Saline Solution",        720,  "NORMAL", 2.5,  ["ward_general"], 25, 4,  0.0, 0, 0.0),
+    ],
+    suppliers=[
+        Supplier("MEDLINE", "MedLine Medical", base_lead_time=2, lead_time_std=0.0,
+                 cost_multiplier=1.0,
+                 products=["GLOVE-001", "IV-500", "SYR-10", "PARA-500", "MASK-001", "SAL-001"]),
+    ],
+    events=[],
+)
+# ─── Task 2: Multi-Ward Seasonal ─────────────────────────────────────────────
+TASK2 = TaskConfig(
+    name="multi_ward_seasonal",
+    max_days=6,
+    actions_per_shift=8,
+    budget_limit=50_000.0,
+    initial_stock_days=3,
+    benchmark_score=0.55,
+    locations=[
+        Location("central_pharmacy", "Central Pharmacy", capacity=15_000),
+        Location("ward_icu",         "ICU",              capacity=None),
+        Location("ward_emergency",   "Emergency",        capacity=None),
+        Location("ward_general",     "General Medicine", capacity=None),
+    ],
+    products=[
+        Product("IV-500",    "IV Bags 500ml",        540,  "NORMAL",   3.5, ["central_pharmacy", "ward_icu"], 25, 4, 0.0, 0, 0.0),
+        Product("SAL-001",   "Saline Solution",      720,  "NORMAL",   2.5, ["central_pharmacy", "ward_icu", "ward_emergency"], 20, 3, 0.0, 0, 0.0),
+        Product("SYR-10",    "Syringes 10ml",        None, "NORMAL",   0.5, ["central_pharmacy", "ward_icu", "ward_emergency", "ward_general"], 70, 10, 0.0, 0, 0.0),
+        Product("ANTIVIR-01","Antiviral Medication",  365, "HIGH",    15.0, ["central_pharmacy", "ward_emergency"], 10, 2, 0.55, 365, 0.0),
+        Product("MASK-N95",  "N95 Masks",            None, "NORMAL",   2.5, ["central_pharmacy", "ward_emergency", "ward_icu"], 30, 5, 0.4, 365, 0.0),
+        Product("PARA-500",  "Paracetamol 500mg",    360,  "NORMAL",   0.1, ["central_pharmacy", "ward_general"], 40, 6, 0.3, 180, 0.0),
+        Product("GLOVE-001", "Surgical Gloves (box)", None, "NORMAL",  5.0, ["central_pharmacy", "ward_icu", "ward_emergency", "ward_general"], 50, 8, 0.0, 0, 0.0),
+        Product("SUTURE-01", "Suture Kit",           None, "NORMAL",  12.0, ["central_pharmacy", "ward_general"], 8,  2, 0.0, 0, 0.0),
+        Product("DRAIN-01",  "Surgical Drain",       None, "NORMAL",  18.0, ["central_pharmacy", "ward_general"], 5,  1, 0.0, 0, 0.0),
+        Product("DRESS-01",  "Wound Dressing Kit",   None, "NORMAL",   6.0, ["central_pharmacy", "ward_general", "ward_emergency"], 20, 4, 0.0, 0, 0.0),
+    ],
+    suppliers=[
+        Supplier("FASTMED", "FastMed Express", base_lead_time=1, lead_time_std=0.0, cost_multiplier=1.4,
+                 products=["IV-500", "SAL-001", "SYR-10", "ANTIVIR-01", "MASK-N95", "PARA-500", "GLOVE-001", "SUTURE-01", "DRAIN-01", "DRESS-01"]),
+        Supplier("MEDLINE", "MedLine Medical", base_lead_time=4, lead_time_std=0.0, cost_multiplier=1.0,
+                 products=["IV-500", "SAL-001", "SYR-10", "ANTIVIR-01", "MASK-N95", "PARA-500", "GLOVE-001", "SUTURE-01", "DRAIN-01", "DRESS-01"]),
+    ],
+    events=[
+        SimEvent(
+            event_id="flu_surge",
+            event_type="demand_surge",
+            trigger_day=3,
+            duration_days=3,
+            params={"products": ["ANTIVIR-01", "MASK-N95", "PARA-500"], "multiplier": 1.5},
+            message=InboxMessageTemplate(
+                priority="HIGH",
+                sender="Regional Health Authority",
+                subject="Influenza Activity Alert — Demand Surge Active",
+                body=(
+                    "Regional influenza activity has been elevated above seasonal baseline.\n"
+                    "Emergency department visits up 40-55%. Surge in effect now.\n"
+                    "Antiviral, N95, and analgesic stock levels require immediate review.\n"
+                    "Prepare for increased ICU admissions over the next 3 days."
+                ),
+            ),
+            warning_message=InboxMessageTemplate(
+                priority="LOW",
+                sender="Regional Health Authority",
+                subject="Early Warning: Influenza Activity Rising",
+                body=(
+                    "Early indicators suggest influenza activity is rising above seasonal norms.\n"
+                    "Consider reviewing antiviral and PPE stock levels as a precaution."
+                ),
+            ),
+        ),
+        SimEvent(
+            event_id="medline_delay",
+            event_type="supplier_disruption",
+            trigger_day=4,
+            duration_days=3,
+            params={"supplier_id": "MEDLINE", "new_lead_time": 7, "reason": "industrial action"},
+            message=InboxMessageTemplate(
+                priority="HIGH",
+                sender="MedLine Medical — Supply Chain",
+                subject="Service Disruption Notice — Lead Time Extension",
+                body=(
+                    "Due to ongoing industrial action at our primary warehouse facility,\n"
+                    "MedLine Medical lead times are currently extended from 4 to 7 days.\n"
+                    "This affects all standard orders. FastMed Express remains unaffected.\n"
+                    "We apologise for the inconvenience. Reference: SUPDIS-2026-0006."
+                ),
+            ),
+            warning_message=None,
+        ),
+    ],
+)
+# ─── Task 3: Hospital Network Crisis ─────────────────────────────────────────
+TASK3 = TaskConfig(
+    name="hospital_network_crisis",
+    max_days=12,
+    actions_per_shift=10,
+    budget_limit=150_000.0,
+    initial_stock_days=4,
+    benchmark_score=0.38,
+    locations=[
+        Location("regional_dc", "Regional Distribution Centre", capacity=None),
+        Location("hospital_a",  "Hospital A",                   capacity=None),
+        Location("hospital_b",  "Hospital B",                   capacity=None),
+        Location("hospital_c",  "Hospital C",                   capacity=None),
+    ],
+    products=[
+        Product("B-001",     "O-Negative Blood (RBC)",      42,   "CRITICAL", 350.0, ["regional_dc", "hospital_a", "hospital_b", "hospital_c"], 3, 1, 0.0, 0, 0.0),
+        Product("B-002",     "Platelet Pack",               5,    "CRITICAL", 250.0, ["regional_dc", "hospital_a", "hospital_b", "hospital_c"], 2, 1, 0.0, 0, 0.0),
+        Product("FFP-001",   "Fresh Frozen Plasma",         365,  "HIGH",     120.0, ["regional_dc", "hospital_a", "hospital_b", "hospital_c"], 4, 1, 0.0, 0, 0.0),
+        Product("INS-001",   "Insulin (opened vial)",       28,   "HIGH",     45.0,  ["hospital_a", "hospital_b", "hospital_c"], 8,  2, 0.0, 0, 0.0),
+        Product("CHEMO-01",  "Chemotherapy Agent (recon.)", 7,    "HIGH",     800.0, ["hospital_a", "hospital_b"],              1,  0, 0.0, 0, 0.0),
+        Product("IV-SAL-500","IV Saline Solution 500ml",    720,  "NORMAL",   3.5,   ["regional_dc", "hospital_a", "hospital_b", "hospital_c"], 60, 10, 0.0, 0, 0.0),
+        Product("IV-500",    "IV Bags 500ml",               540,  "NORMAL",   3.5,   ["regional_dc", "hospital_a", "hospital_b", "hospital_c"], 40, 7,  0.0, 0, 0.0),
+        Product("SYR-10",    "Syringes 10ml",               None, "NORMAL",   0.5,   ["regional_dc", "hospital_a", "hospital_b", "hospital_c"], 100,15, 0.0, 0, 0.0),
+        Product("GLOVE-001", "Surgical Gloves (box)",       None, "NORMAL",   5.0,   ["regional_dc", "hospital_a", "hospital_b", "hospital_c"], 40, 6,  0.0, 0, 0.0),
+        Product("PARA-500",  "Paracetamol 500mg",           360,  "NORMAL",   0.1,   ["hospital_a", "hospital_b", "hospital_c"], 60, 8,  0.0, 0, 0.0),
+        Product("B-003",     "AB-Positive Blood (RBC)",     42,   "HIGH",     320.0, ["regional_dc", "hospital_a", "hospital_b", "hospital_c"], 2, 1, 0.0, 0, 0.0),
+        Product("MASK-001",  "Surgical Masks",              None, "NORMAL",   1.0,   ["regional_dc", "hospital_a", "hospital_b", "hospital_c"], 80, 12, 0.0, 0, 0.0),
+        Product("DRAIN-01",  "Surgical Drain",              None, "NORMAL",   18.0,  ["hospital_a", "hospital_b", "hospital_c"], 4,  1,  0.0, 0, 0.0),
+        Product("SAL-001",   "Saline Solution",             720,  "NORMAL",   2.5,   ["regional_dc", "hospital_a", "hospital_b", "hospital_c"], 30, 5,  0.0, 0, 0.0),
+        Product("DRESS-01",  "Wound Dressing Kit",          None, "NORMAL",   6.0,   ["hospital_a", "hospital_b", "hospital_c"], 25, 4,  0.0, 0, 0.0),
+    ],
+    suppliers=[
+        Supplier("BLOODBANK-A", "Regional Blood Bank", base_lead_time=1, lead_time_std=0.5, cost_multiplier=1.0,
+                 products=["B-001", "B-002", "FFP-001", "B-003"]),
+        Supplier("SUPPLIER-A",  "HealthCo Supplies",   base_lead_time=3, lead_time_std=0.5, cost_multiplier=1.0,
+                 products=["IV-SAL-500", "IV-500", "SYR-10", "GLOVE-001", "PARA-500", "MASK-001", "DRAIN-01", "SAL-001", "DRESS-01"]),
+        Supplier("SUPPLIER-B",  "MedFast Express",     base_lead_time=2, lead_time_std=0.0, cost_multiplier=1.4,
+                 products=["IV-SAL-500", "IV-500", "SYR-10", "GLOVE-001", "PARA-500", "MASK-001", "DRAIN-01", "SAL-001", "DRESS-01"]),
+        Supplier("PHARMA-X",    "PharmaCorp",          base_lead_time=3, lead_time_std=0.5, cost_multiplier=1.0,
+                 products=["INS-001", "CHEMO-01", "PARA-500"]),
+    ],
+    events=[
+        SimEvent(
+            event_id="cold_chain_breach",
+            event_type="cold_chain_breach",
+            trigger_day=3,
+            duration_days=0,
+            params={
+                "location_id": "regional_dc",
+                "product_id": "B-002",
+                "qty_affected": "all",
+            },
+            message=InboxMessageTemplate(
+                priority="CRITICAL",
+                sender="Regional DC Facilities Management",
+                subject="URGENT: Cold Chain Breach — Platelet Inventory Compromised",
+                body=(
+                    "Cold chain monitoring detected temperature excursion in Fridge Unit 3.\n"
+                    "Recorded +8°C for 4 hours (safe range: +20 to +24°C).\n"
+                    "ALL platelet units at Regional DC are presumed compromised and auto-quarantined.\n"
+                    "Affected product: Platelet Pack (B-002) — ALL lots at regional_dc.\n"
+                    "Action required:\n"
+                    "  1. Arrange emergency replacement order from BLOODBANK-A\n"
+                    "  2. Notify clinical teams — platelet supply now limited to hospital stock only.\n"
+                    "Ref: CCBR-2026-0003"
+                ),
+            ),
+            warning_message=None,
+        ),
+        SimEvent(
+            event_id="supplier_a_disruption",
+            event_type="supplier_disruption",
+            trigger_day=6,
+            duration_days=9,
+            params={"supplier_id": "SUPPLIER-A", "new_lead_time": 7, "reason": "force majeure — flu absenteeism"},
+            message=InboxMessageTemplate(
+                priority="HIGH",
+                sender="HealthCo Supplies — Logistics",
+                subject="Force Majeure Notice — Extended Lead Times",
+                body=(
+                    "HealthCo Supplies hereby provides formal notice of force majeure event.\n"
+                    "Widespread workforce absenteeism due to influenza has impacted fulfilment operations.\n"
+                    "Effective immediately, standard lead times are extended from 3 to 7 days.\n"
+                    "All product lines affected. MedFast Express (SUPPLIER-B) is unaffected.\n"
+                    "Ref: FM-2026-0006"
+                ),
+            ),
+            warning_message=None,
+        ),
+        SimEvent(
+            event_id="mci_warning",
+            event_type="demand_surge",
+            trigger_day=8,
+            duration_days=0,
+            params={},
+            message=InboxMessageTemplate(
+                priority="HIGH",
+                sender="Emergency Management Coordination",
+                subject="Mass Casualty Incident — STANDBY",
+                body=(
+                    "Multi-vehicle collision reported on Interstate highway.\n"
+                    "Current status: STANDBY. Incident Command not yet activated.\n"
+                    "Preliminary estimate: 15-25 critically injured.\n"
+                    "All blood banks placed on AMBER alert.\n"
+                    "Recommend pre-emptive review of O-neg and platelet stock levels NOW.\n"
+                    "Further update to follow."
+                ),
+            ),
+            warning_message=None,
+        ),
+        SimEvent(
+            event_id="mci_activation",
+            event_type="mci",
+            trigger_day=9,
+            duration_days=3,
+            params={
+                "products": ["B-001", "B-002", "FFP-001"],
+                "demand_multiplier": 3.0,
+                "locations": ["hospital_a", "hospital_b", "hospital_c"],
+                "mci_tracking": True,
+            },
+            message=InboxMessageTemplate(
+                priority="CRITICAL",
+                sender="Incident Command System",
+                subject="MCI ACTIVATION — Mass Casualty Event",
+                body=(
+                    "INCIDENT COMMAND ACTIVATED.\n"
+                    "Multi-vehicle collision — confirmed 23 critically injured en route.\n"
+                    "Hospital A ETA: 18 min | Hospital B ETA: 29 min | Hospital C ETA: 41 min.\n"
+                    "Trauma surgery teams mobilised at all hospitals.\n"
+                    "All blood banks placed on RED alert.\n"
+                    "Est. blood product requirements: 60-90 units O-neg RBC, 30-40 platelet packs.\n"
+                    "IMMEDIATE ACTION: Verify blood product inventory and initiate emergency procurement."
+                ),
+            ),
+            warning_message=None,
+        ),
+        SimEvent(
+            event_id="iv_saline_recall",
+            event_type="product_recall",
+            trigger_day=11,
+            duration_days=0,
+            params={
+                "product_id": "IV-SAL-500",
+                "lot_id": "RECALL-LOT-IV2026-9821",
+                "locations_with_lot": ["hospital_a", "hospital_b", "hospital_c", "regional_dc"],
+                "qty_per_location": 60,
+            },
+            message=InboxMessageTemplate(
+                priority="CRITICAL",
+                sender="Pharmacy Automated System",
+                subject="MANDATORY RECALL — IV Saline Solution 500ml",
+                body=(
+                    "MANDATORY RECALL — Health Authority ref #HA-2026-0013\n"
+                    "Product: IV Saline Solution 500ml (IV-SAL-500)\n"
+                    "Affected lot: RECALL-LOT-IV2026-9821\n"
+                    "Reason: Potential endotoxin contamination detected in batch.\n"
+                    "ACTION REQUIRED:\n"
+                    "  1. Query inventory at ALL locations for lot RECALL-LOT-IV2026-9821\n"
+                    "  2. Quarantine ALL affected units immediately — do not use\n"
+                    "  3. Submit replacement order\n"
+                    "Supplier contact: SUPPLIER-A case #88291. Ref: RECALL-2026-0013."
+                ),
+            ),
+            warning_message=None,
+        ),
+    ],
+)
+def get_task_config(task_name: str) -> TaskConfig:
+    if task_name == "orientation_ward":
+        return TASK0
+    elif task_name == "single_ward_stable":
+        return TASK1
+    elif task_name == "multi_ward_seasonal":
+        return TASK2
+    elif task_name == "hospital_network_crisis":
+        return TASK3
+    else:
+        raise ValueError(
+            f"Unknown task: {task_name!r}. "
+            "Choose from: orientation_ward, single_ward_stable, multi_ward_seasonal, hospital_network_crisis"
+        )

test.py ADDED Viewed

	@@ -0,0 +1,152 @@

+"""
+Test inference script for medchain_env simulating runs without LLM calls.
+"""
+import asyncio
+import logging
+import sys
+from pathlib import Path
+# Add project root to sys.path so we can import medchain_env
+sys.path.insert(0, str(Path(__file__).parent))
+from medchain_env import CallToolAction, MedchainEnv
+_log_fmt = logging.Formatter("%(asctime)s [%(levelname)s] %(message)s", datefmt="%H:%M:%S")
+class DualWriter:
+    def __init__(self, filepath):
+        self.terminal = sys.stdout
+        self.log = open(filepath, "w")
+    def write(self, message):
+        self.terminal.write(message)
+        self.log.write(message)
+        self.log.flush()
+    def flush(self):
+        self.terminal.flush()
+        self.log.flush()
+sys.stdout = DualWriter("test_inference.log")
+_stream_handler = logging.StreamHandler(sys.stdout)
+_stream_handler.setFormatter(_log_fmt)
+logging.basicConfig(level=logging.INFO, handlers=[_stream_handler])
+log = logging.getLogger(__name__)
+async def run_test(task_name: str, actions_to_take: list):
+    log.info("Starting task: %s", task_name)
+    # We use the same Docker environment as in inference_medchain_env.py
+    env = await MedchainEnv.from_docker_image(
+        "medchain_env-env:latest",
+        env_vars={"MEDCHAIN_TASK": task_name},
+    )
+    try:
+        log.info("[%s] Docker env started", task_name)
+        mcp_tools = await env.list_tools()
+        tool_names = [t.name for t in mcp_tools]
+        log.info("Available tools: %s", tool_names)
+        obs = await env.reset()
+        obs = obs.observation
+        dashboard = obs.metadata.get("dashboard", "")
+        log.info("[%s] env.reset() complete. done=%s  metadata_keys=%s",
+                 task_name, obs.done, list(obs.metadata.keys()))
+        print(f"\n{'=' * 60}")
+        print(f"TASK: {task_name}")
+        print(f"{'=' * 60}")
+        print(dashboard[:500])
+        step_count = 0
+        final_reward = 0.0
+        done = obs.done
+        for act_dict in actions_to_take:
+            if done:
+                log.info("[%s] Episode already done before taking action %s", task_name, act_dict["tool_name"])
+                break
+            step_count += 1
+            tool_name = act_dict["tool_name"]
+            tool_args = act_dict["arguments"]
+            print(f"\n{'─' * 60}")
+            print(f"[{task_name}] Step {step_count} — predefined action")
+            print(f"{'─' * 60}")
+            print(f"\n[{task_name}] Step {step_count} — SIMULATED AGENT RESPONSE:")
+            print(f"  TOOL CALL: {tool_name}({tool_args})")
+            log.info("[%s] Step %d - calling tool: %s(%s)", task_name, step_count, tool_name, tool_args)
+            action = CallToolAction(tool_name=tool_name, arguments=tool_args)
+            step_result = await env.step(action)
+            obs = step_result.observation
+            done = obs.done
+            result_text = obs.metadata.get("tool_result", str(obs.metadata))
+            if "EPISODE COMPLETE" in (result_text or ""):
+                log.info("[%s] Step %d - 'EPISODE COMPLETE' detected in result text; marking done", task_name, step_count)
+                done = True
+            print(f"\n[{task_name}] Step {step_count} — SERVER RESPONSE (tool_result):")
+            print(f"  {(result_text or 'EMPTY')[:500]}")
+            log.info("[%s] Step %d - env.step() returned. done=%s reward=%s result_preview=%r",
+                     task_name, step_count, done, obs.reward, (result_text or "")[:120])
+            if obs.reward is not None and obs.reward > 0:
+                final_reward = obs.reward
+                print(f"    Reward: {obs.reward:.4f} | Done: {done}")
+            # Sleep slightly to replicate inference_medchain_env behaviour and prevent overwhelming
+            await asyncio.sleep(0.1)
+        log.info("[%s] Episode finished. steps=%d done=%s final_reward=%.4f", task_name, step_count, done, final_reward)
+        print(f"  Final reward: {final_reward:.4f} | Steps: {step_count} | Done: {done}")
+    finally:
+        await env.close()
+async def main():
+    # Intro/easy task: orientation_ward (2 days, explore tools, place 1 order)
+    intro_actions = [
+        {"tool_name": "read_inbox", "arguments": {"filter": "all"}},
+        {"tool_name": "query_erp", "arguments": {"table": "inventory"}},
+        {"tool_name": "submit_po", "arguments": {"supplier_id": "MEDLINE", "product_id": "GLOVE-001", "destination_id": "ward_general", "quantity": 40}},
+        {"tool_name": "submit_po", "arguments": {"supplier_id": "MEDLINE", "product_id": "SYR-10", "destination_id": "ward_general", "quantity": 60}},
+        {"tool_name": "end_shift", "arguments": {}},
+        {"tool_name": "end_shift", "arguments": {}},
+    ]
+    await run_test("orientation_ward", intro_actions)
+    # Medium task: single_ward_stable (3 days, 6 products, no events)
+    easy_actions = [
+        {"tool_name": "read_inbox", "arguments": {"filter": "unread"}},
+        {"tool_name": "query_erp", "arguments": {"table": "inventory"}},
+        {"tool_name": "submit_po", "arguments": {"supplier_id": "MEDLINE", "product_id": "IV-500", "destination_id": "ward_general", "quantity": 100}},
+        {"tool_name": "end_shift", "arguments": {}},
+        {"tool_name": "end_shift", "arguments": {}},
+        {"tool_name": "end_shift", "arguments": {}},
+        {"tool_name": "end_shift", "arguments": {}},
+    ]
+    await run_test("single_ward_stable", easy_actions)
+    # Medium-hard task: multi_ward_seasonal (6 days, flu surge + supplier delay)
+    medium_actions = [
+        {"tool_name": "read_inbox", "arguments": {"filter": "unread"}},
+        {"tool_name": "query_erp", "arguments": {"table": "inventory"}},
+        {"tool_name": "transfer", "arguments": {"from_location_id": "central_pharmacy", "to_location_id": "ward_icu", "product_id": "IV-500", "quantity": 50}},
+        {"tool_name": "submit_po", "arguments": {"supplier_id": "MEDLINE", "product_id": "IV-500", "destination_id": "central_pharmacy", "quantity": 200}},
+    ]
+    # Add enough end_shift actions to finish the 6-day episode
+    for _ in range(7):
+        medium_actions.append({"tool_name": "end_shift", "arguments": {}})
+    await run_test("multi_ward_seasonal", medium_actions)
+if __name__ == "__main__":
+    asyncio.run(main())

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff

validate-submission.sh ADDED Viewed

	@@ -0,0 +1,185 @@

+#!/usr/bin/env bash
+#
+# validate-submission.sh — OpenEnv Submission Validator
+#
+# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
+#
+# Prerequisites:
+#   - Docker:       https://docs.docker.com/get-docker/
+#   - openenv-core: pip install openenv-core
+#   - curl (usually pre-installed)
+#
+# Run:
+#   curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
+#
+#   Or download and run locally:
+#     chmod +x validate-submission.sh
+#     ./validate-submission.sh <ping_url> [repo_dir]
+#
+# Arguments:
+#   ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)
+#   repo_dir   Path to your repo (default: current directory)
+#
+# Examples:
+#   ./validate-submission.sh https://my-team.hf.space
+#   ./validate-submission.sh https://my-team.hf.space ./my-repo
+#
+set -uo pipefail
+DOCKER_BUILD_TIMEOUT=600
+if [ -t 1 ]; then
+  RED='\033[0;31m'
+  GREEN='\033[0;32m'
+  YELLOW='\033[1;33m'
+  BOLD='\033[1m'
+  NC='\033[0m'
+else
+  RED='' GREEN='' YELLOW='' BOLD='' NC=''
+fi
+run_with_timeout() {
+  local secs="$1"; shift
+  if command -v timeout &>/dev/null; then
+    timeout "$secs" "$@"
+  elif command -v gtimeout &>/dev/null; then
+    gtimeout "$secs" "$@"
+  else
+    "$@" &
+    local pid=$!
+    ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
+    local watcher=$!
+    wait "$pid" 2>/dev/null
+    local rc=$?
+    kill "$watcher" 2>/dev/null
+    wait "$watcher" 2>/dev/null
+    return $rc
+  fi
+}
+portable_mktemp() {
+  local prefix="${1:-validate}"
+  mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
+}
+CLEANUP_FILES=()
+cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
+trap cleanup EXIT
+PING_URL="${1:-}"
+REPO_DIR="${2:-.}"
+if [ -z "$PING_URL" ]; then
+  printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
+  printf "\n"
+  printf "  ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
+  printf "  repo_dir   Path to your repo (default: current directory)\n"
+  exit 1
+fi
+if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
+  printf "Error: directory '%s' not found\n" "${2:-.}"
+  exit 1
+fi
+PING_URL="${PING_URL%/}"
+export PING_URL
+PASS=0
+log()  { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
+pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
+fail() { log "${RED}FAILED${NC} -- $1"; }
+hint() { printf "  ${YELLOW}Hint:${NC} %b\n" "$1"; }
+stop_at() {
+  printf "\n"
+  printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
+  exit 1
+}
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${BOLD}  OpenEnv Submission Validator${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+log "Repo:     $REPO_DIR"
+log "Ping URL: $PING_URL"
+printf "\n"
+log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
+CURL_OUTPUT=$(portable_mktemp "validate-curl")
+CLEANUP_FILES+=("$CURL_OUTPUT")
+HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
+  -H "Content-Type: application/json" -d '{}' \
+  "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
+if [ "$HTTP_CODE" = "200" ]; then
+  pass "HF Space is live and responds to /reset"
+elif [ "$HTTP_CODE" = "000" ]; then
+  fail "HF Space not reachable (connection failed or timed out)"
+  hint "Check your network connection and that the Space is running."
+  hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
+  stop_at "Step 1"
+else
+  fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
+  hint "Make sure your Space is running and the URL is correct."
+  hint "Try opening $PING_URL in your browser first."
+  stop_at "Step 1"
+fi
+log "${BOLD}Step 2/3: Running docker build${NC} ..."
+if ! command -v docker &>/dev/null; then
+  fail "docker command not found"
+  hint "Install Docker: https://docs.docker.com/get-docker/"
+  stop_at "Step 2"
+fi
+if [ -f "$REPO_DIR/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR"
+elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR/server"
+else
+  fail "No Dockerfile found in repo root or server/ directory"
+  stop_at "Step 2"
+fi
+log "  Found Dockerfile in $DOCKER_CONTEXT"
+BUILD_OK=false
+BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
+if [ "$BUILD_OK" = true ]; then
+  pass "Docker build succeeded"
+else
+  fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
+  printf "%s\n" "$BUILD_OUTPUT" | tail -20
+  stop_at "Step 2"
+fi
+log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
+if ! command -v openenv &>/dev/null; then
+  fail "openenv command not found"
+  hint "Install it: pip install openenv-core"
+  stop_at "Step 3"
+fi
+VALIDATE_OK=false
+VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
+if [ "$VALIDATE_OK" = true ]; then
+  pass "openenv validate passed"
+  [ -n "$VALIDATE_OUTPUT" ] && log "  $VALIDATE_OUTPUT"
+else
+  fail "openenv validate failed"
+  printf "%s\n" "$VALIDATE_OUTPUT"
+  stop_at "Step 3"
+fi
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${GREEN}${BOLD}  All 3/3 checks passed!${NC}\n"
+printf "${GREEN}${BOLD}  Your submission is ready to submit.${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+printf "\n"
+exit 0