Spaces:

Lonelyguyse1
/

ecom

Sleeping

App Files Files Community

Lonelyguyse1 commited on Apr 2

Commit

ce9fb65

verified ·

1 Parent(s): b20f230

Upload folder using huggingface_hub

Browse files

Files changed (9) hide show

Dockerfile +3 -0
README.md +172 -175
__init__.py +3 -2
client.py +27 -68
models.py +53 -10
openenv.yaml +5 -1
server/__init__.py +1 -1
server/app.py +25 -31
server/ecom_environment.py +794 -63

Dockerfile CHANGED Viewed

@@ -27,6 +27,9 @@ ARG ENV_NAME=ecom
 # Copy environment code (always at root of build context)
 COPY . /app/env
 # For in-repo builds, openenv is already vendored in the build context
 # For standalone builds, openenv will be installed via pyproject.toml
 WORKDIR /app/env

 # Copy environment code (always at root of build context)
 COPY . /app/env
+# Remove host virtual environment if present in build context.
+RUN rm -rf /app/env/.venv
 # For in-repo builds, openenv is already vendored in the build context
 # For standalone builds, openenv will be installed via pyproject.toml
 WORKDIR /app/env

README.md CHANGED Viewed

@@ -1,255 +1,252 @@
 ---
-title: Ecom Environment Server
-emoji: 🎻
-colorFrom: indigo
-colorTo: yellow
 sdk: docker
 pinned: false
 app_port: 8000
 base_path: /web
 tags:
   - openenv
 ---
-# Ecom Environment
-A simple test environment that echoes back messages. Perfect for testing the env APIs as well as demonstrating environment usage patterns.
-## Quick Start
-The simplest way to use the Ecom environment is through the `EcomEnv` class:
-```python
-from ecom import EcomAction, EcomEnv
-try:
-    # Create environment from Docker image
-    ecomenv = EcomEnv.from_docker_image("ecom-env:latest")
-    # Reset
-    result = ecomenv.reset()
-    print(f"Reset: {result.observation.echoed_message}")
-    # Send multiple messages
-    messages = ["Hello, World!", "Testing echo", "Final message"]
-    for msg in messages:
-        result = ecomenv.step(EcomAction(message=msg))
-        print(f"Sent: '{msg}'")
-        print(f"  → Echoed: '{result.observation.echoed_message}'")
-        print(f"  → Length: {result.observation.message_length}")
-        print(f"  → Reward: {result.reward}")
-finally:
-    # Always clean up
-    ecomenv.close()
-```
-That's it! The `EcomEnv.from_docker_image()` method handles:
-- Starting the Docker container
-- Waiting for the server to be ready
-- Connecting to the environment
-- Container cleanup when you call `close()`
-## Building the Docker Image
-Before using the environment, you need to build the Docker image:
-```bash
-# From project root
-docker build -t ecom-env:latest -f server/Dockerfile .
-```
-## Deploying to Hugging Face Spaces
-You can easily deploy your OpenEnv environment to Hugging Face Spaces using the `openenv push` command:
-```bash
-# From the environment directory (where openenv.yaml is located)
-openenv push
-# Or specify options
-openenv push --namespace my-org --private
-```
-The `openenv push` command will:
-1. Validate that the directory is an OpenEnv environment (checks for `openenv.yaml`)
-2. Prepare a custom build for Hugging Face Docker space (enables web interface)
-3. Upload to Hugging Face (ensuring you're logged in)
-### Prerequisites
-- Authenticate with Hugging Face: The command will prompt for login if not already authenticated
-### Options
-- `--directory`, `-d`: Directory containing the OpenEnv environment (defaults to current directory)
-- `--repo-id`, `-r`: Repository ID in format 'username/repo-name' (defaults to 'username/env-name' from openenv.yaml)
-- `--base-image`, `-b`: Base Docker image to use (overrides Dockerfile FROM)
-- `--private`: Deploy the space as private (default: public)
-### Examples
-```bash
-# Push to your personal namespace (defaults to username/env-name from openenv.yaml)
-openenv push
-# Push to a specific repository
-openenv push --repo-id my-org/my-env
-# Push with a custom base image
-openenv push --base-image ghcr.io/meta-pytorch/openenv-base:latest
-# Push as a private space
-openenv push --private
-# Combine options
-openenv push --repo-id my-org/my-env --base-image custom-base:latest --private
-```
-After deployment, your space will be available at:
-`https://huggingface.co/spaces/<repo-id>`
-The deployed space includes:
-- **Web Interface** at `/web` - Interactive UI for exploring the environment
-- **API Documentation** at `/docs` - Full OpenAPI/Swagger interface
-- **Health Check** at `/health` - Container health monitoring
-- **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions
-## Environment Details
-### Action
-**EcomAction**: Contains a single field
-- `message` (str) - The message to echo back
-### Observation
-**EcomObservation**: Contains the echo response and metadata
-- `echoed_message` (str) - The message echoed back
-- `message_length` (int) - Length of the message
-- `reward` (float) - Reward based on message length (length × 0.1)
-- `done` (bool) - Always False for echo environment
-- `metadata` (dict) - Additional info like step count
-### Reward
-The reward is calculated as: `message_length × 0.1`
-- "Hi" → reward: 0.2
-- "Hello, World!" → reward: 1.3
-- Empty message → reward: 0.0
-## Advanced Usage
-### Connecting to an Existing Server
-If you already have a Ecom environment server running, you can connect directly:
-```python
-from ecom import EcomEnv
-# Connect to existing server
-ecomenv = EcomEnv(base_url="<ENV_HTTP_URL_HERE>")
-# Use as normal
-result = ecomenv.reset()
-result = ecomenv.step(EcomAction(message="Hello!"))
-```
-Note: When connecting to an existing server, `ecomenv.close()` will NOT stop the server.
-### Using the Context Manager
-The client supports context manager usage for automatic connection management:
 ```python
 from ecom import EcomAction, EcomEnv
-# Connect with context manager (auto-connects and closes)
-with EcomEnv(base_url="http://localhost:8000") as env:
-    result = env.reset()
-    print(f"Reset: {result.observation.echoed_message}")
-    # Multiple steps with low latency
-    for msg in ["Hello", "World", "!"]:
-        result = env.step(EcomAction(message=msg))
-        print(f"Echoed: {result.observation.echoed_message}")
 ```
-The client uses WebSocket connections for:
-- **Lower latency**: No HTTP connection overhead per request
-- **Persistent session**: Server maintains your environment state
-- **Efficient for episodes**: Better for many sequential steps
-### Concurrent WebSocket Sessions
-The server supports multiple concurrent WebSocket connections. To enable this,
-modify `server/app.py` to use factory mode:
-```python
-# In server/app.py - use factory mode for concurrent sessions
-app = create_app(
-    EcomEnvironment,  # Pass class, not instance
-    EcomAction,
-    EcomObservation,
-    max_concurrent_envs=4,  # Allow 4 concurrent sessions
-)
 ```
-Then multiple clients can connect simultaneously:
-```python
-from ecom import EcomAction, EcomEnv
-from concurrent.futures import ThreadPoolExecutor
-def run_episode(client_id: int):
-    with EcomEnv(base_url="http://localhost:8000") as env:
-        result = env.reset()
-        for i in range(10):
-            result = env.step(EcomAction(message=f"Client {client_id}, step {i}"))
-        return client_id, result.observation.message_length
-# Run 4 episodes concurrently
-with ThreadPoolExecutor(max_workers=4) as executor:
-    results = list(executor.map(run_episode, range(4)))
 ```
-## Development & Testing
-### Direct Environment Testing
-Test the environment logic directly without starting the HTTP server:
 ```bash
-# From the server directory
-python3 server/ecom_environment.py
 ```
-This verifies that:
-- Environment resets correctly
-- Step executes actions properly
-- State tracking works
-- Rewards are calculated correctly
-### Running Locally
-Run the server locally for development:
 ```bash
-uvicorn server.app:app --reload
 ```
-## Project Structure
 ```
-ecom/
-├── .dockerignore         # Docker build exclusions
-├── __init__.py            # Module exports
-├── README.md              # This file
-├── openenv.yaml           # OpenEnv manifest
-├── pyproject.toml         # Project metadata and dependencies
-├── uv.lock                # Locked dependencies (generated)
-├── client.py              # EcomEnv client
-├── models.py              # Action and Observation models
-└── server/
-    ├── __init__.py        # Server module exports
-    ├── ecom_environment.py  # Core environment logic
-    ├── app.py             # FastAPI application (HTTP + WebSocket endpoints)
-    └── Dockerfile         # Container image definition
 ```

 ---
+title: E-commerce Returns Decision Environment
+emoji: 📦
+colorFrom: blue
+colorTo: green
 sdk: docker
 pinned: false
 app_port: 8000
 base_path: /web
 tags:
   - openenv
+  - operations
+  - decision-making
 ---
+# E-commerce Returns Decision Environment
+This environment simulates a real operations workflow in online retail: deciding how to handle customer return requests under policy constraints, latent fraud risk, and financial trade-offs.
+It is designed as a **partially observable decision problem**, not a classification toy.
+## Why this environment matters
+Returns handling is a major cost center in e-commerce.
+In production settings, an operations associate (or AI agent) must balance:
+- customer satisfaction,
+- policy compliance,
+- fraud prevention,
+- and cost efficiency.
+This environment captures that exact tension with structured observations, hidden variables, and deterministic graders.
+## Environment API (OpenEnv)
+The environment follows the OpenEnv simulation API:
+- `reset(...)` -> initial observation
+- `step(action)` -> observation, reward, done, info
+- `state` -> current episode state
+The `step` info channel is exposed via `observation.info`.
+## Action space
+`EcomAction`:
+- `action_type`: one of `APPROVE`, `REJECT`, `ESCALATE`, `REQUEST_INFO`
+- `reason_code`: required only when `action_type == REJECT`
+  - `TIME_EXPIRED`
+  - `POLICY_VIOLATION`
+  - `SUSPECTED_FRAUD`
+## Observation space
+`EcomObservation` fields:
+- `return_reason`
+- `product_category`
+- `product_value` (`low | medium | high`)
+- `days_since_purchase`
+- `user_account_age_days`
+- `product_condition_notes`
+- `return_rate` (0.0 to 1.0)
+- `total_orders`
+- `policy_summary` (plain text, includes rules and exceptions)
+- `info` (step metadata)
+No identifier-only fields are included in the observation.
+## Hidden state (grader-only)
+The environment keeps the following latent variables hidden from the agent:
+- `fraud_risk_score`
+- `true_intent` (`genuine` or `abusive`)
+- `cost_impact` by candidate action
+- `optimal_action`
+These are used to compute scores/rewards and evaluate decision quality.
+## Episode flow and boundaries
+- One request per episode.
+- `APPROVE`, `REJECT`, `ESCALATE` are terminal actions (`done=True`).
+- `REQUEST_INFO` is non-terminal on first use and deterministically refines existing observation fields:
+  - `product_condition_notes`
+  - `return_reason` (optional refinement)
+  - slight refinement of `return_rate`
+- No new fields are introduced after `REQUEST_INFO`.
+## Scenario generation
+Scenarios are generated programmatically from controlled distributions.
+The generator includes mandatory realism correlations:
+- higher `return_rate` -> higher fraud likelihood,
+- lower `return_rate` -> lower fraud likelihood,
+- higher `product_value` -> higher fraud likelihood,
+- lower `product_value` -> lower fraud likelihood.
+Difficulty is not just fraud probability; it also changes ambiguity and signal conflict.
+## Reward design
+Reward is deterministic and normalized to `[0.0, 1.0]`.
+1. **Policy gate** (hard constraint)
+   - policy violation => reward `0.0`
+2. Component scores are bounded independently:
+   - `financial_score in [0,1]`
+   - `fraud_score in [0,1]`
+   - `efficiency_score in [0,1]`
+3. Weighted final score:
+   - `0.5 * financial + 0.3 * fraud + 0.2 * efficiency`
+This avoids overflow and grader instability.
+## Tasks and graders (easy -> medium -> hard)
+The environment ships with 3 deterministic benchmark tasks, each with fixed seed + threshold:
+1. `easy_policy_compliance`
+   - clear low-risk case
+   - success threshold: `0.75`
+2. `medium_balanced_judgment`
+   - ambiguous policy/risk trade-off
+   - success threshold: `0.68`
+3. `hard_conflicting_signals`
+   - high-value conflicting signals + exception pressure
+   - success threshold: `0.62`
+Terminal observation includes grader outputs in `info`:
+- `grader_score` (0.0 to 1.0)
+- `grader_success` (bool)
+- detailed component `breakdown`
+## Quick start
+### Local dev server
+```bash
+uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+```
+### Python usage
 ```python
+import asyncio
 from ecom import EcomAction, EcomEnv
+async def run():
+    env = await EcomEnv.from_docker_image("ecom-env:latest")
+    try:
+        result = await env.reset(task_name="medium_balanced_judgment")
+        # optional extra context
+        result = await env.step(EcomAction(action_type="REQUEST_INFO"))
+        # final decision
+        result = await env.step(EcomAction(action_type="REJECT", reason_code="SUSPECTED_FRAUD"))
+        print(result.reward, result.done, result.observation.info)
+    finally:
+        await env.close()
+asyncio.run(run())
 ```
+## Baseline inference
+`inference.py` is at repo root as required.
+Required env vars:
+- `MODEL_NAME`
+- `LOCAL_IMAGE_NAME`
+- `HF_TOKEN` (or `OPENAI_API_KEY`)
+Optional:
+- `API_BASE_URL` (defaults to `https://api.openai.com/v1`)
+Run:
+```bash
+python inference.py
 ```
+The script emits strict structured logs:
+- `[START] ...`
+- `[STEP] ...`
+- `[END] ...`
+### Reproducible baseline scores
+Current deterministic baseline (heuristic fallback) on default task seeds:
+- `easy_policy_compliance`: `0.7997`
+- `medium_balanced_judgment`: `0.8388`
+- `hard_conflicting_signals`: `0.8253`
+## Hugging Face Spaces deployment
+From `ecom/`:
+```bash
+openenv push
 ```
+Or explicit options:
+```bash
+openenv push --repo-id <namespace>/<space-name> --private
+```
+## Docker
+Build from the environment root (`ecom/`):
 ```bash
+docker build -t ecom-env:latest -f server/Dockerfile .
 ```
+Run:
+```bash
+docker run --rm -p 8000:8000 ecom-env:latest
+```
+Health check:
 ```bash
+curl http://localhost:8000/health
 ```
+## Validation
+From `ecom/`:
+```bash
+openenv validate .
 ```
+Optional pre-check from repository root:
+```bash
+./validate-submission.sh <your-space-url> .
 ```

__init__.py CHANGED Viewed

@@ -4,13 +4,14 @@
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
-"""Ecom Environment."""
 from .client import EcomEnv
-from .models import EcomAction, EcomObservation
 __all__ = [
     "EcomAction",
     "EcomObservation",
     "EcomEnv",
 ]

 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
+"""Ecom returns decision environment."""
 from .client import EcomEnv
+from .models import EcomAction, EcomObservation, EcomReward
 __all__ = [
     "EcomAction",
     "EcomObservation",
+    "EcomReward",
     "EcomEnv",
 ]

client.py CHANGED Viewed

@@ -4,9 +4,9 @@
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
-"""Ecom Environment Client."""
-from typing import Dict
 from openenv.core import EnvClient
 from openenv.core.client_types import StepResult
@@ -15,85 +15,44 @@ from openenv.core.env_server.types import State
 from .models import EcomAction, EcomObservation
-class EcomEnv(
-    EnvClient[EcomAction, EcomObservation, State]
-):
-    """
-    Client for the Ecom Environment.
-    This client maintains a persistent WebSocket connection to the environment server,
-    enabling efficient multi-step interactions with lower latency.
-    Each client instance has its own dedicated environment session on the server.
-    Example:
-        >>> # Connect to a running server
-        >>> with EcomEnv(base_url="http://localhost:8000") as client:
-        ...     result = client.reset()
-        ...     print(result.observation.echoed_message)
-        ...
-        ...     result = client.step(EcomAction(message="Hello!"))
-        ...     print(result.observation.echoed_message)
-    Example with Docker:
-        >>> # Automatically start container and connect
-        >>> client = EcomEnv.from_docker_image("ecom-env:latest")
-        >>> try:
-        ...     result = client.reset()
-        ...     result = client.step(EcomAction(message="Test"))
-        ... finally:
-        ...     client.close()
-    """
-    def _step_payload(self, action: EcomAction) -> Dict:
-        """
-        Convert EcomAction to JSON payload for step message.
-        Args:
-            action: EcomAction instance
-        Returns:
-            Dictionary representation suitable for JSON encoding
-        """
-        return {
-            "message": action.message,
         }
-    def _parse_result(self, payload: Dict) -> StepResult[EcomObservation]:
-        """
-        Parse server response into StepResult[EcomObservation].
-        Args:
-            payload: JSON response data from server
-        Returns:
-            StepResult with EcomObservation
-        """
         obs_data = payload.get("observation", {})
         observation = EcomObservation(
-            echoed_message=obs_data.get("echoed_message", ""),
-            message_length=obs_data.get("message_length", 0),
-            done=payload.get("done", False),
             reward=payload.get("reward"),
-            metadata=obs_data.get("metadata", {}),
         )
         return StepResult(
             observation=observation,
             reward=payload.get("reward"),
-            done=payload.get("done", False),
         )
-    def _parse_state(self, payload: Dict) -> State:
-        """
-        Parse server response into State object.
-        Args:
-            payload: JSON response from state request
-        Returns:
-            State object with episode_id and step_count
-        """
         return State(
             episode_id=payload.get("episode_id"),
-            step_count=payload.get("step_count", 0),
         )

 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
+"""Ecom returns decision environment client."""
+from typing import Any, Dict
 from openenv.core import EnvClient
 from openenv.core.client_types import StepResult
 from .models import EcomAction, EcomObservation
+class EcomEnv(EnvClient[EcomAction, EcomObservation, State]):
+    """Client for the returns decision environment."""
+    def _step_payload(self, action: EcomAction) -> Dict[str, Any]:
+        payload: Dict[str, Any] = {
+            "action_type": action.action_type,
         }
+        if action.reason_code is not None:
+            payload["reason_code"] = action.reason_code
+        if action.metadata:
+            payload["metadata"] = action.metadata
+        return payload
+    def _parse_result(self, payload: Dict[str, Any]) -> StepResult[EcomObservation]:
         obs_data = payload.get("observation", {})
         observation = EcomObservation(
+            return_reason=obs_data.get("return_reason", ""),
+            product_category=obs_data.get("product_category", ""),
+            product_value=obs_data.get("product_value", "low"),
+            days_since_purchase=int(obs_data.get("days_since_purchase", 0)),
+            user_account_age_days=int(obs_data.get("user_account_age_days", 0)),
+            product_condition_notes=obs_data.get("product_condition_notes", ""),
+            return_rate=float(obs_data.get("return_rate", 0.0)),
+            total_orders=int(obs_data.get("total_orders", 1)),
+            policy_summary=obs_data.get("policy_summary", ""),
+            info=obs_data.get("info", {}),
+            done=bool(payload.get("done", False)),
             reward=payload.get("reward"),
         )
         return StepResult(
             observation=observation,
             reward=payload.get("reward"),
+            done=bool(payload.get("done", False)),
         )
+    def _parse_state(self, payload: Dict[str, Any]) -> State:
         return State(
             episode_id=payload.get("episode_id"),
+            step_count=int(payload.get("step_count", 0)),
         )

models.py CHANGED Viewed

@@ -4,24 +4,67 @@
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
-"""
-Data models for the Ecom Environment.
-The ecom environment is a simple test environment that echoes back messages.
-"""
 from openenv.core.env_server.types import Action, Observation
-from pydantic import Field
 class EcomAction(Action):
-    """Action for the Ecom environment - just a message to echo."""
-    message: str = Field(..., description="Message to echo back")
 class EcomObservation(Observation):
-    """Observation from the Ecom environment - the echoed message."""
-    echoed_message: str = Field(default="", description="The echoed message")
-    message_length: int = Field(default=0, description="Length of the echoed message")

 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
+"""Data models for the returns decision environment."""
+from typing import Any, Dict, Literal, Optional
 from openenv.core.env_server.types import Action, Observation
+from pydantic import BaseModel, Field, model_validator
+ActionType = Literal["APPROVE", "REJECT", "ESCALATE", "REQUEST_INFO"]
+RejectReason = Literal["TIME_EXPIRED", "POLICY_VIOLATION", "SUSPECTED_FRAUD"]
+ValueBucket = Literal["low", "medium", "high"]
 class EcomAction(Action):
+    """Action schema for return-request handling."""
+    action_type: ActionType = Field(..., description="Decision type")
+    reason_code: Optional[RejectReason] = Field(
+        default=None,
+        description="Required when action_type is REJECT",
+    )
+    @model_validator(mode="after")
+    def validate_reason_code(self) -> "EcomAction":
+        if self.action_type == "REJECT" and self.reason_code is None:
+            raise ValueError("reason_code is required when action_type is REJECT")
+        if self.action_type != "REJECT" and self.reason_code is not None:
+            raise ValueError("reason_code is only allowed when action_type is REJECT")
+        return self
 class EcomObservation(Observation):
+    """Observation schema for the partially observable returns task."""
+    return_reason: str = Field(..., description="Customer-provided return reason")
+    product_category: str = Field(..., description="Product category")
+    product_value: ValueBucket = Field(..., description="Value bucket")
+    days_since_purchase: int = Field(..., ge=0, description="Elapsed days")
+    user_account_age_days: int = Field(..., ge=0, description="Account age in days")
+    product_condition_notes: str = Field(..., description="Condition summary")
+    return_rate: float = Field(
+        ..., ge=0.0, le=1.0, description="Historical return rate"
+    )
+    total_orders: int = Field(..., ge=1, description="Total historical orders")
+    policy_summary: str = Field(
+        ...,
+        description="Natural-language policy text with rules and exceptions",
+    )
+    info: Dict[str, Any] = Field(
+        default_factory=dict,
+        description="Step info payload (OpenEnv-compatible info channel)",
+    )
+class EcomReward(BaseModel):
+    """Typed reward breakdown used by deterministic task graders."""
+    policy_gate: float = Field(..., ge=0.0, le=1.0)
+    financial_score: float = Field(..., ge=0.0, le=1.0)
+    fraud_score: float = Field(..., ge=0.0, le=1.0)
+    efficiency_score: float = Field(..., ge=0.0, le=1.0)
+    normalized_reward: float = Field(..., ge=0.0, le=1.0)
+    policy_violation: bool
+    optimal_action: Optional[str] = None
+    matched_optimal: Optional[bool] = None

openenv.yaml CHANGED Viewed

@@ -4,4 +4,8 @@ type: space
 runtime: fastapi
 app: server.app:app
 port: 8000

 runtime: fastapi
 app: server.app:app
 port: 8000
+description: "Policy-constrained e-commerce returns decision environment with latent fraud risk"
+tags:
+  - openenv
+  - operations
+  - decision-making

server/__init__.py CHANGED Viewed

@@ -4,7 +4,7 @@
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
-"""Ecom environment server components."""
 from .ecom_environment import EcomEnvironment

 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
+"""Ecom returns environment server components."""
 from .ecom_environment import EcomEnvironment

server/app.py CHANGED Viewed

@@ -4,29 +4,7 @@
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
-"""
-FastAPI application for the Ecom Environment.
-This module creates an HTTP server that exposes the EcomEnvironment
-over HTTP and WebSocket endpoints, compatible with EnvClient.
-Endpoints:
-    - POST /reset: Reset the environment
-    - POST /step: Execute an action
-    - GET /state: Get current environment state
-    - GET /schema: Get action/observation schemas
-    - WS /ws: WebSocket endpoint for persistent sessions
-Usage:
-    # Development (with auto-reload):
-    uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
-    # Production:
-    uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 4
-    # Or run directly:
-    python -m server.app
-"""
 try:
     from openenv.core.env_server.http_server import create_app
@@ -35,18 +13,39 @@ except Exception as e:  # pragma: no cover
         "openenv is required for the web interface. Install dependencies with '\n    uv sync\n'"
     ) from e
 from ecom.models import EcomAction, EcomObservation
 from ecom.server.ecom_environment import EcomEnvironment
 # Create the app with web interface and README integration
 app = create_app(
-    EcomEnvironment,
     EcomAction,
     EcomObservation,
     env_name="ecom",
-    max_concurrent_envs=1,  # increase this number to allow more concurrent WebSocket sessions
 )
@@ -73,9 +72,4 @@ def main(host: str = "0.0.0.0", port: int = 8000):
 if __name__ == "__main__":
-    import argparse
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--port", type=int, default=8000)
-    args = parser.parse_args()
-    main(port=args.port)

 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
+"""FastAPI application for the returns decision environment."""
 try:
     from openenv.core.env_server.http_server import create_app
         "openenv is required for the web interface. Install dependencies with '\n    uv sync\n'"
     ) from e
+import os
 from ecom.models import EcomAction, EcomObservation
 from ecom.server.ecom_environment import EcomEnvironment
+def _env_factory() -> EcomEnvironment:
+    mode = os.getenv("ECOM_MODE", "medium").strip().lower()
+    if mode not in {"easy", "medium", "hard"}:
+        mode = "medium"
+    def _maybe_float(name: str) -> float | None:
+        raw = os.getenv(name)
+        if raw is None or raw.strip() == "":
+            return None
+        return float(raw)
+    return EcomEnvironment(
+        mode=mode,
+        fraud_probability=_maybe_float("ECOM_FRAUD_PROBABILITY"),
+        ambiguity_rate=_maybe_float("ECOM_AMBIGUITY_RATE"),
+        conflict_rate=_maybe_float("ECOM_CONFLICT_RATE"),
+    )
 # Create the app with web interface and README integration
 app = create_app(
+    _env_factory,
     EcomAction,
     EcomObservation,
     env_name="ecom",
+    max_concurrent_envs=4,
 )
 if __name__ == "__main__":
+    main()

server/ecom_environment.py CHANGED Viewed

@@ -4,99 +4,830 @@
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
-"""
-Ecom Environment Implementation.
-A simple test environment that echoes back messages sent to it.
-Perfect for testing HTTP server infrastructure.
-"""
 from uuid import uuid4
 from openenv.core.env_server.interfaces import Environment
 from openenv.core.env_server.types import State
-from ecom.models import EcomAction, EcomObservation
-class EcomEnvironment(Environment):
-    """
-    A simple echo environment that echoes back messages.
-    This environment is designed for testing the HTTP server infrastructure.
-    It maintains minimal state and simply echoes back whatever message it receives.
-    Example:
-        >>> env = EcomEnvironment()
-        >>> obs = env.reset()
-        >>> print(obs.echoed_message)  # "Ecom environment ready!"
-        >>>
-        >>> obs = env.step(EcomAction(message="Hello"))
-        >>> print(obs.echoed_message)  # "Hello"
-        >>> print(obs.message_length)  # 5
-    """
-    # Enable concurrent WebSocket sessions.
-    # Set to True if your environment isolates state between instances.
-    # When True, multiple WebSocket clients can connect simultaneously, each
-    # getting their own environment instance (when using factory mode in app.py).
     SUPPORTS_CONCURRENT_SESSIONS: bool = True
-    def __init__(self):
-        """Initialize the ecom environment."""
-        self._state = State(episode_id=str(uuid4()), step_count=0)
-        self._reset_count = 0
-    def reset(self) -> EcomObservation:
-        """
-        Reset the environment.
-        Returns:
-            EcomObservation with a ready message
-        """
         self._state = State(episode_id=str(uuid4()), step_count=0)
-        self._reset_count += 1
-        return EcomObservation(
-            echoed_message="Ecom environment ready!",
-            message_length=0,
             done=False,
-            reward=0.0,
         )
-    def step(self, action: EcomAction) -> EcomObservation:  # type: ignore[override]
-        """
-        Execute a step in the environment by echoing the message.
-        Args:
-            action: EcomAction containing the message to echo
-        Returns:
-            EcomObservation with the echoed message and its length
-        """
         self._state.step_count += 1
-        message = action.message
-        length = len(message)
-        # Simple reward: longer messages get higher rewards
-        reward = length * 0.1
-        return EcomObservation(
-            echoed_message=message,
-            message_length=length,
-            done=False,
             reward=reward,
-            metadata={"original_message": message, "step": self._state.step_count},
         )
     @property
     def state(self) -> State:
-        """
-        Get the current environment state.
-        Returns:
-            Current State with episode_id and step_count
-        """
-        return self._state

 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
+"""Returns decision environment implementation."""
+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Any, Dict, Literal, Optional
 from uuid import uuid4
 from openenv.core.env_server.interfaces import Environment
 from openenv.core.env_server.types import State
+from ecom.models import EcomAction, EcomObservation, EcomReward
+Difficulty = Literal["easy", "medium", "hard"]
+Intent = Literal["genuine", "abusive"]
+FinalAction = Literal["APPROVE", "REJECT", "ESCALATE"]
+@dataclass(frozen=True)
+class DifficultyConfig:
+    fraud_probability: float
+    ambiguity_rate: float
+    conflict_rate: float
+@dataclass(frozen=True)
+class TaskSpec:
+    difficulty: Difficulty
+    seed: int
+    objective: str
+    success_threshold: float
+@dataclass(frozen=True)
+class PolicyProfile:
+    window_days: int
+    non_returnable: tuple[str, ...]
+    exception_text: str
+    def summary(self) -> str:
+        categories = ", ".join(self.non_returnable)
+        return (
+            f"Returns accepted within {self.window_days} days. "
+            f"Non-returnable categories: {categories}. "
+            f"Exception: {self.exception_text}."
+        )
+@dataclass
+class HiddenCaseState:
+    fraud_risk_score: float
+    true_intent: Intent
+    optimal_action: FinalAction
+    cost_impact: Dict[str, float]
+    category_policy_violated: bool
+    time_policy_violated: bool
+    exception_applies: bool
+    is_ambiguous: bool
+@dataclass
+class VisibleCase:
+    return_reason: str
+    product_category: str
+    product_value: Literal["low", "medium", "high"]
+    days_since_purchase: int
+    user_account_age_days: int
+    product_condition_notes: str
+    return_rate: float
+    total_orders: int
+    policy_summary: str
+class EcomEnvironment(Environment[EcomAction, EcomObservation, State]):
+    """Single-request return decision environment with partial observability."""
     SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    _VALUE_INDEX: Dict[str, int] = {"low": 0, "medium": 1, "high": 2}
+    _DIFFICULTY_DEFAULTS: Dict[Difficulty, DifficultyConfig] = {
+        "easy": DifficultyConfig(
+            fraud_probability=0.10, ambiguity_rate=0.10, conflict_rate=0.05
+        ),
+        "medium": DifficultyConfig(
+            fraud_probability=0.25, ambiguity_rate=0.30, conflict_rate=0.20
+        ),
+        "hard": DifficultyConfig(
+            fraud_probability=0.40, ambiguity_rate=0.55, conflict_rate=0.45
+        ),
+    }
+    _CATEGORY_POLICIES: Dict[str, PolicyProfile] = {
+        "electronics": PolicyProfile(
+            window_days=30,
+            non_returnable=("final-sale", "personal-care"),
+            exception_text="Defective electronics remain returnable even beyond standard restrictions",
+        ),
+        "fashion": PolicyProfile(
+            window_days=45,
+            non_returnable=("underwear", "swimwear"),
+            exception_text="Quality defects override category restrictions",
+        ),
+        "home": PolicyProfile(
+            window_days=60,
+            non_returnable=("custom-furniture",),
+            exception_text="Damage in transit is always eligible for return",
+        ),
+    }
+    _CONDITION_NOTES: Dict[str, tuple[str, str]] = {
+        "defective": (
+            "Customer reports device fails to power on intermittently",
+            "Diagnostic notes show repeat hardware faults and consistent malfunction",
+        ),
+        "wrong-item": (
+            "Packaging label and item description appear mismatched",
+            "Warehouse scan and photo check confirm SKU mismatch from fulfillment",
+        ),
+        "damaged-shipping": (
+            "Outer box shows dents and seal damage from transit",
+            "Carrier handoff log notes impact event with photo-confirmed product damage",
+        ),
+        "changed-mind": (
+            "Customer no longer wants the item and packaging appears opened",
+            "Follow-up confirms item used lightly with no defect evidence",
+        ),
+        "size-issue": (
+            "Customer reports fit mismatch after trying item once",
+            "Follow-up confirms sizing mismatch with item otherwise in resellable condition",
+        ),
+    }
+    _TASKS: Dict[str, TaskSpec] = {
+        "easy_policy_compliance": TaskSpec(
+            difficulty="easy",
+            seed=111,
+            objective=(
+                "Handle a straightforward, low-risk return and maximize policy-compliant value."
+            ),
+            success_threshold=0.75,
+        ),
+        "medium_balanced_judgment": TaskSpec(
+            difficulty="medium",
+            seed=222,
+            objective=(
+                "Balance policy, fraud risk, and cost trade-offs in an ambiguous return request."
+            ),
+            success_threshold=0.68,
+        ),
+        "hard_conflicting_signals": TaskSpec(
+            difficulty="hard",
+            seed=333,
+            objective=(
+                "Resolve conflicting policy exceptions and risk signals in a high-value case."
+            ),
+            success_threshold=0.62,
+        ),
+    }
+    def __init__(
+        self,
+        mode: Difficulty = "medium",
+        *,
+        fraud_probability: Optional[float] = None,
+        ambiguity_rate: Optional[float] = None,
+        conflict_rate: Optional[float] = None,
+        task_name: Optional[str] = None,
+    ):
+        self._task_name: Optional[str] = task_name
+        self._task_spec: Optional[TaskSpec] = None
+        if task_name is not None:
+            if task_name not in self._TASKS:
+                valid = ", ".join(sorted(self._TASKS))
+                raise ValueError(
+                    f"Unknown task_name '{task_name}'. Valid tasks: {valid}"
+                )
+            self._task_spec = self._TASKS[task_name]
+            mode = self._task_spec.difficulty
+        if mode not in self._DIFFICULTY_DEFAULTS:
+            raise ValueError("mode must be one of: easy, medium, hard")
+        self._mode: Difficulty = mode
+        base_cfg = self._DIFFICULTY_DEFAULTS[mode]
+        self._cfg = DifficultyConfig(
+            fraud_probability=self._clamp01(
+                base_cfg.fraud_probability
+                if fraud_probability is None
+                else fraud_probability
+            ),
+            ambiguity_rate=self._clamp01(
+                base_cfg.ambiguity_rate if ambiguity_rate is None else ambiguity_rate
+            ),
+            conflict_rate=self._clamp01(
+                base_cfg.conflict_rate if conflict_rate is None else conflict_rate
+            ),
+        )
         self._state = State(episode_id=str(uuid4()), step_count=0)
+        self._visible_case: Optional[VisibleCase] = None
+        self._hidden_case: Optional[HiddenCaseState] = None
+        self._requested_info = False
+        self._done = False
+        self._task_seed: Optional[int] = (
+            self._task_spec.seed if self._task_spec else None
+        )
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        task_name: Optional[str] = None,
+        **kwargs,
+    ) -> EcomObservation:
+        del kwargs
+        if task_name is not None:
+            if task_name not in self._TASKS:
+                valid = ", ".join(sorted(self._TASKS))
+                raise ValueError(
+                    f"Unknown task_name '{task_name}'. Valid tasks: {valid}"
+                )
+            self._task_name = task_name
+            self._task_spec = self._TASKS[task_name]
+            self._task_seed = self._task_spec.seed
+            self._mode = self._task_spec.difficulty
+            base_cfg = self._DIFFICULTY_DEFAULTS[self._mode]
+            self._cfg = DifficultyConfig(
+                fraud_probability=base_cfg.fraud_probability,
+                ambiguity_rate=base_cfg.ambiguity_rate,
+                conflict_rate=base_cfg.conflict_rate,
+            )
+        self._state = State(episode_id=episode_id or str(uuid4()), step_count=0)
+        self._requested_info = False
+        self._done = False
+        effective_seed = seed
+        if effective_seed is None and self._task_seed is not None:
+            effective_seed = self._task_seed
+        rng = self._rng(effective_seed)
+        self._visible_case, self._hidden_case = self._generate_case(rng)
+        return self._to_observation(
+            self._visible_case,
+            reward=None,
             done=False,
+            info={
+                "mode": self._mode,
+                "task_name": self._task_name,
+                "task_objective": self._task_spec.objective
+                if self._task_spec
+                else None,
+                "task_seed": effective_seed,
+                "phase": "initial",
+                "step_contract": "observation_reward_done_info",
+            },
         )
+    def step(
+        self,
+        action: EcomAction,
+        timeout_s: Optional[float] = None,
+        **kwargs,
+    ) -> EcomObservation:
+        del timeout_s, kwargs
+        if self._visible_case is None or self._hidden_case is None:
+            # Allow stateless HTTP /step calls by lazily initializing an episode.
+            self.reset()
+        if self._done:
+            raise RuntimeError(
+                "Episode already terminated. Call reset() to start a new episode"
+            )
         self._state.step_count += 1
+        if action.action_type == "REQUEST_INFO":
+            if self._requested_info:
+                info = {
+                    "invalid_action": "REQUEST_INFO already used",
+                    "allowed_actions": ["APPROVE", "REJECT", "ESCALATE"],
+                    "step_penalty": -0.10,
+                    "step_contract": "observation_reward_done_info",
+                }
+                return self._to_observation(
+                    self._visible_case,
+                    reward=-0.10,
+                    done=False,
+                    info=info,
+                )
+            self._requested_info = True
+            self._visible_case = self._refine_after_request_info(
+                self._visible_case,
+                self._hidden_case,
+            )
+            info_gain_reward = 0.08 if self._hidden_case.is_ambiguous else -0.03
+            info = {
+                "phase": "post_request_info",
+                "revealed": ["product_condition_notes", "return_reason"],
+                "step_reward": info_gain_reward,
+                "step_contract": "observation_reward_done_info",
+            }
+            return self._to_observation(
+                self._visible_case,
+                reward=info_gain_reward,
+                done=False,
+                info=info,
+            )
+        if action.action_type not in ("APPROVE", "REJECT", "ESCALATE"):
+            info = {
+                "invalid_action": "Final action must be APPROVE, REJECT, or ESCALATE",
+                "step_penalty": -0.05,
+                "step_contract": "observation_reward_done_info",
+            }
+            return self._to_observation(
+                self._visible_case,
+                reward=-0.05,
+                done=False,
+                info=info,
+            )
+        reward, breakdown = self._evaluate(
+            action, self._visible_case, self._hidden_case
+        )
+        self._done = True
+        info = {
+            "mode": self._mode,
+            "task_name": self._task_name,
+            "phase": "terminal",
+            "breakdown": breakdown,
+            "grader_score": float(breakdown["normalized_reward"]),
+            "grader_success": self._task_success(float(breakdown["normalized_reward"])),
+            "step_contract": "observation_reward_done_info",
+        }
+        return self._to_observation(
+            self._visible_case,
             reward=reward,
+            done=True,
+            info=info,
         )
     @property
     def state(self) -> State:
+        return State(
+            episode_id=self._state.episode_id,
+            step_count=self._state.step_count,
+            mode=self._mode,
+            task_name=self._task_name,
+            task_objective=self._task_spec.objective if self._task_spec else None,
+            done=self._done,
+            requested_info=self._requested_info,
+        )
+    @classmethod
+    def task_names(cls) -> tuple[str, ...]:
+        return tuple(cls._TASKS.keys())
+    @classmethod
+    def task_spec(cls, task_name: str) -> TaskSpec:
+        if task_name not in cls._TASKS:
+            valid = ", ".join(sorted(cls._TASKS))
+            raise ValueError(f"Unknown task_name '{task_name}'. Valid tasks: {valid}")
+        return cls._TASKS[task_name]
+    def _task_success(self, score: float) -> bool:
+        if self._task_spec is None:
+            return score >= 0.7
+        return score >= self._task_spec.success_threshold
+    def grader_score(self, action: EcomAction) -> float:
+        if self._visible_case is None or self._hidden_case is None:
+            raise RuntimeError("Environment must be reset() before grader scoring")
+        score, _ = self._evaluate(action, self._visible_case, self._hidden_case)
+        return score
+    def get_metadata(self) -> "EnvironmentMetadata":
+        from openenv.core.env_server.types import EnvironmentMetadata
+        return EnvironmentMetadata(
+            name="ecom-returns-decision",
+            description=(
+                "Operational return-decision environment with policy constraints, "
+                "latent fraud risk, and cost-aware grading."
+            ),
+            version="1.0.0",
+            author="OpenEnv_H",
+            documentation_url="https://huggingface.co/spaces/<repo-id>",
+        )
+    @staticmethod
+    def _rng(seed: Optional[int]):
+        import random
+        return random.Random(seed)
+    def _generate_case(self, rng) -> tuple[VisibleCase, HiddenCaseState]:
+        category = rng.choice(tuple(self._CATEGORY_POLICIES.keys()))
+        policy = self._CATEGORY_POLICIES[category]
+        return_reason = self._weighted_choice(
+            rng,
+            {
+                "defective": 0.24,
+                "wrong-item": 0.14,
+                "damaged-shipping": 0.12,
+                "changed-mind": 0.28,
+                "size-issue": 0.22,
+            },
+        )
+        value_bucket = self._weighted_choice(
+            rng,
+            {
+                "low": 0.40,
+                "medium": 0.40,
+                "high": 0.20,
+            },
+        )
+        days_since_purchase = rng.randint(0, 90)
+        user_account_age_days = rng.randint(15, 2200)
+        total_orders = rng.randint(2, 220)
+        # Mandatory behavioral signal.
+        return_rate = self._sample_return_rate(rng, total_orders)
+        # Mandatory correlations.
+        # 1) Higher return_rate -> higher fraud, lower return_rate -> lower fraud.
+        # 2) Higher product_value -> higher fraud, lower product_value -> lower fraud.
+        base_risk = self._cfg.fraud_probability
+        risk = base_risk
+        risk += 0.35 * (return_rate - 0.30)
+        risk += 0.10 * (self._VALUE_INDEX[value_bucket] - 1)
+        risk += 0.08 if return_reason == "changed-mind" else 0.0
+        risk -= (
+            0.10
+            if return_reason in ("defective", "wrong-item", "damaged-shipping")
+            else 0.0
+        )
+        risk += 0.10 if user_account_age_days < 90 else 0.0
+        fraud_risk_score = self._clamp01(risk)
+        # Intent depends on computed latent risk, not independent coin flips.
+        true_intent: Intent = (
+            "abusive" if rng.random() < fraud_risk_score else "genuine"
+        )
+        # Policy constraints with exception support.
+        exception_applies = return_reason in ("defective", "damaged-shipping")
+        category_policy_violated = False
+        if not exception_applies:
+            category_flag_prob = 0.10 + 0.12 * self._cfg.conflict_rate
+            if return_reason == "changed-mind":
+                category_flag_prob += 0.10
+            if rng.random() < self._clamp01(category_flag_prob):
+                policy_tag = rng.choice(policy.non_returnable)
+                category_policy_violated = True
+            else:
+                policy_tag = None
+        else:
+            policy_tag = None
+        time_policy_violated = (
+            days_since_purchase > policy.window_days and not exception_applies
+        )
+        is_ambiguous = rng.random() < self._cfg.ambiguity_rate
+        is_conflicting = rng.random() < self._cfg.conflict_rate
+        # Hardness is not just fraud probability; ambiguity and conflicts reshape signals.
+        condition_brief, condition_detailed = self._CONDITION_NOTES[return_reason]
+        if category_policy_violated and policy_tag is not None:
+            condition_brief += (
+                f"; order line is tagged under restricted class '{policy_tag}'"
+            )
+            condition_detailed += (
+                f"; policy audit confirms '{policy_tag}' item class on this order line"
+            )
+        if is_conflicting:
+            condition_brief = self._inject_conflict_signal(
+                condition_brief, return_reason
+            )
+            condition_detailed = self._inject_conflict_signal(
+                condition_detailed, return_reason
+            )
+        policy_summary = policy.summary()
+        if is_ambiguous:
+            policy_summary = self._make_policy_more_ambiguous(policy_summary)
+        visible = VisibleCase(
+            return_reason=return_reason,
+            product_category=category,
+            product_value=value_bucket,
+            days_since_purchase=days_since_purchase,
+            user_account_age_days=user_account_age_days,
+            product_condition_notes=condition_brief,
+            return_rate=return_rate,
+            total_orders=total_orders,
+            policy_summary=policy_summary,
+        )
+        financial_scores = self._financial_scores(
+            value_bucket=value_bucket,
+            intent=true_intent,
+            category_violation=category_policy_violated,
+            time_violation=time_policy_violated,
+            return_reason=return_reason,
+        )
+        optimal_action = self._argmax_action(financial_scores)
+        hidden = HiddenCaseState(
+            fraud_risk_score=fraud_risk_score,
+            true_intent=true_intent,
+            optimal_action=optimal_action,
+            cost_impact=financial_scores,
+            category_policy_violated=category_policy_violated,
+            time_policy_violated=time_policy_violated,
+            exception_applies=exception_applies,
+            is_ambiguous=is_ambiguous or is_conflicting,
+        )
+        return visible, hidden
+    @staticmethod
+    def _sample_return_rate(rng, total_orders: int) -> float:
+        band = rng.random()
+        if band < 0.60:
+            center = 0.12
+            spread = 0.08
+        elif band < 0.90:
+            center = 0.30
+            spread = 0.10
+        else:
+            center = 0.55
+            spread = 0.12
+        noise = (rng.random() * 2.0 - 1.0) * spread
+        historical_pressure = min(0.08, 8.0 / float(total_orders))
+        return EcomEnvironment._clamp01(center + noise + historical_pressure)
+    @staticmethod
+    def _inject_conflict_signal(text: str, reason: str) -> str:
+        if reason in ("defective", "damaged-shipping"):
+            return text + "; inspection has mixed indicators and partial evidence"
+        return text + "; customer claims conflict with available logistics notes"
+    @staticmethod
+    def _make_policy_more_ambiguous(text: str) -> str:
+        return (
+            text + " In borderline cases, consistency checks and risk controls apply."
+        )
+    @staticmethod
+    def _financial_scores(
+        *,
+        value_bucket: str,
+        intent: Intent,
+        category_violation: bool,
+        time_violation: bool,
+        return_reason: str,
+    ) -> Dict[FinalAction, float]:
+        value_scale = {"low": 1.0, "medium": 1.7, "high": 2.6}[value_bucket]
+        approve_gain = 0.45
+        if return_reason in ("defective", "wrong-item", "damaged-shipping"):
+            approve_gain += 0.25
+        if intent == "abusive":
+            approve_gain -= 0.45 * value_scale
+        else:
+            approve_gain -= 0.15 * value_scale
+        if category_violation or time_violation:
+            approve_gain -= 0.35
+        reject_gain = 0.25
+        if intent == "abusive":
+            reject_gain += 0.35 * value_scale
+        else:
+            reject_gain -= 0.30
+        if return_reason in ("defective", "wrong-item", "damaged-shipping"):
+            reject_gain -= 0.20
+        escalate_gain = 0.30
+        escalate_gain -= 0.08 * value_scale
+        if intent == "abusive":
+            escalate_gain += 0.12
+        if return_reason in ("defective", "damaged-shipping"):
+            escalate_gain += 0.05
+        return {
+            "APPROVE": approve_gain,
+            "REJECT": reject_gain,
+            "ESCALATE": escalate_gain,
+        }
+    @staticmethod
+    def _argmax_action(scores: Dict[FinalAction, float]) -> FinalAction:
+        return max(scores.keys(), key=lambda k: scores[k])
+    def _evaluate(
+        self,
+        action: EcomAction,
+        visible: VisibleCase,
+        hidden: HiddenCaseState,
+    ) -> tuple[float, Dict[str, Any]]:
+        policy_ok = self._policy_gate(action, visible, hidden)
+        if not policy_ok:
+            reward_model = EcomReward(
+                policy_gate=0.0,
+                financial_score=0.0,
+                fraud_score=0.0,
+                efficiency_score=0.0,
+                normalized_reward=0.0,
+                policy_violation=True,
+            )
+            return 0.0, reward_model.model_dump()
+        final_action = action.action_type
+        if final_action == "REJECT":
+            reason_bonus = 0.0
+            if hidden.time_policy_violated and action.reason_code == "TIME_EXPIRED":
+                reason_bonus = 0.05
+            elif (
+                hidden.category_policy_violated
+                and action.reason_code == "POLICY_VIOLATION"
+            ):
+                reason_bonus = 0.05
+            elif (
+                hidden.true_intent == "abusive"
+                and action.reason_code == "SUSPECTED_FRAUD"
+            ):
+                reason_bonus = 0.05
+        else:
+            reason_bonus = 0.0
+        trajectory_bonus = 0.0
+        if self._requested_info and hidden.is_ambiguous:
+            trajectory_bonus += 0.05
+        # Component scores are individually bounded [0, 1] before weighting.
+        financial_raw = (
+            hidden.cost_impact[final_action] + reason_bonus + trajectory_bonus
+        )
+        financial_score = self._normalize_financial(financial_raw)
+        fraud_score = self._fraud_component(final_action, hidden)
+        efficiency_score = self._efficiency_component(final_action)
+        final_reward = (
+            0.50 * financial_score + 0.30 * fraud_score + 0.20 * efficiency_score
+        )
+        final_reward = self._clamp01(final_reward)
+        reward_model = EcomReward(
+            policy_gate=1.0,
+            financial_score=financial_score,
+            fraud_score=fraud_score,
+            efficiency_score=efficiency_score,
+            normalized_reward=final_reward,
+            policy_violation=False,
+            optimal_action=hidden.optimal_action,
+            matched_optimal=final_action == hidden.optimal_action,
+        )
+        return final_reward, reward_model.model_dump()
+    @staticmethod
+    def _policy_gate(
+        action: EcomAction,
+        visible: VisibleCase,
+        hidden: HiddenCaseState,
+    ) -> bool:
+        if action.action_type == "APPROVE":
+            if hidden.time_policy_violated or hidden.category_policy_violated:
+                return False
+        if action.action_type == "REJECT":
+            if hidden.time_policy_violated and hidden.category_policy_violated:
+                if action.reason_code not in ("TIME_EXPIRED", "POLICY_VIOLATION"):
+                    return False
+            elif hidden.time_policy_violated:
+                if action.reason_code != "TIME_EXPIRED":
+                    return False
+            elif hidden.category_policy_violated:
+                if action.reason_code != "POLICY_VIOLATION":
+                    return False
+            else:
+                if action.reason_code in ("TIME_EXPIRED", "POLICY_VIOLATION"):
+                    return False
+            # Prevent unsupported fraud accusation when fraud signal is very low.
+            if (
+                action.reason_code == "SUSPECTED_FRAUD"
+                and hidden.fraud_risk_score < 0.45
+            ):
+                return False
+        # If no violation and no fraud signal, rejecting a clear service-failure claim is policy-inconsistent.
+        if (
+            action.action_type == "REJECT"
+            and not hidden.time_policy_violated
+            and not hidden.category_policy_violated
+            and hidden.fraud_risk_score < 0.30
+            and visible.return_reason in ("defective", "wrong-item", "damaged-shipping")
+        ):
+            return False
+        return True
+    @staticmethod
+    def _normalize_financial(raw_value: float) -> float:
+        # Bound from approximately [-1.5, 1.5] into [0, 1] deterministically.
+        return EcomEnvironment._clamp01((raw_value + 1.5) / 3.0)
+    @staticmethod
+    def _fraud_component(final_action: FinalAction, hidden: HiddenCaseState) -> float:
+        risk = hidden.fraud_risk_score
+        if final_action == "REJECT":
+            if hidden.true_intent == "abusive":
+                return EcomEnvironment._clamp01(0.60 + 0.40 * risk)
+            return EcomEnvironment._clamp01(0.20 + 0.30 * (1.0 - risk))
+        if final_action == "APPROVE":
+            if hidden.true_intent == "genuine":
+                return EcomEnvironment._clamp01(0.65 + 0.35 * (1.0 - risk))
+            return EcomEnvironment._clamp01(0.10 + 0.20 * (1.0 - risk))
+        # ESCALATE
+        if hidden.true_intent == "abusive":
+            return EcomEnvironment._clamp01(0.50 + 0.30 * risk)
+        return EcomEnvironment._clamp01(0.45 + 0.25 * (1.0 - risk))
+    def _efficiency_component(self, final_action: FinalAction) -> float:
+        # Escalation and prior info requests incur efficiency penalty.
+        base = 1.0
+        if self._requested_info:
+            base -= 0.20
+        if final_action == "ESCALATE":
+            base -= 0.30
+        return self._clamp01(base)
+    def _refine_after_request_info(
+        self,
+        visible: VisibleCase,
+        hidden: HiddenCaseState,
+    ) -> VisibleCase:
+        reason = visible.return_reason
+        if hidden.true_intent == "abusive":
+            refined_reason = (
+                "changed-mind" if reason in ("defective", "wrong-item") else reason
+            )
+            refined_notes = (
+                visible.product_condition_notes
+                + "; follow-up review found no reproducible defect evidence"
+            )
+        else:
+            refined_reason = reason
+            refined_notes = (
+                self._CONDITION_NOTES[reason][1]
+                if reason in self._CONDITION_NOTES
+                else visible.product_condition_notes
+            )
+        # Deterministic, existing-field-only refinement.
+        refined_return_rate = self._clamp01(
+            visible.return_rate - 0.03
+            if hidden.true_intent == "genuine"
+            else visible.return_rate + 0.03
+        )
+        return VisibleCase(
+            return_reason=refined_reason,
+            product_category=visible.product_category,
+            product_value=visible.product_value,
+            days_since_purchase=visible.days_since_purchase,
+            user_account_age_days=visible.user_account_age_days,
+            product_condition_notes=refined_notes,
+            return_rate=refined_return_rate,
+            total_orders=visible.total_orders,
+            policy_summary=visible.policy_summary,
+        )
+    @staticmethod
+    def _to_observation(
+        case: VisibleCase,
+        *,
+        reward: Optional[float],
+        done: bool,
+        info: Dict[str, Any],
+    ) -> EcomObservation:
+        return EcomObservation(
+            return_reason=case.return_reason,
+            product_category=case.product_category,
+            product_value=case.product_value,
+            days_since_purchase=case.days_since_purchase,
+            user_account_age_days=case.user_account_age_days,
+            product_condition_notes=case.product_condition_notes,
+            return_rate=case.return_rate,
+            total_orders=case.total_orders,
+            policy_summary=case.policy_summary,
+            reward=reward,
+            done=done,
+            info=info,
+        )
+    @staticmethod
+    def _weighted_choice(rng, distribution: Dict[str, float]) -> str:
+        threshold = rng.random()
+        cumulative = 0.0
+        last = None
+        for key, weight in distribution.items():
+            cumulative += weight
+            last = key
+            if threshold <= cumulative:
+                return key
+        assert last is not None
+        return last
+    @staticmethod
+    def _clamp01(value: float) -> float:
+        if value < 0.0:
+            return 0.0
+        if value > 1.0:
+            return 1.0
+        return float(value)