---
title: E-commerce Returns Decision Environment
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
  - operations
  - decision-making
---

# E-commerce Returns Decision Environment

This environment is a partially observable, policy-constrained decision process
for a single e-commerce return case per episode.

## Environment root and loader contract

This repository uses `ecom/` as the OpenEnv environment root.

- `ecom/openenv.yaml` is the authoritative manifest.
- `app: server.app:app` resolves inside `ecom/`, so it maps to
  `ecom/server/app.py`.
- Validate from repository root with `openenv validate ecom`, or from
  `ecom/` with `openenv validate .`.

## Formal task definition

Episode is defined over hidden state `s_t` and observation `o_t`.

- Hidden state contains fraud intent, policy violations, latent risk, and
  optimal action target.
- Observation exposes only operational case fields and policy summary text.
- Agent takes one action from `A = {APPROVE, REJECT, ESCALATE, REQUEST_INFO}`.
- Terminal objective is maximizing normalized reward while satisfying policy gate
  constraints.

The environment is not a static classifier. It is a short-horizon sequential
decision loop with action-dependent transition and scoring.

## Schemas

### Action (`EcomAction`)

- `action_type`: `APPROVE | REJECT | ESCALATE | REQUEST_INFO`
- `reason_code` is required only when `action_type == REJECT`
- Allowed reject reasons:
  - `TIME_EXPIRED`
  - `POLICY_VIOLATION`
  - `SUSPECTED_FRAUD`

Validation is strict: non-REJECT actions cannot carry `reason_code`.

### Observation (`EcomObservation`)

- `return_reason`
- `product_category`
- `product_value` in `{low, medium, high}`
- `days_since_purchase`
- `user_account_age_days`
- `product_condition_notes`
- `return_rate` in `[0,1]`
- `total_orders >= 1`
- `policy_summary`
- `reward`, `done`, `info`

### Reward payload (`EcomReward`)

Terminal breakdown keys:

- `policy_gate`
- `financial_score`
- `fraud_score`
- `efficiency_score`
- `normalized_reward`
- `policy_violation`
- `optimal_action`
- `matched_optimal`

All numeric reward components are bounded to `[0,1]`.
`optimal_action` is the highest-scoring legal terminal action label, such as
`APPROVE`, `ESCALATE`, or `REJECT(<REASON>)`; it may be `null` when no legal
terminal action exists from the current state.

## Episode protocol

### Reset

`reset(seed=None, episode_id=None, task_name=None)`:

- initializes state
- samples deterministic or stochastic case
- returns initial observation with:
  - `info.phase=initial`
  - `info.available_actions=[APPROVE, REJECT, ESCALATE, REQUEST_INFO]`
  - `info.reject_reason_codes=[TIME_EXPIRED, POLICY_VIOLATION, SUSPECTED_FRAUD]`
  - `info.task_name`, `info.task_seed`, `info.task_objective` when task-based

### Step

`step(action)` follows these guards and transitions:

1. If called before reset:
   - action is ignored
   - returns fresh initial observation
   - sets `invalid_action` and `last_action_error` to
     `step_called_before_reset_action_ignored`
2. If called after terminal:
   - returns terminal observation
   - `reward=0.0`, `done=true`
   - sets `invalid_action` and `last_action_error` to
     `episode_already_terminated_call_reset`
3. `REQUEST_INFO` first use:
   - non-terminal
   - refines existing fields only
   - reward shaping: `+0.08` if ambiguous else `-0.03`
4. Repeated `REQUEST_INFO`:
   - non-terminal penalty `-0.10`
   - error code: `request_info_already_used`
5. Invalid non-terminal-final action type:
   - non-terminal penalty `-0.05`
   - error code: `invalid_final_action`
6. Valid terminal action (`APPROVE|REJECT|ESCALATE`):
   - runs policy gate then reward model
   - returns terminal observation with grader fields

Hard cap is `_MAX_STEPS=4`. Exceeding cap returns terminal `0.0` with
`termination_reason=max_steps_exceeded`.

## Info-channel contract

`info` is the machine-readable control channel. It is used for policy hints,
error handling, and grader reporting.

Common keys by phase:

- Initial phase:
  - `phase=initial`
  - `available_actions`
  - `reject_reason_codes`
- Post-`REQUEST_INFO` phase:
  - `phase=post_request_info`
  - `revealed`
  - `available_actions`
  - `reject_reason_codes`
- Terminal phase:
  - `phase=terminal`
  - `breakdown`
  - `grader_score`
  - `grader_success`
  - `decision_audit`
- Invalid action paths:
  - `invalid_action` (stable machine code)
  - `last_action_error` (same machine code)

## Case generation model

### Difficulty presets

- `easy`: fraud `0.10`, ambiguity `0.10`, conflict `0.05`
- `medium`: fraud `0.25`, ambiguity `0.30`, conflict `0.20`
- `hard`: fraud `0.40`, ambiguity `0.55`, conflict `0.45`

### Latent risk construction

For non-hard-template episodes, latent fraud risk is derived from correlated
signals, not independent labels.

Base formula (clamped to `[0,1]`):

```text
risk = base_fraud_probability
     + 0.35 * (return_rate - 0.30)
     + 0.10 * value_index
     + reason_and_account_adjustments
```

Where `value_index` maps low/medium/high to `-1/0/+1` offset through internal
indexing. Intent is then sampled from this latent risk.

### Policy model

Each category defines:

- return window days
- non-returnable category list
- exception text

Policy violations are split into:

- `time_policy_violated`
- `category_policy_violated`

Exception handling is explicitly modeled and influences both generation and
policy gate decisions.

### Ambiguity and conflict injection

- Ambiguity and conflict are sampled from difficulty-controlled rates.
- Conflict mutates condition/policy wording to create realistic contradictory
  evidence patterns.

### Hard template (`hard_conflicting_signals`)

The hard task uses a deterministic high-risk template:

- high-value electronics focus
- near-window timing
- intentionally conflicting evidence phrases
- stricter policy-gate behavior requiring evidence handling before finalization

## Transition semantics for `REQUEST_INFO`

`REQUEST_INFO` does not add new fields. It only refines existing observable
fields deterministically from hidden intent:

- `product_condition_notes`
- `return_reason` (may refine)
- `return_rate` (small deterministic shift)

This keeps schema fixed while allowing information-gathering behavior.

## Policy gate

If policy gate fails, terminal reward is forced to `0.0`.

Core constraints enforced:

- `APPROVE` is blocked on time/category violations.
- `APPROVE` may be blocked in high-risk ambiguous cases without exception.
- `REJECT` requires reason-code consistency with actual violation structure.
- Fraud rejection is blocked when fraud signal is too low.
- Rejecting clear low-fraud service-failure claims is blocked.
- In ambiguous hard scenarios, direct finalization before evidence collection can
  be blocked.

## Reward model

After gate pass:

1. Financial component:

```text
financial_raw = cost_impact[action] + reason_bonus + trajectory_bonus
financial_score = clamp01((financial_raw + 1.5) / 3.0)
```

2. Fraud component uses action-intent-risk-conditioned piecewise scoring.

3. Efficiency component:

```text
efficiency = 1.0 - 0.20*(requested_info_used) - 0.30*(action==ESCALATE)
```

4. Final reward:

```text
reward = clamp01(
    0.50 * financial_score
  + 0.30 * fraud_score
  + 0.20 * efficiency_score
)
```

Trajectory shaping:

- positive bonus for requesting info in ambiguous cases
- penalty for skipping info in ambiguous cases

## Creative evaluator features

To improve real-world utility and exploit resistance during review, terminal
responses include a deterministic decision audit payload.

`info.decision_audit` includes:

- `chosen_action`
- `chosen_reward`
- `best_counterfactual_reward`
- `decision_gap`
- `counterfactual_rewards` for:
  - `APPROVE`
  - `ESCALATE`
  - `REJECT(TIME_EXPIRED)`
  - `REJECT(POLICY_VIOLATION)`
  - `REJECT(SUSPECTED_FRAUD)`
- `risk_band` (`low|medium|high`)
- `policy_flags` (`time_policy_violated`, `category_policy_violated`,
  `exception_applies`, `ambiguous_case`)

This creates a transparent, machine-checkable explanation surface without
changing reward determinism.

## Deterministic task set

Tasks are fixed-name benchmarks with fixed seed and threshold:

1. `easy_policy_compliance`:
   - seed `111`
   - threshold `0.75`
2. `medium_balanced_judgment`:
   - seed `222`
   - threshold `0.68`
3. `hard_conflicting_signals`:
   - seed `333`
   - threshold `0.74`

Terminal `grader_success` is computed against the active task threshold.

## Determinism and reproducibility

- Uses `random.Random(seed)` for case generation.
- Task mode pins seed unless an explicit seed override is passed.
- No wall-clock dependence in generation or scoring.
- `grader_score(action)` is deterministic for a fixed latent case.

## Inference contract (`../inference.py`)

Baseline runner enforces strict one-line logs:

- `[START] task=<task> env=<benchmark> model=<model>`
- `[STEP] step=<n> action=<action> reward=<r> done=<bool> error=<value|null>`
- `[END] success=<bool> steps=<n> score=<score> rewards=<r1,r2,...>`

Action selection path uses environment-provided control hints:

- `available_actions`
- `reject_reason_codes`
- `invalid_action` / `last_action_error`

This reduces invalid-action loops and keeps inference behavior aligned with
runtime contract.

LLM proxy requirement for submission validation:

- client initialization must use injected environment variables only:
  - `api_key=os.environ["API_KEY"]`
  - `base_url=os.environ["API_BASE_URL"]`
- do not hardcode provider keys or bypass the injected proxy URL.

Exploit-hardening and evaluation integrity notes:

- `step()` before `reset()` does not execute the supplied action; it returns
  initial state with machine-readable error codes.
- post-terminal `step()` calls return terminal state with `reward=0.0` and
  explicit terminal error code, preventing undefined behavior loops.
- invalid-action branches emit stable machine codes and explicit
  `available_actions` / `reject_reason_codes` to avoid parser ambiguity.
- task grading is deterministic with fixed seeds and fixed success thresholds;
  no hidden stochastic post-processing in scoring.

## Validation checklist

From repository root:

```bash
openenv validate ecom
python -m pytest tests -q
./validate-submission.sh <space-url> .
```

From `ecom/`:

```bash
openenv validate .
openenv push
```