ecom / README.md
Lonelyguyse1's picture
Upload folder using huggingface_hub
4364d93 verified
---
title: E-commerce Returns Decision Environment
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
- operations
- decision-making
---
# E-commerce Returns Decision Environment
This environment is a partially observable, policy-constrained decision process
for a single e-commerce return case per episode.
## Environment root and loader contract
This repository uses `ecom/` as the OpenEnv environment root.
- `ecom/openenv.yaml` is the authoritative manifest.
- `app: server.app:app` resolves inside `ecom/`, so it maps to
`ecom/server/app.py`.
- Validate from repository root with `openenv validate ecom`, or from
`ecom/` with `openenv validate .`.
## Formal task definition
Episode is defined over hidden state `s_t` and observation `o_t`.
- Hidden state contains fraud intent, policy violations, latent risk, and
optimal action target.
- Observation exposes only operational case fields and policy summary text.
- Agent takes one action from `A = {APPROVE, REJECT, ESCALATE, REQUEST_INFO}`.
- Terminal objective is maximizing normalized reward while satisfying policy gate
constraints.
The environment is not a static classifier. It is a short-horizon sequential
decision loop with action-dependent transition and scoring.
## Schemas
### Action (`EcomAction`)
- `action_type`: `APPROVE | REJECT | ESCALATE | REQUEST_INFO`
- `reason_code` is required only when `action_type == REJECT`
- Allowed reject reasons:
- `TIME_EXPIRED`
- `POLICY_VIOLATION`
- `SUSPECTED_FRAUD`
Validation is strict: non-REJECT actions cannot carry `reason_code`.
### Observation (`EcomObservation`)
- `return_reason`
- `product_category`
- `product_value` in `{low, medium, high}`
- `days_since_purchase`
- `user_account_age_days`
- `product_condition_notes`
- `return_rate` in `[0,1]`
- `total_orders >= 1`
- `policy_summary`
- `reward`, `done`, `info`
### Reward payload (`EcomReward`)
Terminal breakdown keys:
- `policy_gate`
- `financial_score`
- `fraud_score`
- `efficiency_score`
- `normalized_reward`
- `policy_violation`
- `optimal_action`
- `matched_optimal`
All numeric reward components are bounded to `[0,1]`.
`optimal_action` is the highest-scoring legal terminal action label, such as
`APPROVE`, `ESCALATE`, or `REJECT(<REASON>)`; it may be `null` when no legal
terminal action exists from the current state.
## Episode protocol
### Reset
`reset(seed=None, episode_id=None, task_name=None)`:
- initializes state
- samples deterministic or stochastic case
- returns initial observation with:
- `info.phase=initial`
- `info.available_actions=[APPROVE, REJECT, ESCALATE, REQUEST_INFO]`
- `info.reject_reason_codes=[TIME_EXPIRED, POLICY_VIOLATION, SUSPECTED_FRAUD]`
- `info.task_name`, `info.task_seed`, `info.task_objective` when task-based
### Step
`step(action)` follows these guards and transitions:
1. If called before reset:
- action is ignored
- returns fresh initial observation
- sets `invalid_action` and `last_action_error` to
`step_called_before_reset_action_ignored`
2. If called after terminal:
- returns terminal observation
- `reward=0.0`, `done=true`
- sets `invalid_action` and `last_action_error` to
`episode_already_terminated_call_reset`
3. `REQUEST_INFO` first use:
- non-terminal
- refines existing fields only
- reward shaping: `+0.08` if ambiguous else `-0.03`
4. Repeated `REQUEST_INFO`:
- non-terminal penalty `-0.10`
- error code: `request_info_already_used`
5. Invalid non-terminal-final action type:
- non-terminal penalty `-0.05`
- error code: `invalid_final_action`
6. Valid terminal action (`APPROVE|REJECT|ESCALATE`):
- runs policy gate then reward model
- returns terminal observation with grader fields
Hard cap is `_MAX_STEPS=4`. Exceeding cap returns terminal `0.0` with
`termination_reason=max_steps_exceeded`.
## Info-channel contract
`info` is the machine-readable control channel. It is used for policy hints,
error handling, and grader reporting.
Common keys by phase:
- Initial phase:
- `phase=initial`
- `available_actions`
- `reject_reason_codes`
- Post-`REQUEST_INFO` phase:
- `phase=post_request_info`
- `revealed`
- `available_actions`
- `reject_reason_codes`
- Terminal phase:
- `phase=terminal`
- `breakdown`
- `grader_score`
- `grader_success`
- `decision_audit`
- Invalid action paths:
- `invalid_action` (stable machine code)
- `last_action_error` (same machine code)
## Case generation model
### Difficulty presets
- `easy`: fraud `0.10`, ambiguity `0.10`, conflict `0.05`
- `medium`: fraud `0.25`, ambiguity `0.30`, conflict `0.20`
- `hard`: fraud `0.40`, ambiguity `0.55`, conflict `0.45`
### Latent risk construction
For non-hard-template episodes, latent fraud risk is derived from correlated
signals, not independent labels.
Base formula (clamped to `[0,1]`):
```text
risk = base_fraud_probability
+ 0.35 * (return_rate - 0.30)
+ 0.10 * value_index
+ reason_and_account_adjustments
```
Where `value_index` maps low/medium/high to `-1/0/+1` offset through internal
indexing. Intent is then sampled from this latent risk.
### Policy model
Each category defines:
- return window days
- non-returnable category list
- exception text
Policy violations are split into:
- `time_policy_violated`
- `category_policy_violated`
Exception handling is explicitly modeled and influences both generation and
policy gate decisions.
### Ambiguity and conflict injection
- Ambiguity and conflict are sampled from difficulty-controlled rates.
- Conflict mutates condition/policy wording to create realistic contradictory
evidence patterns.
### Hard template (`hard_conflicting_signals`)
The hard task uses a deterministic high-risk template:
- high-value electronics focus
- near-window timing
- intentionally conflicting evidence phrases
- stricter policy-gate behavior requiring evidence handling before finalization
## Transition semantics for `REQUEST_INFO`
`REQUEST_INFO` does not add new fields. It only refines existing observable
fields deterministically from hidden intent:
- `product_condition_notes`
- `return_reason` (may refine)
- `return_rate` (small deterministic shift)
This keeps schema fixed while allowing information-gathering behavior.
## Policy gate
If policy gate fails, terminal reward is forced to `0.0`.
Core constraints enforced:
- `APPROVE` is blocked on time/category violations.
- `APPROVE` may be blocked in high-risk ambiguous cases without exception.
- `REJECT` requires reason-code consistency with actual violation structure.
- Fraud rejection is blocked when fraud signal is too low.
- Rejecting clear low-fraud service-failure claims is blocked.
- In ambiguous hard scenarios, direct finalization before evidence collection can
be blocked.
## Reward model
After gate pass:
1. Financial component:
```text
financial_raw = cost_impact[action] + reason_bonus + trajectory_bonus
financial_score = clamp01((financial_raw + 1.5) / 3.0)
```
2. Fraud component uses action-intent-risk-conditioned piecewise scoring.
3. Efficiency component:
```text
efficiency = 1.0 - 0.20*(requested_info_used) - 0.30*(action==ESCALATE)
```
4. Final reward:
```text
reward = clamp01(
0.50 * financial_score
+ 0.30 * fraud_score
+ 0.20 * efficiency_score
)
```
Trajectory shaping:
- positive bonus for requesting info in ambiguous cases
- penalty for skipping info in ambiguous cases
## Creative evaluator features
To improve real-world utility and exploit resistance during review, terminal
responses include a deterministic decision audit payload.
`info.decision_audit` includes:
- `chosen_action`
- `chosen_reward`
- `best_counterfactual_reward`
- `decision_gap`
- `counterfactual_rewards` for:
- `APPROVE`
- `ESCALATE`
- `REJECT(TIME_EXPIRED)`
- `REJECT(POLICY_VIOLATION)`
- `REJECT(SUSPECTED_FRAUD)`
- `risk_band` (`low|medium|high`)
- `policy_flags` (`time_policy_violated`, `category_policy_violated`,
`exception_applies`, `ambiguous_case`)
This creates a transparent, machine-checkable explanation surface without
changing reward determinism.
## Deterministic task set
Tasks are fixed-name benchmarks with fixed seed and threshold:
1. `easy_policy_compliance`:
- seed `111`
- threshold `0.75`
2. `medium_balanced_judgment`:
- seed `222`
- threshold `0.68`
3. `hard_conflicting_signals`:
- seed `333`
- threshold `0.74`
Terminal `grader_success` is computed against the active task threshold.
## Determinism and reproducibility
- Uses `random.Random(seed)` for case generation.
- Task mode pins seed unless an explicit seed override is passed.
- No wall-clock dependence in generation or scoring.
- `grader_score(action)` is deterministic for a fixed latent case.
## Inference contract (`../inference.py`)
Baseline runner enforces strict one-line logs:
- `[START] task=<task> env=<benchmark> model=<model>`
- `[STEP] step=<n> action=<action> reward=<r> done=<bool> error=<value|null>`
- `[END] success=<bool> steps=<n> score=<score> rewards=<r1,r2,...>`
Action selection path uses environment-provided control hints:
- `available_actions`
- `reject_reason_codes`
- `invalid_action` / `last_action_error`
This reduces invalid-action loops and keeps inference behavior aligned with
runtime contract.
LLM proxy requirement for submission validation:
- client initialization must use injected environment variables only:
- `api_key=os.environ["API_KEY"]`
- `base_url=os.environ["API_BASE_URL"]`
- do not hardcode provider keys or bypass the injected proxy URL.
Exploit-hardening and evaluation integrity notes:
- `step()` before `reset()` does not execute the supplied action; it returns
initial state with machine-readable error codes.
- post-terminal `step()` calls return terminal state with `reward=0.0` and
explicit terminal error code, preventing undefined behavior loops.
- invalid-action branches emit stable machine codes and explicit
`available_actions` / `reject_reason_codes` to avoid parser ambiguity.
- task grading is deterministic with fixed seeds and fixed success thresholds;
no hidden stochastic post-processing in scoring.
## Validation checklist
From repository root:
```bash
openenv validate ecom
python -m pytest tests -q
./validate-submission.sh <space-url> .
```
From `ecom/`:
```bash
openenv validate .
openenv push
```