Spaces:
Sleeping
title: E-commerce Returns Decision Environment
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
- operations
- decision-making
E-commerce Returns Decision Environment
This environment is a partially observable, policy-constrained decision process for a single e-commerce return case per episode.
Environment root and loader contract
This repository uses ecom/ as the OpenEnv environment root.
ecom/openenv.yamlis the authoritative manifest.app: server.app:appresolves insideecom/, so it maps toecom/server/app.py.- Validate from repository root with
openenv validate ecom, or fromecom/withopenenv validate ..
Formal task definition
Episode is defined over hidden state s_t and observation o_t.
- Hidden state contains fraud intent, policy violations, latent risk, and optimal action target.
- Observation exposes only operational case fields and policy summary text.
- Agent takes one action from
A = {APPROVE, REJECT, ESCALATE, REQUEST_INFO}. - Terminal objective is maximizing normalized reward while satisfying policy gate constraints.
The environment is not a static classifier. It is a short-horizon sequential decision loop with action-dependent transition and scoring.
Schemas
Action (EcomAction)
action_type:APPROVE | REJECT | ESCALATE | REQUEST_INFOreason_codeis required only whenaction_type == REJECT- Allowed reject reasons:
TIME_EXPIREDPOLICY_VIOLATIONSUSPECTED_FRAUD
Validation is strict: non-REJECT actions cannot carry reason_code.
Observation (EcomObservation)
return_reasonproduct_categoryproduct_valuein{low, medium, high}days_since_purchaseuser_account_age_daysproduct_condition_notesreturn_ratein[0,1]total_orders >= 1policy_summaryreward,done,info
Reward payload (EcomReward)
Terminal breakdown keys:
policy_gatefinancial_scorefraud_scoreefficiency_scorenormalized_rewardpolicy_violationoptimal_actionmatched_optimal
All numeric reward components are bounded to [0,1].
optimal_action is the highest-scoring legal terminal action label, such as
APPROVE, ESCALATE, or REJECT(<REASON>); it may be null when no legal
terminal action exists from the current state.
Episode protocol
Reset
reset(seed=None, episode_id=None, task_name=None):
- initializes state
- samples deterministic or stochastic case
- returns initial observation with:
info.phase=initialinfo.available_actions=[APPROVE, REJECT, ESCALATE, REQUEST_INFO]info.reject_reason_codes=[TIME_EXPIRED, POLICY_VIOLATION, SUSPECTED_FRAUD]info.task_name,info.task_seed,info.task_objectivewhen task-based
Step
step(action) follows these guards and transitions:
- If called before reset:
- action is ignored
- returns fresh initial observation
- sets
invalid_actionandlast_action_errortostep_called_before_reset_action_ignored
- If called after terminal:
- returns terminal observation
reward=0.0,done=true- sets
invalid_actionandlast_action_errortoepisode_already_terminated_call_reset
REQUEST_INFOfirst use:- non-terminal
- refines existing fields only
- reward shaping:
+0.08if ambiguous else-0.03
- Repeated
REQUEST_INFO:- non-terminal penalty
-0.10 - error code:
request_info_already_used
- non-terminal penalty
- Invalid non-terminal-final action type:
- non-terminal penalty
-0.05 - error code:
invalid_final_action
- non-terminal penalty
- Valid terminal action (
APPROVE|REJECT|ESCALATE):- runs policy gate then reward model
- returns terminal observation with grader fields
Hard cap is _MAX_STEPS=4. Exceeding cap returns terminal 0.0 with
termination_reason=max_steps_exceeded.
Info-channel contract
info is the machine-readable control channel. It is used for policy hints,
error handling, and grader reporting.
Common keys by phase:
- Initial phase:
phase=initialavailable_actionsreject_reason_codes
- Post-
REQUEST_INFOphase:phase=post_request_inforevealedavailable_actionsreject_reason_codes
- Terminal phase:
phase=terminalbreakdowngrader_scoregrader_successdecision_audit
- Invalid action paths:
invalid_action(stable machine code)last_action_error(same machine code)
Case generation model
Difficulty presets
easy: fraud0.10, ambiguity0.10, conflict0.05medium: fraud0.25, ambiguity0.30, conflict0.20hard: fraud0.40, ambiguity0.55, conflict0.45
Latent risk construction
For non-hard-template episodes, latent fraud risk is derived from correlated signals, not independent labels.
Base formula (clamped to [0,1]):
risk = base_fraud_probability
+ 0.35 * (return_rate - 0.30)
+ 0.10 * value_index
+ reason_and_account_adjustments
Where value_index maps low/medium/high to -1/0/+1 offset through internal
indexing. Intent is then sampled from this latent risk.
Policy model
Each category defines:
- return window days
- non-returnable category list
- exception text
Policy violations are split into:
time_policy_violatedcategory_policy_violated
Exception handling is explicitly modeled and influences both generation and policy gate decisions.
Ambiguity and conflict injection
- Ambiguity and conflict are sampled from difficulty-controlled rates.
- Conflict mutates condition/policy wording to create realistic contradictory evidence patterns.
Hard template (hard_conflicting_signals)
The hard task uses a deterministic high-risk template:
- high-value electronics focus
- near-window timing
- intentionally conflicting evidence phrases
- stricter policy-gate behavior requiring evidence handling before finalization
Transition semantics for REQUEST_INFO
REQUEST_INFO does not add new fields. It only refines existing observable
fields deterministically from hidden intent:
product_condition_notesreturn_reason(may refine)return_rate(small deterministic shift)
This keeps schema fixed while allowing information-gathering behavior.
Policy gate
If policy gate fails, terminal reward is forced to 0.0.
Core constraints enforced:
APPROVEis blocked on time/category violations.APPROVEmay be blocked in high-risk ambiguous cases without exception.REJECTrequires reason-code consistency with actual violation structure.- Fraud rejection is blocked when fraud signal is too low.
- Rejecting clear low-fraud service-failure claims is blocked.
- In ambiguous hard scenarios, direct finalization before evidence collection can be blocked.
Reward model
After gate pass:
- Financial component:
financial_raw = cost_impact[action] + reason_bonus + trajectory_bonus
financial_score = clamp01((financial_raw + 1.5) / 3.0)
Fraud component uses action-intent-risk-conditioned piecewise scoring.
Efficiency component:
efficiency = 1.0 - 0.20*(requested_info_used) - 0.30*(action==ESCALATE)
- Final reward:
reward = clamp01(
0.50 * financial_score
+ 0.30 * fraud_score
+ 0.20 * efficiency_score
)
Trajectory shaping:
- positive bonus for requesting info in ambiguous cases
- penalty for skipping info in ambiguous cases
Creative evaluator features
To improve real-world utility and exploit resistance during review, terminal responses include a deterministic decision audit payload.
info.decision_audit includes:
chosen_actionchosen_rewardbest_counterfactual_rewarddecision_gapcounterfactual_rewardsfor:APPROVEESCALATEREJECT(TIME_EXPIRED)REJECT(POLICY_VIOLATION)REJECT(SUSPECTED_FRAUD)
risk_band(low|medium|high)policy_flags(time_policy_violated,category_policy_violated,exception_applies,ambiguous_case)
This creates a transparent, machine-checkable explanation surface without changing reward determinism.
Deterministic task set
Tasks are fixed-name benchmarks with fixed seed and threshold:
easy_policy_compliance:- seed
111 - threshold
0.75
- seed
medium_balanced_judgment:- seed
222 - threshold
0.68
- seed
hard_conflicting_signals:- seed
333 - threshold
0.74
- seed
Terminal grader_success is computed against the active task threshold.
Determinism and reproducibility
- Uses
random.Random(seed)for case generation. - Task mode pins seed unless an explicit seed override is passed.
- No wall-clock dependence in generation or scoring.
grader_score(action)is deterministic for a fixed latent case.
Inference contract (../inference.py)
Baseline runner enforces strict one-line logs:
[START] task=<task> env=<benchmark> model=<model>[STEP] step=<n> action=<action> reward=<r> done=<bool> error=<value|null>[END] success=<bool> steps=<n> score=<score> rewards=<r1,r2,...>
Action selection path uses environment-provided control hints:
available_actionsreject_reason_codesinvalid_action/last_action_error
This reduces invalid-action loops and keeps inference behavior aligned with runtime contract.
LLM proxy requirement for submission validation:
- client initialization must use injected environment variables only:
api_key=os.environ["API_KEY"]base_url=os.environ["API_BASE_URL"]
- do not hardcode provider keys or bypass the injected proxy URL.
Exploit-hardening and evaluation integrity notes:
step()beforereset()does not execute the supplied action; it returns initial state with machine-readable error codes.- post-terminal
step()calls return terminal state withreward=0.0and explicit terminal error code, preventing undefined behavior loops. - invalid-action branches emit stable machine codes and explicit
available_actions/reject_reason_codesto avoid parser ambiguity. - task grading is deterministic with fixed seeds and fixed success thresholds; no hidden stochastic post-processing in scoring.
Validation checklist
From repository root:
openenv validate ecom
python -m pytest tests -q
./validate-submission.sh <space-url> .
From ecom/:
openenv validate .
openenv push