ecom / README.md
Lonelyguyse1's picture
Upload folder using huggingface_hub
4364d93 verified
metadata
title: E-commerce Returns Decision Environment
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
  - operations
  - decision-making

E-commerce Returns Decision Environment

This environment is a partially observable, policy-constrained decision process for a single e-commerce return case per episode.

Environment root and loader contract

This repository uses ecom/ as the OpenEnv environment root.

  • ecom/openenv.yaml is the authoritative manifest.
  • app: server.app:app resolves inside ecom/, so it maps to ecom/server/app.py.
  • Validate from repository root with openenv validate ecom, or from ecom/ with openenv validate ..

Formal task definition

Episode is defined over hidden state s_t and observation o_t.

  • Hidden state contains fraud intent, policy violations, latent risk, and optimal action target.
  • Observation exposes only operational case fields and policy summary text.
  • Agent takes one action from A = {APPROVE, REJECT, ESCALATE, REQUEST_INFO}.
  • Terminal objective is maximizing normalized reward while satisfying policy gate constraints.

The environment is not a static classifier. It is a short-horizon sequential decision loop with action-dependent transition and scoring.

Schemas

Action (EcomAction)

  • action_type: APPROVE | REJECT | ESCALATE | REQUEST_INFO
  • reason_code is required only when action_type == REJECT
  • Allowed reject reasons:
    • TIME_EXPIRED
    • POLICY_VIOLATION
    • SUSPECTED_FRAUD

Validation is strict: non-REJECT actions cannot carry reason_code.

Observation (EcomObservation)

  • return_reason
  • product_category
  • product_value in {low, medium, high}
  • days_since_purchase
  • user_account_age_days
  • product_condition_notes
  • return_rate in [0,1]
  • total_orders >= 1
  • policy_summary
  • reward, done, info

Reward payload (EcomReward)

Terminal breakdown keys:

  • policy_gate
  • financial_score
  • fraud_score
  • efficiency_score
  • normalized_reward
  • policy_violation
  • optimal_action
  • matched_optimal

All numeric reward components are bounded to [0,1]. optimal_action is the highest-scoring legal terminal action label, such as APPROVE, ESCALATE, or REJECT(<REASON>); it may be null when no legal terminal action exists from the current state.

Episode protocol

Reset

reset(seed=None, episode_id=None, task_name=None):

  • initializes state
  • samples deterministic or stochastic case
  • returns initial observation with:
    • info.phase=initial
    • info.available_actions=[APPROVE, REJECT, ESCALATE, REQUEST_INFO]
    • info.reject_reason_codes=[TIME_EXPIRED, POLICY_VIOLATION, SUSPECTED_FRAUD]
    • info.task_name, info.task_seed, info.task_objective when task-based

Step

step(action) follows these guards and transitions:

  1. If called before reset:
    • action is ignored
    • returns fresh initial observation
    • sets invalid_action and last_action_error to step_called_before_reset_action_ignored
  2. If called after terminal:
    • returns terminal observation
    • reward=0.0, done=true
    • sets invalid_action and last_action_error to episode_already_terminated_call_reset
  3. REQUEST_INFO first use:
    • non-terminal
    • refines existing fields only
    • reward shaping: +0.08 if ambiguous else -0.03
  4. Repeated REQUEST_INFO:
    • non-terminal penalty -0.10
    • error code: request_info_already_used
  5. Invalid non-terminal-final action type:
    • non-terminal penalty -0.05
    • error code: invalid_final_action
  6. Valid terminal action (APPROVE|REJECT|ESCALATE):
    • runs policy gate then reward model
    • returns terminal observation with grader fields

Hard cap is _MAX_STEPS=4. Exceeding cap returns terminal 0.0 with termination_reason=max_steps_exceeded.

Info-channel contract

info is the machine-readable control channel. It is used for policy hints, error handling, and grader reporting.

Common keys by phase:

  • Initial phase:
    • phase=initial
    • available_actions
    • reject_reason_codes
  • Post-REQUEST_INFO phase:
    • phase=post_request_info
    • revealed
    • available_actions
    • reject_reason_codes
  • Terminal phase:
    • phase=terminal
    • breakdown
    • grader_score
    • grader_success
    • decision_audit
  • Invalid action paths:
    • invalid_action (stable machine code)
    • last_action_error (same machine code)

Case generation model

Difficulty presets

  • easy: fraud 0.10, ambiguity 0.10, conflict 0.05
  • medium: fraud 0.25, ambiguity 0.30, conflict 0.20
  • hard: fraud 0.40, ambiguity 0.55, conflict 0.45

Latent risk construction

For non-hard-template episodes, latent fraud risk is derived from correlated signals, not independent labels.

Base formula (clamped to [0,1]):

risk = base_fraud_probability
     + 0.35 * (return_rate - 0.30)
     + 0.10 * value_index
     + reason_and_account_adjustments

Where value_index maps low/medium/high to -1/0/+1 offset through internal indexing. Intent is then sampled from this latent risk.

Policy model

Each category defines:

  • return window days
  • non-returnable category list
  • exception text

Policy violations are split into:

  • time_policy_violated
  • category_policy_violated

Exception handling is explicitly modeled and influences both generation and policy gate decisions.

Ambiguity and conflict injection

  • Ambiguity and conflict are sampled from difficulty-controlled rates.
  • Conflict mutates condition/policy wording to create realistic contradictory evidence patterns.

Hard template (hard_conflicting_signals)

The hard task uses a deterministic high-risk template:

  • high-value electronics focus
  • near-window timing
  • intentionally conflicting evidence phrases
  • stricter policy-gate behavior requiring evidence handling before finalization

Transition semantics for REQUEST_INFO

REQUEST_INFO does not add new fields. It only refines existing observable fields deterministically from hidden intent:

  • product_condition_notes
  • return_reason (may refine)
  • return_rate (small deterministic shift)

This keeps schema fixed while allowing information-gathering behavior.

Policy gate

If policy gate fails, terminal reward is forced to 0.0.

Core constraints enforced:

  • APPROVE is blocked on time/category violations.
  • APPROVE may be blocked in high-risk ambiguous cases without exception.
  • REJECT requires reason-code consistency with actual violation structure.
  • Fraud rejection is blocked when fraud signal is too low.
  • Rejecting clear low-fraud service-failure claims is blocked.
  • In ambiguous hard scenarios, direct finalization before evidence collection can be blocked.

Reward model

After gate pass:

  1. Financial component:
financial_raw = cost_impact[action] + reason_bonus + trajectory_bonus
financial_score = clamp01((financial_raw + 1.5) / 3.0)
  1. Fraud component uses action-intent-risk-conditioned piecewise scoring.

  2. Efficiency component:

efficiency = 1.0 - 0.20*(requested_info_used) - 0.30*(action==ESCALATE)
  1. Final reward:
reward = clamp01(
    0.50 * financial_score
  + 0.30 * fraud_score
  + 0.20 * efficiency_score
)

Trajectory shaping:

  • positive bonus for requesting info in ambiguous cases
  • penalty for skipping info in ambiguous cases

Creative evaluator features

To improve real-world utility and exploit resistance during review, terminal responses include a deterministic decision audit payload.

info.decision_audit includes:

  • chosen_action
  • chosen_reward
  • best_counterfactual_reward
  • decision_gap
  • counterfactual_rewards for:
    • APPROVE
    • ESCALATE
    • REJECT(TIME_EXPIRED)
    • REJECT(POLICY_VIOLATION)
    • REJECT(SUSPECTED_FRAUD)
  • risk_band (low|medium|high)
  • policy_flags (time_policy_violated, category_policy_violated, exception_applies, ambiguous_case)

This creates a transparent, machine-checkable explanation surface without changing reward determinism.

Deterministic task set

Tasks are fixed-name benchmarks with fixed seed and threshold:

  1. easy_policy_compliance:
    • seed 111
    • threshold 0.75
  2. medium_balanced_judgment:
    • seed 222
    • threshold 0.68
  3. hard_conflicting_signals:
    • seed 333
    • threshold 0.74

Terminal grader_success is computed against the active task threshold.

Determinism and reproducibility

  • Uses random.Random(seed) for case generation.
  • Task mode pins seed unless an explicit seed override is passed.
  • No wall-clock dependence in generation or scoring.
  • grader_score(action) is deterministic for a fixed latent case.

Inference contract (../inference.py)

Baseline runner enforces strict one-line logs:

  • [START] task=<task> env=<benchmark> model=<model>
  • [STEP] step=<n> action=<action> reward=<r> done=<bool> error=<value|null>
  • [END] success=<bool> steps=<n> score=<score> rewards=<r1,r2,...>

Action selection path uses environment-provided control hints:

  • available_actions
  • reject_reason_codes
  • invalid_action / last_action_error

This reduces invalid-action loops and keeps inference behavior aligned with runtime contract.

LLM proxy requirement for submission validation:

  • client initialization must use injected environment variables only:
    • api_key=os.environ["API_KEY"]
    • base_url=os.environ["API_BASE_URL"]
  • do not hardcode provider keys or bypass the injected proxy URL.

Exploit-hardening and evaluation integrity notes:

  • step() before reset() does not execute the supplied action; it returns initial state with machine-readable error codes.
  • post-terminal step() calls return terminal state with reward=0.0 and explicit terminal error code, preventing undefined behavior loops.
  • invalid-action branches emit stable machine codes and explicit available_actions / reject_reason_codes to avoid parser ambiguity.
  • task grading is deterministic with fixed seeds and fixed success thresholds; no hidden stochastic post-processing in scoring.

Validation checklist

From repository root:

openenv validate ecom
python -m pytest tests -q
./validate-submission.sh <space-url> .

From ecom/:

openenv validate .
openenv push