Spaces:

Lonelyguyse1
/

ecom

Sleeping

App Files Files Community

ecom / README.md

Lonelyguyse1

Upload folder using huggingface_hub

4364d93 verified 3 days ago

preview code

raw

history blame contribute delete

10.4 kB

metadata

title: E-commerce Returns Decision Environment
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
  - operations
  - decision-making

E-commerce Returns Decision Environment

This environment is a partially observable, policy-constrained decision process for a single e-commerce return case per episode.

Environment root and loader contract

This repository uses ecom/ as the OpenEnv environment root.

ecom/openenv.yaml is the authoritative manifest.
app: server.app:app resolves inside ecom/, so it maps to ecom/server/app.py.
Validate from repository root with openenv validate ecom, or from ecom/ with openenv validate ..

Formal task definition

Episode is defined over hidden state s_t and observation o_t.

Hidden state contains fraud intent, policy violations, latent risk, and optimal action target.
Observation exposes only operational case fields and policy summary text.
Agent takes one action from A = {APPROVE, REJECT, ESCALATE, REQUEST_INFO}.
Terminal objective is maximizing normalized reward while satisfying policy gate constraints.

The environment is not a static classifier. It is a short-horizon sequential decision loop with action-dependent transition and scoring.

Schemas

Action (`EcomAction`)

action_type: APPROVE | REJECT | ESCALATE | REQUEST_INFO
reason_code is required only when action_type == REJECT
Allowed reject reasons:
- TIME_EXPIRED
- POLICY_VIOLATION
- SUSPECTED_FRAUD

Validation is strict: non-REJECT actions cannot carry reason_code.

Observation (`EcomObservation`)

return_reason
product_category
product_value in {low, medium, high}
days_since_purchase
user_account_age_days
product_condition_notes
return_rate in [0,1]
total_orders >= 1
policy_summary
reward, done, info

Reward payload (`EcomReward`)

Terminal breakdown keys:

policy_gate
financial_score
fraud_score
efficiency_score
normalized_reward
policy_violation
optimal_action
matched_optimal

All numeric reward components are bounded to [0,1]. optimal_action is the highest-scoring legal terminal action label, such as APPROVE, ESCALATE, or REJECT(<REASON>); it may be null when no legal terminal action exists from the current state.

Episode protocol

Reset

reset(seed=None, episode_id=None, task_name=None):

initializes state
samples deterministic or stochastic case
returns initial observation with:
- info.phase=initial
- info.available_actions=[APPROVE, REJECT, ESCALATE, REQUEST_INFO]
- info.reject_reason_codes=[TIME_EXPIRED, POLICY_VIOLATION, SUSPECTED_FRAUD]
- info.task_name, info.task_seed, info.task_objective when task-based

Step

step(action) follows these guards and transitions:

If called before reset:
- action is ignored
- returns fresh initial observation
- sets invalid_action and last_action_error to step_called_before_reset_action_ignored
If called after terminal:
- returns terminal observation
- reward=0.0, done=true
- sets invalid_action and last_action_error to episode_already_terminated_call_reset
REQUEST_INFO first use:
- non-terminal
- refines existing fields only
- reward shaping: +0.08 if ambiguous else -0.03
Repeated REQUEST_INFO:
- non-terminal penalty -0.10
- error code: request_info_already_used
Invalid non-terminal-final action type:
- non-terminal penalty -0.05
- error code: invalid_final_action
Valid terminal action (APPROVE|REJECT|ESCALATE):
- runs policy gate then reward model
- returns terminal observation with grader fields

Hard cap is _MAX_STEPS=4. Exceeding cap returns terminal 0.0 with termination_reason=max_steps_exceeded.

Info-channel contract

info is the machine-readable control channel. It is used for policy hints, error handling, and grader reporting.

Common keys by phase:

Initial phase:
- phase=initial
- available_actions
- reject_reason_codes
Post-REQUEST_INFO phase:
- phase=post_request_info
- revealed
- available_actions
- reject_reason_codes
Terminal phase:
- phase=terminal
- breakdown
- grader_score
- grader_success
- decision_audit
Invalid action paths:
- invalid_action (stable machine code)
- last_action_error (same machine code)

Case generation model

Difficulty presets

easy: fraud 0.10, ambiguity 0.10, conflict 0.05
medium: fraud 0.25, ambiguity 0.30, conflict 0.20
hard: fraud 0.40, ambiguity 0.55, conflict 0.45

Latent risk construction

For non-hard-template episodes, latent fraud risk is derived from correlated signals, not independent labels.

Base formula (clamped to [0,1]):

risk = base_fraud_probability
     + 0.35 * (return_rate - 0.30)
     + 0.10 * value_index
     + reason_and_account_adjustments

Where value_index maps low/medium/high to -1/0/+1 offset through internal indexing. Intent is then sampled from this latent risk.

Policy model

Each category defines:

return window days
non-returnable category list
exception text

Policy violations are split into:

time_policy_violated
category_policy_violated

Exception handling is explicitly modeled and influences both generation and policy gate decisions.

Ambiguity and conflict injection

Ambiguity and conflict are sampled from difficulty-controlled rates.
Conflict mutates condition/policy wording to create realistic contradictory evidence patterns.

Hard template (`hard_conflicting_signals`)

The hard task uses a deterministic high-risk template:

high-value electronics focus
near-window timing
intentionally conflicting evidence phrases
stricter policy-gate behavior requiring evidence handling before finalization

Transition semantics for `REQUEST_INFO`

REQUEST_INFO does not add new fields. It only refines existing observable fields deterministically from hidden intent:

product_condition_notes
return_reason (may refine)
return_rate (small deterministic shift)

This keeps schema fixed while allowing information-gathering behavior.

Policy gate

If policy gate fails, terminal reward is forced to 0.0.

Core constraints enforced:

APPROVE is blocked on time/category violations.
APPROVE may be blocked in high-risk ambiguous cases without exception.
REJECT requires reason-code consistency with actual violation structure.
Fraud rejection is blocked when fraud signal is too low.
Rejecting clear low-fraud service-failure claims is blocked.
In ambiguous hard scenarios, direct finalization before evidence collection can be blocked.

Reward model

After gate pass:

Financial component:

financial_raw = cost_impact[action] + reason_bonus + trajectory_bonus
financial_score = clamp01((financial_raw + 1.5) / 3.0)

Fraud component uses action-intent-risk-conditioned piecewise scoring.
Efficiency component:

efficiency = 1.0 - 0.20*(requested_info_used) - 0.30*(action==ESCALATE)

Final reward:

reward = clamp01(
    0.50 * financial_score
  + 0.30 * fraud_score
  + 0.20 * efficiency_score
)

Trajectory shaping:

positive bonus for requesting info in ambiguous cases
penalty for skipping info in ambiguous cases

Creative evaluator features

To improve real-world utility and exploit resistance during review, terminal responses include a deterministic decision audit payload.

info.decision_audit includes:

chosen_action
chosen_reward
best_counterfactual_reward
decision_gap
counterfactual_rewards for:
- APPROVE
- ESCALATE
- REJECT(TIME_EXPIRED)
- REJECT(POLICY_VIOLATION)
- REJECT(SUSPECTED_FRAUD)
risk_band (low|medium|high)
policy_flags (time_policy_violated, category_policy_violated, exception_applies, ambiguous_case)

This creates a transparent, machine-checkable explanation surface without changing reward determinism.

Deterministic task set

Tasks are fixed-name benchmarks with fixed seed and threshold:

easy_policy_compliance:
- seed 111
- threshold 0.75
medium_balanced_judgment:
- seed 222
- threshold 0.68
hard_conflicting_signals:
- seed 333
- threshold 0.74

Terminal grader_success is computed against the active task threshold.

Determinism and reproducibility

Uses random.Random(seed) for case generation.
Task mode pins seed unless an explicit seed override is passed.
No wall-clock dependence in generation or scoring.
grader_score(action) is deterministic for a fixed latent case.

Inference contract (`../inference.py`)

Baseline runner enforces strict one-line logs:

[START] task=<task> env=<benchmark> model=<model>
[STEP] step=<n> action=<action> reward=<r> done=<bool> error=<value|null>
[END] success=<bool> steps=<n> score=<score> rewards=<r1,r2,...>

Action selection path uses environment-provided control hints:

available_actions
reject_reason_codes
invalid_action / last_action_error

This reduces invalid-action loops and keeps inference behavior aligned with runtime contract.

LLM proxy requirement for submission validation:

client initialization must use injected environment variables only:
- api_key=os.environ["API_KEY"]
- base_url=os.environ["API_BASE_URL"]
do not hardcode provider keys or bypass the injected proxy URL.

Exploit-hardening and evaluation integrity notes:

step() before reset() does not execute the supplied action; it returns initial state with machine-readable error codes.
post-terminal step() calls return terminal state with reward=0.0 and explicit terminal error code, preventing undefined behavior loops.
invalid-action branches emit stable machine codes and explicit available_actions / reject_reason_codes to avoid parser ambiguity.
task grading is deterministic with fixed seeds and fixed success thresholds; no hidden stochastic post-processing in scoring.

Validation checklist

From repository root:

openenv validate ecom
python -m pytest tests -q
./validate-submission.sh <space-url> .

From ecom/:

openenv validate .
openenv push