payops_env / README.md
padmapriyagosakan's picture
Fix grader import path: use root-level graders module instead of server.graders
220acb1
|
Raw
History Blame Contribute Delete
23.2 kB
metadata
title: PayOps  Payment Operations Incident Response
emoji: 💳
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
  - openenv
  - finance
  - fraud-detection
  - compliance
  - reinforcement-learning
pinned: false
fullWidth: false
build_version: 2026-04-12-v6

PayOps — Payment Operations Incident Response

An OpenEnv-compatible reinforcement-learning environment where an AI agent acts as a Payment Operations analyst. The agent reviews financial transactions one by one and must decide the correct compliance action for each.


Motivation

Payment operations teams process thousands of transactions every day. A skilled analyst uses dozens of signals — risk scores, velocity, KYC status, flag patterns — to make fast, accurate decisions. This environment lets an AI agent learn and be evaluated on exactly this task, spanning clear-cut cases all the way to subtle adversarial patterns like model-score poisoning and Authorised Push Payment (APP) scams.


Environment Description

Each episode steps through all 30 transactions (6 easy, 8 medium, 10 hard, 6 critical). For each transaction the agent observes a rich set of signals and chooses one of 10 possible actions — 5 terminal decisions and 5 investigation sub-actions. A reward is returned immediately, and the next transaction is presented until the episode is complete.


Action Space

Terminal decisions (no budget cost) commit to a final outcome for the transaction. Investigation sub-actions (with budget cost) reveal more information and let the agent act again on the same transaction.

Action Type Description Budget Cost
approve terminal Mark transaction as legitimate; allow it through
reject terminal Block the transaction outright
flag terminal Soft hold; mark for manual review
escalate terminal Route to senior compliance officer / fraud team
hold terminal Temporary hold pending more information
inspect investigation Pull additional signals (logs, KYC, velocity) — yields inspection_notes 0.10
request_docs investigation Ask sender for supporting documents (invoice, contract) — yields docs_notes 0.20
verify_kyc investigation Trigger an active KYC re-verification check — yields kyc_notes 0.20
contact_sender investigation Contact the sender directly to confirm intent — yields contact_notes 0.30
file_sar investigation File a Suspicious Activity Report to the regulator (required on AML/structuring tasks) 0.10

Observation Space

Field Type Description
transaction_id str Unique transaction identifier
amount float Transaction amount in the stated currency
currency str ISO-4217 currency code
sender str Sender identifier (email / account / alias)
receiver str Receiver identifier
transaction_type str transfer | payment | withdrawal | refund | internal | loan_repayment | payroll
status str pending | approved | rejected | flagged | escalated | held | inspected | docs_requested | kyc_triggered | sender_contacted | sar_filed
risk_score float [0,1] Composite ML risk score
ml_confidence float [0,1] Model's self-reported confidence in risk_score — low value signals possible model poisoning
flags List[str] Active risk flags (e.g. high_value, unknown_sender, velocity_breach)
velocity_1h int? Transactions from sender in the past hour
velocity_24h int? Transactions from sender in the past 24 hours
avg_transaction_amount float? Sender's historical average transaction amount
account_age_days int? Age of the sender account in days
country_risk str? low | medium | high | sanctioned
kyc_status str? verified | pending | failed | none | expired
kyc_expiry_days int? Days until KYC expires (negative = already expired)
previous_violations int? Prior compliance violations for this sender
previous_sars int? Suspicious Activity Reports previously filed for this sender
counterparty_risk str? clean | unknown | watchlist | blacklist
chain_step int Current step in a multi-hop investigation chain (1 = initial presentation)
chain_total int Total investigation steps for this task (1 = single-step)
chain_context str? Accumulated summary of findings from earlier chain steps
steps_remaining int? Investigation sub-steps remaining before a terminal decision is required
action_cost float Budget cost incurred by the last action
budget_remaining float Remaining investigation budget (starts at 5.0; decreases with each investigation action)
inspection_notes str? Additional details revealed after an inspect action
docs_notes str? Document review findings after a request_docs action
kyc_notes str? KYC re-verification outcome after a verify_kyc action
contact_notes str? Outcome after a contact_sender action
investigation_hints List[str] Sub-actions recommended for this task (e.g. inspect, verify_kyc). Using them before the terminal decision earns bonus reward. Empty = no specific investigation required.
recent_decisions List[dict] Last ≤3 completed decisions in this episode (for pattern context)
network_graph dict? Mule-chain / correspondent-bank relationship graph where present
task_id str Identifier of the active task
task_difficulty str easy | medium | hard | critical
step_in_episode int Steps elapsed in this episode
reward float Reward from the last action
reward_breakdown dict Itemised reward components: base, confidence_bonus, cost_penalty, etc.
cumulative_reward float Total reward accumulated so far in this episode
done bool Whether the episode has ended
info dict Diagnostic info (event, correct action, etc.)

Task Descriptions

Easy (4 tasks — clear signals)

ID Description Correct Action
EASY-001 Low-value domestic transfer between verified users; no flags approve
EASY-002 Textbook fraud: unknown sender, offshore, sanctioned country, risk=0.97 reject
EASY-003 Standard refund to verified customer; tiny amount, no flags approve
EASY-004 ATM withdrawal burst — 15 withdrawals in 58 minutes flag
EASY-005 Scheduled monthly mortgage repayment; regular amount, verified borrower approve
EASY-006 Suspected duplicate payment: same sender/receiver/amount submitted twice in 4 minutes flag

Medium (8 tasks — ambiguous, multi-signal reasoning required)

ID Description Correct Action
MED-001 Large B2B wire, verified CFO, cross-border to medium-risk jurisdiction escalate
MED-002 Internal treasury transfer; large amount, KYC pending renewal hold
MED-003 Recurring subscription 3× higher than historical average flag
MED-004 Payment to licensed crypto exchange from verified personal account flag
MED-005 Expired KYC on high-frequency corporate payroll account; KYC lapsed 12 days ago hold
MED-006 Real estate advance payment; large first-time transfer to new receiver but signed contract exists escalate
MED-007 Supplier emails to say bank details have changed; first payment to new account matches large invoice (BEC indicator) hold
MED-008 Buy Now Pay Later high-value purchase; new account, thin credit file, elevated risk signals flag

Hard (10 tasks — adversarial / edge-case)

ID Description Correct Action
HARD-001 Fraud model poisoning: risk_score=0.18 but manual signals scream escalate escalate
HARD-002 APP (Authorised Push Payment) scam: victim sending willingly to mule account reject
HARD-003 Structuring / smurfing: just-below-CTR-threshold payments, same UBO reject
HARD-004 Legitimate FX correspondent banking settlement — looks alarming, is not approve
HARD-005 Insider threat: employee initiating transfers to personal family accounts escalate
HARD-006 Ghost account: dormant 5 years, suddenly received 20 inbound transfers this week flag
HARD-007 SIM-swap attack: phone ported 6 hours ago; account now requesting large crypto withdrawal to new address reject
HARD-008 Romance scam / pig butchering: 4th escalating transfer to overseas 'romantic partner' met online reject
HARD-009 Synthetic identity fraud: new business account with AI-generated-looking perfect profile escalate
HARD-010 Payroll diversion: HR system breach rerouted employee salary to newly added account reject

Critical (6 tasks — regulatory + multi-step investigation chains)

ID Description Correct Action
CRIT-001 Multi-step chain: large PE wire to new counterparty; inspect then request docs before deciding (chain of 3) approve
CRIT-002 Fraud ring: coordinated small payments from 3 related accounts aggregating above reporting threshold; SAR required reject
CRIT-003 Trade-based money laundering: over-invoiced international trade payment (4× market price) escalate
CRIT-004 Compromised corporate account: geo-impossible login (NY → Lagos in 8 min); confirmed account takeover reject
CRIT-005 OFAC sanctions evasion: large USD payment routed through UAE shell chain; UBO is on SDN list (chain of 3) reject
CRIT-006 Correspondent banking: partner bank added to FinCEN 311 Special Measures list; in-flight payments must be escalated escalate

Reward Design

Outcome Reward
Correct action +1.0
Partial-credit adjacent action (per-task) +0.2 – +0.6
inspect (information seeking, first time) +0.15
approve when correct is reject / escalate −1.0
approve when correct is flag / hold −0.5
reject when correct is approve −0.5
Any other wrong action −0.25

The episode score (0–1) is: max(0, total_reward) / max_possible_reward. A score ≥ 0.5 is considered a passing episode.


API Endpoints

Method Path Description
POST /reset Reset environment, return first observation
POST /step Execute an action
GET /state Current internal environment state
GET /schema JSON schemas for action / observation / state
GET /tasks Full task list with metadata
GET /grader Grade the current episode
POST /baseline Run rule-based baseline and return scores
GET /health Health check
WS /ws WebSocket persistent session

Interactive API docs: http://localhost:8000/docs


Setup & Running

Local (Python)

# 1. Install dependencies
pip install -r requirements.txt

# 2. Start the server (from the parent directory of payops_env)
PYTHONPATH=$(pwd) uvicorn payops_env.server.app:app --host 0.0.0.0 --port 8000

# 3. Verify
curl http://localhost:8000/health

Run the baseline agent

# Via the API endpoint (no extra script needed)
curl -s -X POST http://localhost:8000/baseline | python3 -m json.tool

Docker

# Build
docker build -t payops-env .

# Run locally on port 8000
docker run -p 8000:7860 -e PORT=7860 payops-env

# Verify
curl http://localhost:8000/health

HuggingFace Space

The Dockerfile exposes port 7860 (HF Spaces default). Push the repo to a HF Space with Docker runtime — no additional configuration required.


Example Agent Interaction

import httpx

base = "http://localhost:8000"

# Reset
obs = httpx.post(f"{base}/reset").json()
print(obs["transaction_id"], obs["risk_score"], obs["flags"])

# Step
while not obs["done"]:
    # ... agent decides action_type ...
    obs = httpx.post(f"{base}/step", json={
        "action_type": "approve",
        "transaction_id": obs["transaction_id"],
    }).json()
    print(f"reward={obs['reward']:+.2f}  done={obs['done']}")

# Grade
score = httpx.get(f"{base}/grader").json()
print(f"Episode score: {score['normalised_score']:.4f}")

Baseline Results

Rule-based baseline (POST /baseline)

The rule-based baseline uses a deterministic priority-ordered policy in scripts_util.py.

Metric Rule-based baseline (v2, 30 tasks)
Normalised score 0.68–0.76
Passed (≥ 0.5) Yes
Strong at Easy tasks, clear velocity/flag patterns
Weak at Hard adversarial tasks (HARD-001 model-poisoning, HARD-004 FX settlement)
Critical coverage Partial — misses some SAR filing requirements

Scores vary slightly per run due to per-episode parameter jitter.

Run POST /baseline to reproduce.

LLM baseline (inference.pyllama-3.1-8b-instant via Groq)

Run locally against seed 42 (reproducible) with investigation sub-actions enabled.

Metric llama-3.1-8b-instant (Groq)
Normalised score 0.6028
Total reward 17.000 / 28.200 max
Tasks correct 6 / 20 (30%)
Budget spent 5.50 / 5.00
Budget penalty 0.05
Episode steps 57 (incl. investigation sub-actions)
Duration ~290 s
Passed (≥ 0.5) YES ✓
Seed 42 (fixed — deterministic across re-runs)

Per-task decisions:

Task LLM Action Correct Action Weighted Reward
EASY-001 approve approve +1.000 ✓
EASY-002 flag reject −0.250 ✗ (flag no longer partial credit)
EASY-003 approve approve +1.000 ✓
EASY-004 flag flag +1.000 ✓
MED-001 flag escalate +0.900 (partial + investigation bonus)
MED-002 flag hold +0.540 (partial + investigation bonus)
MED-003 flag flag +1.200 ✓
MED-004 flag flag +1.200 ✓
MED-005 flag hold +0.660 (partial + investigation bonus)
MED-006 flag escalate +0.600 (partial + investigation bonus)
HARD-001 flag escalate +1.275 (partial + investigation bonus)
HARD-002 flag reject +0.525 (partial + investigation bonus)
HARD-003 flag reject +0.675 (partial + investigation bonus)
HARD-004 flag approve +0.825 (partial + investigation bonus)
HARD-005 flag escalate +0.825 (partial + investigation bonus)
HARD-006 flag flag +2.025 ✓ (+ investigation bonus)
CRIT-001 flag approve +1.100 (partial + investigation bonus)
CRIT-002 flag reject +0.900 (partial + investigation bonus)
CRIT-003 flag escalate +1.300 (partial + investigation bonus)
CRIT-004 flag reject −0.250 ✗

Observations: The model used investigation sub-actions (inspect, verify_kyc, contact_sender) before terminal decisions, earning investigation bonuses that raised the score from a naive always-flag baseline. Easy cases with clear evidence now penalise lazy flag decisions (e.g. EASY-002). Agents that correctly identify terminal actions on top of proper investigation can exceed 0.90.

To reproduce exactly (seed=42 is the default):

export OPENAI_API_KEY="gsk_..."           # your Groq API key
export API_BASE_URL="https://api.groq.com/openai/v1"
export MODEL_NAME="llama-3.1-8b-instant"
export PAYOPS_BASE_URL="https://padmapriyagosakan-payops-env.hf.space"
# INFERENCE_SEED=42  # default; set to "random" for a fresh episode
PYTHONPATH=$(pwd) python payops_env/inference.py

For Groq setup instructions see the Running inference with Groq section below.


Running inference with Groq (recommended — free)

Groq provides a completely free API with no monthly credit cap and no installation required. It uses the same OpenAI-compatible interface that inference.py already targets.

Prerequisites

  1. Create a free Groq account — go to console.groq.com and sign up (Google / GitHub login available)

  2. Generate an API key — click API Keys → Create API Key, copy the key (starts with gsk_)

  3. Install the Python dependency (already in requirements.txt):

    pip install openai
    

Run inference

cd /path/to/payops_env      # project root (parent of payops_env/)

export OPENAI_API_KEY="gsk_..."          # your Groq API key
export API_BASE_URL="https://api.groq.com/openai/v1"
export MODEL_NAME="llama-3.1-8b-instant"
export PAYOPS_BASE_URL="https://padmapriyagosakan-payops-env.hf.space"

PYTHONPATH=$(pwd) python payops_env/inference.py

Why Groq?

  • Free tier: 14,400 requests/day, 500,000 tokens/minute — a 20-task episode uses ~30 calls
  • No monthly credit pool that runs out mid-run (unlike the HF free tier)
  • No installation or model download (unlike Ollama)
  • temperature=0.0 is already set in inference.py so results are reproducible
  • Inference speed: ~750 tok/s → full episode completes in under 30 seconds

Alternative free models on Groq

Model Notes
llama-3.1-8b-instant Fastest, good reasoning
llama-3.3-70b-versatile Best quality on hard tasks; same free tier
mixtral-8x7b-32768 Large context window
gemma2-9b-it Google Gemma 2

Alternative: Ollama (fully local, no internet required for LLM calls)

If you prefer to run the model entirely on your machine:

# 1. Install
brew install ollama

# 2. Pull a model (choose based on available RAM)
ollama pull qwen2.5:3b      # ~2 GB  –  8 GB RAM
ollama pull qwen2.5:7b      # ~4.7 GB – 16 GB RAM

# 3. Start the server (keep running in a separate terminal)
ollama serve

# 4. Run inference
export OPENAI_API_KEY=ollama
export API_BASE_URL="http://localhost:11434/v1"
export MODEL_NAME="qwen2.5:3b"
export PAYOPS_BASE_URL="https://padmapriyagosakan-payops-env.hf.space"
PYTHONPATH=$(pwd) python payops_env/inference.py

Project Structure

payops_env/
├── models.py              # PayOpsAction, PayOpsObservation, PayOpsState (Pydantic)
├── environment.py         # PayOpsEnvironment — reset_async / step_async / state
├── tasks.py               # 30 tasks (EASY×6, MED×8, HARD×10, CRIT×6) with ground-truth labels
├── grader.py              # Partial-credit reward function + episode grader
├── scripts_util.py        # Baseline runner helper (used by /baseline endpoint)
├── server/
│   └── app.py             # FastAPI server with all required endpoints
├── inference.py           # Competition inference script (OpenAI client, root-level)
├── validate.py            # Pre-submission checklist validator
├── openenv.yaml           # OpenEnv manifest v2.0.0
├── Dockerfile             # Docker / HuggingFace Space container (port 7860)
├── requirements.txt       # Python dependencies
└── README.md              # This file

Evaluation Criteria Alignment

Criterion Implementation
Real-world utility Payment fraud and compliance triage — deployed daily by fintech ops teams worldwide
Task & grader quality 30 tasks across 4 difficulty tiers (easy→critical); partial-credit grader; clear pass/fail
Environment design 30-field observation space; 10-action space (5 terminal + 5 investigation); budget mechanic; episode state tracking
Code quality & spec compliance Pydantic v2 models; async API; all 11 required endpoints; openenv.yaml v2; Dockerfile; validate.py
Creativity & novelty Adversarial model-poisoning task; APP scam; AML structuring with SAR requirement; PEP detection

Reward Design (v2 — Trajectory-Based)

Rewards are dense across the full trajectory, not just on the final decision:

Component Value Condition
Correct terminal action +1.0 per task (difficulty-weighted in episode score)
Investigation sub-action +0.15 per eligible sub-action, first use only
Flag identification +0.20 agent used inspect AND key diagnostic flags present
Confidence bonus +0.10 confidence ≥ 0.8 AND correct
Confidence penalty −0.10 confidence ≥ 0.8 AND wrong
Regulatory SAR bonus +0.20 file_sar before terminal on a regulatory task
Duplicate investigation −0.05 same sub-action used twice on same task
Approve a fraud/sanctioned −1.00 worst mistake

Difficulty weights: easy×1.0, medium×1.2, hard×1.5, critical×2.0
Episode score is strictly clamped to [0.0, 1.0]. Passing threshold: 0.5.

Per-Episode Parameter Jitter

Each POST /reset generates a unique episode_seed and applies small random perturbations to prevent agent overfitting:

Field Jitter
amount × Uniform(0.85, 1.20)
risk_score + Gauss(0, 0.03), clamped [0,1]
velocity_1h + Randint(−3, +3), min 0
velocity_24h + Randint(−3, +3), min 0

The correct_action and all ground-truth labels are never changed — only the observable values the agent uses to make decisions.

The episode_seed is returned by GET /health and GET /state for reproducibility.

Network Graph

Selected tasks include a network_graph field in the observation exposing mule-chain / correspondent-banking relationships (e.g. victim → mule → offshore). This gives agents richer context for complex fraud patterns.