Spaces:
Paused
title: PayOps — Payment Operations Incident Response
emoji: 💳
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
- openenv
- finance
- fraud-detection
- compliance
- reinforcement-learning
pinned: false
fullWidth: false
build_version: 2026-04-12-v6
PayOps — Payment Operations Incident Response
An OpenEnv-compatible reinforcement-learning environment where an AI agent acts as a Payment Operations analyst. The agent reviews financial transactions one by one and must decide the correct compliance action for each.
Motivation
Payment operations teams process thousands of transactions every day. A skilled analyst uses dozens of signals — risk scores, velocity, KYC status, flag patterns — to make fast, accurate decisions. This environment lets an AI agent learn and be evaluated on exactly this task, spanning clear-cut cases all the way to subtle adversarial patterns like model-score poisoning and Authorised Push Payment (APP) scams.
Environment Description
Each episode steps through all 30 transactions (6 easy, 8 medium, 10 hard, 6 critical). For each transaction the agent observes a rich set of signals and chooses one of 10 possible actions — 5 terminal decisions and 5 investigation sub-actions. A reward is returned immediately, and the next transaction is presented until the episode is complete.
Action Space
Terminal decisions (no budget cost) commit to a final outcome for the transaction. Investigation sub-actions (with budget cost) reveal more information and let the agent act again on the same transaction.
| Action | Type | Description | Budget Cost |
|---|---|---|---|
approve |
terminal | Mark transaction as legitimate; allow it through | — |
reject |
terminal | Block the transaction outright | — |
flag |
terminal | Soft hold; mark for manual review | — |
escalate |
terminal | Route to senior compliance officer / fraud team | — |
hold |
terminal | Temporary hold pending more information | — |
inspect |
investigation | Pull additional signals (logs, KYC, velocity) — yields inspection_notes |
0.10 |
request_docs |
investigation | Ask sender for supporting documents (invoice, contract) — yields docs_notes |
0.20 |
verify_kyc |
investigation | Trigger an active KYC re-verification check — yields kyc_notes |
0.20 |
contact_sender |
investigation | Contact the sender directly to confirm intent — yields contact_notes |
0.30 |
file_sar |
investigation | File a Suspicious Activity Report to the regulator (required on AML/structuring tasks) | 0.10 |
Observation Space
| Field | Type | Description |
|---|---|---|
transaction_id |
str |
Unique transaction identifier |
amount |
float |
Transaction amount in the stated currency |
currency |
str |
ISO-4217 currency code |
sender |
str |
Sender identifier (email / account / alias) |
receiver |
str |
Receiver identifier |
transaction_type |
str |
transfer | payment | withdrawal | refund | internal | loan_repayment | payroll |
status |
str |
pending | approved | rejected | flagged | escalated | held | inspected | docs_requested | kyc_triggered | sender_contacted | sar_filed |
risk_score |
float [0,1] |
Composite ML risk score |
ml_confidence |
float [0,1] |
Model's self-reported confidence in risk_score — low value signals possible model poisoning |
flags |
List[str] |
Active risk flags (e.g. high_value, unknown_sender, velocity_breach) |
velocity_1h |
int? |
Transactions from sender in the past hour |
velocity_24h |
int? |
Transactions from sender in the past 24 hours |
avg_transaction_amount |
float? |
Sender's historical average transaction amount |
account_age_days |
int? |
Age of the sender account in days |
country_risk |
str? |
low | medium | high | sanctioned |
kyc_status |
str? |
verified | pending | failed | none | expired |
kyc_expiry_days |
int? |
Days until KYC expires (negative = already expired) |
previous_violations |
int? |
Prior compliance violations for this sender |
previous_sars |
int? |
Suspicious Activity Reports previously filed for this sender |
counterparty_risk |
str? |
clean | unknown | watchlist | blacklist |
chain_step |
int |
Current step in a multi-hop investigation chain (1 = initial presentation) |
chain_total |
int |
Total investigation steps for this task (1 = single-step) |
chain_context |
str? |
Accumulated summary of findings from earlier chain steps |
steps_remaining |
int? |
Investigation sub-steps remaining before a terminal decision is required |
action_cost |
float |
Budget cost incurred by the last action |
budget_remaining |
float |
Remaining investigation budget (starts at 5.0; decreases with each investigation action) |
inspection_notes |
str? |
Additional details revealed after an inspect action |
docs_notes |
str? |
Document review findings after a request_docs action |
kyc_notes |
str? |
KYC re-verification outcome after a verify_kyc action |
contact_notes |
str? |
Outcome after a contact_sender action |
investigation_hints |
List[str] |
Sub-actions recommended for this task (e.g. inspect, verify_kyc). Using them before the terminal decision earns bonus reward. Empty = no specific investigation required. |
recent_decisions |
List[dict] |
Last ≤3 completed decisions in this episode (for pattern context) |
network_graph |
dict? |
Mule-chain / correspondent-bank relationship graph where present |
task_id |
str |
Identifier of the active task |
task_difficulty |
str |
easy | medium | hard | critical |
step_in_episode |
int |
Steps elapsed in this episode |
reward |
float |
Reward from the last action |
reward_breakdown |
dict |
Itemised reward components: base, confidence_bonus, cost_penalty, etc. |
cumulative_reward |
float |
Total reward accumulated so far in this episode |
done |
bool |
Whether the episode has ended |
info |
dict |
Diagnostic info (event, correct action, etc.) |
Task Descriptions
Easy (4 tasks — clear signals)
| ID | Description | Correct Action |
|---|---|---|
| EASY-001 | Low-value domestic transfer between verified users; no flags | approve |
| EASY-002 | Textbook fraud: unknown sender, offshore, sanctioned country, risk=0.97 | reject |
| EASY-003 | Standard refund to verified customer; tiny amount, no flags | approve |
| EASY-004 | ATM withdrawal burst — 15 withdrawals in 58 minutes | flag |
| EASY-005 | Scheduled monthly mortgage repayment; regular amount, verified borrower | approve |
| EASY-006 | Suspected duplicate payment: same sender/receiver/amount submitted twice in 4 minutes | flag |
Medium (8 tasks — ambiguous, multi-signal reasoning required)
| ID | Description | Correct Action |
|---|---|---|
| MED-001 | Large B2B wire, verified CFO, cross-border to medium-risk jurisdiction | escalate |
| MED-002 | Internal treasury transfer; large amount, KYC pending renewal | hold |
| MED-003 | Recurring subscription 3× higher than historical average | flag |
| MED-004 | Payment to licensed crypto exchange from verified personal account | flag |
| MED-005 | Expired KYC on high-frequency corporate payroll account; KYC lapsed 12 days ago | hold |
| MED-006 | Real estate advance payment; large first-time transfer to new receiver but signed contract exists | escalate |
| MED-007 | Supplier emails to say bank details have changed; first payment to new account matches large invoice (BEC indicator) | hold |
| MED-008 | Buy Now Pay Later high-value purchase; new account, thin credit file, elevated risk signals | flag |
Hard (10 tasks — adversarial / edge-case)
| ID | Description | Correct Action |
|---|---|---|
| HARD-001 | Fraud model poisoning: risk_score=0.18 but manual signals scream escalate | escalate |
| HARD-002 | APP (Authorised Push Payment) scam: victim sending willingly to mule account | reject |
| HARD-003 | Structuring / smurfing: just-below-CTR-threshold payments, same UBO | reject |
| HARD-004 | Legitimate FX correspondent banking settlement — looks alarming, is not | approve |
| HARD-005 | Insider threat: employee initiating transfers to personal family accounts | escalate |
| HARD-006 | Ghost account: dormant 5 years, suddenly received 20 inbound transfers this week | flag |
| HARD-007 | SIM-swap attack: phone ported 6 hours ago; account now requesting large crypto withdrawal to new address | reject |
| HARD-008 | Romance scam / pig butchering: 4th escalating transfer to overseas 'romantic partner' met online | reject |
| HARD-009 | Synthetic identity fraud: new business account with AI-generated-looking perfect profile | escalate |
| HARD-010 | Payroll diversion: HR system breach rerouted employee salary to newly added account | reject |
Critical (6 tasks — regulatory + multi-step investigation chains)
| ID | Description | Correct Action |
|---|---|---|
| CRIT-001 | Multi-step chain: large PE wire to new counterparty; inspect then request docs before deciding (chain of 3) | approve |
| CRIT-002 | Fraud ring: coordinated small payments from 3 related accounts aggregating above reporting threshold; SAR required | reject |
| CRIT-003 | Trade-based money laundering: over-invoiced international trade payment (4× market price) | escalate |
| CRIT-004 | Compromised corporate account: geo-impossible login (NY → Lagos in 8 min); confirmed account takeover | reject |
| CRIT-005 | OFAC sanctions evasion: large USD payment routed through UAE shell chain; UBO is on SDN list (chain of 3) | reject |
| CRIT-006 | Correspondent banking: partner bank added to FinCEN 311 Special Measures list; in-flight payments must be escalated | escalate |
Reward Design
| Outcome | Reward |
|---|---|
| Correct action | +1.0 |
| Partial-credit adjacent action (per-task) | +0.2 – +0.6 |
inspect (information seeking, first time) |
+0.15 |
approve when correct is reject / escalate |
−1.0 |
approve when correct is flag / hold |
−0.5 |
reject when correct is approve |
−0.5 |
| Any other wrong action | −0.25 |
The episode score (0–1) is: max(0, total_reward) / max_possible_reward.
A score ≥ 0.5 is considered a passing episode.
API Endpoints
| Method | Path | Description |
|---|---|---|
POST |
/reset |
Reset environment, return first observation |
POST |
/step |
Execute an action |
GET |
/state |
Current internal environment state |
GET |
/schema |
JSON schemas for action / observation / state |
GET |
/tasks |
Full task list with metadata |
GET |
/grader |
Grade the current episode |
POST |
/baseline |
Run rule-based baseline and return scores |
GET |
/health |
Health check |
WS |
/ws |
WebSocket persistent session |
Interactive API docs: http://localhost:8000/docs
Setup & Running
Local (Python)
# 1. Install dependencies
pip install -r requirements.txt
# 2. Start the server (from the parent directory of payops_env)
PYTHONPATH=$(pwd) uvicorn payops_env.server.app:app --host 0.0.0.0 --port 8000
# 3. Verify
curl http://localhost:8000/health
Run the baseline agent
# Via the API endpoint (no extra script needed)
curl -s -X POST http://localhost:8000/baseline | python3 -m json.tool
Docker
# Build
docker build -t payops-env .
# Run locally on port 8000
docker run -p 8000:7860 -e PORT=7860 payops-env
# Verify
curl http://localhost:8000/health
HuggingFace Space
The Dockerfile exposes port 7860 (HF Spaces default). Push the repo to
a HF Space with Docker runtime — no additional configuration required.
Example Agent Interaction
import httpx
base = "http://localhost:8000"
# Reset
obs = httpx.post(f"{base}/reset").json()
print(obs["transaction_id"], obs["risk_score"], obs["flags"])
# Step
while not obs["done"]:
# ... agent decides action_type ...
obs = httpx.post(f"{base}/step", json={
"action_type": "approve",
"transaction_id": obs["transaction_id"],
}).json()
print(f"reward={obs['reward']:+.2f} done={obs['done']}")
# Grade
score = httpx.get(f"{base}/grader").json()
print(f"Episode score: {score['normalised_score']:.4f}")
Baseline Results
Rule-based baseline (POST /baseline)
The rule-based baseline uses a deterministic priority-ordered policy in scripts_util.py.
| Metric | Rule-based baseline (v2, 30 tasks) |
|---|---|
| Normalised score | 0.68–0.76 |
| Passed (≥ 0.5) | Yes |
| Strong at | Easy tasks, clear velocity/flag patterns |
| Weak at | Hard adversarial tasks (HARD-001 model-poisoning, HARD-004 FX settlement) |
| Critical coverage | Partial — misses some SAR filing requirements |
Scores vary slightly per run due to per-episode parameter jitter.
Run POST /baseline to reproduce.
LLM baseline (inference.py — llama-3.1-8b-instant via Groq)
Run locally against seed 42 (reproducible) with investigation sub-actions enabled.
| Metric | llama-3.1-8b-instant (Groq) |
|---|---|
| Normalised score | 0.6028 |
| Total reward | 17.000 / 28.200 max |
| Tasks correct | 6 / 20 (30%) |
| Budget spent | 5.50 / 5.00 |
| Budget penalty | 0.05 |
| Episode steps | 57 (incl. investigation sub-actions) |
| Duration | ~290 s |
| Passed (≥ 0.5) | YES ✓ |
| Seed | 42 (fixed — deterministic across re-runs) |
Per-task decisions:
| Task | LLM Action | Correct Action | Weighted Reward |
|---|---|---|---|
| EASY-001 | approve |
approve |
+1.000 ✓ |
| EASY-002 | flag |
reject |
−0.250 ✗ (flag no longer partial credit) |
| EASY-003 | approve |
approve |
+1.000 ✓ |
| EASY-004 | flag |
flag |
+1.000 ✓ |
| MED-001 | flag |
escalate |
+0.900 (partial + investigation bonus) |
| MED-002 | flag |
hold |
+0.540 (partial + investigation bonus) |
| MED-003 | flag |
flag |
+1.200 ✓ |
| MED-004 | flag |
flag |
+1.200 ✓ |
| MED-005 | flag |
hold |
+0.660 (partial + investigation bonus) |
| MED-006 | flag |
escalate |
+0.600 (partial + investigation bonus) |
| HARD-001 | flag |
escalate |
+1.275 (partial + investigation bonus) |
| HARD-002 | flag |
reject |
+0.525 (partial + investigation bonus) |
| HARD-003 | flag |
reject |
+0.675 (partial + investigation bonus) |
| HARD-004 | flag |
approve |
+0.825 (partial + investigation bonus) |
| HARD-005 | flag |
escalate |
+0.825 (partial + investigation bonus) |
| HARD-006 | flag |
flag |
+2.025 ✓ (+ investigation bonus) |
| CRIT-001 | flag |
approve |
+1.100 (partial + investigation bonus) |
| CRIT-002 | flag |
reject |
+0.900 (partial + investigation bonus) |
| CRIT-003 | flag |
escalate |
+1.300 (partial + investigation bonus) |
| CRIT-004 | flag |
reject |
−0.250 ✗ |
Observations: The model used investigation sub-actions (inspect, verify_kyc, contact_sender) before terminal decisions, earning investigation bonuses that raised the score from a naive always-flag baseline. Easy cases with clear evidence now penalise lazy flag decisions (e.g. EASY-002). Agents that correctly identify terminal actions on top of proper investigation can exceed 0.90.
To reproduce exactly (seed=42 is the default):
export OPENAI_API_KEY="gsk_..." # your Groq API key
export API_BASE_URL="https://api.groq.com/openai/v1"
export MODEL_NAME="llama-3.1-8b-instant"
export PAYOPS_BASE_URL="https://padmapriyagosakan-payops-env.hf.space"
# INFERENCE_SEED=42 # default; set to "random" for a fresh episode
PYTHONPATH=$(pwd) python payops_env/inference.py
For Groq setup instructions see the Running inference with Groq section below.
Running inference with Groq (recommended — free)
Groq provides a completely free API with no monthly credit cap and no installation required. It uses the same OpenAI-compatible interface that inference.py already targets.
Prerequisites
Create a free Groq account — go to console.groq.com and sign up (Google / GitHub login available)
Generate an API key — click API Keys → Create API Key, copy the key (starts with
gsk_)Install the Python dependency (already in
requirements.txt):pip install openai
Run inference
cd /path/to/payops_env # project root (parent of payops_env/)
export OPENAI_API_KEY="gsk_..." # your Groq API key
export API_BASE_URL="https://api.groq.com/openai/v1"
export MODEL_NAME="llama-3.1-8b-instant"
export PAYOPS_BASE_URL="https://padmapriyagosakan-payops-env.hf.space"
PYTHONPATH=$(pwd) python payops_env/inference.py
Why Groq?
- Free tier: 14,400 requests/day, 500,000 tokens/minute — a 20-task episode uses ~30 calls
- No monthly credit pool that runs out mid-run (unlike the HF free tier)
- No installation or model download (unlike Ollama)
temperature=0.0is already set ininference.pyso results are reproducible- Inference speed: ~750 tok/s → full episode completes in under 30 seconds
Alternative free models on Groq
| Model | Notes |
|---|---|
llama-3.1-8b-instant |
Fastest, good reasoning |
llama-3.3-70b-versatile |
Best quality on hard tasks; same free tier |
mixtral-8x7b-32768 |
Large context window |
gemma2-9b-it |
Google Gemma 2 |
Alternative: Ollama (fully local, no internet required for LLM calls)
If you prefer to run the model entirely on your machine:
# 1. Install
brew install ollama
# 2. Pull a model (choose based on available RAM)
ollama pull qwen2.5:3b # ~2 GB – 8 GB RAM
ollama pull qwen2.5:7b # ~4.7 GB – 16 GB RAM
# 3. Start the server (keep running in a separate terminal)
ollama serve
# 4. Run inference
export OPENAI_API_KEY=ollama
export API_BASE_URL="http://localhost:11434/v1"
export MODEL_NAME="qwen2.5:3b"
export PAYOPS_BASE_URL="https://padmapriyagosakan-payops-env.hf.space"
PYTHONPATH=$(pwd) python payops_env/inference.py
Project Structure
payops_env/
├── models.py # PayOpsAction, PayOpsObservation, PayOpsState (Pydantic)
├── environment.py # PayOpsEnvironment — reset_async / step_async / state
├── tasks.py # 30 tasks (EASY×6, MED×8, HARD×10, CRIT×6) with ground-truth labels
├── grader.py # Partial-credit reward function + episode grader
├── scripts_util.py # Baseline runner helper (used by /baseline endpoint)
├── server/
│ └── app.py # FastAPI server with all required endpoints
├── inference.py # Competition inference script (OpenAI client, root-level)
├── validate.py # Pre-submission checklist validator
├── openenv.yaml # OpenEnv manifest v2.0.0
├── Dockerfile # Docker / HuggingFace Space container (port 7860)
├── requirements.txt # Python dependencies
└── README.md # This file
Evaluation Criteria Alignment
| Criterion | Implementation |
|---|---|
| Real-world utility | Payment fraud and compliance triage — deployed daily by fintech ops teams worldwide |
| Task & grader quality | 30 tasks across 4 difficulty tiers (easy→critical); partial-credit grader; clear pass/fail |
| Environment design | 30-field observation space; 10-action space (5 terminal + 5 investigation); budget mechanic; episode state tracking |
| Code quality & spec compliance | Pydantic v2 models; async API; all 11 required endpoints; openenv.yaml v2; Dockerfile; validate.py |
| Creativity & novelty | Adversarial model-poisoning task; APP scam; AML structuring with SAR requirement; PEP detection |
Reward Design (v2 — Trajectory-Based)
Rewards are dense across the full trajectory, not just on the final decision:
| Component | Value | Condition |
|---|---|---|
| Correct terminal action | +1.0 | per task (difficulty-weighted in episode score) |
| Investigation sub-action | +0.15 | per eligible sub-action, first use only |
| Flag identification | +0.20 | agent used inspect AND key diagnostic flags present |
| Confidence bonus | +0.10 | confidence ≥ 0.8 AND correct |
| Confidence penalty | −0.10 | confidence ≥ 0.8 AND wrong |
| Regulatory SAR bonus | +0.20 | file_sar before terminal on a regulatory task |
| Duplicate investigation | −0.05 | same sub-action used twice on same task |
| Approve a fraud/sanctioned | −1.00 | worst mistake |
Difficulty weights: easy×1.0, medium×1.2, hard×1.5, critical×2.0
Episode score is strictly clamped to [0.0, 1.0]. Passing threshold: 0.5.
Per-Episode Parameter Jitter
Each POST /reset generates a unique episode_seed and applies small random perturbations to prevent agent overfitting:
| Field | Jitter |
|---|---|
amount |
× Uniform(0.85, 1.20) |
risk_score |
+ Gauss(0, 0.03), clamped [0,1] |
velocity_1h |
+ Randint(−3, +3), min 0 |
velocity_24h |
+ Randint(−3, +3), min 0 |
The correct_action and all ground-truth labels are never changed — only the observable values the agent uses to make decisions.
The episode_seed is returned by GET /health and GET /state for reproducibility.
Network Graph
Selected tasks include a network_graph field in the observation exposing mule-chain / correspondent-banking relationships (e.g. victim → mule → offshore). This gives agents richer context for complex fraud patterns.