payops_env / README.md
padmapriyagosakan's picture
Fix grader import path: use root-level graders module instead of server.graders
220acb1
|
Raw
History Blame Contribute Delete
23.2 kB
---
title: PayOps Payment Operations Incident Response
emoji: 💳
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
- openenv
- finance
- fraud-detection
- compliance
- reinforcement-learning
pinned: false
fullWidth: false
build_version: 2026-04-12-v6
---
# PayOps — Payment Operations Incident Response
An **OpenEnv-compatible** reinforcement-learning environment where an AI agent
acts as a Payment Operations analyst. The agent reviews financial transactions
one by one and must decide the correct compliance action for each.
---
## Motivation
Payment operations teams process thousands of transactions every day. A
skilled analyst uses dozens of signals — risk scores, velocity, KYC status,
flag patterns — to make fast, accurate decisions. This environment lets an AI
agent learn and be evaluated on exactly this task, spanning clear-cut cases all
the way to subtle adversarial patterns like model-score poisoning and
Authorised Push Payment (APP) scams.
---
## Environment Description
Each **episode** steps through all **30 transactions** (6 easy, 8 medium, 10 hard, 6 critical).
For each transaction the agent observes a rich set of signals and chooses one
of **10 possible actions** — 5 terminal decisions and 5 investigation sub-actions.
A reward is returned immediately, and the next transaction is presented until
the episode is complete.
---
## Action Space
Terminal decisions (no budget cost) commit to a final outcome for the transaction.
Investigation sub-actions (with budget cost) reveal more information and let the agent act again on the same transaction.
| Action | Type | Description | Budget Cost |
|-----------------|---------------|-------------|-------------|
| `approve` | terminal | Mark transaction as legitimate; allow it through | — |
| `reject` | terminal | Block the transaction outright | — |
| `flag` | terminal | Soft hold; mark for manual review | — |
| `escalate` | terminal | Route to senior compliance officer / fraud team | — |
| `hold` | terminal | Temporary hold pending more information | — |
| `inspect` | investigation | Pull additional signals (logs, KYC, velocity) — yields `inspection_notes` | 0.10 |
| `request_docs` | investigation | Ask sender for supporting documents (invoice, contract) — yields `docs_notes` | 0.20 |
| `verify_kyc` | investigation | Trigger an active KYC re-verification check — yields `kyc_notes` | 0.20 |
| `contact_sender` | investigation | Contact the sender directly to confirm intent — yields `contact_notes` | 0.30 |
| `file_sar` | investigation | File a Suspicious Activity Report to the regulator (required on AML/structuring tasks) | 0.10 |
---
## Observation Space
| Field | Type | Description |
|------------------------|-------------------|-------------|
| `transaction_id` | `str` | Unique transaction identifier |
| `amount` | `float` | Transaction amount in the stated currency |
| `currency` | `str` | ISO-4217 currency code |
| `sender` | `str` | Sender identifier (email / account / alias) |
| `receiver` | `str` | Receiver identifier |
| `transaction_type` | `str` | transfer \| payment \| withdrawal \| refund \| internal \| loan_repayment \| payroll |
| `status` | `str` | pending \| approved \| rejected \| flagged \| escalated \| held \| inspected \| docs_requested \| kyc_triggered \| sender_contacted \| sar_filed |
| `risk_score` | `float [0,1]` | Composite ML risk score |
| `ml_confidence` | `float [0,1]` | Model's self-reported confidence in `risk_score` — low value signals possible model poisoning |
| `flags` | `List[str]` | Active risk flags (e.g. `high_value`, `unknown_sender`, `velocity_breach`) |
| `velocity_1h` | `int?` | Transactions from sender in the past hour |
| `velocity_24h` | `int?` | Transactions from sender in the past 24 hours |
| `avg_transaction_amount`| `float?` | Sender's historical average transaction amount |
| `account_age_days` | `int?` | Age of the sender account in days |
| `country_risk` | `str?` | low \| medium \| high \| sanctioned |
| `kyc_status` | `str?` | verified \| pending \| failed \| none \| expired |
| `kyc_expiry_days` | `int?` | Days until KYC expires (negative = already expired) |
| `previous_violations` | `int?` | Prior compliance violations for this sender |
| `previous_sars` | `int?` | Suspicious Activity Reports previously filed for this sender |
| `counterparty_risk` | `str?` | clean \| unknown \| watchlist \| blacklist |
| `chain_step` | `int` | Current step in a multi-hop investigation chain (1 = initial presentation) |
| `chain_total` | `int` | Total investigation steps for this task (1 = single-step) |
| `chain_context` | `str?` | Accumulated summary of findings from earlier chain steps |
| `steps_remaining` | `int?` | Investigation sub-steps remaining before a terminal decision is required |
| `action_cost` | `float` | Budget cost incurred by the last action |
| `budget_remaining` | `float` | Remaining investigation budget (starts at 5.0; decreases with each investigation action) |
| `inspection_notes` | `str?` | Additional details revealed after an `inspect` action |
| `docs_notes` | `str?` | Document review findings after a `request_docs` action |
| `kyc_notes` | `str?` | KYC re-verification outcome after a `verify_kyc` action |
| `contact_notes` | `str?` | Outcome after a `contact_sender` action |
| `investigation_hints` | `List[str]` | Sub-actions recommended for this task (e.g. `inspect`, `verify_kyc`). Using them before the terminal decision earns bonus reward. Empty = no specific investigation required. |
| `recent_decisions` | `List[dict]` | Last ≤3 completed decisions in this episode (for pattern context) |
| `network_graph` | `dict?` | Mule-chain / correspondent-bank relationship graph where present |
| `task_id` | `str` | Identifier of the active task |
| `task_difficulty` | `str` | easy \| medium \| hard \| critical |
| `step_in_episode` | `int` | Steps elapsed in this episode |
| `reward` | `float` | Reward from the last action |
| `reward_breakdown` | `dict` | Itemised reward components: base, confidence_bonus, cost_penalty, etc. |
| `cumulative_reward` | `float` | Total reward accumulated so far in this episode |
| `done` | `bool` | Whether the episode has ended |
| `info` | `dict` | Diagnostic info (event, correct action, etc.) |
---
## Task Descriptions
### Easy (4 tasks — clear signals)
| ID | Description | Correct Action |
|----------|-------------|----------------|
| EASY-001 | Low-value domestic transfer between verified users; no flags | `approve` |
| EASY-002 | Textbook fraud: unknown sender, offshore, sanctioned country, risk=0.97 | `reject` |
| EASY-003 | Standard refund to verified customer; tiny amount, no flags | `approve` |
| EASY-004 | ATM withdrawal burst — 15 withdrawals in 58 minutes | `flag` |
| EASY-005 | Scheduled monthly mortgage repayment; regular amount, verified borrower | `approve` |
| EASY-006 | Suspected duplicate payment: same sender/receiver/amount submitted twice in 4 minutes | `flag` |
### Medium (8 tasks — ambiguous, multi-signal reasoning required)
| ID | Description | Correct Action |
|---------|-------------|----------------|
| MED-001 | Large B2B wire, verified CFO, cross-border to medium-risk jurisdiction | `escalate` |
| MED-002 | Internal treasury transfer; large amount, KYC pending renewal | `hold` |
| MED-003 | Recurring subscription 3× higher than historical average | `flag` |
| MED-004 | Payment to licensed crypto exchange from verified personal account | `flag` |
| MED-005 | Expired KYC on high-frequency corporate payroll account; KYC lapsed 12 days ago | `hold` |
| MED-006 | Real estate advance payment; large first-time transfer to new receiver but signed contract exists | `escalate` |
| MED-007 | Supplier emails to say bank details have changed; first payment to new account matches large invoice (BEC indicator) | `hold` |
| MED-008 | Buy Now Pay Later high-value purchase; new account, thin credit file, elevated risk signals | `flag` |
### Hard (10 tasks — adversarial / edge-case)
| ID | Description | Correct Action |
|----------|-------------|----------------|
| HARD-001 | Fraud model poisoning: risk_score=0.18 but manual signals scream escalate | `escalate` |
| HARD-002 | APP (Authorised Push Payment) scam: victim sending willingly to mule account | `reject` |
| HARD-003 | Structuring / smurfing: just-below-CTR-threshold payments, same UBO | `reject` |
| HARD-004 | Legitimate FX correspondent banking settlement — looks alarming, is not | `approve` |
| HARD-005 | Insider threat: employee initiating transfers to personal family accounts | `escalate` |
| HARD-006 | Ghost account: dormant 5 years, suddenly received 20 inbound transfers this week | `flag` |
| HARD-007 | SIM-swap attack: phone ported 6 hours ago; account now requesting large crypto withdrawal to new address | `reject` |
| HARD-008 | Romance scam / pig butchering: 4th escalating transfer to overseas 'romantic partner' met online | `reject` |
| HARD-009 | Synthetic identity fraud: new business account with AI-generated-looking perfect profile | `escalate` |
| HARD-010 | Payroll diversion: HR system breach rerouted employee salary to newly added account | `reject` |
### Critical (6 tasks — regulatory + multi-step investigation chains)
| ID | Description | Correct Action |
|----------|-------------|----------------|
| CRIT-001 | Multi-step chain: large PE wire to new counterparty; inspect then request docs before deciding (chain of 3) | `approve` |
| CRIT-002 | Fraud ring: coordinated small payments from 3 related accounts aggregating above reporting threshold; SAR required | `reject` |
| CRIT-003 | Trade-based money laundering: over-invoiced international trade payment (4× market price) | `escalate` |
| CRIT-004 | Compromised corporate account: geo-impossible login (NY → Lagos in 8 min); confirmed account takeover | `reject` |
| CRIT-005 | OFAC sanctions evasion: large USD payment routed through UAE shell chain; UBO is on SDN list (chain of 3) | `reject` |
| CRIT-006 | Correspondent banking: partner bank added to FinCEN 311 Special Measures list; in-flight payments must be escalated | `escalate` |
---
## Reward Design
| Outcome | Reward |
|---------|--------|
| Correct action | **+1.0** |
| Partial-credit adjacent action (per-task) | **+0.2 – +0.6** |
| `inspect` (information seeking, first time) | **+0.15** |
| `approve` when correct is `reject` / `escalate` | **−1.0** |
| `approve` when correct is `flag` / `hold` | **−0.5** |
| `reject` when correct is `approve` | **−0.5** |
| Any other wrong action | **−0.25** |
The **episode score** (0–1) is: `max(0, total_reward) / max_possible_reward`.
A score ≥ 0.5 is considered a passing episode.
---
## API Endpoints
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/reset` | Reset environment, return first observation |
| `POST` | `/step` | Execute an action |
| `GET` | `/state` | Current internal environment state |
| `GET` | `/schema` | JSON schemas for action / observation / state |
| `GET` | `/tasks` | Full task list with metadata |
| `GET` | `/grader` | Grade the current episode |
| `POST` | `/baseline` | Run rule-based baseline and return scores |
| `GET` | `/health` | Health check |
| `WS` | `/ws` | WebSocket persistent session |
Interactive API docs: `http://localhost:8000/docs`
---
## Setup & Running
### Local (Python)
```bash
# 1. Install dependencies
pip install -r requirements.txt
# 2. Start the server (from the parent directory of payops_env)
PYTHONPATH=$(pwd) uvicorn payops_env.server.app:app --host 0.0.0.0 --port 8000
# 3. Verify
curl http://localhost:8000/health
```
### Run the baseline agent
```bash
# Via the API endpoint (no extra script needed)
curl -s -X POST http://localhost:8000/baseline | python3 -m json.tool
```
### Docker
```bash
# Build
docker build -t payops-env .
# Run locally on port 8000
docker run -p 8000:7860 -e PORT=7860 payops-env
# Verify
curl http://localhost:8000/health
```
### HuggingFace Space
The `Dockerfile` exposes port **7860** (HF Spaces default). Push the repo to
a HF Space with Docker runtime — no additional configuration required.
---
## Example Agent Interaction
```python
import httpx
base = "http://localhost:8000"
# Reset
obs = httpx.post(f"{base}/reset").json()
print(obs["transaction_id"], obs["risk_score"], obs["flags"])
# Step
while not obs["done"]:
# ... agent decides action_type ...
obs = httpx.post(f"{base}/step", json={
"action_type": "approve",
"transaction_id": obs["transaction_id"],
}).json()
print(f"reward={obs['reward']:+.2f} done={obs['done']}")
# Grade
score = httpx.get(f"{base}/grader").json()
print(f"Episode score: {score['normalised_score']:.4f}")
```
---
## Baseline Results
### Rule-based baseline (`POST /baseline`)
The rule-based baseline uses a deterministic priority-ordered policy in `scripts_util.py`.
| Metric | Rule-based baseline (v2, 30 tasks) |
|--------|------------------------------------|
| Normalised score | 0.68–0.76 |
| Passed (≥ 0.5) | Yes |
| Strong at | Easy tasks, clear velocity/flag patterns |
| Weak at | Hard adversarial tasks (HARD-001 model-poisoning, HARD-004 FX settlement) |
| Critical coverage | Partial — misses some SAR filing requirements |
Scores vary slightly per run due to per-episode parameter jitter.
Run `POST /baseline` to reproduce.
### LLM baseline (`inference.py` — `llama-3.1-8b-instant` via Groq)
Run locally against seed 42 (reproducible) with investigation sub-actions enabled.
| Metric | llama-3.1-8b-instant (Groq) |
|--------|-----------------------------|
| Normalised score | **0.6028** |
| Total reward | 17.000 / 28.200 max |
| Tasks correct | 6 / 20 (30%) |
| Budget spent | 5.50 / 5.00 |
| Budget penalty | 0.05 |
| Episode steps | 57 (incl. investigation sub-actions) |
| Duration | ~290 s |
| Passed (≥ 0.5) | **YES ✓** |
| Seed | 42 (fixed — deterministic across re-runs) |
**Per-task decisions:**
| Task | LLM Action | Correct Action | Weighted Reward |
|------|-----------|----------------|----------------|
| EASY-001 | `approve` | `approve` | +1.000 ✓ |
| EASY-002 | `flag` | `reject` | −0.250 ✗ (flag no longer partial credit) |
| EASY-003 | `approve` | `approve` | +1.000 ✓ |
| EASY-004 | `flag` | `flag` | +1.000 ✓ |
| MED-001 | `flag` | `escalate` | +0.900 (partial + investigation bonus) |
| MED-002 | `flag` | `hold` | +0.540 (partial + investigation bonus) |
| MED-003 | `flag` | `flag` | +1.200 ✓ |
| MED-004 | `flag` | `flag` | +1.200 ✓ |
| MED-005 | `flag` | `hold` | +0.660 (partial + investigation bonus) |
| MED-006 | `flag` | `escalate` | +0.600 (partial + investigation bonus) |
| HARD-001 | `flag` | `escalate` | +1.275 (partial + investigation bonus) |
| HARD-002 | `flag` | `reject` | +0.525 (partial + investigation bonus) |
| HARD-003 | `flag` | `reject` | +0.675 (partial + investigation bonus) |
| HARD-004 | `flag` | `approve` | +0.825 (partial + investigation bonus) |
| HARD-005 | `flag` | `escalate` | +0.825 (partial + investigation bonus) |
| HARD-006 | `flag` | `flag` | +2.025 ✓ (+ investigation bonus) |
| CRIT-001 | `flag` | `approve` | +1.100 (partial + investigation bonus) |
| CRIT-002 | `flag` | `reject` | +0.900 (partial + investigation bonus) |
| CRIT-003 | `flag` | `escalate` | +1.300 (partial + investigation bonus) |
| CRIT-004 | `flag` | `reject` | −0.250 ✗ |
**Observations:** The model used investigation sub-actions (`inspect`, `verify_kyc`, `contact_sender`) before terminal decisions, earning investigation bonuses that raised the score from a naive always-flag baseline. Easy cases with clear evidence now penalise lazy `flag` decisions (e.g. EASY-002). Agents that correctly identify terminal actions on top of proper investigation can exceed 0.90.
To reproduce exactly (seed=42 is the default):
```bash
export OPENAI_API_KEY="gsk_..." # your Groq API key
export API_BASE_URL="https://api.groq.com/openai/v1"
export MODEL_NAME="llama-3.1-8b-instant"
export PAYOPS_BASE_URL="https://padmapriyagosakan-payops-env.hf.space"
# INFERENCE_SEED=42 # default; set to "random" for a fresh episode
PYTHONPATH=$(pwd) python payops_env/inference.py
```
For Groq setup instructions see the **Running inference with Groq** section below.
---
## Running inference with Groq (recommended — free)
[Groq](https://console.groq.com) provides a completely free API with no monthly credit cap and no installation required. It uses the same OpenAI-compatible interface that `inference.py` already targets.
### Prerequisites
1. **Create a free Groq account** — go to [console.groq.com](https://console.groq.com) and sign up (Google / GitHub login available)
2. **Generate an API key** — click **API Keys → Create API Key**, copy the key (starts with `gsk_`)
3. **Install the Python dependency** (already in `requirements.txt`):
```bash
pip install openai
```
### Run inference
```bash
cd /path/to/payops_env # project root (parent of payops_env/)
export OPENAI_API_KEY="gsk_..." # your Groq API key
export API_BASE_URL="https://api.groq.com/openai/v1"
export MODEL_NAME="llama-3.1-8b-instant"
export PAYOPS_BASE_URL="https://padmapriyagosakan-payops-env.hf.space"
PYTHONPATH=$(pwd) python payops_env/inference.py
```
> **Why Groq?**
> - Free tier: 14,400 requests/day, 500,000 tokens/minute — a 20-task episode uses ~30 calls
> - No monthly credit pool that runs out mid-run (unlike the HF free tier)
> - No installation or model download (unlike Ollama)
> - `temperature=0.0` is already set in `inference.py` so results are reproducible
> - Inference speed: ~750 tok/s → full episode completes in under 30 seconds
### Alternative free models on Groq
| Model | Notes |
|-------|-------|
| `llama-3.1-8b-instant` | Fastest, good reasoning |
| `llama-3.3-70b-versatile` | Best quality on hard tasks; same free tier |
| `mixtral-8x7b-32768` | Large context window |
| `gemma2-9b-it` | Google Gemma 2 |
### Alternative: Ollama (fully local, no internet required for LLM calls)
If you prefer to run the model entirely on your machine:
```bash
# 1. Install
brew install ollama
# 2. Pull a model (choose based on available RAM)
ollama pull qwen2.5:3b # ~2 GB – 8 GB RAM
ollama pull qwen2.5:7b # ~4.7 GB – 16 GB RAM
# 3. Start the server (keep running in a separate terminal)
ollama serve
# 4. Run inference
export OPENAI_API_KEY=ollama
export API_BASE_URL="http://localhost:11434/v1"
export MODEL_NAME="qwen2.5:3b"
export PAYOPS_BASE_URL="https://padmapriyagosakan-payops-env.hf.space"
PYTHONPATH=$(pwd) python payops_env/inference.py
```
---
## Project Structure
```
payops_env/
├── models.py # PayOpsAction, PayOpsObservation, PayOpsState (Pydantic)
├── environment.py # PayOpsEnvironment — reset_async / step_async / state
├── tasks.py # 30 tasks (EASY×6, MED×8, HARD×10, CRIT×6) with ground-truth labels
├── grader.py # Partial-credit reward function + episode grader
├── scripts_util.py # Baseline runner helper (used by /baseline endpoint)
├── server/
│ └── app.py # FastAPI server with all required endpoints
├── inference.py # Competition inference script (OpenAI client, root-level)
├── validate.py # Pre-submission checklist validator
├── openenv.yaml # OpenEnv manifest v2.0.0
├── Dockerfile # Docker / HuggingFace Space container (port 7860)
├── requirements.txt # Python dependencies
└── README.md # This file
```
---
## Evaluation Criteria Alignment
| Criterion | Implementation |
|-----------|---------------|
| Real-world utility | Payment fraud and compliance triage — deployed daily by fintech ops teams worldwide |
| Task & grader quality | 30 tasks across 4 difficulty tiers (easy→critical); partial-credit grader; clear pass/fail |
| Environment design | 30-field observation space; 10-action space (5 terminal + 5 investigation); budget mechanic; episode state tracking |
| Code quality & spec compliance | Pydantic v2 models; async API; all 11 required endpoints; openenv.yaml v2; Dockerfile; validate.py |
| Creativity & novelty | Adversarial model-poisoning task; APP scam; AML structuring with SAR requirement; PEP detection |
---
## Reward Design (v2 — Trajectory-Based)
Rewards are dense across the full trajectory, not just on the final decision:
| Component | Value | Condition |
|-----------|-------|-----------|
| Correct terminal action | **+1.0** | per task (difficulty-weighted in episode score) |
| Investigation sub-action | **+0.15** | per eligible sub-action, first use only |
| Flag identification | **+0.20** | agent used `inspect` AND key diagnostic flags present |
| Confidence bonus | +0.10 | confidence ≥ 0.8 AND correct |
| Confidence penalty | −0.10 | confidence ≥ 0.8 AND wrong |
| Regulatory SAR bonus | +0.20 | `file_sar` before terminal on a regulatory task |
| Duplicate investigation | −0.05 | same sub-action used twice on same task |
| Approve a fraud/sanctioned | **−1.00** | worst mistake |
Difficulty weights: easy×1.0, medium×1.2, hard×1.5, critical×2.0
Episode score is **strictly clamped to `[0.0, 1.0]`**. Passing threshold: **0.5**.
### Per-Episode Parameter Jitter
Each `POST /reset` generates a unique `episode_seed` and applies small random perturbations to prevent agent overfitting:
| Field | Jitter |
|-------|--------|
| `amount` | × Uniform(0.85, 1.20) |
| `risk_score` | + Gauss(0, 0.03), clamped [0,1] |
| `velocity_1h` | + Randint(−3, +3), min 0 |
| `velocity_24h` | + Randint(−3, +3), min 0 |
The `correct_action` and all ground-truth labels are **never changed** — only the observable values the agent uses to make decisions.
The `episode_seed` is returned by `GET /health` and `GET /state` for reproducibility.
### Network Graph
Selected tasks include a `network_graph` field in the observation exposing mule-chain / correspondent-banking relationships (e.g. victim → mule → offshore). This gives agents richer context for complex fraud patterns.