--- title: PayOps β€” Payment Operations Incident Response emoji: πŸ’³ colorFrom: blue colorTo: green sdk: docker app_port: 7860 tags: - openenv - finance - fraud-detection - compliance - reinforcement-learning pinned: false fullWidth: false build_version: 2026-04-12-v6 --- # PayOps β€” Payment Operations Incident Response An **OpenEnv-compatible** reinforcement-learning environment where an AI agent acts as a Payment Operations analyst. The agent reviews financial transactions one by one and must decide the correct compliance action for each. --- ## Motivation Payment operations teams process thousands of transactions every day. A skilled analyst uses dozens of signals β€” risk scores, velocity, KYC status, flag patterns β€” to make fast, accurate decisions. This environment lets an AI agent learn and be evaluated on exactly this task, spanning clear-cut cases all the way to subtle adversarial patterns like model-score poisoning and Authorised Push Payment (APP) scams. --- ## Environment Description Each **episode** steps through all **30 transactions** (6 easy, 8 medium, 10 hard, 6 critical). For each transaction the agent observes a rich set of signals and chooses one of **10 possible actions** β€” 5 terminal decisions and 5 investigation sub-actions. A reward is returned immediately, and the next transaction is presented until the episode is complete. --- ## Action Space Terminal decisions (no budget cost) commit to a final outcome for the transaction. Investigation sub-actions (with budget cost) reveal more information and let the agent act again on the same transaction. | Action | Type | Description | Budget Cost | |-----------------|---------------|-------------|-------------| | `approve` | terminal | Mark transaction as legitimate; allow it through | β€” | | `reject` | terminal | Block the transaction outright | β€” | | `flag` | terminal | Soft hold; mark for manual review | β€” | | `escalate` | terminal | Route to senior compliance officer / fraud team | β€” | | `hold` | terminal | Temporary hold pending more information | β€” | | `inspect` | investigation | Pull additional signals (logs, KYC, velocity) β€” yields `inspection_notes` | 0.10 | | `request_docs` | investigation | Ask sender for supporting documents (invoice, contract) β€” yields `docs_notes` | 0.20 | | `verify_kyc` | investigation | Trigger an active KYC re-verification check β€” yields `kyc_notes` | 0.20 | | `contact_sender` | investigation | Contact the sender directly to confirm intent β€” yields `contact_notes` | 0.30 | | `file_sar` | investigation | File a Suspicious Activity Report to the regulator (required on AML/structuring tasks) | 0.10 | --- ## Observation Space | Field | Type | Description | |------------------------|-------------------|-------------| | `transaction_id` | `str` | Unique transaction identifier | | `amount` | `float` | Transaction amount in the stated currency | | `currency` | `str` | ISO-4217 currency code | | `sender` | `str` | Sender identifier (email / account / alias) | | `receiver` | `str` | Receiver identifier | | `transaction_type` | `str` | transfer \| payment \| withdrawal \| refund \| internal \| loan_repayment \| payroll | | `status` | `str` | pending \| approved \| rejected \| flagged \| escalated \| held \| inspected \| docs_requested \| kyc_triggered \| sender_contacted \| sar_filed | | `risk_score` | `float [0,1]` | Composite ML risk score | | `ml_confidence` | `float [0,1]` | Model's self-reported confidence in `risk_score` β€” low value signals possible model poisoning | | `flags` | `List[str]` | Active risk flags (e.g. `high_value`, `unknown_sender`, `velocity_breach`) | | `velocity_1h` | `int?` | Transactions from sender in the past hour | | `velocity_24h` | `int?` | Transactions from sender in the past 24 hours | | `avg_transaction_amount`| `float?` | Sender's historical average transaction amount | | `account_age_days` | `int?` | Age of the sender account in days | | `country_risk` | `str?` | low \| medium \| high \| sanctioned | | `kyc_status` | `str?` | verified \| pending \| failed \| none \| expired | | `kyc_expiry_days` | `int?` | Days until KYC expires (negative = already expired) | | `previous_violations` | `int?` | Prior compliance violations for this sender | | `previous_sars` | `int?` | Suspicious Activity Reports previously filed for this sender | | `counterparty_risk` | `str?` | clean \| unknown \| watchlist \| blacklist | | `chain_step` | `int` | Current step in a multi-hop investigation chain (1 = initial presentation) | | `chain_total` | `int` | Total investigation steps for this task (1 = single-step) | | `chain_context` | `str?` | Accumulated summary of findings from earlier chain steps | | `steps_remaining` | `int?` | Investigation sub-steps remaining before a terminal decision is required | | `action_cost` | `float` | Budget cost incurred by the last action | | `budget_remaining` | `float` | Remaining investigation budget (starts at 5.0; decreases with each investigation action) | | `inspection_notes` | `str?` | Additional details revealed after an `inspect` action | | `docs_notes` | `str?` | Document review findings after a `request_docs` action | | `kyc_notes` | `str?` | KYC re-verification outcome after a `verify_kyc` action | | `contact_notes` | `str?` | Outcome after a `contact_sender` action | | `investigation_hints` | `List[str]` | Sub-actions recommended for this task (e.g. `inspect`, `verify_kyc`). Using them before the terminal decision earns bonus reward. Empty = no specific investigation required. | | `recent_decisions` | `List[dict]` | Last ≀3 completed decisions in this episode (for pattern context) | | `network_graph` | `dict?` | Mule-chain / correspondent-bank relationship graph where present | | `task_id` | `str` | Identifier of the active task | | `task_difficulty` | `str` | easy \| medium \| hard \| critical | | `step_in_episode` | `int` | Steps elapsed in this episode | | `reward` | `float` | Reward from the last action | | `reward_breakdown` | `dict` | Itemised reward components: base, confidence_bonus, cost_penalty, etc. | | `cumulative_reward` | `float` | Total reward accumulated so far in this episode | | `done` | `bool` | Whether the episode has ended | | `info` | `dict` | Diagnostic info (event, correct action, etc.) | --- ## Task Descriptions ### Easy (4 tasks β€” clear signals) | ID | Description | Correct Action | |----------|-------------|----------------| | EASY-001 | Low-value domestic transfer between verified users; no flags | `approve` | | EASY-002 | Textbook fraud: unknown sender, offshore, sanctioned country, risk=0.97 | `reject` | | EASY-003 | Standard refund to verified customer; tiny amount, no flags | `approve` | | EASY-004 | ATM withdrawal burst β€” 15 withdrawals in 58 minutes | `flag` | | EASY-005 | Scheduled monthly mortgage repayment; regular amount, verified borrower | `approve` | | EASY-006 | Suspected duplicate payment: same sender/receiver/amount submitted twice in 4 minutes | `flag` | ### Medium (8 tasks β€” ambiguous, multi-signal reasoning required) | ID | Description | Correct Action | |---------|-------------|----------------| | MED-001 | Large B2B wire, verified CFO, cross-border to medium-risk jurisdiction | `escalate` | | MED-002 | Internal treasury transfer; large amount, KYC pending renewal | `hold` | | MED-003 | Recurring subscription 3Γ— higher than historical average | `flag` | | MED-004 | Payment to licensed crypto exchange from verified personal account | `flag` | | MED-005 | Expired KYC on high-frequency corporate payroll account; KYC lapsed 12 days ago | `hold` | | MED-006 | Real estate advance payment; large first-time transfer to new receiver but signed contract exists | `escalate` | | MED-007 | Supplier emails to say bank details have changed; first payment to new account matches large invoice (BEC indicator) | `hold` | | MED-008 | Buy Now Pay Later high-value purchase; new account, thin credit file, elevated risk signals | `flag` | ### Hard (10 tasks β€” adversarial / edge-case) | ID | Description | Correct Action | |----------|-------------|----------------| | HARD-001 | Fraud model poisoning: risk_score=0.18 but manual signals scream escalate | `escalate` | | HARD-002 | APP (Authorised Push Payment) scam: victim sending willingly to mule account | `reject` | | HARD-003 | Structuring / smurfing: just-below-CTR-threshold payments, same UBO | `reject` | | HARD-004 | Legitimate FX correspondent banking settlement β€” looks alarming, is not | `approve` | | HARD-005 | Insider threat: employee initiating transfers to personal family accounts | `escalate` | | HARD-006 | Ghost account: dormant 5 years, suddenly received 20 inbound transfers this week | `flag` | | HARD-007 | SIM-swap attack: phone ported 6 hours ago; account now requesting large crypto withdrawal to new address | `reject` | | HARD-008 | Romance scam / pig butchering: 4th escalating transfer to overseas 'romantic partner' met online | `reject` | | HARD-009 | Synthetic identity fraud: new business account with AI-generated-looking perfect profile | `escalate` | | HARD-010 | Payroll diversion: HR system breach rerouted employee salary to newly added account | `reject` | ### Critical (6 tasks β€” regulatory + multi-step investigation chains) | ID | Description | Correct Action | |----------|-------------|----------------| | CRIT-001 | Multi-step chain: large PE wire to new counterparty; inspect then request docs before deciding (chain of 3) | `approve` | | CRIT-002 | Fraud ring: coordinated small payments from 3 related accounts aggregating above reporting threshold; SAR required | `reject` | | CRIT-003 | Trade-based money laundering: over-invoiced international trade payment (4Γ— market price) | `escalate` | | CRIT-004 | Compromised corporate account: geo-impossible login (NY β†’ Lagos in 8 min); confirmed account takeover | `reject` | | CRIT-005 | OFAC sanctions evasion: large USD payment routed through UAE shell chain; UBO is on SDN list (chain of 3) | `reject` | | CRIT-006 | Correspondent banking: partner bank added to FinCEN 311 Special Measures list; in-flight payments must be escalated | `escalate` | --- ## Reward Design | Outcome | Reward | |---------|--------| | Correct action | **+1.0** | | Partial-credit adjacent action (per-task) | **+0.2 – +0.6** | | `inspect` (information seeking, first time) | **+0.15** | | `approve` when correct is `reject` / `escalate` | **βˆ’1.0** | | `approve` when correct is `flag` / `hold` | **βˆ’0.5** | | `reject` when correct is `approve` | **βˆ’0.5** | | Any other wrong action | **βˆ’0.25** | The **episode score** (0–1) is: `max(0, total_reward) / max_possible_reward`. A score β‰₯ 0.5 is considered a passing episode. --- ## API Endpoints | Method | Path | Description | |--------|------|-------------| | `POST` | `/reset` | Reset environment, return first observation | | `POST` | `/step` | Execute an action | | `GET` | `/state` | Current internal environment state | | `GET` | `/schema` | JSON schemas for action / observation / state | | `GET` | `/tasks` | Full task list with metadata | | `GET` | `/grader` | Grade the current episode | | `POST` | `/baseline` | Run rule-based baseline and return scores | | `GET` | `/health` | Health check | | `WS` | `/ws` | WebSocket persistent session | Interactive API docs: `http://localhost:8000/docs` --- ## Setup & Running ### Local (Python) ```bash # 1. Install dependencies pip install -r requirements.txt # 2. Start the server (from the parent directory of payops_env) PYTHONPATH=$(pwd) uvicorn payops_env.server.app:app --host 0.0.0.0 --port 8000 # 3. Verify curl http://localhost:8000/health ``` ### Run the baseline agent ```bash # Via the API endpoint (no extra script needed) curl -s -X POST http://localhost:8000/baseline | python3 -m json.tool ``` ### Docker ```bash # Build docker build -t payops-env . # Run locally on port 8000 docker run -p 8000:7860 -e PORT=7860 payops-env # Verify curl http://localhost:8000/health ``` ### HuggingFace Space The `Dockerfile` exposes port **7860** (HF Spaces default). Push the repo to a HF Space with Docker runtime β€” no additional configuration required. --- ## Example Agent Interaction ```python import httpx base = "http://localhost:8000" # Reset obs = httpx.post(f"{base}/reset").json() print(obs["transaction_id"], obs["risk_score"], obs["flags"]) # Step while not obs["done"]: # ... agent decides action_type ... obs = httpx.post(f"{base}/step", json={ "action_type": "approve", "transaction_id": obs["transaction_id"], }).json() print(f"reward={obs['reward']:+.2f} done={obs['done']}") # Grade score = httpx.get(f"{base}/grader").json() print(f"Episode score: {score['normalised_score']:.4f}") ``` --- ## Baseline Results ### Rule-based baseline (`POST /baseline`) The rule-based baseline uses a deterministic priority-ordered policy in `scripts_util.py`. | Metric | Rule-based baseline (v2, 30 tasks) | |--------|------------------------------------| | Normalised score | 0.68–0.76 | | Passed (β‰₯ 0.5) | Yes | | Strong at | Easy tasks, clear velocity/flag patterns | | Weak at | Hard adversarial tasks (HARD-001 model-poisoning, HARD-004 FX settlement) | | Critical coverage | Partial β€” misses some SAR filing requirements | Scores vary slightly per run due to per-episode parameter jitter. Run `POST /baseline` to reproduce. ### LLM baseline (`inference.py` β€” `llama-3.1-8b-instant` via Groq) Run locally against seed 42 (reproducible) with investigation sub-actions enabled. | Metric | llama-3.1-8b-instant (Groq) | |--------|-----------------------------| | Normalised score | **0.6028** | | Total reward | 17.000 / 28.200 max | | Tasks correct | 6 / 20 (30%) | | Budget spent | 5.50 / 5.00 | | Budget penalty | 0.05 | | Episode steps | 57 (incl. investigation sub-actions) | | Duration | ~290 s | | Passed (β‰₯ 0.5) | **YES βœ“** | | Seed | 42 (fixed β€” deterministic across re-runs) | **Per-task decisions:** | Task | LLM Action | Correct Action | Weighted Reward | |------|-----------|----------------|----------------| | EASY-001 | `approve` | `approve` | +1.000 βœ“ | | EASY-002 | `flag` | `reject` | βˆ’0.250 βœ— (flag no longer partial credit) | | EASY-003 | `approve` | `approve` | +1.000 βœ“ | | EASY-004 | `flag` | `flag` | +1.000 βœ“ | | MED-001 | `flag` | `escalate` | +0.900 (partial + investigation bonus) | | MED-002 | `flag` | `hold` | +0.540 (partial + investigation bonus) | | MED-003 | `flag` | `flag` | +1.200 βœ“ | | MED-004 | `flag` | `flag` | +1.200 βœ“ | | MED-005 | `flag` | `hold` | +0.660 (partial + investigation bonus) | | MED-006 | `flag` | `escalate` | +0.600 (partial + investigation bonus) | | HARD-001 | `flag` | `escalate` | +1.275 (partial + investigation bonus) | | HARD-002 | `flag` | `reject` | +0.525 (partial + investigation bonus) | | HARD-003 | `flag` | `reject` | +0.675 (partial + investigation bonus) | | HARD-004 | `flag` | `approve` | +0.825 (partial + investigation bonus) | | HARD-005 | `flag` | `escalate` | +0.825 (partial + investigation bonus) | | HARD-006 | `flag` | `flag` | +2.025 βœ“ (+ investigation bonus) | | CRIT-001 | `flag` | `approve` | +1.100 (partial + investigation bonus) | | CRIT-002 | `flag` | `reject` | +0.900 (partial + investigation bonus) | | CRIT-003 | `flag` | `escalate` | +1.300 (partial + investigation bonus) | | CRIT-004 | `flag` | `reject` | βˆ’0.250 βœ— | **Observations:** The model used investigation sub-actions (`inspect`, `verify_kyc`, `contact_sender`) before terminal decisions, earning investigation bonuses that raised the score from a naive always-flag baseline. Easy cases with clear evidence now penalise lazy `flag` decisions (e.g. EASY-002). Agents that correctly identify terminal actions on top of proper investigation can exceed 0.90. To reproduce exactly (seed=42 is the default): ```bash export OPENAI_API_KEY="gsk_..." # your Groq API key export API_BASE_URL="https://api.groq.com/openai/v1" export MODEL_NAME="llama-3.1-8b-instant" export PAYOPS_BASE_URL="https://padmapriyagosakan-payops-env.hf.space" # INFERENCE_SEED=42 # default; set to "random" for a fresh episode PYTHONPATH=$(pwd) python payops_env/inference.py ``` For Groq setup instructions see the **Running inference with Groq** section below. --- ## Running inference with Groq (recommended β€” free) [Groq](https://console.groq.com) provides a completely free API with no monthly credit cap and no installation required. It uses the same OpenAI-compatible interface that `inference.py` already targets. ### Prerequisites 1. **Create a free Groq account** β€” go to [console.groq.com](https://console.groq.com) and sign up (Google / GitHub login available) 2. **Generate an API key** β€” click **API Keys β†’ Create API Key**, copy the key (starts with `gsk_`) 3. **Install the Python dependency** (already in `requirements.txt`): ```bash pip install openai ``` ### Run inference ```bash cd /path/to/payops_env # project root (parent of payops_env/) export OPENAI_API_KEY="gsk_..." # your Groq API key export API_BASE_URL="https://api.groq.com/openai/v1" export MODEL_NAME="llama-3.1-8b-instant" export PAYOPS_BASE_URL="https://padmapriyagosakan-payops-env.hf.space" PYTHONPATH=$(pwd) python payops_env/inference.py ``` > **Why Groq?** > - Free tier: 14,400 requests/day, 500,000 tokens/minute β€” a 20-task episode uses ~30 calls > - No monthly credit pool that runs out mid-run (unlike the HF free tier) > - No installation or model download (unlike Ollama) > - `temperature=0.0` is already set in `inference.py` so results are reproducible > - Inference speed: ~750 tok/s β†’ full episode completes in under 30 seconds ### Alternative free models on Groq | Model | Notes | |-------|-------| | `llama-3.1-8b-instant` | Fastest, good reasoning | | `llama-3.3-70b-versatile` | Best quality on hard tasks; same free tier | | `mixtral-8x7b-32768` | Large context window | | `gemma2-9b-it` | Google Gemma 2 | ### Alternative: Ollama (fully local, no internet required for LLM calls) If you prefer to run the model entirely on your machine: ```bash # 1. Install brew install ollama # 2. Pull a model (choose based on available RAM) ollama pull qwen2.5:3b # ~2 GB – 8 GB RAM ollama pull qwen2.5:7b # ~4.7 GB – 16 GB RAM # 3. Start the server (keep running in a separate terminal) ollama serve # 4. Run inference export OPENAI_API_KEY=ollama export API_BASE_URL="http://localhost:11434/v1" export MODEL_NAME="qwen2.5:3b" export PAYOPS_BASE_URL="https://padmapriyagosakan-payops-env.hf.space" PYTHONPATH=$(pwd) python payops_env/inference.py ``` --- ## Project Structure ``` payops_env/ β”œβ”€β”€ models.py # PayOpsAction, PayOpsObservation, PayOpsState (Pydantic) β”œβ”€β”€ environment.py # PayOpsEnvironment β€” reset_async / step_async / state β”œβ”€β”€ tasks.py # 30 tasks (EASYΓ—6, MEDΓ—8, HARDΓ—10, CRITΓ—6) with ground-truth labels β”œβ”€β”€ grader.py # Partial-credit reward function + episode grader β”œβ”€β”€ scripts_util.py # Baseline runner helper (used by /baseline endpoint) β”œβ”€β”€ server/ β”‚ └── app.py # FastAPI server with all required endpoints β”œβ”€β”€ inference.py # Competition inference script (OpenAI client, root-level) β”œβ”€β”€ validate.py # Pre-submission checklist validator β”œβ”€β”€ openenv.yaml # OpenEnv manifest v2.0.0 β”œβ”€β”€ Dockerfile # Docker / HuggingFace Space container (port 7860) β”œβ”€β”€ requirements.txt # Python dependencies └── README.md # This file ``` --- ## Evaluation Criteria Alignment | Criterion | Implementation | |-----------|---------------| | Real-world utility | Payment fraud and compliance triage β€” deployed daily by fintech ops teams worldwide | | Task & grader quality | 30 tasks across 4 difficulty tiers (easyβ†’critical); partial-credit grader; clear pass/fail | | Environment design | 30-field observation space; 10-action space (5 terminal + 5 investigation); budget mechanic; episode state tracking | | Code quality & spec compliance | Pydantic v2 models; async API; all 11 required endpoints; openenv.yaml v2; Dockerfile; validate.py | | Creativity & novelty | Adversarial model-poisoning task; APP scam; AML structuring with SAR requirement; PEP detection | --- ## Reward Design (v2 β€” Trajectory-Based) Rewards are dense across the full trajectory, not just on the final decision: | Component | Value | Condition | |-----------|-------|-----------| | Correct terminal action | **+1.0** | per task (difficulty-weighted in episode score) | | Investigation sub-action | **+0.15** | per eligible sub-action, first use only | | Flag identification | **+0.20** | agent used `inspect` AND key diagnostic flags present | | Confidence bonus | +0.10 | confidence β‰₯ 0.8 AND correct | | Confidence penalty | βˆ’0.10 | confidence β‰₯ 0.8 AND wrong | | Regulatory SAR bonus | +0.20 | `file_sar` before terminal on a regulatory task | | Duplicate investigation | βˆ’0.05 | same sub-action used twice on same task | | Approve a fraud/sanctioned | **βˆ’1.00** | worst mistake | Difficulty weights: easyΓ—1.0, mediumΓ—1.2, hardΓ—1.5, criticalΓ—2.0 Episode score is **strictly clamped to `[0.0, 1.0]`**. Passing threshold: **0.5**. ### Per-Episode Parameter Jitter Each `POST /reset` generates a unique `episode_seed` and applies small random perturbations to prevent agent overfitting: | Field | Jitter | |-------|--------| | `amount` | Γ— Uniform(0.85, 1.20) | | `risk_score` | + Gauss(0, 0.03), clamped [0,1] | | `velocity_1h` | + Randint(βˆ’3, +3), min 0 | | `velocity_24h` | + Randint(βˆ’3, +3), min 0 | The `correct_action` and all ground-truth labels are **never changed** β€” only the observable values the agent uses to make decisions. The `episode_seed` is returned by `GET /health` and `GET /state` for reproducibility. ### Network Graph Selected tasks include a `network_graph` field in the observation exposing mule-chain / correspondent-banking relationships (e.g. victim β†’ mule β†’ offshore). This gives agents richer context for complex fraud patterns.