--- title: GraphStrike emoji: 🕵️ colorFrom: blue colorTo: indigo sdk: docker app_port: 7860 pinned: false license: mit tags: - reinforcement-learning - social-network - fraud-detection - openenv - llm-agent ---



An OpenEnv-compatible reinforcement learning environment where an LLM agent must identify all 10 members of a coordinated fake account network hidden inside a synthetic social network. The agent learns via Reflexion and a dynamic hybrid rule/LLM policy , not via gradient updates or fine-tuning.



### *Deployed Endpoint Verification* The live environment at [huggingface.co/spaces/Pandago/graphstrike](https://huggingface.co/spaces/Pandago/graphstrike) responds to all standard OpenEnv endpoints: ```bash # Health check curl https://pandago-graphstrike.hf.space/health # → {"status": "healthy"} # Task discovery curl https://pandago-graphstrike.hf.space/tasks # → {"tasks": ["easy","medium","hard"], "action_schema": {...}, "score_range": [0.0, 1.0]} # Baseline (deterministic, reproducible) curl -X POST https://pandago-graphstrike.hf.space/baseline # → {"scores": {"easy": 0.91, "medium": 0.906, "hard": 0.9038}, "agent": "rule_based"} ``` ---
We evaluate GraphStrike's hybrid rule/LLM policy across multiple *frontier models to measure how well each model handles the investigation task. All runs use the same inference pipeline (`inference.py`) with identical system prompts and structured logging. Each model ran: (1) seed=0 on all 3 tasks, and (2) seeds 0-2 on all 3 tasks for variance measurement.*
**Seed=0 scores (single episode per task):**

Model Performance Table


**3-seed variance scores (mean across seeds 0, 1, 2):**

Model Performance Table


**Rule-Based Baseline (no LLM, deterministic)**

Model Performance Table


--- **The task:** A social network contains fake accounts organised into a single coordinated network of 10. The network behaves in a coordinated way — same posting hour, same IP subnet, stolen celebrity photos, copy-paste bios. The agent must find all 10 by navigating a limited step budget, inspecting accounts, and flagging suspects. **What makes this non-trivial:** The network is large (50–1000 accounts depending on difficulty). Fake accounts are mixed with innocent high-signal "decoy" accounts.In hard mode, the fake accounts actively evades — dropping intra-account follows, renaming profiles — while the agent is mid-investigation.The agent cannot see the full network upfront: it must explore via INSPECT and INVESTIGATE_NETWORK actions, spending steps to reveal information. **What makes the learning novel:** The LInference LLM (via aws bedrock) cannot be fine-tuned.it's a black-box API. The agent learns via Reflexion i.e., post-episode lessons are written back into memory and injected into every future prompt. A dynamic hybrid policy (α-weighted) blends the LLM with a deterministic rule engine, with the blend weight α updating based on recent win rate. Rules dominate early; the LLM takes over as it proves itself. --- ## Detection Signals Detection operates entirely on numeric metadata — no content processing. Three signal categories, computed at different points: **Node signals** (pre-computed by content pipeline, static per account): | Feature | Fake range | Real range | Notes | |---|---|---|---| | `photo_reuse_score` | 0.30–0.95 | 0.00–0.15 | pHash fingerprint match against celebrity photo DB | | `bio_template_score` | 0.20–0.90 | 0.00–0.12 | Cosine sim to known fake bio templates | | `comment_repeat_score` | 0.60–0.90 | 0.00–0.08 | Copy-paste spam fraction across accounts | **Behavioral signals** (static, from account metadata): | Feature | Fake pattern | |---|---| | `avg_post_hour` | All 10 gang members cluster within ±0.5 hours — coordinated scheduling | | `account_age_days` | Created within the same week (`base_age ± 7 days`) | | `shared_ip_count` | 9 — all 10 share one IP subnet (`ip_gang_{seed}`) | **Graph signals** (dynamic — computed at INSPECT time, shift as investigation progresses): | Feature | Why it matters | |---|---| | `mutual_follow_rate` | Gang members mutually follow each other at 0.6–0.9 density; legitimate hubs don't follow back | | `flagged_neighbor_count` | Grows as more gang members are flagged — the cascade signal | | `post_hour_cluster_score` | Alignment to mean posting hour of currently-flagged accounts (wrap-around aware) | | `suspicious_mutual_ratio` | Used to compute hub legitimacy — protects celebrities from false positives | Graph signals are the most powerful: once one gang member is flagged, `flagged_neighbor_count` rises for all connected members, compounding with each subsequent flag. --- ## Synthetic Network Composition 150 episodes pre-generated deterministically (50 per task). Each episode is a JSON file (`episodes/{task}_{seed:03d}.json`). | Task | Accounts | Gang | Decoys | Max steps | Evasion | |---|---|---|---|---|---| | easy | 50 | 10 | 0 | 30 | None | | medium | 200 | 10 | 20 | 50 | Step 20 (once) | | hard | 1000 | 10 | 50 | 80 | Steps 15/30/45/60 | - **Gang:** Dense intra-follow graph (density 0.60–0.80), same IP subnet, tightly clustered post hours (std 0.5/1.5/2.5 by task). - **Decoys** (medium/hard only): Real accounts with elevated `photo_reuse` and `bio_template` scores (0.20–0.40). They score as suspicious but are not gang members — they penalise reckless flagging. - **Celebrities** (2 per episode): 100k–5M followers, near-zero fake scores. Hub legitimacy formula protects them. - **Zero-edge isolates** (2 per episode): `follower_count=0`, no edges. Test whether the agent wastes steps on disconnected nodes. --- ## Actions | Action | Cost | Effect | |---|---|---| | `inspect` | 1 step | Reveals full `AccountProfile` (all 22 features), adds neighbors to visible set | | `investigate_network` | 2 steps | Bidirectional 2-hop expansion — reveals account IDs only (no profiles); re-cascades SUSPECT | | `flag` | 0 steps | Marks account CONFIRMED_FAKE; dual cascade: follow-graph + IP cluster | | `unflag` | 0 steps | Clears CONFIRMED_FAKE status | | `submit` | 0 steps | Ends episode, triggers scoring | **Dual SUSPECT cascade on FLAG:** 1. *Follow-graph:* Every visible account that the flagged account follows → SUSPECT (high precision: gang follow density 0.70+). 2. *IP cluster:* Every visible account sharing the same `ip_cluster_id` → SUSPECT (zero false positives: real accounts each have a unique IP; gang shares `ip_gang_{seed}`). Both mechanisms surface in `obs.suspect_ids` — the agent's highest-priority INSPECT targets. --- ## Risk Scoring (`server/scoring.py`) All functions are stateless, called inside `_build_profile()` at INSPECT time and on re-profiling after each FLAG. ``` node_risk = 0.60 × photo_reuse + 0.40 × bio_template age_norm = min(1.0, account_age_days / 365) behavior_risk = 0.55 × (1 − age_norm) + 0.45 × post_hour_cluster_score flagged_ratio = flagged_neighbor_count / max(inspected_neighbor_count, 1) graph_risk = 0.45 × flagged_ratio + 0.35 × mutual_follow_rate + 0.20 × avg_neighbor_photo_reuse hub_legitimacy = 0.45 × log(1+followers)/log(1+1M) + 0.25 × (1 − follow_ratio_norm) + 0.20 × age_norm + 0.10 × (1 − suspicious_mutual_ratio) fake_risk = clip(0.30×node_risk + 0.25×behavior_risk + 0.45×graph_risk − 0.25×hub_legitimacy, 0, 1) ``` **Weight rationale:** Graph risk (0.45) is dominant — structural signals are hardest to fake and compound across the investigation. Hub legitimacy is subtractive — a celebrity with 5M followers produces `hub_legitimacy ≈ 1.0`, making their fake_risk near zero even if gang members follow them. **Classification thresholds:** - `fake_risk < 0.35` → normal - `0.35 ≤ fake_risk < 0.60` → suspect - `fake_risk ≥ 0.60` → confirmed_fake (formula-level; explicit FLAG overrides) **Grader score** (normalised [0.0, 1.0], returned by `/grader`): ``` recall = tp / 10 precision = tp / max(tp + fp, 1) efficiency = max(0, (max_steps − steps_used) / max_steps) if recall ≥ 0.8 AND precision ≥ 0.7: score = 0.55 + 0.20×recall + 0.15×precision + 0.10×efficiency else: score = 0.30×recall + 0.10×precision ``` Maximum 1.0 (all 10 found, zero false positives, zero steps used). Win threshold ≈ 0.815. --- ## Hybrid Policy (`agent/hybrid_policy.py`) The agent blends a deterministic rule engine with Qwen3-Next-80B (via AWS Bedrock) using a per-task trust weight α. **Alpha update** (per episode, after win/loss recorded): ``` reflection_factor = min(1.0, n_reflections / 4.0) raw = 0.20 + reflection_factor × (0.80 × recent_win_rate + 0.12) alpha = clamp(raw, 0.20, task_cap) ``` | Task | α cap | Rationale | |---|---|---| | easy | 0.50 | Rule engine alone hits ~91% — LLM assists, doesn't override | | medium | 0.70 | Decoys require LLM judgment, but cascade must stay | | hard | 0.85 | LLM needs latitude for evasion adaptation | `reflection_factor` gates α: the LLM must accumulate ≥4 post-episode lessons before reaching meaningful trust, regardless of raw win rate. **Blending decision:** ```python rule_action, rule_conf = get_rule_action(obs) # deterministic, with confidence score llm_action, _ = get_action(obs, ...) # Qwen3 via Bedrock if rule_action == llm_action: final = llm_action # agree elif rule_conf >= alpha: final = rule_action # rule overrides else: final = llm_action # LLM trusted ``` Rule confidences: SUBMIT-forced=1.00, INSPECT-suspect=0.95, FLAG-high-risk=0.95, FLAG-threshold=0.70+, INSPECT-explore=0.30. At `α=0.50` (easy cap), safety decisions (suspects, forced submit) always override; exploration goes to the LLM. **Reflexion learning:** After each episode, Qwen3 generates a 2–3 sentence lesson from the action log and outcome. Lessons are stored in `memory/reflections_{task}.jsonl` and injected into every future prompt (last 4 lessons + best winning trajectory as few-shot example). Memory persists across container restarts via Docker volume. --- ## API Reference | Endpoint | Method | Description | |---|---|---| | `/health` | GET | `{"status": "healthy"}` | | `/tasks` | GET | Task list + `action_schema` + `score_range: [0.0, 1.0]` | | `/reset` | POST | `{task, seed}` → initial observation | | `/step` | POST | `{action_type, account_id?}` → updated observation | | `/state` | GET | Episode metadata (step count, task, score, evasion count) | | `/grader` | GET | Normalised [0.0, 1.0] score after SUBMIT (400 if not done) | | `/baseline` | POST | Runs rule-based agent on all 3 tasks, seed=0 | | `/metadata` | GET | OpenEnv metadata block | | `/schema` | GET | Full JSON schema for actions and observations | | `/mcp` | POST | JSON-RPC 2.0 tool discovery (Model Context Protocol) | Live: `https://pandago-graphstrike.hf.space` --- ## File Structure ``` server/ app.py — FastAPI + Gradio UI (gr.mount_gradio_app) environment.py — Episode lifecycle, action mechanics, cascade logic generator.py — Deterministic episode generation (150 JSON files) scoring.py — Stateless risk formula functions models.py — Pydantic models: AccountProfile, FakeGangObservation, ActionType agent/ policy.py — Qwen3 prompt construction + action parsing hybrid_policy.py — Alpha blending, rule engine with confidence scores reflection.py — Post-episode lesson generation memory.py — JSONL persistence for reflections, trajectories, alpha inference.py — Submission entrypoint: [START]/[STEP]/[END] structured logs, OpenAI client validate.py — 24-point pre-submission validator (local + HTTP) train.py — Full training loop with curriculum episodes/ — 150 pre-generated JSON episode files (baked into Docker image) memory/ — Docker volume: reflections, win history, alpha values ``` --- ## Baseline Scores | Task | Seed=0 | Win rate (50 seeds) | Mean (50 seeds) | |---|---|---|---| | easy | 0.910 | 100% | ~0.91 | | medium | 0.906 | 84% | ~0.77 | | hard | 0.9038 | 52% | ~0.47 | The rule-based baseline (no LLM) is competitive on easy/medium. Hard is the real differentiator — evasion events drop intra-gang edges mid-investigation, destroying graph signals. Frontier LLM agents with accumulated reflections adapt; the rule engine degrades. --- *Built by team computeXor*