---
title: GraphStrike
emoji: 🕵️
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
tags:
- reinforcement-learning
- social-network
- fraud-detection
- openenv
- llm-agent
---
An OpenEnv-compatible reinforcement learning environment where an LLM agent must identify all 10 members of a coordinated fake account network hidden inside a synthetic social network. The agent learns via Reflexion and a dynamic hybrid rule/LLM policy , not via gradient updates or fine-tuning.
### *Deployed Endpoint Verification*
The live environment at [huggingface.co/spaces/Pandago/graphstrike](https://huggingface.co/spaces/Pandago/graphstrike)
responds to all standard OpenEnv endpoints:
```bash
# Health check
curl https://pandago-graphstrike.hf.space/health
# → {"status": "healthy"}
# Task discovery
curl https://pandago-graphstrike.hf.space/tasks
# → {"tasks": ["easy","medium","hard"], "action_schema": {...}, "score_range": [0.0, 1.0]}
# Baseline (deterministic, reproducible)
curl -X POST https://pandago-graphstrike.hf.space/baseline
# → {"scores": {"easy": 0.91, "medium": 0.906, "hard": 0.9038}, "agent": "rule_based"}
```
---
We evaluate GraphStrike's hybrid rule/LLM policy across multiple *frontier models to measure how well each model handles the investigation task. All runs use
the same inference pipeline (`inference.py`) with identical system prompts and structured logging. Each model ran: (1) seed=0 on all 3 tasks, and
(2) seeds 0-2 on all 3 tasks for variance measurement.*
**Seed=0 scores (single episode per task):**
**3-seed variance scores (mean across seeds 0, 1, 2):**
**Rule-Based Baseline (no LLM, deterministic)**
---
**The task:** A social network contains fake accounts organised into a
single coordinated network of 10. The network behaves in a coordinated way — same posting hour,
same IP subnet, stolen celebrity photos, copy-paste bios. The agent must find
all 10 by navigating a limited step budget, inspecting accounts, and flagging suspects.
**What makes this non-trivial:** The network is large (50–1000 accounts depending on difficulty). Fake accounts are mixed with innocent high-signal "decoy" accounts.In hard mode, the fake accounts actively evades — dropping intra-account follows, renaming profiles — while the agent is mid-investigation.The agent cannot see the full network upfront: it must explore via INSPECT and INVESTIGATE_NETWORK actions, spending steps to reveal information.
**What makes the learning novel:** The LInference LLM (via aws bedrock) cannot be fine-tuned.it's a black-box API. The agent learns via Reflexion i.e., post-episode lessons are written back into memory and injected into every future prompt. A dynamic hybrid policy (α-weighted) blends the LLM with a deterministic rule engine, with the blend weight α updating based on recent win rate. Rules dominate early; the LLM takes over as it proves itself.
---
## Detection Signals
Detection operates entirely on numeric metadata — no content processing. Three signal categories, computed at different points:
**Node signals** (pre-computed by content pipeline, static per account):
| Feature | Fake range | Real range | Notes |
|---|---|---|---|
| `photo_reuse_score` | 0.30–0.95 | 0.00–0.15 | pHash fingerprint match against celebrity photo DB |
| `bio_template_score` | 0.20–0.90 | 0.00–0.12 | Cosine sim to known fake bio templates |
| `comment_repeat_score` | 0.60–0.90 | 0.00–0.08 | Copy-paste spam fraction across accounts |
**Behavioral signals** (static, from account metadata):
| Feature | Fake pattern |
|---|---|
| `avg_post_hour` | All 10 gang members cluster within ±0.5 hours — coordinated scheduling |
| `account_age_days` | Created within the same week (`base_age ± 7 days`) |
| `shared_ip_count` | 9 — all 10 share one IP subnet (`ip_gang_{seed}`) |
**Graph signals** (dynamic — computed at INSPECT time, shift as investigation progresses):
| Feature | Why it matters |
|---|---|
| `mutual_follow_rate` | Gang members mutually follow each other at 0.6–0.9 density; legitimate hubs don't follow back |
| `flagged_neighbor_count` | Grows as more gang members are flagged — the cascade signal |
| `post_hour_cluster_score` | Alignment to mean posting hour of currently-flagged accounts (wrap-around aware) |
| `suspicious_mutual_ratio` | Used to compute hub legitimacy — protects celebrities from false positives |
Graph signals are the most powerful: once one gang member is flagged, `flagged_neighbor_count` rises for all connected members, compounding with each subsequent flag.
---
## Synthetic Network Composition
150 episodes pre-generated deterministically (50 per task). Each episode is a JSON file (`episodes/{task}_{seed:03d}.json`).
| Task | Accounts | Gang | Decoys | Max steps | Evasion |
|---|---|---|---|---|---|
| easy | 50 | 10 | 0 | 30 | None |
| medium | 200 | 10 | 20 | 50 | Step 20 (once) |
| hard | 1000 | 10 | 50 | 80 | Steps 15/30/45/60 |
- **Gang:** Dense intra-follow graph (density 0.60–0.80), same IP subnet, tightly clustered post hours (std 0.5/1.5/2.5 by task).
- **Decoys** (medium/hard only): Real accounts with elevated `photo_reuse` and `bio_template` scores (0.20–0.40). They score as suspicious but are not gang members — they penalise reckless flagging.
- **Celebrities** (2 per episode): 100k–5M followers, near-zero fake scores. Hub legitimacy formula protects them.
- **Zero-edge isolates** (2 per episode): `follower_count=0`, no edges. Test whether the agent wastes steps on disconnected nodes.
---
## Actions
| Action | Cost | Effect |
|---|---|---|
| `inspect` | 1 step | Reveals full `AccountProfile` (all 22 features), adds neighbors to visible set |
| `investigate_network` | 2 steps | Bidirectional 2-hop expansion — reveals account IDs only (no profiles); re-cascades SUSPECT |
| `flag` | 0 steps | Marks account CONFIRMED_FAKE; dual cascade: follow-graph + IP cluster |
| `unflag` | 0 steps | Clears CONFIRMED_FAKE status |
| `submit` | 0 steps | Ends episode, triggers scoring |
**Dual SUSPECT cascade on FLAG:**
1. *Follow-graph:* Every visible account that the flagged account follows → SUSPECT (high precision: gang follow density 0.70+).
2. *IP cluster:* Every visible account sharing the same `ip_cluster_id` → SUSPECT (zero false positives: real accounts each have a unique IP; gang shares `ip_gang_{seed}`).
Both mechanisms surface in `obs.suspect_ids` — the agent's highest-priority INSPECT targets.
---
## Risk Scoring (`server/scoring.py`)
All functions are stateless, called inside `_build_profile()` at INSPECT time and on re-profiling after each FLAG.
```
node_risk = 0.60 × photo_reuse + 0.40 × bio_template
age_norm = min(1.0, account_age_days / 365)
behavior_risk = 0.55 × (1 − age_norm) + 0.45 × post_hour_cluster_score
flagged_ratio = flagged_neighbor_count / max(inspected_neighbor_count, 1)
graph_risk = 0.45 × flagged_ratio + 0.35 × mutual_follow_rate + 0.20 × avg_neighbor_photo_reuse
hub_legitimacy = 0.45 × log(1+followers)/log(1+1M)
+ 0.25 × (1 − follow_ratio_norm)
+ 0.20 × age_norm
+ 0.10 × (1 − suspicious_mutual_ratio)
fake_risk = clip(0.30×node_risk + 0.25×behavior_risk + 0.45×graph_risk − 0.25×hub_legitimacy, 0, 1)
```
**Weight rationale:** Graph risk (0.45) is dominant — structural signals are hardest to fake and compound across the investigation. Hub legitimacy is subtractive — a celebrity with 5M followers produces `hub_legitimacy ≈ 1.0`, making their fake_risk near zero even if gang members follow them.
**Classification thresholds:**
- `fake_risk < 0.35` → normal
- `0.35 ≤ fake_risk < 0.60` → suspect
- `fake_risk ≥ 0.60` → confirmed_fake (formula-level; explicit FLAG overrides)
**Grader score** (normalised [0.0, 1.0], returned by `/grader`):
```
recall = tp / 10
precision = tp / max(tp + fp, 1)
efficiency = max(0, (max_steps − steps_used) / max_steps)
if recall ≥ 0.8 AND precision ≥ 0.7:
score = 0.55 + 0.20×recall + 0.15×precision + 0.10×efficiency
else:
score = 0.30×recall + 0.10×precision
```
Maximum 1.0 (all 10 found, zero false positives, zero steps used). Win threshold ≈ 0.815.
---
## Hybrid Policy (`agent/hybrid_policy.py`)
The agent blends a deterministic rule engine with Qwen3-Next-80B (via AWS Bedrock) using a per-task trust weight α.
**Alpha update** (per episode, after win/loss recorded):
```
reflection_factor = min(1.0, n_reflections / 4.0)
raw = 0.20 + reflection_factor × (0.80 × recent_win_rate + 0.12)
alpha = clamp(raw, 0.20, task_cap)
```
| Task | α cap | Rationale |
|---|---|---|
| easy | 0.50 | Rule engine alone hits ~91% — LLM assists, doesn't override |
| medium | 0.70 | Decoys require LLM judgment, but cascade must stay |
| hard | 0.85 | LLM needs latitude for evasion adaptation |
`reflection_factor` gates α: the LLM must accumulate ≥4 post-episode lessons before reaching meaningful trust, regardless of raw win rate.
**Blending decision:**
```python
rule_action, rule_conf = get_rule_action(obs) # deterministic, with confidence score
llm_action, _ = get_action(obs, ...) # Qwen3 via Bedrock
if rule_action == llm_action: final = llm_action # agree
elif rule_conf >= alpha: final = rule_action # rule overrides
else: final = llm_action # LLM trusted
```
Rule confidences: SUBMIT-forced=1.00, INSPECT-suspect=0.95, FLAG-high-risk=0.95, FLAG-threshold=0.70+, INSPECT-explore=0.30. At `α=0.50` (easy cap), safety decisions (suspects, forced submit) always override; exploration goes to the LLM.
**Reflexion learning:** After each episode, Qwen3 generates a 2–3 sentence lesson from the action log and outcome. Lessons are stored in `memory/reflections_{task}.jsonl` and injected into every future prompt (last 4 lessons + best winning trajectory as few-shot example). Memory persists across container restarts via Docker volume.
---
## API Reference
| Endpoint | Method | Description |
|---|---|---|
| `/health` | GET | `{"status": "healthy"}` |
| `/tasks` | GET | Task list + `action_schema` + `score_range: [0.0, 1.0]` |
| `/reset` | POST | `{task, seed}` → initial observation |
| `/step` | POST | `{action_type, account_id?}` → updated observation |
| `/state` | GET | Episode metadata (step count, task, score, evasion count) |
| `/grader` | GET | Normalised [0.0, 1.0] score after SUBMIT (400 if not done) |
| `/baseline` | POST | Runs rule-based agent on all 3 tasks, seed=0 |
| `/metadata` | GET | OpenEnv metadata block |
| `/schema` | GET | Full JSON schema for actions and observations |
| `/mcp` | POST | JSON-RPC 2.0 tool discovery (Model Context Protocol) |
Live: `https://pandago-graphstrike.hf.space`
---
## File Structure
```
server/
app.py — FastAPI + Gradio UI (gr.mount_gradio_app)
environment.py — Episode lifecycle, action mechanics, cascade logic
generator.py — Deterministic episode generation (150 JSON files)
scoring.py — Stateless risk formula functions
models.py — Pydantic models: AccountProfile, FakeGangObservation, ActionType
agent/
policy.py — Qwen3 prompt construction + action parsing
hybrid_policy.py — Alpha blending, rule engine with confidence scores
reflection.py — Post-episode lesson generation
memory.py — JSONL persistence for reflections, trajectories, alpha
inference.py — Submission entrypoint: [START]/[STEP]/[END] structured logs, OpenAI client
validate.py — 24-point pre-submission validator (local + HTTP)
train.py — Full training loop with curriculum
episodes/ — 150 pre-generated JSON episode files (baked into Docker image)
memory/ — Docker volume: reflections, win history, alpha values
```
---
## Baseline Scores
| Task | Seed=0 | Win rate (50 seeds) | Mean (50 seeds) |
|---|---|---|---|
| easy | 0.910 | 100% | ~0.91 |
| medium | 0.906 | 84% | ~0.77 |
| hard | 0.9038 | 52% | ~0.47 |
The rule-based baseline (no LLM) is competitive on easy/medium. Hard is the real differentiator — evasion events drop intra-gang edges mid-investigation, destroying graph signals. Frontier LLM agents with accumulated reflections adapt; the rule engine degrades.
---
*Built by team computeXor*