Spaces:
Sleeping
title: GraphStrike
emoji: 🕵️
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
tags:
- reinforcement-learning
- social-network
- fraud-detection
- openenv
- llm-agent
An OpenEnv-compatible reinforcement learning environment where an LLM agent must identify all 10 members of a coordinated fake account network hidden inside a synthetic social network. The agent learns via Reflexion and a dynamic hybrid rule/LLM policy , not via gradient updates or fine-tuning.
Deployed Endpoint Verification
The live environment at huggingface.co/spaces/Pandago/graphstrike responds to all standard OpenEnv endpoints:
# Health check
curl https://pandago-graphstrike.hf.space/health
# → {"status": "healthy"}
# Task discovery
curl https://pandago-graphstrike.hf.space/tasks
# → {"tasks": ["easy","medium","hard"], "action_schema": {...}, "score_range": [0.0, 1.0]}
# Baseline (deterministic, reproducible)
curl -X POST https://pandago-graphstrike.hf.space/baseline
# → {"scores": {"easy": 0.91, "medium": 0.906, "hard": 0.9038}, "agent": "rule_based"}
We evaluate GraphStrike's hybrid rule/LLM policy across multiple frontier models to measure how well each model handles the investigation task. All runs use
the same inference pipeline (inference.py) with identical system prompts and structured logging. Each model ran: (1) seed=0 on all 3 tasks, and
(2) seeds 0-2 on all 3 tasks for variance measurement.
Seed=0 scores (single episode per task):
3-seed variance scores (mean across seeds 0, 1, 2):
Rule-Based Baseline (no LLM, deterministic)
The task: A social network contains fake accounts organised into a single coordinated network of 10. The network behaves in a coordinated way — same posting hour, same IP subnet, stolen celebrity photos, copy-paste bios. The agent must find all 10 by navigating a limited step budget, inspecting accounts, and flagging suspects.
What makes this non-trivial: The network is large (50–1000 accounts depending on difficulty). Fake accounts are mixed with innocent high-signal "decoy" accounts.In hard mode, the fake accounts actively evades — dropping intra-account follows, renaming profiles — while the agent is mid-investigation.The agent cannot see the full network upfront: it must explore via INSPECT and INVESTIGATE_NETWORK actions, spending steps to reveal information.
What makes the learning novel: The LInference LLM (via aws bedrock) cannot be fine-tuned.it's a black-box API. The agent learns via Reflexion i.e., post-episode lessons are written back into memory and injected into every future prompt. A dynamic hybrid policy (α-weighted) blends the LLM with a deterministic rule engine, with the blend weight α updating based on recent win rate. Rules dominate early; the LLM takes over as it proves itself.
Detection Signals
Detection operates entirely on numeric metadata — no content processing. Three signal categories, computed at different points:
Node signals (pre-computed by content pipeline, static per account):
| Feature | Fake range | Real range | Notes |
|---|---|---|---|
photo_reuse_score |
0.30–0.95 | 0.00–0.15 | pHash fingerprint match against celebrity photo DB |
bio_template_score |
0.20–0.90 | 0.00–0.12 | Cosine sim to known fake bio templates |
comment_repeat_score |
0.60–0.90 | 0.00–0.08 | Copy-paste spam fraction across accounts |
Behavioral signals (static, from account metadata):
| Feature | Fake pattern |
|---|---|
avg_post_hour |
All 10 gang members cluster within ±0.5 hours — coordinated scheduling |
account_age_days |
Created within the same week (base_age ± 7 days) |
shared_ip_count |
9 — all 10 share one IP subnet (ip_gang_{seed}) |
Graph signals (dynamic — computed at INSPECT time, shift as investigation progresses):
| Feature | Why it matters |
|---|---|
mutual_follow_rate |
Gang members mutually follow each other at 0.6–0.9 density; legitimate hubs don't follow back |
flagged_neighbor_count |
Grows as more gang members are flagged — the cascade signal |
post_hour_cluster_score |
Alignment to mean posting hour of currently-flagged accounts (wrap-around aware) |
suspicious_mutual_ratio |
Used to compute hub legitimacy — protects celebrities from false positives |
Graph signals are the most powerful: once one gang member is flagged, flagged_neighbor_count rises for all connected members, compounding with each subsequent flag.
Synthetic Network Composition
150 episodes pre-generated deterministically (50 per task). Each episode is a JSON file (episodes/{task}_{seed:03d}.json).
| Task | Accounts | Gang | Decoys | Max steps | Evasion |
|---|---|---|---|---|---|
| easy | 50 | 10 | 0 | 30 | None |
| medium | 200 | 10 | 20 | 50 | Step 20 (once) |
| hard | 1000 | 10 | 50 | 80 | Steps 15/30/45/60 |
- Gang: Dense intra-follow graph (density 0.60–0.80), same IP subnet, tightly clustered post hours (std 0.5/1.5/2.5 by task).
- Decoys (medium/hard only): Real accounts with elevated
photo_reuseandbio_templatescores (0.20–0.40). They score as suspicious but are not gang members — they penalise reckless flagging. - Celebrities (2 per episode): 100k–5M followers, near-zero fake scores. Hub legitimacy formula protects them.
- Zero-edge isolates (2 per episode):
follower_count=0, no edges. Test whether the agent wastes steps on disconnected nodes.
Actions
| Action | Cost | Effect |
|---|---|---|
inspect |
1 step | Reveals full AccountProfile (all 22 features), adds neighbors to visible set |
investigate_network |
2 steps | Bidirectional 2-hop expansion — reveals account IDs only (no profiles); re-cascades SUSPECT |
flag |
0 steps | Marks account CONFIRMED_FAKE; dual cascade: follow-graph + IP cluster |
unflag |
0 steps | Clears CONFIRMED_FAKE status |
submit |
0 steps | Ends episode, triggers scoring |
Dual SUSPECT cascade on FLAG:
- Follow-graph: Every visible account that the flagged account follows → SUSPECT (high precision: gang follow density 0.70+).
- IP cluster: Every visible account sharing the same
ip_cluster_id→ SUSPECT (zero false positives: real accounts each have a unique IP; gang sharesip_gang_{seed}).
Both mechanisms surface in obs.suspect_ids — the agent's highest-priority INSPECT targets.
Risk Scoring (server/scoring.py)
All functions are stateless, called inside _build_profile() at INSPECT time and on re-profiling after each FLAG.
node_risk = 0.60 × photo_reuse + 0.40 × bio_template
age_norm = min(1.0, account_age_days / 365)
behavior_risk = 0.55 × (1 − age_norm) + 0.45 × post_hour_cluster_score
flagged_ratio = flagged_neighbor_count / max(inspected_neighbor_count, 1)
graph_risk = 0.45 × flagged_ratio + 0.35 × mutual_follow_rate + 0.20 × avg_neighbor_photo_reuse
hub_legitimacy = 0.45 × log(1+followers)/log(1+1M)
+ 0.25 × (1 − follow_ratio_norm)
+ 0.20 × age_norm
+ 0.10 × (1 − suspicious_mutual_ratio)
fake_risk = clip(0.30×node_risk + 0.25×behavior_risk + 0.45×graph_risk − 0.25×hub_legitimacy, 0, 1)
Weight rationale: Graph risk (0.45) is dominant — structural signals are hardest to fake and compound across the investigation. Hub legitimacy is subtractive — a celebrity with 5M followers produces hub_legitimacy ≈ 1.0, making their fake_risk near zero even if gang members follow them.
Classification thresholds:
fake_risk < 0.35→ normal0.35 ≤ fake_risk < 0.60→ suspectfake_risk ≥ 0.60→ confirmed_fake (formula-level; explicit FLAG overrides)
Grader score (normalised [0.0, 1.0], returned by /grader):
recall = tp / 10
precision = tp / max(tp + fp, 1)
efficiency = max(0, (max_steps − steps_used) / max_steps)
if recall ≥ 0.8 AND precision ≥ 0.7:
score = 0.55 + 0.20×recall + 0.15×precision + 0.10×efficiency
else:
score = 0.30×recall + 0.10×precision
Maximum 1.0 (all 10 found, zero false positives, zero steps used). Win threshold ≈ 0.815.
Hybrid Policy (agent/hybrid_policy.py)
The agent blends a deterministic rule engine with Qwen3-Next-80B (via AWS Bedrock) using a per-task trust weight α.
Alpha update (per episode, after win/loss recorded):
reflection_factor = min(1.0, n_reflections / 4.0)
raw = 0.20 + reflection_factor × (0.80 × recent_win_rate + 0.12)
alpha = clamp(raw, 0.20, task_cap)
| Task | α cap | Rationale |
|---|---|---|
| easy | 0.50 | Rule engine alone hits ~91% — LLM assists, doesn't override |
| medium | 0.70 | Decoys require LLM judgment, but cascade must stay |
| hard | 0.85 | LLM needs latitude for evasion adaptation |
reflection_factor gates α: the LLM must accumulate ≥4 post-episode lessons before reaching meaningful trust, regardless of raw win rate.
Blending decision:
rule_action, rule_conf = get_rule_action(obs) # deterministic, with confidence score
llm_action, _ = get_action(obs, ...) # Qwen3 via Bedrock
if rule_action == llm_action: final = llm_action # agree
elif rule_conf >= alpha: final = rule_action # rule overrides
else: final = llm_action # LLM trusted
Rule confidences: SUBMIT-forced=1.00, INSPECT-suspect=0.95, FLAG-high-risk=0.95, FLAG-threshold=0.70+, INSPECT-explore=0.30. At α=0.50 (easy cap), safety decisions (suspects, forced submit) always override; exploration goes to the LLM.
Reflexion learning: After each episode, Qwen3 generates a 2–3 sentence lesson from the action log and outcome. Lessons are stored in memory/reflections_{task}.jsonl and injected into every future prompt (last 4 lessons + best winning trajectory as few-shot example). Memory persists across container restarts via Docker volume.
API Reference
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | {"status": "healthy"} |
/tasks |
GET | Task list + action_schema + score_range: [0.0, 1.0] |
/reset |
POST | {task, seed} → initial observation |
/step |
POST | {action_type, account_id?} → updated observation |
/state |
GET | Episode metadata (step count, task, score, evasion count) |
/grader |
GET | Normalised [0.0, 1.0] score after SUBMIT (400 if not done) |
/baseline |
POST | Runs rule-based agent on all 3 tasks, seed=0 |
/metadata |
GET | OpenEnv metadata block |
/schema |
GET | Full JSON schema for actions and observations |
/mcp |
POST | JSON-RPC 2.0 tool discovery (Model Context Protocol) |
Live: https://pandago-graphstrike.hf.space
File Structure
server/
app.py — FastAPI + Gradio UI (gr.mount_gradio_app)
environment.py — Episode lifecycle, action mechanics, cascade logic
generator.py — Deterministic episode generation (150 JSON files)
scoring.py — Stateless risk formula functions
models.py — Pydantic models: AccountProfile, FakeGangObservation, ActionType
agent/
policy.py — Qwen3 prompt construction + action parsing
hybrid_policy.py — Alpha blending, rule engine with confidence scores
reflection.py — Post-episode lesson generation
memory.py — JSONL persistence for reflections, trajectories, alpha
inference.py — Submission entrypoint: [START]/[STEP]/[END] structured logs, OpenAI client
validate.py — 24-point pre-submission validator (local + HTTP)
train.py — Full training loop with curriculum
episodes/ — 150 pre-generated JSON episode files (baked into Docker image)
memory/ — Docker volume: reflections, win history, alpha values
Baseline Scores
| Task | Seed=0 | Win rate (50 seeds) | Mean (50 seeds) |
|---|---|---|---|
| easy | 0.910 | 100% | ~0.91 |
| medium | 0.906 | 84% | ~0.77 |
| hard | 0.9038 | 52% | ~0.47 |
The rule-based baseline (no LLM) is competitive on easy/medium. Hard is the real differentiator — evasion events drop intra-gang edges mid-investigation, destroying graph signals. Frontier LLM agents with accumulated reflections adapt; the rule engine degrades.
Built by team computeXor