graphstrike / docs.md
Pandago's picture
Upload folder using huggingface_hub
87f2d84 verified
metadata
title: GraphStrike
emoji: 🕵️
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
tags:
  - reinforcement-learning
  - social-network
  - fraud-detection
  - openenv
  - llm-agent



An OpenEnv-compatible reinforcement learning environment where an LLM agent must identify all 10 members of a coordinated fake account network hidden inside a synthetic social network. The agent learns via Reflexion and a dynamic hybrid rule/LLM policy , not via gradient updates or fine-tuning.



Deployed Endpoint Verification

The live environment at huggingface.co/spaces/Pandago/graphstrike responds to all standard OpenEnv endpoints:

# Health check
curl https://pandago-graphstrike.hf.space/health
# → {"status": "healthy"}

# Task discovery
curl https://pandago-graphstrike.hf.space/tasks
# → {"tasks": ["easy","medium","hard"], "action_schema": {...}, "score_range": [0.0, 1.0]}

# Baseline (deterministic, reproducible)
curl -X POST https://pandago-graphstrike.hf.space/baseline
# → {"scores": {"easy": 0.91, "medium": 0.906, "hard": 0.9038}, "agent": "rule_based"}


We evaluate GraphStrike's hybrid rule/LLM policy across multiple frontier models to measure how well each model handles the investigation task. All runs use the same inference pipeline (inference.py) with identical system prompts and structured logging. Each model ran: (1) seed=0 on all 3 tasks, and (2) seeds 0-2 on all 3 tasks for variance measurement.


Seed=0 scores (single episode per task):

Model Performance Table


3-seed variance scores (mean across seeds 0, 1, 2):

Model Performance Table


Rule-Based Baseline (no LLM, deterministic)

Model Performance Table



The task: A social network contains fake accounts organised into a single coordinated network of 10. The network behaves in a coordinated way — same posting hour, same IP subnet, stolen celebrity photos, copy-paste bios. The agent must find all 10 by navigating a limited step budget, inspecting accounts, and flagging suspects.

What makes this non-trivial: The network is large (50–1000 accounts depending on difficulty). Fake accounts are mixed with innocent high-signal "decoy" accounts.In hard mode, the fake accounts actively evades — dropping intra-account follows, renaming profiles — while the agent is mid-investigation.The agent cannot see the full network upfront: it must explore via INSPECT and INVESTIGATE_NETWORK actions, spending steps to reveal information.

What makes the learning novel: The LInference LLM (via aws bedrock) cannot be fine-tuned.it's a black-box API. The agent learns via Reflexion i.e., post-episode lessons are written back into memory and injected into every future prompt. A dynamic hybrid policy (α-weighted) blends the LLM with a deterministic rule engine, with the blend weight α updating based on recent win rate. Rules dominate early; the LLM takes over as it proves itself.


Detection Signals

Detection operates entirely on numeric metadata — no content processing. Three signal categories, computed at different points:

Node signals (pre-computed by content pipeline, static per account):

Feature Fake range Real range Notes
photo_reuse_score 0.30–0.95 0.00–0.15 pHash fingerprint match against celebrity photo DB
bio_template_score 0.20–0.90 0.00–0.12 Cosine sim to known fake bio templates
comment_repeat_score 0.60–0.90 0.00–0.08 Copy-paste spam fraction across accounts

Behavioral signals (static, from account metadata):

Feature Fake pattern
avg_post_hour All 10 gang members cluster within ±0.5 hours — coordinated scheduling
account_age_days Created within the same week (base_age ± 7 days)
shared_ip_count 9 — all 10 share one IP subnet (ip_gang_{seed})

Graph signals (dynamic — computed at INSPECT time, shift as investigation progresses):

Feature Why it matters
mutual_follow_rate Gang members mutually follow each other at 0.6–0.9 density; legitimate hubs don't follow back
flagged_neighbor_count Grows as more gang members are flagged — the cascade signal
post_hour_cluster_score Alignment to mean posting hour of currently-flagged accounts (wrap-around aware)
suspicious_mutual_ratio Used to compute hub legitimacy — protects celebrities from false positives

Graph signals are the most powerful: once one gang member is flagged, flagged_neighbor_count rises for all connected members, compounding with each subsequent flag.


Synthetic Network Composition

150 episodes pre-generated deterministically (50 per task). Each episode is a JSON file (episodes/{task}_{seed:03d}.json).

Task Accounts Gang Decoys Max steps Evasion
easy 50 10 0 30 None
medium 200 10 20 50 Step 20 (once)
hard 1000 10 50 80 Steps 15/30/45/60
  • Gang: Dense intra-follow graph (density 0.60–0.80), same IP subnet, tightly clustered post hours (std 0.5/1.5/2.5 by task).
  • Decoys (medium/hard only): Real accounts with elevated photo_reuse and bio_template scores (0.20–0.40). They score as suspicious but are not gang members — they penalise reckless flagging.
  • Celebrities (2 per episode): 100k–5M followers, near-zero fake scores. Hub legitimacy formula protects them.
  • Zero-edge isolates (2 per episode): follower_count=0, no edges. Test whether the agent wastes steps on disconnected nodes.

Actions

Action Cost Effect
inspect 1 step Reveals full AccountProfile (all 22 features), adds neighbors to visible set
investigate_network 2 steps Bidirectional 2-hop expansion — reveals account IDs only (no profiles); re-cascades SUSPECT
flag 0 steps Marks account CONFIRMED_FAKE; dual cascade: follow-graph + IP cluster
unflag 0 steps Clears CONFIRMED_FAKE status
submit 0 steps Ends episode, triggers scoring

Dual SUSPECT cascade on FLAG:

  1. Follow-graph: Every visible account that the flagged account follows → SUSPECT (high precision: gang follow density 0.70+).
  2. IP cluster: Every visible account sharing the same ip_cluster_id → SUSPECT (zero false positives: real accounts each have a unique IP; gang shares ip_gang_{seed}).

Both mechanisms surface in obs.suspect_ids — the agent's highest-priority INSPECT targets.


Risk Scoring (server/scoring.py)

All functions are stateless, called inside _build_profile() at INSPECT time and on re-profiling after each FLAG.

node_risk     = 0.60 × photo_reuse + 0.40 × bio_template

age_norm      = min(1.0, account_age_days / 365)
behavior_risk = 0.55 × (1 − age_norm) + 0.45 × post_hour_cluster_score

flagged_ratio = flagged_neighbor_count / max(inspected_neighbor_count, 1)
graph_risk    = 0.45 × flagged_ratio + 0.35 × mutual_follow_rate + 0.20 × avg_neighbor_photo_reuse

hub_legitimacy = 0.45 × log(1+followers)/log(1+1M)
              + 0.25 × (1 − follow_ratio_norm)
              + 0.20 × age_norm
              + 0.10 × (1 − suspicious_mutual_ratio)

fake_risk = clip(0.30×node_risk + 0.25×behavior_risk + 0.45×graph_risk − 0.25×hub_legitimacy, 0, 1)

Weight rationale: Graph risk (0.45) is dominant — structural signals are hardest to fake and compound across the investigation. Hub legitimacy is subtractive — a celebrity with 5M followers produces hub_legitimacy ≈ 1.0, making their fake_risk near zero even if gang members follow them.

Classification thresholds:

  • fake_risk < 0.35 → normal
  • 0.35 ≤ fake_risk < 0.60 → suspect
  • fake_risk ≥ 0.60 → confirmed_fake (formula-level; explicit FLAG overrides)

Grader score (normalised [0.0, 1.0], returned by /grader):

recall    = tp / 10
precision = tp / max(tp + fp, 1)
efficiency = max(0, (max_steps − steps_used) / max_steps)

if recall ≥ 0.8 AND precision ≥ 0.7:
    score = 0.55 + 0.20×recall + 0.15×precision + 0.10×efficiency
else:
    score = 0.30×recall + 0.10×precision

Maximum 1.0 (all 10 found, zero false positives, zero steps used). Win threshold ≈ 0.815.


Hybrid Policy (agent/hybrid_policy.py)

The agent blends a deterministic rule engine with Qwen3-Next-80B (via AWS Bedrock) using a per-task trust weight α.

Alpha update (per episode, after win/loss recorded):

reflection_factor = min(1.0, n_reflections / 4.0)
raw   = 0.20 + reflection_factor × (0.80 × recent_win_rate + 0.12)
alpha = clamp(raw, 0.20, task_cap)
Task α cap Rationale
easy 0.50 Rule engine alone hits ~91% — LLM assists, doesn't override
medium 0.70 Decoys require LLM judgment, but cascade must stay
hard 0.85 LLM needs latitude for evasion adaptation

reflection_factor gates α: the LLM must accumulate ≥4 post-episode lessons before reaching meaningful trust, regardless of raw win rate.

Blending decision:

rule_action, rule_conf = get_rule_action(obs)   # deterministic, with confidence score
llm_action,  _        = get_action(obs, ...)    # Qwen3 via Bedrock

if rule_action == llm_action:   final = llm_action     # agree
elif rule_conf >= alpha:        final = rule_action     # rule overrides
else:                           final = llm_action      # LLM trusted

Rule confidences: SUBMIT-forced=1.00, INSPECT-suspect=0.95, FLAG-high-risk=0.95, FLAG-threshold=0.70+, INSPECT-explore=0.30. At α=0.50 (easy cap), safety decisions (suspects, forced submit) always override; exploration goes to the LLM.

Reflexion learning: After each episode, Qwen3 generates a 2–3 sentence lesson from the action log and outcome. Lessons are stored in memory/reflections_{task}.jsonl and injected into every future prompt (last 4 lessons + best winning trajectory as few-shot example). Memory persists across container restarts via Docker volume.


API Reference

Endpoint Method Description
/health GET {"status": "healthy"}
/tasks GET Task list + action_schema + score_range: [0.0, 1.0]
/reset POST {task, seed} → initial observation
/step POST {action_type, account_id?} → updated observation
/state GET Episode metadata (step count, task, score, evasion count)
/grader GET Normalised [0.0, 1.0] score after SUBMIT (400 if not done)
/baseline POST Runs rule-based agent on all 3 tasks, seed=0
/metadata GET OpenEnv metadata block
/schema GET Full JSON schema for actions and observations
/mcp POST JSON-RPC 2.0 tool discovery (Model Context Protocol)

Live: https://pandago-graphstrike.hf.space


File Structure

server/
  app.py          — FastAPI + Gradio UI (gr.mount_gradio_app)
  environment.py  — Episode lifecycle, action mechanics, cascade logic
  generator.py    — Deterministic episode generation (150 JSON files)
  scoring.py      — Stateless risk formula functions
  models.py       — Pydantic models: AccountProfile, FakeGangObservation, ActionType

agent/
  policy.py       — Qwen3 prompt construction + action parsing
  hybrid_policy.py — Alpha blending, rule engine with confidence scores
  reflection.py   — Post-episode lesson generation
  memory.py       — JSONL persistence for reflections, trajectories, alpha

inference.py      — Submission entrypoint: [START]/[STEP]/[END] structured logs, OpenAI client
validate.py       — 24-point pre-submission validator (local + HTTP)
train.py          — Full training loop with curriculum
episodes/         — 150 pre-generated JSON episode files (baked into Docker image)
memory/           — Docker volume: reflections, win history, alpha values

Baseline Scores

Task Seed=0 Win rate (50 seeds) Mean (50 seeds)
easy 0.910 100% ~0.91
medium 0.906 84% ~0.77
hard 0.9038 52% ~0.47

The rule-based baseline (no LLM) is competitive on easy/medium. Hard is the real differentiator — evasion events drop intra-gang edges mid-investigation, destroying graph signals. Frontier LLM agents with accumulated reflections adapt; the rule engine degrades.


Built by team computeXor