Spaces:
Sleeping
title: GraphStrike
emoji: π΅οΈ
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
tags:
- reinforcement-learning
- social-network
- fraud-detection
- openenv
- llm-agent
base_path: /web
An OpenEnv-compatible reinforcement learning environment where an LLM agent must identify all 10 members of a coordinated fake account network hidden inside a synthetic social network. The agent learns via Reflexion and a dynamic hybrid rule/LLM policy , not via gradient updates or fine-tuning.
Theme
SUPPORT
Customer Service Agents
Complex environment where agents resolve multi-step queries using external tools and APIs.
Problem Statement
The task: A social network contains fake accounts organised into a single coordinated ring of 10. The ring behaves in a coordinated way β same posting hour, same IP subnet, stolen celebrity photos, copy-paste bios. The agent must find all 10 by navigating a limited step budget, inspecting accounts, and flagging suspects.
Proposed Solution
An OpenEnv-compatible reinforcement learning environment where an LLM agent must identify all 10 members of a coordinated fake account ring hidden inside a synthetic social network. The agent learns via Reflexion and a dynamic hybrid rule/LLM policy β not via gradient updates or fine-tuning.
Novelty Highlights
- Adaptive Hybrid Intelligence (Rules + LLM): Unlike static ensembles, GraphStrike dynamically blends deterministic rules and LLM reasoning using a trust gate, shifting control as performance improves.
- Learning Without Fine-Tuning: Instead of updating model weights, the agent learns through Reflexion lessons and best-trajectory memory injected into future prompts.
- Graph-First Detection Pipeline: Detection is not account-by-account only; it uses cascade effects, neighbor propagation, and multi-hop graph expansion to uncover coordinated rings.
- Math-Grounded Decision Control: Risk composition, trust calibration, and grader alignment are formula-driven, making behavior interpretable and reproducible.
- Adversarial Evasion Benchmarking: Hard-mode includes timed evasion events, so success reflects robustness under disruption rather than overfitting to static patterns.
- Safety-Net by Design: High-confidence rule overrides prevent catastrophic LLM errors while preserving LLM flexibility for strategic exploration.
Performance Summary
We evaluate GraphStrike's hybrid rule/LLM policy across multiple frontier models to measure how well each model handles the investigation task. All runs use
the same inference pipeline (inference.py) with identical system prompts and structured logging. Each model ran: (1) seed=0 on all 3 tasks, and
(2) seeds 0-2 on all 3 tasks for variance measurement.
Seed=0 scores (single episode per task):
3-seed variance scores (mean across seeds 0, 1, 2):
Rule-Based Baseline (no LLM, deterministic)
Table of Contents
- What This Is
- The Problem: How Fake Detection Actually Works
- Synthetic Data Generation
- Data Model
- The RL Environment
- Risk Scoring Mathematics
- The LLM Policy (Qwen3 via Bedrock)
- Reflexion β How the Agent Learns
- Hybrid Policy β The Novel Contribution
- Training Loop End-to-End
- API Reference
- Docker Deployment
- Submission Requirements
- Verification & Validation
1. What is this !?
This is an OpenEnv hackathon submission. OpenEnv is a framework for building RL environments with a standard microservice interface (/reset, /step, /state) so that any agent implementation can plug in.
What makes this non-trivial:
- The network is large (50β1000 accounts depending on difficulty).
- Fake accounts are mixed with innocent high-signal "decoy" accounts.
- In hard mode, the gang actively evades β dropping intra-gang follows, renaming profiles β while the agent is mid-investigation.
- The agent cannot see the full network upfront: it must explore via INSPECT and INVESTIGATE_NETWORK actions, spending steps to reveal information.
What makes the learning novel:
- The LLM (inference via AWS Bedrock) cannot be fine-tuned β it is a black-box API.
- The agent learns via Reflexion: post-episode lessons are written back into memory and injected into every future prompt.
- A dynamic hybrid policy (Ξ±-weighted) blends the LLM with a deterministic rule engine, with the blend weight Ξ± updating based on recent win rate. Rules dominate early; the LLM takes over as it proves itself.
System Architecture
2. The Problem: How Fake Detection Actually Works
A real-world fake account detector does not read post content. Detection relies on three categories of signals computed from metadata:
Signal Hierarchy (Node -> Behavioral -> Graph)
- Node signals (offline): content fingerprints like photo reuse, bio-template similarity, and comment repetition provide the first suspicion layer.
- Behavioral signals (temporal/device): coordinated posting hour, account-age clustering, and shared IP subnet add stronger gang-level evidence.
- Graph signals (live at INSPECT): mutual follows, flagged-neighbor growth, and cluster alignment are hardest to evade, so they carry the highest weight in risk scoring.
- False-positive control: high-legitimacy hubs (for example celebrities) are down-weighted through hub-legitimacy discounting.
3. Synthetic Data Generation
File: server/generator.py
Episodes are generated deterministically by seed. 150 episodes are pre-generated (50 per task) and cached as JSON files in episodes/.
Network Composition
| Task | Network size | Gang | Decoys | Real | Max steps |
|---|---|---|---|---|---|
| easy | 50 | 10 | 0 | 40 | 30 |
| medium | 200 | 10 | 20 | 170 | 50 |
| hard | 1000 | 10 | 50 | 940 | 80 |
- Gang accounts: All 10 share
base_age(same creation week), tightly clusteredavg_post_hour, highphoto_reuse_score/bio_template_score,comment_repeat_scorein [0.60, 0.90],ip_cluster_id = "ip_gang_{seed}", and dense intra-gang follow edges (density 0.60β0.80). - Real accounts: Log-normal follower distributions, unique IP clusters, low fake scores.
- Decoy accounts (medium/hard): Real accounts with elevated fraud scores (0.20β0.40 range) β they look suspicious but are NOT gang members and penalise reckless flagging.
- Celebrity accounts (2 per episode): 100kβ5M followers, very low fake scores, high
hub_legitimacy_score. - Zero-edge isolates (2 per episode): No edges β test whether the agent wastes steps on disconnected nodes.
4. Data Model
File: models.py
ActionType
| Value | Cost | Effect |
|---|---|---|
inspect |
1 step | Reveals full AccountProfile + follow list |
investigate_network |
2 steps | Expands 2 hops; reveals account IDs only |
flag |
0 steps | Marks account as gang member; triggers SUSPECT cascade |
unflag |
0 steps | Removes flag; clears CONFIRMED_FAKE status |
submit |
0 steps | Ends episode; triggers scoring |
AccountProfile β key fields
| Category | Fields |
|---|---|
| Raw counts | follower_count, following_count, post_count |
| Temporal | avg_post_hour, account_age_days |
| Content pipeline (0β1) | photo_reuse_score, bio_template_score, comment_repeat_score |
| IP/device | shared_ip_count, ip_cluster_id |
| Graph (live at INSPECT) | mutual_follow_rate, flagged_neighbor_count, avg_neighbor_photo_reuse, post_hour_cluster_score |
| Risk breakdown | fake_risk_score, node_risk, behavior_risk, graph_risk, hub_legitimacy_score |
| Evasion/status | name_change_count, status (NORMAL/SUSPECT/CONFIRMED_FAKE) |
FakeGangObservation β what the agent sees each step
done, reward, visible_accounts, visible_account_ids, flagged_ids, inspected_ids, suspect_ids, graph_edges, steps_remaining, evasion_triggered, evasion_count, task, message
5. The RL Environment
File: server/environment.py
Episode Lifecycle & Action Mechanics
FLAG cascade (dual): When FLAG(X) is called β (1) every visible account that X follows becomes SUSPECT via the follow-graph, and (2) every visible account sharing X's ip_cluster_id becomes SUSPECT. Gang members share ip_gang_{seed}; real accounts have unique IPs β zero false positives.
Reward Function
base_reward = tpΓ1.0 β fpΓ0.5 β fnΓ0.3
Win condition:
easy/medium: recall β₯ 0.8 AND precision β₯ 0.7
hard: recall β₯ 0.9 AND precision β₯ 0.8
Bonuses:
+5.0 full win
+3.0 perfect recall
+2.0 partial win (high recall, low precision)
+1.0 efficiency (SUBMIT with β₯50% steps remaining)
β1.0 per evasion event (hard mode)
β2.0 forced submit (ran out of steps)
Evasion (hard mode)
unfollow_intragang: 30% of intra-gang edges randomly removed mid-investigation β destroys graph signal. Fires 4 times (steps 15, 30, 45, 60).rename_count: Random gang members getname_change_count += 1β a visual evasion signal.
6. Risk Scoring Mathematics
File: server/scoring.py β all functions are stateless and deterministic.
8. The LLM Policy (Qwen3 via Bedrock)
File: agent/policy.py
Model: qwen.qwen3-next-80b-a3b via AWS Bedrock Converse API (maxTokens=512, temperature=0.4)
Prompt Structure
Every step, the policy builds a prompt from three components:
[reflections from past episodes] β grows richer every episode
[best trajectory few-shot example] β best win ever, showing the full action log
βββ CURRENT CASE βββ
[formatted observation] β status badges, risk scores, suspect list
What is your next action?
Accounts in the observation are sorted by fake_risk_score descending, with status badges prepended. fnbr=N(!) highlights when flagged_neighbor_count > 0; [HUB?] warns the LLM not to flag high-legitimacy accounts.
Required Response Format
<thinking>
Reasoning β which account is most suspicious and why.
</thinking>
<action>
INSPECT acc_0041
</action>
If parsing fails, a heuristic fallback inspects the highest-scored uninspected account. Retries use exponential backoff (1s, 2s, 4s) up to 3 attempts.
9. Reflexion β How the Agent Learns
Files: agent/reflection.py, agent/memory.py
The agent cannot update Qwen3's weights β Bedrock is a black-box API. Instead, it learns via Reflexion: post-episode lessons are written as text and injected into future prompts.
Reflexion Learning Loop
Episode N:
1. LLM acts using: system_prompt + reflections[last 4] + best_trajectory
2. Episode ends β WIN or LOSS
3. Post-episode:
LOSS β generate_reflection(action_log, outcome) β lesson stored
WIN β save trajectory if better reward + generate_success_reflection
Episode N+1:
β last 4 reflections + best win trajectory injected into prompt
β LLM has learned from its past
Example generated reflection:
"The starting accounts were all real; I wasted 8 steps inspecting low-signal nodes before pivoting. When photo_reuse and bio_template are both below 0.3 after 3 inspections, immediately use INVESTIGATE_NETWORK to jump to a different graph region."
All memory persists in a Docker volume (memory/) across container restarts β reflections, best trajectories, win history, and Ξ± values per task.
10. Hybrid Policy β The Novel Contribution
File: agent/hybrid_policy.py
Key insight: A new LLM agent starts dumb but improves over time. A rule engine is always consistent but cannot adapt. The hybrid policy exploits both β rules provide a safety net early while the LLM builds its track record; once the LLM proves itself, rules step back.
Architecture
Alpha (Ξ±): The Trust Weight
Ξ± is a per-task value in [0.20, cap] representing current trust in the LLM:
reflection_factor = min(1.0, n_reflections / 4.0)
raw = 0.20 + reflection_factor Γ (0.80 Γ recent_win_rate + 0.12)
Ξ± = clamp(raw, 0.20, cap)
| Task | Ξ± cap | Rationale |
|---|---|---|
| easy | 0.50 | Rule engine alone achieves ~91% β LLM should assist, not override |
| medium | 0.70 | Decoys require some LLM judgment, but cascade must stay |
| hard | 0.85 | LLM needs latitude for evasion adaptation, but safety rules remain |
Alpha trajectory over training (easy task, cap=0.50):
| Episode | Win rate | Reflections | Ξ± (capped) |
|---|---|---|---|
| 1 | 0% | 0 | 0.20 |
| 5 | 20% | 4 | 0.48 |
| 10 | 50% | 9 | 0.50 |
| 20 | 80% | 19 | 0.50 |
Rule Confidence Levels
| Situation | Action | Confidence |
|---|---|---|
| Steps remaining = 0 | SUBMIT | 1.00 |
| Uninspected SUSPECT accounts exist | INSPECT suspects[0] | 0.95 |
fake_risk β₯ 0.85 |
FLAG that account | 0.95 |
fake_risk in [threshold, 0.85) |
FLAG that account | 0.70+ |
| 10 accounts already flagged | SUBMIT | 0.85 |
| Steps remaining β€ 3 | SUBMIT | 0.90 |
| Uninspected accounts available | INSPECT top candidate | 0.30 |
At Ξ±=0.20 (early): rules dominate (~90% of decisions). At Ξ±=0.50 (moderate): LLM controls exploration; rules control safety. At Ξ±=0.85 (high): LLM controls most decisions; rules only override forced submits and uninspected suspects.
Ξ± is saved to memory/alpha_{task}.json and persists across Docker restarts β the agent doesn't reset to 0.20 every time.
11. Training Loop End-to-End
File: train.py
Curriculum
| Phase | Episodes | Task | Goal |
|---|---|---|---|
| 1 | 1β20 | easy | Learn basic signal thresholds, build first reflections |
| 2 | 21β35 | medium | Handle decoys, learn evasion response |
| 3 | 36β50 | hard | Feature-only detection, persistent evasion |
Seeds rotate deterministically: seed = (episode_num + task_offset) % 50
Per-Episode Flow
for ep in range(n_episodes):
1. DETERMINE TASK curriculum_task(ep) or fixed task
2. COMPUTE ALPHA compute_alpha(win_rate, n_reflections, task)
3. LOAD CONTEXT last 4 reflections + best win trajectory
4. RUN EPISODE while not obs.done:
blend(rule_action, llm_action, rule_conf, Ξ±)
β obs = env.step(final)
5. POST-EPISODE record_win β update Ξ± β generate reflection
6. LOG task | win/loss | reward | recall | precision | Ξ± | modes
Episode metrics (flushed to runs/metrics.jsonl every 5 episodes) include: episode, task, won, reward, recall, precision, steps_used, alpha_used, mode_agree, mode_rule, mode_llm, n_reflections_used.
You can watch the transition: early episodes have high rule counts; later episodes have high agree counts (LLM learned to make the same decisions as the rules, but also brings strategic reasoning the rules can't).
12. API Reference
File: server/app.py
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | {"status": "healthy"} |
/tasks |
GET | Task list + action_schema + score_range: [0.0, 1.0] |
/reset |
POST | Accepts {task, seed} β returns initial observation |
/step |
POST | Accepts any FakeGangAction β returns updated observation |
/state |
GET | Current episode metadata (step count, task, score) |
/grader |
GET | Normalised [0.0, 1.0] score after SUBMIT |
/baseline |
POST | Runs rule-based agent on all 3 tasks, returns scores |
Baseline performance:
| Task | Seed=0 score | Win rate (50 seeds) | Mean score (50 seeds) |
|---|---|---|---|
| easy | 0.91 | 100% | ~0.91 |
| medium | 0.906 | 84% | ~0.77 |
| hard | 0.9038 | 52% | ~0.47 |
13. Docker Deployment
# Build
docker build -f server/Dockerfile -t graphstrike .
# Run
docker run -it \
-e AWS_ACCESS_KEY_ID=your_key \
-e AWS_SECRET_ACCESS_KEY=your_secret \
-v $(pwd)/memory:/app/memory \
-v $(pwd)/runs:/app/runs \
-p 8000:8000 \
graphstrike
The memory/ and runs/ volumes preserve all learning between container restarts.
Environment Variables
| Variable | Default | Description |
|---|---|---|
AWS_ACCESS_KEY_ID |
(required) | For Bedrock/Qwen3 access |
AWS_SECRET_ACCESS_KEY |
(required) | For Bedrock/Qwen3 access |
AWS_DEFAULT_REGION |
us-east-1 |
Bedrock region |
TRAIN_TASK |
(curriculum) | Fix to easy/medium/hard |
TRAIN_EPISODES |
50 |
Total training episodes |
TRAIN_TEMP |
0.4 |
LLM sampling temperature |
TRAIN_VERBOSE |
0 |
Set 1 for per-step action logging |
SERVER_PORT |
8000 |
FastAPI port |
Startup Sequence (run.sh)
1. Validate AWS credentials
2. python server/generator.py β generates 150 episode JSON files
3. uvicorn server.app:app β starts the environment server
4. Health check polling β waits until /health responds
5. python train.py β runs the full training loop
Full HTTP validation
python3 -m uvicorn server.app:app --port 8001 &
sleep 3
python3 validate.py --url http://localhost:8001
# Expected: Results: 24/24 passed β all OK
Deployed Endpoint Verification
curl https://pandago-graphstrike.hf.space/health
# β {"status": "healthy"}
curl https://pandago-graphstrike.hf.space/tasks
# β {"tasks": ["easy","medium","hard"], "action_schema": {...}, "score_range": [0.0, 1.0]}
curl -X POST https://pandago-graphstrike.hf.space/baseline
# β {"scores": {"easy": 0.91, "medium": 0.906, "hard": 0.9038}, "agent": "rule_based"}








