---
title: GraphStrike
emoji: 🕵️
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
tags:
- reinforcement-learning
- social-network
- fraud-detection
- openenv
- llm-agent
base_path: /web
---
An OpenEnv-compatible reinforcement learning environment where an LLM agent must identify all 10 members of a coordinated fake account network hidden inside a synthetic social network. The agent learns via Reflexion and a dynamic hybrid rule/LLM policy , not via gradient updates or fine-tuning.
## Theme
**SUPPORT**
### Customer Service Agents
Complex environment where agents resolve multi-step queries using external tools and APIs.
## Problem Statement
**The task:** A social network contains fake accounts organised into a single coordinated ring of 10. The ring behaves in a coordinated way — same posting hour, same IP subnet, stolen celebrity photos, copy-paste bios. The agent must find all 10 by navigating a limited step budget, inspecting accounts, and flagging suspects.
## Proposed Solution
An OpenEnv-compatible reinforcement learning environment where an LLM agent must identify all 10 members of a coordinated fake account ring hidden inside a synthetic social network. The agent learns via **Reflexion** and a **dynamic hybrid rule/LLM policy** — not via gradient updates or fine-tuning.
---
## Novelty Highlights
- **Adaptive Hybrid Intelligence (Rules + LLM):** Unlike static ensembles, GraphStrike dynamically blends deterministic rules and LLM reasoning using a trust gate, shifting control as performance improves.
- **Learning Without Fine-Tuning:** Instead of updating model weights, the agent learns through Reflexion lessons and best-trajectory memory injected into future prompts.
- **Graph-First Detection Pipeline:** Detection is not account-by-account only; it uses cascade effects, neighbor propagation, and multi-hop graph expansion to uncover coordinated rings.
- **Math-Grounded Decision Control:** Risk composition, trust calibration, and grader alignment are formula-driven, making behavior interpretable and reproducible.
- **Adversarial Evasion Benchmarking:** Hard-mode includes timed evasion events, so success reflects robustness under disruption rather than overfitting to static patterns.
- **Safety-Net by Design:** High-confidence rule overrides prevent catastrophic LLM errors while preserving LLM flexibility for strategic exploration.
---
## Performance Summary
We evaluate GraphStrike's hybrid rule/LLM policy across multiple *frontier models to measure how well each model handles the investigation task. All runs use
the same inference pipeline (`inference.py`) with identical system prompts and structured logging. Each model ran: (1) seed=0 on all 3 tasks, and
(2) seeds 0-2 on all 3 tasks for variance measurement.*
**Seed=0 scores (single episode per task):**
**3-seed variance scores (mean across seeds 0, 1, 2):**
**Rule-Based Baseline (no LLM, deterministic)**
---
## Table of Contents
1. [What This Is](#1-what-this-is)
2. [The Problem: How Fake Detection Actually Works](#2-the-problem-how-fake-detection-actually-works)
3. [Synthetic Data Generation](#3-synthetic-data-generation)
4. [Data Model](#4-data-model)
5. [The RL Environment](#5-the-rl-environment)
6. [Risk Scoring Mathematics](#6-risk-scoring-mathematics)
8. [The LLM Policy (Qwen3 via Bedrock)](#8-the-llm-policy-qwen3-via-bedrock)
9. [Reflexion — How the Agent Learns](#9-reflexion--how-the-agent-learns)
10. [Hybrid Policy — The Novel Contribution](#10-hybrid-policy--the-novel-contribution)
11. [Training Loop End-to-End](#11-training-loop-end-to-end)
12. [API Reference](#12-api-reference)
13. [Docker Deployment](#13-docker-deployment)
14. [Submission Requirements](#14-submission-requirements)
15. [Verification & Validation](#15-verification--validation)
---
## 1. What is this !?
This is an **OpenEnv hackathon** submission. OpenEnv is a framework for building RL environments with a standard microservice interface (`/reset`, `/step`, `/state`) so that any agent implementation can plug in.
**What makes this non-trivial:**
- The network is large (50–1000 accounts depending on difficulty).
- Fake accounts are mixed with innocent high-signal "decoy" accounts.
- In hard mode, the gang actively evades — dropping intra-gang follows, renaming profiles — while the agent is mid-investigation.
- The agent cannot see the full network upfront: it must explore via INSPECT and INVESTIGATE_NETWORK actions, spending steps to reveal information.
**What makes the learning novel:**
- The LLM (inference via AWS Bedrock) cannot be fine-tuned — it is a black-box API.
- The agent learns via **Reflexion**: post-episode lessons are written back into memory and injected into every future prompt.
- A **dynamic hybrid policy** (α-weighted) blends the LLM with a deterministic rule engine, with the blend weight α updating based on recent win rate. Rules dominate early; the LLM takes over as it proves itself.
### System Architecture

---
## 2. The Problem: How Fake Detection Actually Works
A real-world fake account detector does **not** read post content. Detection relies on three categories of signals computed from metadata:
### Signal Hierarchy (Node -> Behavioral -> Graph)

- **Node signals (offline):** content fingerprints like photo reuse, bio-template similarity, and comment repetition provide the first suspicion layer.
- **Behavioral signals (temporal/device):** coordinated posting hour, account-age clustering, and shared IP subnet add stronger gang-level evidence.
- **Graph signals (live at INSPECT):** mutual follows, flagged-neighbor growth, and cluster alignment are hardest to evade, so they carry the highest weight in risk scoring.
- **False-positive control:** high-legitimacy hubs (for example celebrities) are down-weighted through hub-legitimacy discounting.
---
## 3. Synthetic Data Generation
**File:** `server/generator.py`
Episodes are generated deterministically by seed. 150 episodes are pre-generated (50 per task) and cached as JSON files in `episodes/`.
### Network Composition
| Task | Network size | Gang | Decoys | Real | Max steps |
|---|---|---|---|---|---|
| easy | 50 | 10 | 0 | 40 | 30 |
| medium | 200 | 10 | 20 | 170 | 50 |
| hard | 1000 | 10 | 50 | 940 | 80 |
- **Gang accounts:** All 10 share `base_age` (same creation week), tightly clustered `avg_post_hour`, high `photo_reuse_score`/`bio_template_score`, `comment_repeat_score` in [0.60, 0.90], `ip_cluster_id = "ip_gang_{seed}"`, and dense intra-gang follow edges (density 0.60–0.80).
- **Real accounts:** Log-normal follower distributions, unique IP clusters, low fake scores.
- **Decoy accounts** (medium/hard): Real accounts with elevated fraud scores (0.20–0.40 range) — they look suspicious but are NOT gang members and penalise reckless flagging.
- **Celebrity accounts** (2 per episode): 100k–5M followers, very low fake scores, high `hub_legitimacy_score`.
- **Zero-edge isolates** (2 per episode): No edges — test whether the agent wastes steps on disconnected nodes.
---
## 4. Data Model
**File:** `models.py`
### ActionType
| Value | Cost | Effect |
|---|---|---|
| `inspect` | 1 step | Reveals full `AccountProfile` + follow list |
| `investigate_network` | 2 steps | Expands 2 hops; reveals account IDs only |
| `flag` | 0 steps | Marks account as gang member; triggers SUSPECT cascade |
| `unflag` | 0 steps | Removes flag; clears CONFIRMED_FAKE status |
| `submit` | 0 steps | Ends episode; triggers scoring |
### AccountProfile — key fields
| Category | Fields |
|---|---|
| Raw counts | `follower_count`, `following_count`, `post_count` |
| Temporal | `avg_post_hour`, `account_age_days` |
| Content pipeline (0–1) | `photo_reuse_score`, `bio_template_score`, `comment_repeat_score` |
| IP/device | `shared_ip_count`, `ip_cluster_id` |
| Graph (live at INSPECT) | `mutual_follow_rate`, `flagged_neighbor_count`, `avg_neighbor_photo_reuse`, `post_hour_cluster_score` |
| Risk breakdown | `fake_risk_score`, `node_risk`, `behavior_risk`, `graph_risk`, `hub_legitimacy_score` |
| Evasion/status | `name_change_count`, `status` (NORMAL/SUSPECT/CONFIRMED_FAKE) |
### FakeGangObservation — what the agent sees each step
`done`, `reward`, `visible_accounts`, `visible_account_ids`, `flagged_ids`, `inspected_ids`, `suspect_ids`, `graph_edges`, `steps_remaining`, `evasion_triggered`, `evasion_count`, `task`, `message`
---
## 5. The RL Environment
**File:** `server/environment.py`
### Episode Lifecycle & Action Mechanics

**FLAG cascade (dual):** When FLAG(X) is called — (1) every visible account that X follows becomes SUSPECT via the follow-graph, and (2) every visible account sharing X's `ip_cluster_id` becomes SUSPECT. Gang members share `ip_gang_{seed}`; real accounts have unique IPs → zero false positives.
### Reward Function
```
base_reward = tp×1.0 − fp×0.5 − fn×0.3
Win condition:
easy/medium: recall ≥ 0.8 AND precision ≥ 0.7
hard: recall ≥ 0.9 AND precision ≥ 0.8
Bonuses:
+5.0 full win
+3.0 perfect recall
+2.0 partial win (high recall, low precision)
+1.0 efficiency (SUBMIT with ≥50% steps remaining)
−1.0 per evasion event (hard mode)
−2.0 forced submit (ran out of steps)
```
### Evasion (hard mode)
- **`unfollow_intragang`:** 30% of intra-gang edges randomly removed mid-investigation — destroys graph signal. Fires 4 times (steps 15, 30, 45, 60).
- **`rename_count`:** Random gang members get `name_change_count += 1` — a visual evasion signal.
---
## 6. Risk Scoring Mathematics
**File:** `server/scoring.py` — all functions are stateless and deterministic.



---
## 8. The LLM Policy (Qwen3 via Bedrock)
**File:** `agent/policy.py`
**Model:** `qwen.qwen3-next-80b-a3b` via AWS Bedrock Converse API (`maxTokens=512, temperature=0.4`)
### Prompt Structure
Every step, the policy builds a prompt from three components:
```
[reflections from past episodes] ← grows richer every episode
[best trajectory few-shot example] ← best win ever, showing the full action log
━━━ CURRENT CASE ━━━
[formatted observation] ← status badges, risk scores, suspect list
What is your next action?
```
Accounts in the observation are **sorted by `fake_risk_score` descending**, with status badges prepended. `fnbr=N(!)` highlights when `flagged_neighbor_count > 0`; `[HUB?]` warns the LLM not to flag high-legitimacy accounts.
### Required Response Format
```xml
Reasoning — which account is most suspicious and why.
INSPECT acc_0041
```
If parsing fails, a heuristic fallback inspects the highest-scored uninspected account. Retries use exponential backoff (1s, 2s, 4s) up to 3 attempts.
---
## 9. Reflexion — How the Agent Learns
**Files:** `agent/reflection.py`, `agent/memory.py`
The agent **cannot** update Qwen3's weights — Bedrock is a black-box API. Instead, it learns via **Reflexion**: post-episode lessons are written as text and injected into future prompts.
### Reflexion Learning Loop

```
Episode N:
1. LLM acts using: system_prompt + reflections[last 4] + best_trajectory
2. Episode ends → WIN or LOSS
3. Post-episode:
LOSS → generate_reflection(action_log, outcome) → lesson stored
WIN → save trajectory if better reward + generate_success_reflection
Episode N+1:
→ last 4 reflections + best win trajectory injected into prompt
→ LLM has learned from its past
```
**Example generated reflection:**
> *"The starting accounts were all real; I wasted 8 steps inspecting low-signal nodes before pivoting. When photo_reuse and bio_template are both below 0.3 after 3 inspections, immediately use INVESTIGATE_NETWORK to jump to a different graph region."*
All memory persists in a Docker volume (`memory/`) across container restarts — reflections, best trajectories, win history, and α values per task.
---
## 10. Hybrid Policy — The Novel Contribution
**File:** `agent/hybrid_policy.py`
**Key insight:** A new LLM agent starts dumb but improves over time. A rule engine is always consistent but cannot adapt. The hybrid policy exploits both — rules provide a safety net early while the LLM builds its track record; once the LLM proves itself, rules step back.
### Architecture

### Alpha (α): The Trust Weight
α is a per-task value in [0.20, cap] representing current trust in the LLM:
```
reflection_factor = min(1.0, n_reflections / 4.0)
raw = 0.20 + reflection_factor × (0.80 × recent_win_rate + 0.12)
α = clamp(raw, 0.20, cap)
```
| Task | α cap | Rationale |
|---|---|---|
| easy | 0.50 | Rule engine alone achieves ~91% — LLM should assist, not override |
| medium | 0.70 | Decoys require some LLM judgment, but cascade must stay |
| hard | 0.85 | LLM needs latitude for evasion adaptation, but safety rules remain |
**Alpha trajectory over training (easy task, cap=0.50):**
| Episode | Win rate | Reflections | α (capped) |
|---|---|---|---|
| 1 | 0% | 0 | 0.20 |
| 5 | 20% | 4 | 0.48 |
| 10 | 50% | 9 | **0.50** |
| 20 | 80% | 19 | **0.50** |

### Rule Confidence Levels
| Situation | Action | Confidence |
|---|---|---|
| Steps remaining = 0 | SUBMIT | 1.00 |
| Uninspected SUSPECT accounts exist | INSPECT suspects[0] | 0.95 |
| `fake_risk ≥ 0.85` | FLAG that account | 0.95 |
| `fake_risk` in [threshold, 0.85) | FLAG that account | 0.70+ |
| 10 accounts already flagged | SUBMIT | 0.85 |
| Steps remaining ≤ 3 | SUBMIT | 0.90 |
| Uninspected accounts available | INSPECT top candidate | 0.30 |
At **α=0.20** (early): rules dominate (~90% of decisions). At **α=0.50** (moderate): LLM controls exploration; rules control safety. At **α=0.85** (high): LLM controls most decisions; rules only override forced submits and uninspected suspects.
α is saved to `memory/alpha_{task}.json` and persists across Docker restarts — the agent doesn't reset to 0.20 every time.
---
## 11. Training Loop End-to-End
**File:** `train.py`
### Curriculum
| Phase | Episodes | Task | Goal |
|---|---|---|---|
| 1 | 1–20 | easy | Learn basic signal thresholds, build first reflections |
| 2 | 21–35 | medium | Handle decoys, learn evasion response |
| 3 | 36–50 | hard | Feature-only detection, persistent evasion |
Seeds rotate deterministically: `seed = (episode_num + task_offset) % 50`
### Per-Episode Flow
```
for ep in range(n_episodes):
1. DETERMINE TASK curriculum_task(ep) or fixed task
2. COMPUTE ALPHA compute_alpha(win_rate, n_reflections, task)
3. LOAD CONTEXT last 4 reflections + best win trajectory
4. RUN EPISODE while not obs.done:
blend(rule_action, llm_action, rule_conf, α)
→ obs = env.step(final)
5. POST-EPISODE record_win → update α → generate reflection
6. LOG task | win/loss | reward | recall | precision | α | modes
```
Episode metrics (flushed to `runs/metrics.jsonl` every 5 episodes) include: `episode`, `task`, `won`, `reward`, `recall`, `precision`, `steps_used`, `alpha_used`, `mode_agree`, `mode_rule`, `mode_llm`, `n_reflections_used`.
You can watch the transition: early episodes have high `rule` counts; later episodes have high `agree` counts (LLM learned to make the same decisions as the rules, but also brings strategic reasoning the rules can't).
---
## 12. API Reference
**File:** `server/app.py`
| Endpoint | Method | Description |
|---|---|---|
| `/health` | GET | `{"status": "healthy"}` |
| `/tasks` | GET | Task list + `action_schema` + `score_range: [0.0, 1.0]` |
| `/reset` | POST | Accepts `{task, seed}` → returns initial observation |
| `/step` | POST | Accepts any `FakeGangAction` → returns updated observation |
| `/state` | GET | Current episode metadata (step count, task, score) |
| `/grader` | GET | Normalised [0.0, 1.0] score after SUBMIT |
| `/baseline` | POST | Runs rule-based agent on all 3 tasks, returns scores |
**Baseline performance:**
| Task | Seed=0 score | Win rate (50 seeds) | Mean score (50 seeds) |
|---|---|---|---|
| easy | 0.91 | 100% | ~0.91 |
| medium | 0.906 | 84% | ~0.77 |
| hard | 0.9038 | 52% | ~0.47 |
---
## 13. Docker Deployment
```bash
# Build
docker build -f server/Dockerfile -t graphstrike .
# Run
docker run -it \
-e AWS_ACCESS_KEY_ID=your_key \
-e AWS_SECRET_ACCESS_KEY=your_secret \
-v $(pwd)/memory:/app/memory \
-v $(pwd)/runs:/app/runs \
-p 8000:8000 \
graphstrike
```
The `memory/` and `runs/` volumes preserve all learning between container restarts.
### Environment Variables
| Variable | Default | Description |
|---|---|---|
| `AWS_ACCESS_KEY_ID` | (required) | For Bedrock/Qwen3 access |
| `AWS_SECRET_ACCESS_KEY` | (required) | For Bedrock/Qwen3 access |
| `AWS_DEFAULT_REGION` | `us-east-1` | Bedrock region |
| `TRAIN_TASK` | (curriculum) | Fix to `easy`/`medium`/`hard` |
| `TRAIN_EPISODES` | `50` | Total training episodes |
| `TRAIN_TEMP` | `0.4` | LLM sampling temperature |
| `TRAIN_VERBOSE` | `0` | Set `1` for per-step action logging |
| `SERVER_PORT` | `8000` | FastAPI port |
### Startup Sequence (`run.sh`)
```
1. Validate AWS credentials
2. python server/generator.py → generates 150 episode JSON files
3. uvicorn server.app:app → starts the environment server
4. Health check polling → waits until /health responds
5. python train.py → runs the full training loop
```
---
### Full HTTP validation
```bash
python3 -m uvicorn server.app:app --port 8001 &
sleep 3
python3 validate.py --url http://localhost:8001
# Expected: Results: 24/24 passed — all OK
```
### Deployed Endpoint Verification
```bash
curl https://pandago-graphstrike.hf.space/health
# → {"status": "healthy"}
curl https://pandago-graphstrike.hf.space/tasks
# → {"tasks": ["easy","medium","hard"], "action_schema": {...}, "score_range": [0.0, 1.0]}
curl -X POST https://pandago-graphstrike.hf.space/baseline
# → {"scores": {"easy": 0.91, "medium": 0.906, "hard": 0.9038}, "agent": "rule_based"}
```
---

## Developed with ❤️ by Team ComputeXOR
### {
### [Sai Nivedh](https://github.com/SaiNivedh26) ,
### [Charuvarthan](https://github.com/Charuvarthan-T) ,
### [Sajeev](https://github.com/SajeevSenthil)
### }