Spaces:
Running
GraphStrike β Single Source of Truth
Consolidates:
FINAL_SUMMARY.md,IMPLEMENTATION_COMPLETE.md,IMPLEMENTATION_STATUS.md,INFERENCE_UPDATE.md,PIPELINE.md,QUICKSTART.md,ROUND2_COMPLETE.md,ROUND2_STATUS.md,ROUND2_TRAINING_READY.md,server/ROUND2_FINAL_STATUS.md, and the top-levelROUND2_ARCHITECTURE.md/ROUND2_IMPLEMENTATION_PLAN.md/ROUND2_QUICK_REFERENCE.md/OpenEnv-Complete.md.The HF-Space
README.mdis kept (it contains the YAML frontmatter Spaces needs). The per-directorydashboard/README.mddescribes only the local dashboard and stays with it.
1. What GraphStrike is
An OpenEnv-compatible RL environment. An LLM agent must identify the 10 members of a coordinated fake-account ring hidden inside a synthetic social network. Round 2 makes detection platform-adaptive:
- Each episode belongs to a platform (Instagram, Snapchat, X, LinkedIn, Reddit, β¦ any name).
- A
PlatformPolicyis compiled from real transparency-report text via a Bayesian threshold formula and cached per-platform. - The high-signal account fields (
photo_reuse_score,bio_template_score,ip_cluster_id) start hidden and are revealed only by explicit tool actions. - Reward shape, FP penalty, grader score, and the moderation-decision package are all derived from the compiled policy rather than hardcoded.
A separate shared evaluation runner drives episodes deterministically and consults the LLM at exactly two decision points per suspicious account; six thin model shims plug in HF-router or Bedrock models against that runner.
2. Round 2 deltas (what changed vs Round 1)
| Area | Round 1 | Round 2 |
|---|---|---|
| Platform | β | platform field per episode; any name supported (env defaults to seed-parity Instagram/Snapchat) |
| Policy | hardcoded thresholds | PlatformPolicy compiled dynamically from transparency reports (Bayesian ΞΈ*) with 30-day cache freshness and sanity checks |
| Signals | all visible at INSPECT | photo_reuse_score, bio_template_score, ip_cluster_id start at 0.0 / "" and are revealed only by tool actions |
| Visible accounts | populated only on INSPECT | populated for every visible account from reset; tool reveals propagate immediately |
| Per-step reward | null for non-terminal steps |
float delta of self._score returned every step |
| Actions | inspect, investigate_network, flag, unflag, submit |
+ get_policy, reverse_image_search, analyze_bio, check_ip |
| Reward shaping | terminal only | + +0.20 first-action GET_POLICY bonus, redundant-tool penalties, no-evidence flag deny |
| Submit response | {observation, done, reward, message} |
+ top-level decision_package and grader_score |
| Eval | one monolithic qwen_test_judge_eval.py per model |
shared _round2_runner.py + 6 thin shims, two LLM decision points per account |
Platform assignment is deterministic in the env: seed % 2 == 0 β Instagram, else Snapchat. The eval runner remaps seeds so any requested platform actually fires (--platform Instagram forces even seeds, --platform Snapchat forces odd).
3. End-to-end policy flow (from transparency report to gradient signal)
This is the spine of Round 2. Every other component reads from this pipeline.
(ONE-TIME / OFFLINE)
transparency-report URLs policy_cache/
ββββββββββββββββββββββββ βββββββββββββ
β β
βΌ β²
Tavily search β
query: "{platform} fake account content β
policy enforcement 2024 2025" β
β β
βΌ β
Groq Llama-3.1-8B extraction β
β {base_rate Ο, fn_cost_signal, fp_cost_signal, β
harm_weight, primary_signal, confidence} β
β β
βΌ β
sanitize_pi() β clamp [0.0005, 0.05] β
(>0.05 β "enforcement rate misread", clamp + warn) β
β β
βΌ β
compute_threshold(Ο, fn_signal, fp_signal, hw) β
ββββββββββββββββββββββββββββββββββββββββββββββββ β
C_fn = FN_COST_MAP[fn_signal] β
C_fp = FP_COST_MAP[fp_signal] β
ΞΈ_raw = C_fnΒ·Ο / [C_fnΒ·Ο + C_fpΒ·(1βΟ)] β
ΞΈ* = clamp(ΞΈ_raw / harm_weight, 0.01, 0.95) β
fp_penalty_weight = C_fp β
β β
βΌ β
PlatformPolicy(threshold=ΞΈ*, base_rate=Ο, β
fn/fp_cost_signal, harm_weight, β
primary_enforcement_signal, β
fp_penalty_weight=C_fp, β
confidence, sources, used_fallback) βββββ
β
βΌ sanity_check_policy() β surfaces warnings
βΌ (high ΞΈ*, suspicious Ο, low confidence, bad signal name)
βΌ
cached to policy_cache/{platform}.json
β
=====================β=====================
β (PER EPISODE β RUNTIME)
βΌ
client.reset(task, seed)
env.platform = "Instagram"
env._policy = get_policy("Instagram") βββ reads cached JSON
β (recompiles if >30 days old)
βΌ
deterministic step 0: GET_POLICY (free, +0.20 first-action bonus)
message: "Policy compiled: Platform: Instagram |
Threshold: 0.369 | Primary Signal: photo_reuse | FP Penalty: 0.1x | β¦"
β
βΌ
runner._policy_from_message() β policy dict {threshold, primary_signal, fp_weight}
β
βΌ
per suspicious account, sorted by risk_score desc:
INSPECT (deterministic)
INVESTIGATE_NETWORK if risk β₯ 0.80 (deterministic, once)
ββ DP1 (LLM) ββββββββββββββββββββββββββββββ
β prompt includes platform, primary_signal,β
β ΞΈ*, revealed-vs-None signals, budget β
β β "reverse_image_search" / "analyze_bio" β
β / "check_ip" / "done" β
ββββββββββββββββββββββββββββββββββββββββββββ
β (loop until "done" or signals sufficient)
ββ DP2 (LLM) ββββββββββββββββββββββββββββββ
β prompt includes revealed signals, β
β ΞΈ*, fp_penalty=C_fp, running tp/fp count β
β β "flag" / "skip" β
ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
SUBMIT (deterministic)
reward = tpΒ·1.0 β fpΒ·C_fp β fnΒ·0.3 + bonuses β penalties
β²
βββ platform-specific via fp_penalty_weight
grader_score and decision_package surfaced at top level of /step response.
Two views of the same policy:
ΞΈ*is in the prompt at DP1/DP2 β the LLM conditions on it.C_fp(=fp_penalty_weight) is in the terminal reward β the LLM is graded against it.
Both come from the same compile-time computation; they cannot drift apart.
4. Policy Compiler (server/policy_compiler.py)
4.1 Formula
ΞΈ_raw = C_fn Β· Ο / [C_fn Β· Ο + C_fp Β· (1 β Ο)]
ΞΈ* = clamp(ΞΈ_raw / harm_weight, 0.01, 0.95)
fp_penalty_weight = C_fp
Action rule the threshold serves: FLAG if risk_score β₯ ΞΈ*.
ΞΈ_raw is the share of expected cost coming from missed fakes. Higher C_fn or higher base rate β higher ΞΈ_raw β lower threshold (the agent should flag more aggressively when misses are expensive).
harm_weight > 1 strict (lowers ΞΈ*); harm_weight < 1 lenient (raises ΞΈ*).
History note. The original spec used
ΞΈ_raw = C_fp(1βΟ) / [C_fp(1βΟ) + C_fnΒ·Ο]β the complementary probability. With small Ο that formula collapses toβ 1for every platform (Ο is the bottleneck, not the costs). Audit on 2026-04-25 confirmed this was a formula-direction error; the orientation above is correct for our action rule.
4.2 Cost maps
FN_COST_MAP = {"low": 0.5, "medium": 1.0, "high": 2.0, "critical": 4.0}
FP_COST_MAP = {"low": 0.1, "medium": 0.5, "high": 1.5}
Signals are extracted from policy text by an LLM and constrained to these keys (defaults high / medium if absent or invalid).
4.3 Extraction inputs
| Field | Source | Sanitization |
|---|---|---|
base_rate (Ο) |
LLM extraction from transparency report | sanitize_pi: clamp to [0.0005, 0.05]; >0.05 logs "likely enforcement rate misread, clamped". The prompt also instructs the LLM to return 0.005 if it sees an enforcement rate or no prevalence figure. |
fn_cost_signal |
LLM extraction | invalid β high |
fp_cost_signal |
LLM extraction | invalid β medium |
harm_weight |
LLM extraction | non-numeric β 1.0 |
primary_enforcement_signal |
LLM extraction | None / blank / non-string β photo_reuse |
confidence |
LLM extraction | non-numeric β 0.0 |
4.4 Tavily query (generic, platform-agnostic)
query = f"{platform} fake account content policy enforcement 2024 2025"
The previous query was Meta/Instagram-specific; the generic form works for any platform name. Domain filtering (is_high_signal_source) was removed for the same reason β it gated to meta.com/snap.com domains.
4.5 Caching & freshness
- Cached at
policy_cache/{platform_lowercase}.json. - Entries older than
CACHE_TTL_DAYS = 30are treated as stale and recompiled. compile_policy(platform, use_cache=True)is the runtime entry;--use-cacheflag controls CLI behavior (default re-compile when invoked from CLI).
4.6 Fallbacks
FALLBACK_POLICIESprovides hardcoded params for Instagram / Snapchat. Any other platform falls back toGENERIC_FALLBACK(Ο=0.005, fn=high, fp=medium, hw=1.0).- Fallback policies set
used_fallback=True(a new field onPlatformPolicy). - The threshold value in fallbacks is computed via the same formula β there is no hardcoded threshold in the policy compiler anymore.
4.7 Sanity check (sanity_check_policy)
After every compile, the compiler prints warnings for any of:
| Trigger | Meaning |
|---|---|
ΞΈ* > 0.90 |
agent will almost never flag β check fn_cost extraction |
ΞΈ* < 0.005 |
agent will flag nearly everything β check fp_cost extraction |
base_rate > 0.05 |
likely enforcement-rate misread |
confidence < 0.60 |
low extraction quality; consider falling back |
primary_signal β {photo_reuse, bio_template, ip_cluster, behavior} |
not a known tool action |
Sanity check does not block compilation; it surfaces issues so an operator can review before running eval.
4.8 CLI
python -m server.policy_compiler --platform <Name> # always recompile
python -m server.policy_compiler --platform <Name> --use-cache
4.9 Currently compiled policies
| Platform | Ο | fn_signal | fp_signal | hw | ΞΈ* | C_fp | confidence | used_fallback |
|---|---|---|---|---|---|---|---|---|
| X | 0.005 | high | low | 1.0 | 0.091 | 0.10 | 0.80 | False |
| 0.030 | critical | low | 1.5 | 0.369 | 0.10 | 0.80 | False | |
| Snapchat | 0.005 | low | low | 1.0 | 0.025 | 0.10 | 0.50 β | False |
| 0.005 | critical | low | 1.0 | 0.167 | 0.10 | 0.80 | False | |
| 0.005 | low | low | 1.0 | 0.025 | 0.10 | 0.50 β | False |
Snapchat and Reddit currently raise the low confidence sanity warning β extraction is noisy on those transparency reports. Consider forcing the fallback path before training on them.
5. Hidden-signal architecture
Episode JSON stores hidden signals at episode level, not per account:
{
"episode_id": "easy_042_Instagram",
"platform": "Instagram",
"hidden_signals": {
"photo_reuse": {"acc_0001": 0.87, ...},
"bio_template": {"acc_0001": 0.72, ...},
"ip_cluster": {"acc_0001": "ip_gang_42", ...}
}
}
account.features start with photo_reuse_score = 0.0, bio_template_score = 0.0, ip_cluster_id = "". Tool handlers copy from ep["hidden_signals"] into account.features and refresh the cached profile so subsequent observations carry the revealed value.
Known limitation.
generator.pyaccepts aplatformarg but currently produces identical hidden-signal distributions for every platform. Platform conditioning is therefore purely prompt-side β the LLM learns to read ΞΈ* and C_fp from the prompt and reward, not to recognize platform-specific data shape. Parametrizing the generator by platform is a separate follow-up.
6. Scoring (server/scoring.py)
Stateless risk functions (kept from Round 1): compute_node_risk, compute_behavior_risk, compute_graph_risk, compute_hub_legitimacy, compute_fake_risk.
Round 2 additions:
compute_weighted_fake_risk(..., primary_signal)boosts the platform's primary signal (node risk +0.15 for content signals; behavior risk +0.15 forip_cluster).classify_risk(fake_risk, threshold)accepts platform threshold.grader_score(tp, fp, fn, steps, max_steps, threshold, fp_penalty_weight)adds0.05 Γ (1 β threshold)to reward stricter platforms.
Win conditions (unchanged from Round 1): easy/medium recall β₯ 0.8, precision β₯ 0.7; hard recall β₯ 0.9, precision β₯ 0.8.
7. Tool-action contracts
| Action | Step cost | Score delta | Reveals | Notes |
|---|---|---|---|---|
GET_POLICY |
0 | +0.20 once (first action) |
β (returns PlatformPolicy summary in message) |
Free; bonus only fires on _action_count == 1 |
INSPECT |
1 | β0.01 |
full profile, edges | needed before any DP1/DP2 logic |
REVERSE_IMAGE_SEARCH |
1 | β0.01 (β0.05 if redundant) |
photo_reuse_score |
sets account.features.photo_reuse_score |
ANALYZE_BIO |
1 | β0.01 (β0.05 if redundant) |
bio_template_score |
sets account.features.bio_template_score |
CHECK_IP |
2 | β0.02 (β0.10 if redundant) |
ip_cluster_id + cluster-size message |
heaviest tool; only worth it for shared_ip β₯ 5 |
INVESTIGATE_NETWORK |
2 | β0.02 |
2-hop expansion + SUSPECT cascade | unchanged from Round 1 |
FLAG |
0 | β0.15 if no evidence (deny) |
dual SUSPECT cascade (follow-graph + IP) | "no evidence" = not inspected AND no tool used on the account |
UNFLAG |
0 | 0 | β | unchanged |
SUBMIT |
0 | terminal-reward formula (Β§ 8) | end episode | also surfaces decision_package and grader_score at top level |
All tool handlers validate acc_id in self._accounts, refresh the cached profile, and force _do_submit(forced=True) if max steps were consumed.
8. Reward shape (per-step deltas + terminal)
Per-step delta is now visible on every /step response: it is round(self._score - self._last_score, 4). terminal_reward overrides the delta on the SUBMIT step so the caller sees the full episode reward there.
8.1 Per-step shaping (visible immediately)
+0.20 GET_POLICY as first action (once per episode)
-0.01 per inspect / reverse_image_search / analyze_bio (time cost)
-0.02 per check_ip / investigate_network (time cost)
-0.05 per redundant reverse_image_search / analyze_bio
-0.10 per redundant check_ip
-0.15 blind FLAG (no inspect, no tool used on account) β deny + penalty
8.2 Terminal reward at SUBMIT
reward = tp Β· 1.0
β fp Β· self._policy.fp_penalty_weight (= C_fp; varies per platform)
β fn Β· 0.3
+ 5.0 if recall β₯ win_recall AND precision β₯ win_precision
+ 3.0 if tp == 10 (perfect recall)
+ 2.0 if partial win (recall met, precision missed)
+ 1.0 if SUBMIT with β₯ 50% steps remaining
+ 2.0 if Instagram and precision β₯ 0.95
+ 2.0 if Snapchat and recall β₯ 0.95
β 1.0 Γ evasion_count (hard task only)
β 2.0 if forced SUBMIT (ran out of steps)
β 0.15 Γ |unsupported_flags| (flags with no revealed signals at submit time)
Note: fp_penalty_weight is platform-specific and is the principal lever the policy compiler pulls. Same FP behavior costs more on X (1.5) than on Instagram/Snapchat (0.1).
9. Schemas (OpenEnv-compliant)
9.1 Models
FakeGangAction:action_type: ActionType,account_id: Optional[str]FakeGangObservation:done,reward(per-step delta or terminal),visible_accounts[AccountProfile](now populated for every visible id),visible_account_ids,flagged_ids,inspected_ids,graph_edges,steps_remaining,evasion_triggered,evasion_count,task,message,suspect_ids,platformFakeGangState:episode_id,step_count,task,score_so_far,evasion_count,network_size,gang_size,episode_seed,platformPlatformPolicy:platform,threshold,base_rate,fn_cost_signal,fp_cost_signal,harm_weight,primary_enforcement_signal,fp_penalty_weight,sources,confidence,compiled_at,used_fallback
9.2 StepResponse (HTTP)
{
"observation": { ... },
"done": <bool>,
"reward": <float | null>,
"message": "...",
"decision_package": { ... } | null, // populated after SUBMIT
"grader_score": <float> | null // populated after SUBMIT, sourced from decision_package
}
decision_package (after SUBMIT) carries:
platform,flagged_accounts[],recommended_action β {queue_for_review, temporary_hold, scheduled_ban, batch_takedown}evidence_summary:flagged,revealed_photo_reuse,revealed_bio_template,revealed_ip_cluster,unsupported_flags[]policy_rationale: textual explanation including ΞΈ*, primary signal, FP penalty, observed precision/recalltp,fp,fn,precision,recall,reward,grader_score
The terminal message also embeds the keywords flagged_accounts, evidence_summary, policy_rationale, grader_score for callers that grep the message string.
10. HTTP API (server/app.py)
| Endpoint | Method | Notes |
|---|---|---|
/health |
GET | {"status":"healthy"} |
/reset |
POST | {task, seed, episode_id} β StepResponse |
/step |
POST | FakeGangAction body β StepResponse (per-step reward delta + decision_package + grader_score on SUBMIT) |
/state |
GET | Current FakeGangState |
/tasks |
GET | Task list + Round 2 action_schema (9 actions) |
/grader |
GET | Normalized [0,1] score; requires SUBMIT first |
/metadata |
GET | HF Spaces metadata |
/schema |
GET | Pydantic JSON schemas |
/mcp |
POST | MCP JSON-RPC for tools/list |
/baseline |
POST | Runs rule-based baseline on all 3 tasks |
/ |
GET | Gradio playground |
openenv.yaml action schema mirrors all nine action types (Round 1 five plus the four Round 2 tools).
11. Evaluation runner (eval-models/_round2_runner.py)
11.1 Outer loop (deterministic)
reset(task, seed)
β
GET_POLICY (step 0) β always; bonus +0.20
β
loop over visible accounts sorted by (suspect_flag, risk_score) desc:
INSPECT if not yet inspected
INVESTIGATE_NETWORK if risk_score β₯ 0.80 (once per account, β₯5 steps left)
DP1 loop (LLM) β pick a tool or "done"
reverse_image_search | analyze_bio | check_ip | done
stops on "done", missing budget, or photo + bio both revealed
DP2 (LLM) β flag-or-skip
flag β env.step(FLAG)
skip β leave alone, move to next account
β
SUBMIT
Stops early when done is signaled, steps_remaining β€ 1, or max_accounts_per_episode = 15 accounts have been processed.
11.2 The two LLM decision points
DP1 β tool selection prompt includes:
platform,primary_signal,ΞΈ*account_id,risk_score,hub_legitimacy- Each revealed signal value or
None steps_remaining, tool costs
DP2 β flag decision prompt includes:
- All revealed signals for the account
ΞΈ*,fp_penalty = C_fp- Running
flagged / 10,steps_remaining
Each prompt asks for exactly one token so parsing is robust. Invalid completions are counted in dp1_invalid / dp2_invalid for QA.
11.3 Per-episode JSONL log
eval-models/results/{model}_{platform}_results.jsonl β one line per episode:
{
"model": "Bedrock/qwen.qwen3-next-80b-a3b",
"platform": "Instagram",
"task": "easy", "seed": 0,
"episode_id": "easy_000_Instagram",
"threshold": 0.369, "primary_signal": "photo_reuse",
"steps_taken": 14, "inspected": 5,
"tool_calls": {"reverse_image_search": 5, "analyze_bio": 4, "check_ip": 1,
"get_policy": 1, "investigate_network": 1},
"flagged": 7,
"dp1_calls": 12, "dp2_calls": 5, "dp1_invalid": 0, "dp2_invalid": 0,
"reward": 4.32, "grader_score": 0.71,
"final_message": "...", "wall_seconds": 23.4
}
11.4 Public entry point
from _round2_runner import run_evaluation
run_evaluation(
model_name="qwen-72b",
call_llm=lambda prompt: ..., # injectable adapter
platform="Instagram",
base_url="http://localhost:7860",
tasks=["easy", "medium", "hard"],
seeds=[0, 1, 2],
)
The runner remaps requested seeds to the env's parity rule so --platform Instagram actually runs Instagram episodes (even) and --platform Snapchat runs Snapchat (odd). Other platform names pass seeds through unmodified (env then falls back to its parity default for that seed).
11.5 Import-order safety
The runner unconditionally inserts the project root at sys.path[0] and evicts any cached models / client modules so a stale copy in ~/.local/lib/python3.12/site-packages cannot win. If your shim raises ActionType has no attribute GET_POLICY, that means the safety insert was skipped β verify you are running today's runner.
12. Model shims (eval-models/{qwen,gemma,deepseek,llama,mistral,nvidia}_test_judge_eval.py)
Each shim is ~30 lines. It declares the model identifiers and delegates to the runner via _llm_adapters.make_caller:
| Shim | HF model | Bedrock model |
|---|---|---|
| qwen | Qwen/Qwen2.5-72B-Instruct |
qwen.qwen3-next-80b-a3b |
| gemma | google.gemma-3-12b-it |
same |
| deepseek | deepseek.v3.2 |
same |
| llama | meta.llama4-scout-17b-instruct-v1:0 |
same |
| mistral | mistral.ministral-3-8b-instruct |
same |
| nvidia | nvidia.nemotron-super-3-120b |
same |
_llm_adapters.py exposes make_hf_caller(model), make_bedrock_caller(model_id), and a unified make_caller(backend, hf_model, bedrock_model). Both backends strip <think>...</think> reasoning blocks and retry up to 3Γ with exponential backoff.
Usage
# HF router (needs HF_TOKEN)
python eval-models/qwen_test_judge_eval.py --url http://localhost:7860 --platform Instagram
# AWS Bedrock (needs AWS_* env vars)
python eval-models/qwen_test_judge_eval.py --bedrock --url http://localhost:7860 --platform Snapchat \
--tasks easy medium --seeds 0 1 2
13. Files that matter
Source of truth (read first):
reference.mdβ this filemodels.pyβ data schemas (PlatformPolicy.used_fallbackis new)server/policy_compiler.pyβ Bayesian ΞΈ*, sanity check, generic Tavily, 30-day cacheserver/environment.pyβ reset/step/state, tool handlers, per-step reward delta, no-evidence flag deny, decision packageserver/app.pyβStepResponsewith top-leveldecision_packageandgrader_scoreserver/scoring.pyβ risk/grader mathserver/generator.pyβ episode generation,hidden_signalseval-models/_round2_runner.pyβ deterministic loop + DP1/DP2eval-models/_llm_adapters.pyβ HF + Bedrock callerseval-models/{model}_test_judge_eval.pyβ six thin shimsopenenv.yamlβ action schema mirrors all 9 actionscheck.shβ 12-step Round 2 system check (server side)
Operational:
policy_cache/{platform}.jsonβ compiled policies (delete to force recompile)episodes/{task}_{seed}.jsonβ generated episodes (regenerate withpython -m server.generator)eval-models/results/{model}_{platform}_results.jsonlβ per-episode eval logs
Round 1 still functional:
agent/train.py,agent/policy.py,agent/memory.py,agent/reflection.py,agent/hybrid_policy.pyinference.py,bedrock_model.py,client.pyvalidate.py,test_round2.py
14. Quickstart
# 1. Install
cd fake_gang_env
uv sync # or: pip install -r requirements.txt
# 2. Compile / refresh platform policies (one-time, then per β₯30 days)
python -m server.policy_compiler --platform Instagram
python -m server.policy_compiler --platform Snapchat
python -m server.policy_compiler --platform X
python -m server.policy_compiler --platform LinkedIn
# 3. (Re)generate episodes
python -m server.generator
# 4. Start the env server
python -m uvicorn server.app:app --port 7860
# 5. End-to-end system check (12 verifications)
bash check.sh
# 6. Run a model shim against the live server
export HF_TOKEN=... # or AWS_*
python eval-models/qwen_test_judge_eval.py \
--url http://localhost:7860 \
--platform Instagram \
--tasks easy medium hard \
--seeds 0 1 2
# Logs: eval-models/results/Qwen_Qwen2.5-72B-Instruct_instagram_results.jsonl
Docker:
docker build -f server/Dockerfile -t graphstrike .
docker run -p 7860:7860 -v $(pwd)/memory:/app/memory -v $(pwd)/runs:/app/runs graphstrike
15. System check (check.sh)
Twelve numbered checks against a running server at http://localhost:7860:
| # | Check | Pass criterion |
|---|---|---|
| 1β4 | health, /tasks, /reset, /step GET_POLICY | endpoints respond; action schema lists 9 types; threshold appears in message |
| 5 | INSPECT first visible account | profile returned; account_id is real (extracted from visible_account_ids before inspect) |
| 6 | REVERSE_IMAGE_SEARCH | photo_reuse_score > 0 for that account in observation.visible_accounts[*] |
| 7 | ANALYZE_BIO | bio_template_score > 0 |
| 8 | CHECK_IP | message reports cluster, shared_ip_count populated |
| 9 | GET_POLICY first-action bonus | per-step reward β₯ 0.15 |
| 10 | redundant tool penalty | second reverse_image_search reward < first |
| 11 | blind FLAG penalty | flag without prior inspect/tool β reward β€ β0.10 |
| 12 | full episode | submit response carries the four decision-package keywords + non-null grader_score |
CHECK 5β8 read from observation.visible_accounts[*] rather than a non-existent top-level profile field β the prior version of check.sh had that bug.
16. Bug fixes shipped 2026-04-25
| # | File | Symptom | Root cause | Fix |
|---|---|---|---|---|
| 1 | server/environment.py |
reward: null on every non-terminal step |
_make_observation only set terminal_reward |
Track _last_score; return score - _last_score as per-step delta |
| 2 | server/environment.py |
visible_accounts: [] until INSPECT |
observation included only _profiled |
Build a profile for every _visible_id (cached for inspected, fresh otherwise). Tool reveals propagate because _build_profile reads from account.features which the tool handlers update. |
| 3 | server/environment.py |
Tool reveals invisible to caller | covered by Bug 2 | β |
| 4 | server/environment.py |
GET_POLICY +0.20 not visible | accumulated into _score but _make_observation never returned it |
covered by Bug 1 |
| 5 | server/environment.py, server/app.py |
submit response missing decision-package keywords + grader_score |
message lacked the literal keywords; StepResponse only had four fields | enrich submit message; add decision_package and grader_score as top-level fields on StepResponse |
| 6 | server/environment.py |
blind FLAG (no inspect, no tool) returned 0 reward | submit-time unsupported_flags only fires at SUBMIT |
_do_flag now denies blind flags immediately with β0.15 |
| 7 | eval-models/_round2_runner.py |
ActionType has no attribute GET_POLICY when running shims |
if _PARENT not in sys.path guard skipped the insert because path was already present at lower priority; site-packages models.py won |
Insert _PARENT at index 0 unconditionally; evict cached models/client from sys.modules |
| 8 | check.sh |
acc_000 hardcoded; profile field read from wrong path | script bugs | extract real account_id from observation.visible_account_ids before CHECK 5; read profiles from observation.visible_accounts[*] |
| β | server/policy_compiler.py |
ΞΈ* always β 0.95 | formula direction inverted (computed FP-cost share, not FN-cost share) | ΞΈ_raw = C_fnΒ·Ο / [C_fnΒ·Ο + C_fpΒ·(1βΟ)] |
| β | server/policy_compiler.py |
enforcement-rate misreads (e.g. Snap Ο=0.262) | LLM confusion between "% removed" and "% prevalence" | sanitize_pi clamp [0.0005, 0.05] + warning; extraction prompt explicitly disambiguates |
| β | server/policy_compiler.py |
crash on Pydantic validation when LLM returned None for primary_enforcement_signal |
strict typing | coerce None / blank to photo_reuse; same for confidence |
17. Sanity rules for adding a new platform
After running python -m server.policy_compiler --platform <Name>:
| Property | Acceptable range | Action if outside |
|---|---|---|
threshold |
[0.005, 0.90] |
review β likely cost-signal extraction issue |
base_rate |
[0.0005, 0.05] |
review β likely enforcement-rate misread |
confidence |
β₯ 0.60 |
force fallback or improve sources |
primary_signal |
one of {photo_reuse, bio_template, ip_cluster, behavior} |
coerced to photo_reuse |
used_fallback |
match expectation | ensure Tavily/Groq keys are set if False expected |
Cross-platform ordering is not an invariant. Any platform may land anywhere on the [0.01, 0.95] ΞΈ* scale depending on its actual policy.
18. Outstanding (optional) work
- Platform-specific episode generation β
generate_episodeaccepts aplatformarg but produces identical hidden-signal distributions. Parametrize Ο, signal strengths, and evasion behavior per platform for richer training data. - TRL/GRPO trainer wrapper β runner produces
(prompt, completion)pairs at DP1/DP2 and per-step rewards. Threading these into a TRLDataCollatoris the next step (training-side scope, not part of this readiness pass). - Force-fallback flag on the CLI β convenient way to ignore Tavily and use hardcoded params when sanity check raises low-confidence warnings.
hybrid_policy.pyplatform-aware upgrade β Round-1 rule engine still uses fixed_THRESHOLDS; could readenv._policy.threshold. Low priority sinceagent/train.pyand the eval runner are independent.- Dashboard β
dashboard/DASHBOARD_SPEC.mddescribes a React + D3 demo; not required.
19. Design decisions (kept from earlier docs, condensed)
- Hidden signals at episode level, not account level β easier to track revelation, cleaner rollback between episodes.
- Platform assignment by seed parity (env) β reproducible without extra RNG state; eval runner remaps seeds when
--platformis requested. - Bayesian ΞΈ* β principled, explainable, varies sensibly when policy text changes. Action rule is
FLAG if risk β₯ ΞΈ*. - Asymmetric tool costs β CHECK_IP is 2Γ to force the agent to use cheap signals first.
- Cached policies + 30-day TTL β hackathon-demo viable without network; live recompile on staleness.
- Two LLM decision points β keeps the LLM's job focused (tool-pick + flag/skip) and makes (prompt, completion, reward) tuples cleanly attributable for future RL training.
- Top-level
decision_package+grader_scoreβ callers shouldn't have to grep the message string for the four submission fields.
20. Known tests / validation
bash check.shβ 12-step end-to-end against a running server (Round 2 system check).test_round2.pyβ 9-stage Python test againstserver/environment.py.validate.pyβ 24 HTTP validator checks against a running server.eval-models/{model}_test_judge_eval.pyβ judge model vs. environment scoring with two-decision-point loop.
All four were verified against the current tree on 2026-04-25.