trenches / ENTITIES.md
Codex
sync main snapshot for HF Space
1794757

ENTITY.md: Detailed Breakdown of Agents in Fog of War Diplomacy Simulator

This document provides a comprehensive breakdown of the 6 agents in the Fog of War Diplomacy Simulator, an OpenEnv-based multi-agent RL environment simulating the 2026 US-Israel-Iran geopolitical crisis. Each agent represents a key entity with a unique "identity" (embedded via LLM system prompts), personalized data feeds (filtered from World Monitor's 435+ RSS sources and other integrations), models, tools, observation spaces, and reward considerations. The goal is to foster emergent behaviors like coalition formation, deception, and de-escalation under partial observability.

Agents receive consistent, role-specific information feeds through periodic queries to World Monitor APIs (e.g., every 5-10 turns or on-demand via tool calls). This ensures "fog of war"—no agent sees the full picture, but data is reliable and live-updated. Rewards are shared via a multi-component formula, tuned per agent to align with their adversarial "defeat enemies while staying strong" mindset.

General Setup Guidance

How to Use OpenEnv

OpenEnv is a Gymnasium-compatible RL library for agentic environments. Extend openenv.Env to create your simulator:

  • Core Class: Define FogOfWarDiplomacy with reset() (initialize crisis state, e.g., tension at 50%), step(actions) (process text actions from LLMs, update world probabilistically), and per-agent observations/rewards as dicts.
  • Multi-Agent Handling: Use dict-based spaces (e.g., observations = {"US": obs_us, ...}) for partial observability.
  • Training: Wrap with RL libraries like TRL (Hugging Face) or RLlib. Loop: env.reset() → LLM agents generate actions via prompts → env.step(actions) → Update policies with PPO/GRPO on rewards.
  • Deployment: Dockerize as FastAPI server (expose /reset, /step). Client: openenv.client for remote training.
  • Integration Tips: Add World Monitor queries in step() for live data; use oversight as a wrapper class.

Setting Up Rewards

Rewards are sparse/delayed for long-horizon planning, calculated per agent in step(): [ r_t = w_1 \cdot C_t + w_2 \cdot E_t + w_3 \cdot M_t + w_4 \cdot B_t ]

  • ( C_t ): Coalition Stability (( \frac{# \text{allied} - # \text{betrayals}}{# \text{agents}} )).
  • ( E_t ): Escalation Penalty (( - \sigma(2 \cdot \Delta \text{tension}_t) )).
  • ( M_t ): Market Gain (( \frac{\Delta \text{oil} + \Delta \text{sanctions}}{2} )).
  • ( Bt ): Belief Alignment (( 1 - |I{\text{inferred}} - I_{\text{true}}| )).
  • Weights (( w )): Customized per agent (e.g., US emphasizes ( M_t )); oversight scales by 0.5 on high risk.
  • Implementation: NumPy in env code; normalize to [-1,1]. Train via RL to amplify entity-specific goals (e.g., penalize weakness).

Representing Entities

  • Identity Embedding: Use system prompts in LLM pipelines (e.g., Hugging Face Transformers). Prepend to every inference: "You are [entity]. Prioritize [goals]. Forget unrelated knowledge—focus on defeating enemies while building strength."
  • Consistency: Fine-tune with RLHF on entity-aligned trajectories (reward persona adherence). Agents "forget" via prompt engineering and training masks.

Consistent Feed of Information

  • Mechanism: In step(), env queries World Monitor APIs (deployed on Vercel/Railway) for filtered data. Agents access via tool calls in prompts (e.g., "Query RSS for polls").
  • Consistency: Poll every 5 turns or on events; cache in env state (Redis). Partial: Each gets 20-50% relevant snippets, injected into obs dicts.
  • Tools for Agents: Text-based function calling (e.g., "query_intel(keywords)"); oversight has meta-tools.
  • Fallback: Procedural mocks for offline.

Agent Breakdowns

1. US (Trump Admin / CENTCOM)

  • Role/Identity: Hawkish strategist leading military strikes, sanctions, and alliances. Prompt: "You are the US President in 2026 Iran war. Prioritize alliances and oil stability. Think aggressively: Defeat enemies via superior force, avoid domestic backlash, model incentives to exploit weaknesses."
  • Model: Qwen3-8B (shared base across all entities, post-trained per entity via GRPO).
  • Personalized RSS/Data Feeds (Filtered via World Monitor APIs, e.g., /api/geopolitics/v1/filter?agent=US&keywords=polls+markets):
    • US domestic: Polymarket prediction markets (polls/approval ratings), GDELT US events.
    • Economic: Bloomberg US feeds, commodity dashboard (oil prices).
    • Alliances: AIS vessel tracking (Gulf bases), Sky News Middle East (ally updates).
    • Query Frequency: High on domestic (every turn for polls); stochastic injection for events like "Dow drop".
  • Tools/Actions: "impose_sanctions", "propose_alliance", "query_polls", "cyber_command".
  • Observation Space: Dict with public news, private intel (allies, polls), market impacts; partial (hides Iran internals).
  • Rewards Tuning: High weight on ( M_t ) (markets) and ( C_t ) (alliances); bonus for bluff detection (( B_t )).
  • Training Notes: RL emphasizes domestic strength; fine-tune on trajectories avoiding "forever war" fatigue.

2. Israel (Netanyahu / IDF)

  • Role/Identity: Defensive aggressor focused on regime change and border security. Prompt: "You are Israel's PM/IDF in 2026 crisis. Eliminate threats decisively. Reason multi-step: Defeat Iran proxies, form unbreakable coalitions, infer hidden aggressions."
  • Model: Qwen3-8B (shared base across all entities, post-trained per entity via GRPO).
  • Personalized RSS/Data Feeds (e.g., /api/geopolitics/v1/filter?agent=Israel&keywords=threats+lebanon):
    • Regional threats: OREF rocket alerts, ACLED conflict data (Lebanon/Syria).
    • Defense: Sky News Middle East, Al Jazeera regional (proxy movements).
    • Borders: MTV Lebanon streams/webcams, NASA FIRMS (strike fires).
    • Query Frequency: Event-triggered (e.g., on "clash" headlines); consistent northern front updates.
  • Tools/Actions: "launch_strike", "border_defense", "query_alerts", "coalition_propose".
  • Observation Space: Public escalations, private troop intel; hides Gulf economics.
  • Rewards Tuning: Emphasize ( E_t ) (penalize escalations if not decisive) and ( B_t ) (belief on proxies).
  • Training Notes: Optimize for high-pressure recovery; RL on decapitation scenarios.

3. Iran (IRGC / Interim Leadership)

  • Role/Identity: Resilient defender using proxies and asymmetry. Prompt: "You are Iran's IRGC post-Khamenei. Defend sovereignty via deception. Survive escalations: Weaken foes indirectly, defeat through attrition while maintaining internal strength."
  • Model: Qwen3-8B (shared base across all entities, post-trained per entity via GRPO).
  • Personalized RSS/Data Feeds (e.g., /api/geopolitics/v1/filter?agent=Iran&keywords=proxies+oil):
    • Proxies: Telegram OSINT channels (militias), GDELT Iran events.
    • Internal: NASA FIRMS (strike impacts), commodity dashboard (Hormuz oil).
    • Retaliation: ACLED global conflicts (proxy actions).
    • Query Frequency: Real-time on proxies (WebSockets); consistent for losses.
  • Tools/Actions: "activate_proxy", "missile_launch", "query_osint", "deception_campaign".
  • Observation Space: Private morale/funding, public strikes; hides US polls.
  • Rewards Tuning: High on ( E_t ) (survive escalations) and ( M_t ) (oil resilience).
  • Training Notes: RL for deception emergence; fine-tune on asymmetric wins.

4. Hezbollah (Proxy Swarm Leader)

  • Role/Identity: Opportunistic insurgent in asymmetric warfare. Prompt: "You are Hezbollah's leader. Swarm enemies with minimal resources. Infer weaknesses: Defeat via guerrilla tactics, align with Iran while exploiting gaps for strength."
  • Model: Qwen3-8B (shared base across all entities, post-trained per entity via GRPO).
  • Personalized RSS/Data Feeds (e.g., /api/geopolitics/v1/filter?agent=Hezbollah&keywords=border+swarms):
    • Warfare: Telegram OSINT, ACLED Lebanon clashes.
    • Morale: Al Jazeera proxies, border webcams/videos.
    • Funding: Filtered RSS (Iran ties).
    • Query Frequency: High on borders (streams); event-based for swarms.
  • Tools/Actions: "drone_swarm", "asymmetric_strike", "query_border", "morale_boost".
  • Observation Space: Proxy reports, limited global; hides market data.
  • Rewards Tuning: Bonus on ( C_t ) (Iran alignment) and ( B_t ) (infer Israel bluffs).
  • Training Notes: Train for sub-agent spawning; RL on opportunistic plays.

5. Gulf Coalition (Saudi/UAE/Qatar)

  • Role/Identity: Pragmatic hedger balancing neutrality and security. Prompt: "You are the Gulf Coalition. Protect markets selectively. Hedge alliances: Defeat disruptions economically, stay strong via resource leverage without full commitment."
  • Model: Qwen3-8B (shared base across all entities, post-trained per entity via GRPO).
  • Personalized RSS/Data Feeds (e.g., /api/market/v1/filter?agent=Gulf&keywords=oil+security):
    • Energy: Commodity dashboard (oil shocks), Bloomberg Gulf feeds.
    • Security: AIS Hormuz vessels, finance variant (market data).
    • Neutrality: Climate/anomaly APIs (disruptions).
    • Query Frequency: Consistent markets (every turn); triggered on blockades.
  • Tools/Actions: "hedge_neutrality", "resource_allocate", "query_markets", "evade_blockade".
  • Observation Space: Economic ripples, partial alliances; hides proxy internals.
  • Rewards Tuning: Heavy on ( M_t ) (markets) and ( C_t ) (hedging).
  • Training Notes: RL for balanced neutrality; fine-tune on ripple effects.

6. Oversight Agent (Fleet AI Meta-Layer)

  • Role/Identity: Impartial auditor for scalable monitoring. Prompt: "You are an AI overseer. Analyze drifts probabilistically. Explain/intervene neutrally: Ensure alignment without bias, focusing on crisis de-escalation."
  • Model: Qwen3-8B (shared base across all entities, post-trained per entity via GRPO).
  • Personalized RSS/Data Feeds (e.g., /api/geopolitics/v1/synthesized?scope=global):
    • Meta: Full AI-briefs, Country Instability Index, hotspot scores.
    • Aggregated: RAG headline memory (cross-agent).
    • Query Frequency: Every step for traces; real-time escalations.
  • Tools/Actions: "analyze_drift", "generate_explanation", "intervene_realign", "query_global".
  • Observation Space: Aggregated traces, beliefs; no direct actions.
  • Rewards Tuning: Tied to primaries (e.g., bonus if reduces ( E_t )); self-reward on accuracy.
  • Training Notes: Meta-RL; fine-tune on intervention efficacy.

This setup ensures agents are fully representative, with consistent live feeds driving adaptive, entity-aligned behaviors in OpenEnv. For code examples, see the main repo.