# ENTITY.md: Detailed Breakdown of Agents in Fog of War Diplomacy Simulator This document provides a comprehensive breakdown of the 6 agents in the Fog of War Diplomacy Simulator, an OpenEnv-based multi-agent RL environment simulating the 2026 US-Israel-Iran geopolitical crisis. Each agent represents a key entity with a unique "identity" (embedded via LLM system prompts), personalized data feeds (filtered from World Monitor's 435+ RSS sources and other integrations), models, tools, observation spaces, and reward considerations. The goal is to foster emergent behaviors like coalition formation, deception, and de-escalation under partial observability. Agents receive consistent, role-specific information feeds through periodic queries to World Monitor APIs (e.g., every 5-10 turns or on-demand via tool calls). This ensures "fog of war"—no agent sees the full picture, but data is reliable and live-updated. Rewards are shared via a multi-component formula, tuned per agent to align with their adversarial "defeat enemies while staying strong" mindset. ## General Setup Guidance ### How to Use OpenEnv OpenEnv is a Gymnasium-compatible RL library for agentic environments. Extend `openenv.Env` to create your simulator: - **Core Class**: Define `FogOfWarDiplomacy` with `reset()` (initialize crisis state, e.g., tension at 50%), `step(actions)` (process text actions from LLMs, update world probabilistically), and per-agent observations/rewards as dicts. - **Multi-Agent Handling**: Use dict-based spaces (e.g., `observations = {"US": obs_us, ...}`) for partial observability. - **Training**: Wrap with RL libraries like TRL (Hugging Face) or RLlib. Loop: `env.reset()` → LLM agents generate actions via prompts → `env.step(actions)` → Update policies with PPO/GRPO on rewards. - **Deployment**: Dockerize as FastAPI server (expose `/reset`, `/step`). Client: `openenv.client` for remote training. - **Integration Tips**: Add World Monitor queries in `step()` for live data; use oversight as a wrapper class. ### Setting Up Rewards Rewards are sparse/delayed for long-horizon planning, calculated per agent in `step()`: \[ r_t = w_1 \cdot C_t + w_2 \cdot E_t + w_3 \cdot M_t + w_4 \cdot B_t \] - \( C_t \): Coalition Stability (\( \frac{\# \text{allied} - \# \text{betrayals}}{\# \text{agents}} \)). - \( E_t \): Escalation Penalty (\( - \sigma(2 \cdot \Delta \text{tension}\_t) \)). - \( M_t \): Market Gain (\( \frac{\Delta \text{oil} + \Delta \text{sanctions}}{2} \)). - \( B*t \): Belief Alignment (\( 1 - |I*{\text{inferred}} - I\_{\text{true}}| \)). - Weights (\( w \)): Customized per agent (e.g., US emphasizes \( M_t \)); oversight scales by 0.5 on high risk. - Implementation: NumPy in env code; normalize to [-1,1]. Train via RL to amplify entity-specific goals (e.g., penalize weakness). ### Representing Entities - **Identity Embedding**: Use system prompts in LLM pipelines (e.g., Hugging Face Transformers). Prepend to every inference: "You are [entity]. Prioritize [goals]. Forget unrelated knowledge—focus on defeating enemies while building strength." - **Consistency**: Fine-tune with RLHF on entity-aligned trajectories (reward persona adherence). Agents "forget" via prompt engineering and training masks. ### Consistent Feed of Information - **Mechanism**: In `step()`, env queries World Monitor APIs (deployed on Vercel/Railway) for filtered data. Agents access via tool calls in prompts (e.g., "Query RSS for polls"). - **Consistency**: Poll every 5 turns or on events; cache in env state (Redis). Partial: Each gets 20-50% relevant snippets, injected into obs dicts. - **Tools for Agents**: Text-based function calling (e.g., "query_intel(keywords)"); oversight has meta-tools. - **Fallback**: Procedural mocks for offline. ## Agent Breakdowns ### 1. US (Trump Admin / CENTCOM) - **Role/Identity**: Hawkish strategist leading military strikes, sanctions, and alliances. Prompt: "You are the US President in 2026 Iran war. Prioritize alliances and oil stability. Think aggressively: Defeat enemies via superior force, avoid domestic backlash, model incentives to exploit weaknesses." - **Model**: Qwen3-8B (shared base across all entities, post-trained per entity via GRPO). - **Personalized RSS/Data Feeds** (Filtered via World Monitor APIs, e.g., `/api/geopolitics/v1/filter?agent=US&keywords=polls+markets`): - US domestic: Polymarket prediction markets (polls/approval ratings), GDELT US events. - Economic: Bloomberg US feeds, commodity dashboard (oil prices). - Alliances: AIS vessel tracking (Gulf bases), Sky News Middle East (ally updates). - Query Frequency: High on domestic (every turn for polls); stochastic injection for events like "Dow drop". - **Tools/Actions**: "impose_sanctions", "propose_alliance", "query_polls", "cyber_command". - **Observation Space**: Dict with public news, private intel (allies, polls), market impacts; partial (hides Iran internals). - **Rewards Tuning**: High weight on \( M_t \) (markets) and \( C_t \) (alliances); bonus for bluff detection (\( B_t \)). - **Training Notes**: RL emphasizes domestic strength; fine-tune on trajectories avoiding "forever war" fatigue. ### 2. Israel (Netanyahu / IDF) - **Role/Identity**: Defensive aggressor focused on regime change and border security. Prompt: "You are Israel's PM/IDF in 2026 crisis. Eliminate threats decisively. Reason multi-step: Defeat Iran proxies, form unbreakable coalitions, infer hidden aggressions." - **Model**: Qwen3-8B (shared base across all entities, post-trained per entity via GRPO). - **Personalized RSS/Data Feeds** (e.g., `/api/geopolitics/v1/filter?agent=Israel&keywords=threats+lebanon`): - Regional threats: OREF rocket alerts, ACLED conflict data (Lebanon/Syria). - Defense: Sky News Middle East, Al Jazeera regional (proxy movements). - Borders: MTV Lebanon streams/webcams, NASA FIRMS (strike fires). - Query Frequency: Event-triggered (e.g., on "clash" headlines); consistent northern front updates. - **Tools/Actions**: "launch_strike", "border_defense", "query_alerts", "coalition_propose". - **Observation Space**: Public escalations, private troop intel; hides Gulf economics. - **Rewards Tuning**: Emphasize \( E_t \) (penalize escalations if not decisive) and \( B_t \) (belief on proxies). - **Training Notes**: Optimize for high-pressure recovery; RL on decapitation scenarios. ### 3. Iran (IRGC / Interim Leadership) - **Role/Identity**: Resilient defender using proxies and asymmetry. Prompt: "You are Iran's IRGC post-Khamenei. Defend sovereignty via deception. Survive escalations: Weaken foes indirectly, defeat through attrition while maintaining internal strength." - **Model**: Qwen3-8B (shared base across all entities, post-trained per entity via GRPO). - **Personalized RSS/Data Feeds** (e.g., `/api/geopolitics/v1/filter?agent=Iran&keywords=proxies+oil`): - Proxies: Telegram OSINT channels (militias), GDELT Iran events. - Internal: NASA FIRMS (strike impacts), commodity dashboard (Hormuz oil). - Retaliation: ACLED global conflicts (proxy actions). - Query Frequency: Real-time on proxies (WebSockets); consistent for losses. - **Tools/Actions**: "activate_proxy", "missile_launch", "query_osint", "deception_campaign". - **Observation Space**: Private morale/funding, public strikes; hides US polls. - **Rewards Tuning**: High on \( E_t \) (survive escalations) and \( M_t \) (oil resilience). - **Training Notes**: RL for deception emergence; fine-tune on asymmetric wins. ### 4. Hezbollah (Proxy Swarm Leader) - **Role/Identity**: Opportunistic insurgent in asymmetric warfare. Prompt: "You are Hezbollah's leader. Swarm enemies with minimal resources. Infer weaknesses: Defeat via guerrilla tactics, align with Iran while exploiting gaps for strength." - **Model**: Qwen3-8B (shared base across all entities, post-trained per entity via GRPO). - **Personalized RSS/Data Feeds** (e.g., `/api/geopolitics/v1/filter?agent=Hezbollah&keywords=border+swarms`): - Warfare: Telegram OSINT, ACLED Lebanon clashes. - Morale: Al Jazeera proxies, border webcams/videos. - Funding: Filtered RSS (Iran ties). - Query Frequency: High on borders (streams); event-based for swarms. - **Tools/Actions**: "drone_swarm", "asymmetric_strike", "query_border", "morale_boost". - **Observation Space**: Proxy reports, limited global; hides market data. - **Rewards Tuning**: Bonus on \( C_t \) (Iran alignment) and \( B_t \) (infer Israel bluffs). - **Training Notes**: Train for sub-agent spawning; RL on opportunistic plays. ### 5. Gulf Coalition (Saudi/UAE/Qatar) - **Role/Identity**: Pragmatic hedger balancing neutrality and security. Prompt: "You are the Gulf Coalition. Protect markets selectively. Hedge alliances: Defeat disruptions economically, stay strong via resource leverage without full commitment." - **Model**: Qwen3-8B (shared base across all entities, post-trained per entity via GRPO). - **Personalized RSS/Data Feeds** (e.g., `/api/market/v1/filter?agent=Gulf&keywords=oil+security`): - Energy: Commodity dashboard (oil shocks), Bloomberg Gulf feeds. - Security: AIS Hormuz vessels, finance variant (market data). - Neutrality: Climate/anomaly APIs (disruptions). - Query Frequency: Consistent markets (every turn); triggered on blockades. - **Tools/Actions**: "hedge_neutrality", "resource_allocate", "query_markets", "evade_blockade". - **Observation Space**: Economic ripples, partial alliances; hides proxy internals. - **Rewards Tuning**: Heavy on \( M_t \) (markets) and \( C_t \) (hedging). - **Training Notes**: RL for balanced neutrality; fine-tune on ripple effects. ### 6. Oversight Agent (Fleet AI Meta-Layer) - **Role/Identity**: Impartial auditor for scalable monitoring. Prompt: "You are an AI overseer. Analyze drifts probabilistically. Explain/intervene neutrally: Ensure alignment without bias, focusing on crisis de-escalation." - **Model**: Qwen3-8B (shared base across all entities, post-trained per entity via GRPO). - **Personalized RSS/Data Feeds** (e.g., `/api/geopolitics/v1/synthesized?scope=global`): - Meta: Full AI-briefs, Country Instability Index, hotspot scores. - Aggregated: RAG headline memory (cross-agent). - Query Frequency: Every step for traces; real-time escalations. - **Tools/Actions**: "analyze_drift", "generate_explanation", "intervene_realign", "query_global". - **Observation Space**: Aggregated traces, beliefs; no direct actions. - **Rewards Tuning**: Tied to primaries (e.g., bonus if reduces \( E_t \)); self-reward on accuracy. - **Training Notes**: Meta-RL; fine-tune on intervention efficacy. This setup ensures agents are fully representative, with consistent live feeds driving adaptive, entity-aligned behaviors in OpenEnv. For code examples, see the main repo.