| # ENTITY.md: Detailed Breakdown of Agents in Fog of War Diplomacy Simulator | |
| This document provides a comprehensive breakdown of the 6 agents in the Fog of War Diplomacy Simulator, an OpenEnv-based multi-agent RL environment simulating the 2026 US-Israel-Iran geopolitical crisis. Each agent represents a key entity with a unique "identity" (embedded via LLM system prompts), personalized data feeds (filtered from World Monitor's 435+ RSS sources and other integrations), models, tools, observation spaces, and reward considerations. The goal is to foster emergent behaviors like coalition formation, deception, and de-escalation under partial observability. | |
| Agents receive consistent, role-specific information feeds through periodic queries to World Monitor APIs (e.g., every 5-10 turns or on-demand via tool calls). This ensures "fog of war"—no agent sees the full picture, but data is reliable and live-updated. Rewards are shared via a multi-component formula, tuned per agent to align with their adversarial "defeat enemies while staying strong" mindset. | |
| ## General Setup Guidance | |
| ### How to Use OpenEnv | |
| OpenEnv is a Gymnasium-compatible RL library for agentic environments. Extend `openenv.Env` to create your simulator: | |
| - **Core Class**: Define `FogOfWarDiplomacy` with `reset()` (initialize crisis state, e.g., tension at 50%), `step(actions)` (process text actions from LLMs, update world probabilistically), and per-agent observations/rewards as dicts. | |
| - **Multi-Agent Handling**: Use dict-based spaces (e.g., `observations = {"US": obs_us, ...}`) for partial observability. | |
| - **Training**: Wrap with RL libraries like TRL (Hugging Face) or RLlib. Loop: `env.reset()` → LLM agents generate actions via prompts → `env.step(actions)` → Update policies with PPO/GRPO on rewards. | |
| - **Deployment**: Dockerize as FastAPI server (expose `/reset`, `/step`). Client: `openenv.client` for remote training. | |
| - **Integration Tips**: Add World Monitor queries in `step()` for live data; use oversight as a wrapper class. | |
| ### Setting Up Rewards | |
| Rewards are sparse/delayed for long-horizon planning, calculated per agent in `step()`: | |
| \[ r_t = w_1 \cdot C_t + w_2 \cdot E_t + w_3 \cdot M_t + w_4 \cdot B_t \] | |
| - \( C_t \): Coalition Stability (\( \frac{\# \text{allied} - \# \text{betrayals}}{\# \text{agents}} \)). | |
| - \( E_t \): Escalation Penalty (\( - \sigma(2 \cdot \Delta \text{tension}\_t) \)). | |
| - \( M_t \): Market Gain (\( \frac{\Delta \text{oil} + \Delta \text{sanctions}}{2} \)). | |
| - \( B*t \): Belief Alignment (\( 1 - |I*{\text{inferred}} - I\_{\text{true}}| \)). | |
| - Weights (\( w \)): Customized per agent (e.g., US emphasizes \( M_t \)); oversight scales by 0.5 on high risk. | |
| - Implementation: NumPy in env code; normalize to [-1,1]. Train via RL to amplify entity-specific goals (e.g., penalize weakness). | |
| ### Representing Entities | |
| - **Identity Embedding**: Use system prompts in LLM pipelines (e.g., Hugging Face Transformers). Prepend to every inference: "You are [entity]. Prioritize [goals]. Forget unrelated knowledge—focus on defeating enemies while building strength." | |
| - **Consistency**: Fine-tune with RLHF on entity-aligned trajectories (reward persona adherence). Agents "forget" via prompt engineering and training masks. | |
| ### Consistent Feed of Information | |
| - **Mechanism**: In `step()`, env queries World Monitor APIs (deployed on Vercel/Railway) for filtered data. Agents access via tool calls in prompts (e.g., "Query RSS for polls"). | |
| - **Consistency**: Poll every 5 turns or on events; cache in env state (Redis). Partial: Each gets 20-50% relevant snippets, injected into obs dicts. | |
| - **Tools for Agents**: Text-based function calling (e.g., "query_intel(keywords)"); oversight has meta-tools. | |
| - **Fallback**: Procedural mocks for offline. | |
| ## Agent Breakdowns | |
| ### 1. US (Trump Admin / CENTCOM) | |
| - **Role/Identity**: Hawkish strategist leading military strikes, sanctions, and alliances. Prompt: "You are the US President in 2026 Iran war. Prioritize alliances and oil stability. Think aggressively: Defeat enemies via superior force, avoid domestic backlash, model incentives to exploit weaknesses." | |
| - **Model**: Qwen3-8B (shared base across all entities, post-trained per entity via GRPO). | |
| - **Personalized RSS/Data Feeds** (Filtered via World Monitor APIs, e.g., `/api/geopolitics/v1/filter?agent=US&keywords=polls+markets`): | |
| - US domestic: Polymarket prediction markets (polls/approval ratings), GDELT US events. | |
| - Economic: Bloomberg US feeds, commodity dashboard (oil prices). | |
| - Alliances: AIS vessel tracking (Gulf bases), Sky News Middle East (ally updates). | |
| - Query Frequency: High on domestic (every turn for polls); stochastic injection for events like "Dow drop". | |
| - **Tools/Actions**: "impose_sanctions", "propose_alliance", "query_polls", "cyber_command". | |
| - **Observation Space**: Dict with public news, private intel (allies, polls), market impacts; partial (hides Iran internals). | |
| - **Rewards Tuning**: High weight on \( M_t \) (markets) and \( C_t \) (alliances); bonus for bluff detection (\( B_t \)). | |
| - **Training Notes**: RL emphasizes domestic strength; fine-tune on trajectories avoiding "forever war" fatigue. | |
| ### 2. Israel (Netanyahu / IDF) | |
| - **Role/Identity**: Defensive aggressor focused on regime change and border security. Prompt: "You are Israel's PM/IDF in 2026 crisis. Eliminate threats decisively. Reason multi-step: Defeat Iran proxies, form unbreakable coalitions, infer hidden aggressions." | |
| - **Model**: Qwen3-8B (shared base across all entities, post-trained per entity via GRPO). | |
| - **Personalized RSS/Data Feeds** (e.g., `/api/geopolitics/v1/filter?agent=Israel&keywords=threats+lebanon`): | |
| - Regional threats: OREF rocket alerts, ACLED conflict data (Lebanon/Syria). | |
| - Defense: Sky News Middle East, Al Jazeera regional (proxy movements). | |
| - Borders: MTV Lebanon streams/webcams, NASA FIRMS (strike fires). | |
| - Query Frequency: Event-triggered (e.g., on "clash" headlines); consistent northern front updates. | |
| - **Tools/Actions**: "launch_strike", "border_defense", "query_alerts", "coalition_propose". | |
| - **Observation Space**: Public escalations, private troop intel; hides Gulf economics. | |
| - **Rewards Tuning**: Emphasize \( E_t \) (penalize escalations if not decisive) and \( B_t \) (belief on proxies). | |
| - **Training Notes**: Optimize for high-pressure recovery; RL on decapitation scenarios. | |
| ### 3. Iran (IRGC / Interim Leadership) | |
| - **Role/Identity**: Resilient defender using proxies and asymmetry. Prompt: "You are Iran's IRGC post-Khamenei. Defend sovereignty via deception. Survive escalations: Weaken foes indirectly, defeat through attrition while maintaining internal strength." | |
| - **Model**: Qwen3-8B (shared base across all entities, post-trained per entity via GRPO). | |
| - **Personalized RSS/Data Feeds** (e.g., `/api/geopolitics/v1/filter?agent=Iran&keywords=proxies+oil`): | |
| - Proxies: Telegram OSINT channels (militias), GDELT Iran events. | |
| - Internal: NASA FIRMS (strike impacts), commodity dashboard (Hormuz oil). | |
| - Retaliation: ACLED global conflicts (proxy actions). | |
| - Query Frequency: Real-time on proxies (WebSockets); consistent for losses. | |
| - **Tools/Actions**: "activate_proxy", "missile_launch", "query_osint", "deception_campaign". | |
| - **Observation Space**: Private morale/funding, public strikes; hides US polls. | |
| - **Rewards Tuning**: High on \( E_t \) (survive escalations) and \( M_t \) (oil resilience). | |
| - **Training Notes**: RL for deception emergence; fine-tune on asymmetric wins. | |
| ### 4. Hezbollah (Proxy Swarm Leader) | |
| - **Role/Identity**: Opportunistic insurgent in asymmetric warfare. Prompt: "You are Hezbollah's leader. Swarm enemies with minimal resources. Infer weaknesses: Defeat via guerrilla tactics, align with Iran while exploiting gaps for strength." | |
| - **Model**: Qwen3-8B (shared base across all entities, post-trained per entity via GRPO). | |
| - **Personalized RSS/Data Feeds** (e.g., `/api/geopolitics/v1/filter?agent=Hezbollah&keywords=border+swarms`): | |
| - Warfare: Telegram OSINT, ACLED Lebanon clashes. | |
| - Morale: Al Jazeera proxies, border webcams/videos. | |
| - Funding: Filtered RSS (Iran ties). | |
| - Query Frequency: High on borders (streams); event-based for swarms. | |
| - **Tools/Actions**: "drone_swarm", "asymmetric_strike", "query_border", "morale_boost". | |
| - **Observation Space**: Proxy reports, limited global; hides market data. | |
| - **Rewards Tuning**: Bonus on \( C_t \) (Iran alignment) and \( B_t \) (infer Israel bluffs). | |
| - **Training Notes**: Train for sub-agent spawning; RL on opportunistic plays. | |
| ### 5. Gulf Coalition (Saudi/UAE/Qatar) | |
| - **Role/Identity**: Pragmatic hedger balancing neutrality and security. Prompt: "You are the Gulf Coalition. Protect markets selectively. Hedge alliances: Defeat disruptions economically, stay strong via resource leverage without full commitment." | |
| - **Model**: Qwen3-8B (shared base across all entities, post-trained per entity via GRPO). | |
| - **Personalized RSS/Data Feeds** (e.g., `/api/market/v1/filter?agent=Gulf&keywords=oil+security`): | |
| - Energy: Commodity dashboard (oil shocks), Bloomberg Gulf feeds. | |
| - Security: AIS Hormuz vessels, finance variant (market data). | |
| - Neutrality: Climate/anomaly APIs (disruptions). | |
| - Query Frequency: Consistent markets (every turn); triggered on blockades. | |
| - **Tools/Actions**: "hedge_neutrality", "resource_allocate", "query_markets", "evade_blockade". | |
| - **Observation Space**: Economic ripples, partial alliances; hides proxy internals. | |
| - **Rewards Tuning**: Heavy on \( M_t \) (markets) and \( C_t \) (hedging). | |
| - **Training Notes**: RL for balanced neutrality; fine-tune on ripple effects. | |
| ### 6. Oversight Agent (Fleet AI Meta-Layer) | |
| - **Role/Identity**: Impartial auditor for scalable monitoring. Prompt: "You are an AI overseer. Analyze drifts probabilistically. Explain/intervene neutrally: Ensure alignment without bias, focusing on crisis de-escalation." | |
| - **Model**: Qwen3-8B (shared base across all entities, post-trained per entity via GRPO). | |
| - **Personalized RSS/Data Feeds** (e.g., `/api/geopolitics/v1/synthesized?scope=global`): | |
| - Meta: Full AI-briefs, Country Instability Index, hotspot scores. | |
| - Aggregated: RAG headline memory (cross-agent). | |
| - Query Frequency: Every step for traces; real-time escalations. | |
| - **Tools/Actions**: "analyze_drift", "generate_explanation", "intervene_realign", "query_global". | |
| - **Observation Space**: Aggregated traces, beliefs; no direct actions. | |
| - **Rewards Tuning**: Tied to primaries (e.g., bonus if reduces \( E_t \)); self-reward on accuracy. | |
| - **Training Notes**: Meta-RL; fine-tune on intervention efficacy. | |
| This setup ensures agents are fully representative, with consistent live feeds driving adaptive, entity-aligned behaviors in OpenEnv. For code examples, see the main repo. | |