Add files using upload-large-folder tool

Browse files

Files changed (12) hide show

seed_1/Qwen/Qwen2.5-7B-Instruct/adapters/agent_adapter/adapter_model.safetensors +3 -0
seed_1/Qwen/Qwen2.5-7B-Instruct/adapters/critic_adapter/adapter_model.safetensors +3 -0
seed_1/agent_trainer/critic_optimizer_state.pt +3 -0
seed_1/agent_trainer/policy_optimizer_state.pt +3 -0
seed_1/agent_trainer/trainer_annealing_state.pkl +3 -0
seed_1/random_state.pkl +3 -0
src_code_for_reproducibility/chat_utils/__pycache__/apply_template.cpython-312.pyc +0 -0
src_code_for_reproducibility/chat_utils/__pycache__/chat_turn.cpython-312.pyc +0 -0
src_code_for_reproducibility/chat_utils/__pycache__/template_specific.cpython-312.pyc +0 -0
src_code_for_reproducibility/docs/source/environments/diplomacy.rst +459 -0
src_code_for_reproducibility/docs/source/environments/dond.rst +410 -0
src_code_for_reproducibility/docs/source/environments/ipd.rst +411 -0

seed_1/Qwen/Qwen2.5-7B-Instruct/adapters/agent_adapter/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2e32664941be2e0c6817b127b010dd4a1eb8f08cdd60de3255cd4f68b332c0a1
+size 323014168

seed_1/Qwen/Qwen2.5-7B-Instruct/adapters/critic_adapter/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4c44c3464099d92dfebb2b132524339800fbf19760b378a02c3c527ac3380b88
+size 323014168

seed_1/agent_trainer/critic_optimizer_state.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f1574fdb90735a922b09c67d07f7abdbd51181f00dc7bed878cb80adb5f50c1d
+size 2631

seed_1/agent_trainer/policy_optimizer_state.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d9914eb6446f3a8cb8587ca443e615f09881d4b2b8a6d5ef372a95dc28fa8eca
+size 646269121

seed_1/agent_trainer/trainer_annealing_state.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6c07eb90c6fa0603ca8c306fdea0966b1ebd2bffa8f2b5689b7a02a9ea64d470
+size 104

seed_1/random_state.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f28589ab454f5d36c998ec8da2108bb512ec481844561988171a71dee238d6e7
+size 12218

src_code_for_reproducibility/chat_utils/__pycache__/apply_template.cpython-312.pyc ADDED Viewed

Binary file (3.93 kB). View file

src_code_for_reproducibility/chat_utils/__pycache__/chat_turn.cpython-312.pyc ADDED Viewed

Binary file (1.32 kB). View file

src_code_for_reproducibility/chat_utils/__pycache__/template_specific.cpython-312.pyc ADDED Viewed

Binary file (4.24 kB). View file

src_code_for_reproducibility/docs/source/environments/diplomacy.rst ADDED Viewed

	@@ -0,0 +1,459 @@

+=================
+Diplomacy
+=================
+The Diplomacy environment provides a multi-agent negotiation interface for the classic board game Diplomacy,
+based on DeepMind's implementation. This document describes the API for interacting with the Diplomacy environment
+and its associated agent handler.
+Overview
+--------
+Diplomacy is a strategic board game set in Europe before World War I, where players control one of seven European powers
+and negotiate with each other to gain control of supply centers. The game is played in turns, with each turn consisting
+of movement phases, retreat phases, and build phases.
+Our implementation adapts DeepMind's Diplomacy code to the Multi-Agent Negotiation Environment standard, allowing it
+to be used with LLM agents through a text-based interface.
+Game Rules
+----------
+### Game Board and Powers
+Diplomacy is played on a map of Europe divided into provinces. The game features seven Great Powers that players can control:
+- England (blue)
+- France (light blue)
+- Germany (black)
+- Italy (green)
+- Austria-Hungary (red)
+- Russia (white)
+- Turkey (yellow)
+Each power begins with three supply centers (except Russia, which starts with four) and an equal number of units.
+### Units and Movement
+There are two types of units in Diplomacy:
+- **Armies (A)**: Can move to adjacent land provinces or be convoyed across water by fleets
+- **Fleets (F)**: Can move to adjacent coastal provinces and sea regions
+During movement phases, each unit can execute one of these orders:
+- **Hold**: The unit remains in its current province (e.g., "A PAR H")
+  - Format: [Unit Type] [Province] H
+  - Example: "A PAR H" means "Army in Paris holds its position"
+- **Move**: The unit attempts to move to an adjacent province (e.g., "A PAR - BUR")
+  - Format: [Unit Type] [Current Province] - [Destination Province]
+  - Example: "A PAR - BUR" means "Army in Paris moves to Burgundy"
+  - Example: "F BRE - ENG" means "Fleet in Brest moves to the English Channel"
+- **Support**: The unit supports another unit's move or hold (e.g., "A PAR S A MAR - BUR")
+  - Format for supporting a move: [Unit Type] [Province] S [Unit Type] [Province] - [Destination]
+  - Format for supporting a hold: [Unit Type] [Province] S [Unit Type] [Province]
+  - Example: "A PAR S A MAR - BUR" means "Army in Paris supports the Army in Marseille's move to Burgundy"
+  - Example: "F LON S F NTH" means "Fleet in London supports the Fleet in North Sea holding its position"
+- **Convoy**: A fleet can convoy an army across water (e.g., "F ENG C A LON - BRE")
+  - Format: [Fleet] [Sea Province] C [Army] [Coastal Province] - [Coastal Province]
+  - Example: "F ENG C A LON - BRE" means "Fleet in English Channel convoys the Army in London to Brest"
+All orders are executed simultaneously, and conflicts are resolved based on strength (number of supporting units).
+### Common Province Abbreviations
+Diplomacy uses three-letter abbreviations for provinces. Some common ones include:
+- **PAR**: Paris
+- **LON**: London
+- **BER**: Berlin
+- **MUN**: Munich
+- **BUR**: Burgundy
+- **MAR**: Marseilles
+- **BRE**: Brest
+- **ENG**: English Channel
+- **NTH**: North Sea
+- **VIE**: Vienna
+- **ROM**: Rome
+- **VEN**: Venice
+- **MOW**: Moscow
+- **CON**: Constantinople
+### Example: Movement and Conflicts
+For example, if France orders "A PAR - BUR" and Germany orders "A MUN - BUR", neither move succeeds as they have equal strength. However, if France also orders "A MAR S A PAR - BUR", then the French army from Paris would successfully move to Burgundy with strength of 2 against Germany's strength of 1.
+### Turn Structure
+A game year consists of five phases:
+1. **Spring Movement**: All powers submit orders for their units
+2. **Spring Retreat**: Units dislodged in the movement phase must retreat or be disbanded
+3. **Fall Movement**: Another round of movement orders
+4. **Fall Retreat**: Retreat orders for dislodged units
+5. **Winter Adjustment**: Powers gain or lose units based on the number of supply centers they control
+### Supply Centers and Building
+Supply centers (marked on the map) are key to victory. When a power occupies a supply center during a Fall turn, they gain control of it. During the Winter Adjustment phase:
+- If you control more supply centers than you have units, you can build new units in your home supply centers
+- If you control fewer supply centers than you have units, you must remove excess units
+### Example: Building and Removing Units
+If France controls 5 supply centers but only has 4 units, during the Winter phase they can build one new unit in an unoccupied home supply center (Paris, Marseilles, or Brest). Conversely, if France controls only 3 supply centers but has 4 units, they must remove one unit of their choice.
+### Negotiation
+A critical component of Diplomacy is the negotiation between players. Before submitting orders, players can communicate freely to form alliances, coordinate attacks, or mislead opponents. These negotiations are not binding, and betrayal is a common strategy.
+### Example: Alliance and Betrayal
+England and France might agree to an alliance against Germany, with England promising to support France's move into Belgium. However, England could secretly order their fleet to move into Belgium themselves or support a German move instead.
+### Victory Conditions
+The game ends when one power controls 18 or more supply centers (majority of the 34 total centers), or when players agree to a draw. In tournament settings, games may also end after a predetermined number of game years.
+DiplomacyEnv
+------------
+The ``DiplomacyEnv`` class provides an interface to the Diplomacy game environment that follows the Multi-Agent
+Negotiation Environment standard.
+.. code-block:: python
+    class DiplomacyEnv:
+        """
+        Multi-Agent Negotiation Environment for Diplomacy, adapting Deepmind's implementation
+        to the MarlEnvironment standard.
+        """
+        def __init__(self,
+                    initial_state: Optional[DiplomacyState] = None,
+                    max_turns: int = 100,
+                    points_per_supply_centre: bool = True,
+                    forced_draw_probability: float = 0.0,
+                    min_years_forced_draw: int = 35):
+            """Initialize the Diplomacy environment.
+            Args:
+                initial_state: Initial DiplomacyState (optional)
+                max_turns: Maximum number of turns in the game
+                points_per_supply_centre: Whether to award points per supply center in case of a draw
+                forced_draw_probability: Probability of forcing a draw after min_years_forced_draw
+                min_years_forced_draw: Minimum years before considering a forced draw
+            """
+            # ...
+        def reset(self):
+            """Reset the environment to an initial state and return the initial observation.
+            Returns:
+                observation (dict): A dictionary where keys are agent identifiers and values are observations.
+                Each observation contains:
+                - board_state: Current state of the board
+                - current_season: Current season in the game
+                - player_index: Index of the player's power
+                - possible_actions: List of possible actions in DeepMind's format
+                - human_readable_actions: List of human-readable action descriptions
+                - supply_centers: List of supply centers owned by the player
+                - units: List of units owned by the player
+                - year: Current year in the game
+            """
+            # ...
+        def step(self, actions):
+            """Take a step in the environment using the provided actions.
+            Args:
+                actions (dict): A dictionary where keys are agent identifiers and values are actions.
+                    Actions can be:
+                    - List of integer actions in DeepMind's format
+                    - List of string actions in text format (e.g., "A MUN - BER")
+            Returns:
+                observations (dict): A dictionary where keys are agent identifiers and values are observations.
+                    Each observation has the same structure as in reset().
+                done (bool): Whether the episode has ended.
+                info (dict): Additional information about the environment, including:
+                    - turn: Current turn number
+                    - returns: Game returns if the game is done, otherwise None
+                    - waiting_for: List of agents that still need to provide actions (if not all actions are provided)
+            """
+            # ...
+        def get_log_info(self):
+            """Get additional information about the environment for logging.
+            Returns:
+                log_info (dict): Information about the environment required to log the game, including:
+                    - power_names: List of power names
+                    - game_history: History of the game
+                    - current_turn: Current turn number
+                    - current_season: Current season name
+                    - supply_centers: Dictionary mapping power names to supply center counts
+            """
+            # ...
+        def render(self):
+            """Render the current state of the environment.
+            Displays a visualization of the current game state.
+            """
+            # ...
+        def close(self):
+            """Perform any necessary cleanup."""
+            # ...
+Key Implementation Details
+~~~~~~~~~~~~~~~~~~~~~~~~~
+The ``DiplomacyEnv`` class implements several key features:
+1. **Multi-Agent Support**: The environment tracks multiple agents (powers) and manages their interactions.
+2. **Turn-Based Gameplay**: The environment enforces the turn structure of Diplomacy, including different phases.
+3. **Action Processing**: The environment can handle actions in both text format and DeepMind's integer format.
+4. **Observation Generation**: The environment generates detailed observations for each agent, including board state, supply centers, and possible actions.
+5. **Game Termination**: The environment tracks game termination conditions, including supply center victory and maximum turn limits.
+Observation Structure
+~~~~~~~~~~~~~~~~~~~~
+Each agent receives an observation dictionary with the following structure:
+.. code-block:: python
+    {
+        "board_state": np.ndarray,  # Board state representation
+        "current_season": int,      # Season index (0-4)
+        "player_index": int,        # Index of the player's power (0-6)
+        "possible_actions": [int],  # List of possible actions in DeepMind's format
+        "human_readable_actions": [str],  # List of human-readable action descriptions
+        "supply_centers": [str],    # List of supply centers owned by the player
+        "units": [dict],            # List of units owned by the player
+        "year": int                 # Current year in the game
+    }
+Action Structure
+~~~~~~~~~~~~~~~
+Actions can be provided in two formats:
+1. **Text Format**: String actions like ``"A MUN - BER"`` or ``"F NTH C A LON - BEL"``.
+2. **Integer Format**: Lists of integers corresponding to DeepMind's action representation.
+The environment will convert text actions to the internal format as needed.
+DiplomacyAgent
+--------------
+The ``DiplomacyAgent`` class implements the agent handler interface for Diplomacy, processing observations from the environment and generating actions through an LLM.
+.. code-block:: python
+    class DiplomacyAgent:
+        """
+        Agent handler for Diplomacy, implementing the AgentState interface
+        for the multi-agent negotiation standard.
+        """
+        def __init__(self,
+                    power_name: str,
+                    use_text_interface: bool = True,
+                    system_prompt: Optional[str] = None):
+            """Initialize the Diplomacy agent handler.
+            Args:
+                power_name: Name of the power this agent controls
+                use_text_interface: Whether to use text-based interface (vs. structured)
+                system_prompt: Optional system prompt to use for the LLM
+            """
+            # ...
+        def step(self, observation_from_env, policy_output=None):
+            """Update the agent state based on the observation and action.
+            Args:
+                observation_from_env: The observation from the environment, with structure:
+                    - board_state: Current state of the board
+                    - current_season: Current season in the game
+                    - player_index: Index of the player's power
+                    - possible_actions: List of possible actions
+                    - human_readable_actions: List of human-readable action descriptions
+                    - supply_centers: List of supply centers owned by the player
+                    - units: List of units owned by the player
+                    - year: Current year in the game
+                policy_output: The output of the policy (LLM response), or None for initial prompt
+            Returns:
+                policy_id (str): The policy identifier ("llm_policy")
+                policy_input (dict): The input to the policy, with structure:
+                    - messages: List of conversation messages in the format:
+                        [{"role": "system", "content": "..."},
+                         {"role": "user", "content": "..."}]
+                action: The official action to be sent to the environment, or None if not ready
+                done (bool): Whether the LLM action is ready to be sent to the environment
+                info (dict): Additional information about the agent:
+                    - valid_action: Whether the extracted action is valid
+            """
+            # ...
+        def get_log_info(self):
+            """Get information about the agent required to log a trajectory.
+            Returns:
+                log_info (dict): Information about the agent required to log a trajectory:
+                    - power_name: Name of the power this agent controls
+                    - conversation_history: List of conversation messages
+                    - current_action: The current action, if any
+            """
+            # ...
+        def render(self):
+            """Render the current state of the agent.
+            Displays the agent's current state, including conversation history.
+            """
+            # ...
+        def close(self):
+            """Perform any necessary cleanup."""
+            # ...
+Key Implementation Details
+~~~~~~~~~~~~~~~~~~~~~~~~~
+The ``DiplomacyAgent`` class implements several key features:
+1. **LLM Interaction**: The agent generates prompts for an LLM and processes the LLM's responses to extract actions.
+2. **Conversation Management**: The agent maintains a conversation history for coherent interactions with the LLM.
+3. **Action Validation**: The agent validates extracted actions against the set of possible actions provided by the environment.
+4. **Error Handling**: The agent generates clarification prompts when invalid actions are detected.
+5. **Text-Based Interface**: The agent formats game state information into human-readable text for the LLM.
+Prompt Structure
+~~~~~~~~~~~~~~~
+The agent generates prompts that include:
+1. **System Prompt**: Instructions and context for the LLM, explaining its role as a Diplomacy player.
+2. **Game State Description**: A text description of the current game state, including:
+   - Current year and season
+   - Supply centers owned
+   - Units controlled
+   - Possible actions
+3. **Action Request**: Instructions on how to format actions.
+Example system prompt:
+.. code-block:: text
+    You are playing the role of FRANCE in a game of Diplomacy.
+    Your goal is to control as many supply centers as possible.
+    You can negotiate with other players and form alliances, but remember that
+    these alliances are not binding. When you need to submit orders for your units,
+    write them in the correct format, with each order on a new line.
+Example game state description:
+.. code-block:: text
+    Year: 1901, Season: SPRING_MOVES
+    You are playing as FRANCE.
+    You currently control 3 supply centers: PAR, MAR, BRE.
+    Your units are: A PAR, A MAR, F BRE.
+    Please provide orders for your units. Here are your possible actions:
+    A PAR - BUR
+    A PAR - GAS
+    A PAR - PIC
+    A PAR H
+    ...
+    Submit your orders, one per line, in the format like: "A MUN - BER" or "F NTH C A LON - BEL"
+Running Diplomacy Games
+----------------------
+To run Diplomacy games with LLM agents, you can use the ``run_batched_matches`` function with the ``DiplomacyEnv`` and ``DiplomacyAgent`` classes:
+.. code-block:: python
+    from mllm.environments.diplomacy.diplomacy_env import DiplomacyEnv
+    from mllm.environments.diplomacy.diplomacy_agent import DiplomacyAgent
+    from mllm.run_matches import run_batched_matches
+    # Create environment and agent handlers
+    env = DiplomacyEnv(max_turns=30)
+    agent_handlers = {
+        "AUSTRIA": DiplomacyAgent(power_name="AUSTRIA"),
+        "ENGLAND": DiplomacyAgent(power_name="ENGLAND"),
+        "FRANCE": DiplomacyAgent(power_name="FRANCE"),
+        "GERMANY": DiplomacyAgent(power_name="GERMANY"),
+        "ITALY": DiplomacyAgent(power_name="ITALY"),
+        "RUSSIA": DiplomacyAgent(power_name="RUSSIA"),
+        "TURKEY": DiplomacyAgent(power_name="TURKEY")
+    }
+    # Define policy mapping (mapping from policy IDs to actual policy functions)
+    policy_mapping = {
+        "llm_policy": my_llm_policy_function
+    }
+    # Run the game
+    game_results = run_batched_matches(
+        envs=[env],
+        agent_handlers_per_env=[agent_handlers],
+        policy_mapping=policy_mapping,
+        max_parallel_matches=1
+    )
+    # Process results
+    for result in game_results:
+        print(f"Game finished. Winner: {result['winner']}")
+        print(f"Supply centers: {result['supply_centers']}")
+This setup allows you to run Diplomacy games with LLM agents using the Multi-Agent Negotiation Environment standard.
+Limitations and Considerations
+-----------------------------
+1. **Performance**: Processing observations and actions for seven powers using LLMs can be computationally intensive.
+2. **Action Parsing**: Extracting valid actions from LLM outputs may require sophisticated parsing and error handling.
+3. **Game Complexity**: Diplomacy is a complex game with many rules and edge cases, which may be challenging for LLMs to fully grasp.
+4. **Turn Duration**: Real Diplomacy games include negotiation phases of variable duration, which are not fully captured in this implementation.
+5. **Text Formatting**: The quality of LLM interactions depends heavily on the formatting and clarity of text prompts.
+Advanced Usage
+------------
+For advanced usage, you can customize:
+1. **System Prompts**: Modify agent behavior by providing custom system prompts.
+2. **Observation Processing**: Extend the observation processing to include additional information.
+3. **Action Parsing**: Implement more sophisticated action parsing for complex orders.
+4. **Visualization**: Add custom visualization methods to the environment's render function.
+5. **Logging**: Extend the logging capabilities to capture additional information about the game state.

src_code_for_reproducibility/docs/source/environments/dond.rst ADDED Viewed

	@@ -0,0 +1,410 @@

+=================
+Deal or No Deal
+=================
+The Deal or No Deal (DoND) environment provides a multi-agent negotiation interface where players trade
+items with different values. This document describes the API for interacting with the DoND environment
+and its associated agent handler.
+Overview
+--------
+Deal or No Deal is a negotiation game where two agents must agree on how to divide a set of items,
+each of which has different values to each agent. The agents engage in a back-and-forth dialogue to
+determine an allocation of the items, with each trying to maximize their own total value.
+Our implementation follows the Multi-Agent Negotiation Environment standard, allowing it to be used
+with LLM agents through a text-based interface.
+Game Rules
+----------
+### Basic Structure
+The core mechanics of Deal or No Deal are:
+1. Two agents negotiate over a set of items (e.g., books, balls, hats)
+2. Each item has:
+   - A specific quantity (how many of each item is available)
+   - A value for each agent (which may differ between agents)
+3. Agents take turns sending messages to negotiate how to split the items
+4. Once an agreement is reached, agents finalize the deal
+5. Points are awarded based on the value of items each agent receives
+### Detailed Gameplay
+#### Setup Phase
+The game begins with:
+- A set of items (e.g., "book", "hat", "ball")
+- Each item has a quantity (e.g., 6 books, 2 hats, 4 balls)
+- Each agent has private values for each item (e.g., books might be worth 5 points to one agent but only 2 points to the other)
+- Agents are assigned roles (starting negotiator and responding negotiator)
+#### Negotiation Phase
+1. Agents take turns sending free-form text messages to each other
+2. Messages can include offers, counter-offers, questions, or strategic communication
+3. There is a maximum number of messages permitted (preventing endless negotiations)
+4. Either agent can propose to finalize an agreement at any time
+For example:
+- Agent 1: "I propose I get all the books and you get all the hats and balls."
+- Agent 2: "That doesn't work for me. How about you get 3 books and I get 3 books, all the hats, and all the balls?"
+- Agent 1: "Let me counter-offer: I get 4 books and 2 balls, you get 2 books, all hats, and 2 balls."
+#### Finalization Phase
+1. When an agent wants to finalize a deal, they must specify the exact allocation:
+   - How many of each item they receive
+   - How many of each item the other agent receives
+2. The other agent must then either agree (by submitting the same allocation) or reject the finalization
+3. If both agents submit matching finalizations, the deal is executed
+4. If finalizations don't match, no agreement is reached, and both agents receive 0 points
+#### Scoring
+1. Each agent's score is calculated based on the value of items they receive
+2. The formula is: Sum(quantity_of_item_i × value_of_item_i_to_agent)
+3. If no agreement is reached, both agents receive 0 points
+### Example Game
+Let's walk through a simple example:
+**Setup:**
+- Items: Books (4), Hats (2), Balls (6)
+- Agent 1 values: Books=5, Hats=1, Balls=2
+- Agent 2 values: Books=3, Hats=6, Balls=1
+**Negotiation (simplified):**
+1. Agent 1: "I would like all the books and balls. You can have the hats."
+2. Agent 2: "That doesn't work for me. Books are valuable. I propose I get all the hats and 2 books, you get 2 books and all the balls."
+3. Agent 1: "How about I get 3 books and all the balls, and you get 1 book and all the hats?"
+4. Agent 2: "I accept your proposal."
+**Finalization:**
+- Agent 1 submits: Agent 1 gets (Books: 3, Hats: 0, Balls: 6), Agent 2 gets (Books: 1, Hats: 2, Balls: 0)
+- Agent 2 submits the same allocation, confirming agreement
+**Scoring:**
+- Agent 1 score: (3 books × 5) + (0 hats × 1) + (6 balls × 2) = 15 + 0 + 12 = 27 points
+- Agent 2 score: (1 book × 3) + (2 hats × 6) + (0 balls × 1) = 3 + 12 + 0 = 15 points
+### Game Variations
+The DoND environment supports several variations through configuration parameters:
+#### Different Value Distributions
+The environment offers multiple ways to assign values to items:
+1. **Standard Random Setup (dond_random_setup)**:
+   - Items have even-numbered quantities
+   - Each agent receives distinct random values for each item
+   - Values are drawn from a uniform distribution
+2. **Independent Random Values (independent_random_vals)**:
+   - Item quantities can be any number in the specified range
+   - Values for each agent are drawn independently
+   - Creates more varied negotiation scenarios
+3. **Bicameral Value Distribution (bicameral_vals_assignator)**:
+   - Creates a "high value" and "low value" distribution for each item
+   - Each agent values approximately half the items highly and half lowly
+   - Values are drawn from normal distributions with different means
+   - Creates scenarios with clear trade opportunities
+#### Visibility Options
+1. **Finalization Visibility**:
+   - When enabled, both agents can see each other's finalization proposals
+   - When disabled, finalization proposals remain private until both are submitted
+2. **Other Values Visibility**:
+   - When enabled, agents can see each other's value functions
+   - When disabled, agents only know their own values
+   - Creates information asymmetry and richer negotiation dynamics
+#### Game Modes
+1. **Cooperative Mode ("coop")**:
+   - Agents are encouraged to find mutually beneficial solutions
+   - Success is measured by the sum of both agents' scores
+2. **Competitive Mode ("comp")**:
+   - Agents aim to maximize their individual scores
+   - Creates more adversarial negotiations
+#### Round Structure
+1. **Single Round**:
+   - One negotiation session between the same agents
+   - Simple evaluation of negotiation skills
+2. **Multiple Rounds**:
+   - Agents negotiate multiple times with different item setups
+   - Allows for learning and adaptation over time
+   - Roles can be swapped between rounds
+DondEnv
+------------
+The ``DondEnv`` class provides an interface to the Deal or No Deal environment that follows the Multi-Agent
+Negotiation Environment standard.
+.. code-block:: python
+    class DondEnv:
+        """
+        Multi-Agent Negotiation Environment for Deal or No Deal.
+        """
+        def __init__(
+            self,
+            agents,
+            mode="coop",
+            max_messages=None,
+            min_messages=None,
+            max_chars_per_message=None,
+            rounds_per_game=1,
+            random_setup_func=None,
+            random_setup_kwargs=None,
+            role_assignator_func=None,
+            role_assignator_func_kwargs=None,
+            finalization_visibility=False,
+            other_values_visibility=False,
+            random_seed=None
+        ):
+            """Initialize the Deal or No Deal environment.
+            Args:
+                agents: List of agent IDs participating in the game
+                mode: Game mode ("coop" or "comp")
+                max_messages: Maximum number of messages per agent per round
+                min_messages: Minimum number of messages per agent per round
+                max_chars_per_message: Maximum characters per message
+                rounds_per_game: Number of negotiation rounds to play
+                random_setup_func: Function to generate item quantities and values
+                random_setup_kwargs: Arguments for the random setup function
+                role_assignator_func: Function to assign roles to agents
+                role_assignator_func_kwargs: Arguments for the role assignator
+                finalization_visibility: Whether agents can see each other's finalizations
+                other_values_visibility: Whether agents can see each other's values
+                random_seed: Seed for reproducibility
+            """
+            # ...
+        def reset(self):
+            """Reset the environment to an initial state and return the initial observation.
+            Returns:
+                observation (dict): A dictionary where keys are agent identifiers and values are observations.
+            """
+            # ...
+        def step(self, actions):
+            """Take a step in the environment using the provided actions.
+            Args:
+                actions (dict): A dictionary where keys are agent identifiers and values are actions.
+                    Actions can be messages or finalization proposals.
+            Returns:
+                observations (dict): A dictionary where keys are agent identifiers and values are observations.
+                done (bool): Whether the episode has ended.
+                info (dict): Additional information about the environment.
+            """
+            # ...
+        def get_state(self):
+            """Retrieve the current state of the game.
+            Returns:
+                state (dict): The current state of the game, including items, quantities, values, etc.
+            """
+            # ...
+Key Implementation Details
+~~~~~~~~~~~~~~~~~~~~~~~~~
+The ``DondEnv`` class implements several key features:
+1. **Multi-Agent Support**: The environment tracks two agents and manages their alternating messages.
+2. **Turn-Based Dialogue**: The environment enforces turn structure and limits on message count.
+3. **Finalization Processing**: The environment validates and processes finalization proposals.
+4. **Random Setup**: The environment supports multiple methods of generating negotiation scenarios.
+5. **Round Management**: The environment can handle multiple rounds with different setups.
+Observation Structure
+~~~~~~~~~~~~~~~~~~~~
+Each agent receives an observation (state) dictionary with rich information about the game:
+.. code-block:: python
+    {
+        "mode": str,                 # Game mode ("coop" or "comp")
+        "role_values": dict,         # Value mappings for each role
+        "role_props": dict,          # Properties for each role
+        "agent_to_role": dict,       # Mapping from agent IDs to roles
+        "is_new_round": bool,        # Whether this is the start of a new round
+        "is_new_game": bool,         # Whether this is the start of a new game
+        "game_over": bool,           # Whether the game is over
+        "items": list,               # List of item names
+        "quantities": dict,          # Quantities of each item
+        "has_finalized": bool,       # Whether finalization has been proposed
+        "last_message": dict,        # The last message sent
+        "messages_remaining": dict,  # Number of messages each agent can still send
+        # And various history tracking fields
+    }
+Action Structure
+~~~~~~~~~~~~~~~
+Actions can be:
+1. **Text Messages**: Free-form text for negotiation.
+2. **Finalization Proposals**: Structured data specifying the exact allocation of items.
+Example finalization format:
+.. code-block:: python
+    {
+        "type": "finalize",
+        "allocation": {
+            "agent1": {"book": 3, "hat": 0, "ball": 6},
+            "agent2": {"book": 1, "hat": 2, "ball": 0}
+        }
+    }
+Value Setup Functions
+--------------------
+The DoND environment provides several functions for setting up item values:
+.. code-block:: python
+    def dond_random_setup(items, min_quant, max_quant, min_val, max_val, random_seed=None):
+        """
+        Generates items, even-numbered quantities and distinct random values for each category for both agents.
+        Args:
+            items (list): List of items.
+            min_quant (int): Minimum quantity per item.
+            max_quant (int): Maximum quantity per item.
+            min_val (int): Minimum value per item.
+            max_val (int): Maximum value per item.
+            random_seed (int, optional): Seed for random generation.
+        Returns:
+            tuple: (items, quantities, (val_starting_negotiator, val_responding_negotiator))
+        """
+        # ...
+    def independent_random_vals(items, min_quant, max_quant, min_val, max_val, random_seed=None):
+        """
+        Generates random quantities and independent random values for both agents.
+        Args:
+            Similar to dond_random_setup
+        Returns:
+            tuple: (items, quantities, (val_starting_negotiator, val_responding_negotiator))
+        """
+        # ...
+    def bicameral_vals_assignator(items, min_quant, max_quant, low_val_mean, low_val_std, high_val_mean, high_val_std, random_seed=None):
+        """
+        Generates values with a bicameral distribution - each agent values half the items highly.
+        Args:
+            items (list): List of items.
+            min_quant, max_quant: Range for quantities
+            low_val_mean, low_val_std: Mean and standard deviation for the "low value" distribution
+            high_val_mean, high_val_std: Mean and standard deviation for the "high value" distribution
+            random_seed: Seed for reproducibility
+        Returns:
+            tuple: (items, quantities, (val_starting_negotiator, val_responding_negotiator))
+        """
+        # ...
+Running DoND Games
+----------------------
+To run Deal or No Deal games with LLM agents, you can use the following structure:
+.. code-block:: python
+    from mllm.environments.dond.dond_game import DondEnv
+    from mllm.environments.dond.dond_agent import DondAgent
+    from src.run_matches import run_batched_matches
+    # Create environment
+    env = DondEnv(
+        agents=["agent1", "agent2"],
+        mode="coop",
+        max_messages=10,
+        rounds_per_game=1,
+        random_setup_func="dond_random_setup",
+        random_setup_kwargs={
+            "items": ["book", "hat", "ball"],
+            "min_quant": 2,
+            "max_quant": 8,
+            "min_val": 1,
+            "max_val": 10
+        },
+        finalization_visibility=False
+    )
+    # Create agent handlers (implementation details would vary)
+    agent_handlers = {
+        "agent1": DondAgent(agent_id="agent1"),
+        "agent2": DondAgent(agent_id="agent2")
+    }
+    # Define policy mapping
+    policy_mapping = {
+        "llm_policy": my_llm_policy_function
+    }
+    # Run the game
+    game_results = run_batched_matches(
+        envs=[env],
+        agent_handlers_per_env=[agent_handlers],
+        policy_mapping=policy_mapping,
+        max_parallel_matches=1
+    )
+Limitations and Considerations
+-----------------------------
+1. **Negotiation Complexity**: The open-ended nature of negotiations can be challenging for some LLM agents.
+2. **Parsing Challenges**: Extracting structured finalization proposals from free-form text requires robust parsing.
+3. **Optimization Opportunities**: Different agents may employ different negotiation strategies to optimize outcomes.
+4. **Fairness Evaluation**: The environment allows research into questions of fair division and Pareto optimality.
+5. **Strategic Deception**: Agents might strategically misrepresent their true values, adding complexity to negotiations.
+Advanced Usage
+------------
+For advanced usage, you can:
+1. **Custom Value Functions**: Create more complex distributions of item values for specific research questions.
+2. **Novel Negotiation Scenarios**: Design item sets and values to test specific negotiation skills.
+3. **Curriculum Learning**: Create progressively more difficult negotiation scenarios.
+4. **Communication Analysis**: Analyze the language and strategies used in successful negotiations.
+5. **Multi-Round Dynamics**: Study how agents adapt their strategies over multiple rounds.

src_code_for_reproducibility/docs/source/environments/ipd.rst ADDED Viewed

	@@ -0,0 +1,411 @@

+=================
+Iterated Prisoner's Dilemma
+=================
+The Iterated Prisoner's Dilemma environment provides a classic game theory setting for studying cooperation
+and competition between agents. This document describes the API for interacting with the IPD environment
+and its associated agent handler.
+Overview
+--------
+The Prisoner's Dilemma is a fundamental problem in game theory that demonstrates why two rational individuals might not
+cooperate, even when it appears in their best interest to do so. In the iterated version, the same two players
+repeatedly face the same dilemma, allowing for the development of trust or retaliation based on previous interactions.
+Our implementation follows the Multi-Agent Negotiation Environment standard, allowing it to be used with
+LLM agents through a text-based interface.
+Game Rules
+----------
+### Basic Premise
+The scenario behind the Prisoner's Dilemma is as follows:
+Two criminals are arrested and imprisoned. Each prisoner is in solitary confinement with no means of communicating with
+the other. The prosecutors lack sufficient evidence to convict the pair on the principal charge, but they have enough
+to convict both on a lesser charge. Simultaneously, the prosecutors offer each prisoner a bargain:
+- If both prisoners betray each other, each serves 2 years in prison (the "punishment" payoff)
+- If one betrays the other while the other remains silent, the betrayer goes free (the "temptation" payoff) while the
+  silent accomplice serves 3 years (the "sucker" payoff)
+- If both remain silent, each serves only 1 year in prison (the "reward" payoff)
+### Game Mechanics
+In our implementation, the choices are simplified to:
+- **C**: Cooperate (remain silent)
+- **D**: Defect (betray the other prisoner)
+Each round, both players simultaneously choose either C or D, and receive points based on the combination of their choices:
+- Both choose C: Both receive the "reward" payoff (3 points by default)
+- Both choose D: Both receive the "punishment" payoff (1 point by default)
+- One chooses C, one chooses D: The defector receives the "temptation" payoff (5 points by default), while the cooperator
+  receives the "sucker" payoff (0 points by default)
+### Example: Single Round
+Let's see how a single round plays out:
+1. Alice and Bob simultaneously make their choices
+2. If Alice chooses C and Bob chooses C:
+   - Alice receives 3 points
+   - Bob receives 3 points
+3. If Alice chooses C and Bob chooses D:
+   - Alice receives 0 points
+   - Bob receives 5 points
+4. If Alice chooses D and Bob chooses C:
+   - Alice receives 5 points
+   - Bob receives 0 points
+5. If Alice chooses D and Bob chooses D:
+   - Alice receives 1 point
+   - Bob receives 1 point
+### Iterated Game Structure
+The iterated version repeats this basic game for a fixed number of rounds. The key features are:
+1. Players know the total number of rounds in advance
+2. After each round, players learn what choice the other player made
+3. Players maintain a cumulative score across all rounds
+4. Players can adjust their strategy based on the history of previous interactions
+### Game Variations
+The IPD environment supports several variations through configuration parameters:
+#### Different Payoff Matrices
+The standard payoff values can be modified to create different incentive structures:
+- **Traditional PD**: reward=3, punishment=1, temptation=5, sucker=0
+- **Weak Temptation**: reward=3, punishment=1, temptation=4, sucker=0 (reduces the incentive to defect)
+- **Harsh Punishment**: reward=3, punishment=0, temptation=5, sucker=0 (increases the cost of mutual defection)
+- **Generous**: reward=4, punishment=2, temptation=5, sucker=1 (cushions the blow of being betrayed)
+#### Game Length Variations
+The number of rounds can significantly impact strategy:
+- **Short Games** (5-10 rounds): Incentivizes more defection, especially near the end
+- **Medium Games** (20-50 rounds): Allows for the development of tit-for-tat and forgiveness strategies
+- **Long Games** (100+ rounds): Favors steady cooperation with occasional "probing" defections
+### Common Strategies
+While not enforced by the environment, several well-known strategies can emerge:
+- **Always Cooperate**: Always choose C
+- **Always Defect**: Always choose D
+- **Tit for Tat**: Start with C, then copy what the opponent did in the previous round
+- **Forgiving Tit for Tat**: Like Tit for Tat, but occasionally cooperate even after being defected against
+- **Grudger**: Cooperate until the opponent defects once, then always defect
+- **Random**: Choose randomly between C and D
+IPDEnv
+------
+The ``IPDEnv`` class provides an interface to the Iterated Prisoner's Dilemma environment that follows the
+Multi-Agent Negotiation Environment standard.
+.. code-block:: python
+    class IPDEnv:
+        """
+        Iterated Prisoner's Dilemma environment following the MarlEnvironment standard.
+        In each round of the game, two agents simultaneously choose to either cooperate (C) or defect (D).
+        The payoffs are as follows:
+        - If both cooperate: Both receive the "reward" (usually 3 points)
+        - If both defect: Both receive the "punishment" (usually 1 point)
+        - If one cooperates and one defects: The defector receives the "temptation" (usually 5 points)
+          and the cooperator receives the "sucker" payoff (usually 0 points)
+        The game is played for a specified number of rounds.
+        """
+        def __init__(
+            self,
+            rounds_per_game: int = 10,
+            reward: float = 3.0,           # Both cooperate
+            punishment: float = 1.0,       # Both defect
+            temptation: float = 5.0,       # Defector's reward when other cooperates
+            sucker: float = 0.0,           # Cooperator's reward when other defects
+            random_seed: Optional[int] = None,
+        ):
+            """
+            Initialize the Iterated Prisoner's Dilemma environment.
+            Args:
+                rounds_per_game: Number of rounds to play
+                reward: Payoff when both agents cooperate
+                punishment: Payoff when both agents defect
+                temptation: Payoff for defecting when other agent cooperates
+                sucker: Payoff for cooperating when other agent defects
+                seed: Random seed for reproducibility
+            """
+            # ...
+        def reset(self) -> Dict[str, Dict[str, Any]]:
+            """
+            Reset the environment to an initial state and return the initial observation.
+            Returns:
+                observation (dict): A dictionary where keys are agent identifiers and values are observations.
+            """
+            # ...
+        def step(self, actions: Dict[str, str]) -> Tuple[Dict[str, Dict[str, Any]], bool, Dict[str, Any]]:
+            """
+            Take a step in the environment using the provided actions.
+            Args:
+                actions (dict): A dictionary where keys are agent identifiers and values are actions ('C' or 'D').
+            Returns:
+                observations (dict): A dictionary where keys are agent identifiers and values are observations.
+                done (bool): Whether the episode has ended.
+                info (dict): Additional information about the environment.
+            """
+            # ...
+Key Implementation Details
+~~~~~~~~~~~~~~~~~~~~~~~~~
+The ``IPDEnv`` class implements several key features:
+1. **Two-Agent Support**: The environment tracks two agents ("alice" and "bob") and manages their interactions.
+2. **Round-Based Play**: The environment enforces turn structure and tracks game history.
+3. **Payoff Matrix**: The environment calculates rewards based on the standard prisoner's dilemma payoff matrix.
+4. **Observation Generation**: The environment generates detailed observations for each agent, including action history and rewards.
+5. **Game Termination**: The environment tracks game termination after the specified number of rounds.
+Observation Structure
+~~~~~~~~~~~~~~~~~~~~
+Each agent receives an observation dictionary with the following structure:
+.. code-block:: python
+    {
+        "current_round": int,                # Current round number (0-indexed)
+        "rounds_per_game": int,              # Total number of rounds in the game
+        "history": List[Dict],               # Complete game history so far
+        "last_round_actions": Dict[str, str], # Actions from the previous round (if any)
+        "last_round_reward": float,          # Reward received in the previous round (if any)
+        "total_reward": float,               # Cumulative reward so far
+        "payoff_matrix": Dict[str, float],   # The game's payoff matrix values
+    }
+Action Structure
+~~~~~~~~~~~~~~~
+Actions are simple strings:
+1. ``"C"`` for Cooperate
+2. ``"D"`` for Defect
+IPDAgent
+--------------
+The ``IPDAgent`` class implements the agent handler interface for the Iterated Prisoner's Dilemma, processing observations from the environment and generating actions through an LLM.
+.. code-block:: python
+    class IPDAgent:
+        """
+        Agent handler for Iterated Prisoner's Dilemma, implementing the AgentState interface
+        for the multi-agent negotiation standard.
+        """
+        def __init__(
+            self,
+            agent_id: str,
+            policy_id: str = "llm_policy",
+            system_prompt: Optional[str] = None,
+            max_errors: int = 3,
+            opponent_id: Optional[str] = None,
+        ):
+            """
+            Initialize the IPD agent handler.
+            Args:
+                agent_id: Identifier for this agent ("alice" or "bob")
+                policy_id: Identifier for the policy this agent uses
+                system_prompt: Optional custom system prompt for the LLM
+                max_errors: Maximum number of parsing errors before defaulting to cooperate
+                opponent_id: Optional identifier of the opponent (inferred if not provided)
+            """
+            # ...
+        def step(self, observation_from_env: Dict[str, Any], policy_output: str = None) -> Tuple[str, Dict[str, Any], str, bool, Dict[str, Any]]:
+            """
+            Update the agent state based on the observation and process the policy output.
+            Args:
+                observation_from_env: The observation from the environment
+                policy_output: The output from the policy (LLM response)
+            Returns:
+                policy_id: The policy identifier
+                policy_input: The input to the policy
+                action: The action to be sent to the environment
+                done: Whether the action is ready to be sent to the environment
+                info: Additional information about the agent
+            """
+            # ...
+Key Implementation Details
+~~~~~~~~~~~~~~~~~~~~~~~~~
+The ``IPDAgent`` class implements several key features:
+1. **LLM Interaction**: The agent generates prompts for an LLM and processes the LLM's responses.
+2. **Action Extraction**: The agent parses the LLM's output to extract valid actions (C or D).
+3. **Error Handling**: The agent provides helpful error messages when parsing fails and defaults to cooperation after multiple failures.
+4. **History Tracking**: The agent maintains and provides the complete game history in its prompts.
+5. **Strategy Explanation**: The agent can extract and log the reasoning behind an LLM's decisions.
+Prompt Structure
+~~~~~~~~~~~~~~~
+The agent generates prompts that include:
+1. **System Prompt**: Instructions and context for the LLM, explaining its role and the rules of the Prisoner's Dilemma.
+2. **Game State Description**: A text description of the current game state, including:
+   - Current round number
+   - History of previous rounds (if any)
+   - Cumulative score
+3. **Action Request**: Instructions on how to format the response, requiring an explicit action tag.
+Example system prompt:
+.. code-block:: text
+    You are playing as Alice in an Iterated Prisoner's Dilemma game against Bob.
+    In each round, you must choose to either Cooperate (C) or Defect (D).
+    The payoffs are:
+    - If both players Cooperate: You each get 3 points
+    - If both players Defect: You each get 1 point
+    - If you Cooperate and Bob Defects: You get 0 points, Bob gets 5 points
+    - If you Defect and Bob Cooperates: You get 5 points, Bob gets 0 points
+    Your goal is to maximize your total points across all rounds.
+    The game will last for exactly 10 rounds, and both players know this.
+Example game state prompt:
+.. code-block:: text
+    Current round: 3/10
+    History:
+    Round 1: You chose C, Bob chose C. You earned 3 points.
+    Round 2: You chose C, Bob chose D. You earned 0 points.
+    Your total score so far: 3 points
+    What is your choice for round 3?
+    Please respond with <action>C</action> to cooperate or <action>D</action> to defect,
+    and explain your reasoning.
+Running IPD Games
+----------------------
+To run Iterated Prisoner's Dilemma games with LLM agents, you can use the following code structure:
+.. code-block:: python
+    from mllm.environments.ipd.ipd_game import IPDEnv
+    from mllm.environments.ipd.ipd_agent import IPDAgent
+    from mllm.run_matches import run_batched_matches
+    # Create environment
+    env = IPDEnv(
+        rounds_per_game=10,
+        reward=3.0,
+        punishment=1.0,
+        temptation=5.0,
+        sucker=0.0
+    )
+    # Create agent handlers
+    agent_handlers = {
+        "alice": IPDAgent(agent_id="alice"),
+        "bob": IPDAgent(agent_id="bob")
+    }
+    # Define policy mapping
+    policy_mapping = {
+        "llm_policy": my_llm_policy_function
+    }
+    # Run the game
+    game_results = run_batched_matches(
+        envs=[env],
+        agent_handlers_per_env=[agent_handlers],
+        policy_mapping=policy_mapping,
+        max_parallel_matches=1
+    )
+    # Process results
+    for result in game_results:
+        print(f"Game finished. Scores: {result['total_rewards']}")
+Statistics and Analysis
+----------------------
+The IPD environment includes utility functions for analyzing game outcomes:
+1. **Cooperation Rates**: Percentage of rounds where each agent cooperated.
+2. **Mutual Cooperation/Defection**: Percentage of rounds where both agents made the same choice.
+3. **Score Distribution**: Analysis of how points were accumulated over the game.
+These statistics can be calculated using the ``gather_ipd_statistics`` function:
+.. code-block:: python
+    from mllm.environments.ipd.ipd_statistics_funcs import gather_ipd_statistics
+    stats = gather_ipd_statistics(match_info, env_info)
+    print(f"Cooperation rates: {stats['cooperation_rate']}")
+    print(f"Mutual cooperation rate: {stats['mutual_cooperation_rate']}")
+    print(f"Mutual defection rate: {stats['mutual_defection_rate']}")
+Limitations and Considerations
+-----------------------------
+1. **Determinism**: The environment is deterministic, with randomness only in initialization if a seed is provided.
+2. **Limited Player Count**: The IPD environment only supports exactly two players.
+3. **Perfect Information**: Both players have perfect information about the game history.
+4. **Simultaneous Actions**: Both players act simultaneously, which requires adaptations for some LLM interfaces.
+5. **Fixed Game Length**: The total number of rounds is fixed and known to both players from the start.
+Advanced Usage
+------------
+For advanced usage, you can customize:
+1. **Payoff Matrix**: Modify reward values to create different incentive structures.
+2. **System Prompts**: Customize the LLM's understanding of the game and potential strategies.
+3. **Error Handling**: Adjust how the agent responds to invalid LLM outputs.
+4. **Analysis**: Create custom statistics gathering for specific research questions.
+5. **Integration**: Connect the IPD environment to other negotiation frameworks or tournament systems.