Muqeeth commited on
Commit
d34e8a3
·
verified ·
1 Parent(s): 201e790

Add files using upload-large-folder tool

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. seed_1111/Qwen/Qwen2.5-7B-Instruct/adapters/agent_adapter/adapter_model.safetensors +3 -0
  2. seed_1111/Qwen/Qwen2.5-7B-Instruct/adapters/critic_adapter/adapter_model.safetensors +3 -0
  3. seed_1111/agent_trainer/critic_optimizer_state.pt +3 -0
  4. seed_1111/agent_trainer/policy_optimizer_state.pt +3 -0
  5. seed_1111/agent_trainer/trainer_annealing_state.pkl +3 -0
  6. seed_1111/random_state.pkl +3 -0
  7. src_code_for_reproducibility/chat_utils/__pycache__/chat_turn.cpython-312.pyc +0 -0
  8. src_code_for_reproducibility/chat_utils/__pycache__/template_specific.cpython-312.pyc +0 -0
  9. src_code_for_reproducibility/docs/source/conf.py +48 -0
  10. src_code_for_reproducibility/docs/source/environments.rst +35 -0
  11. src_code_for_reproducibility/docs/source/index.rst +22 -0
  12. src_code_for_reproducibility/docs/source/marl_standard.rst +141 -0
  13. src_code_for_reproducibility/docs/source/src.environments.dond.dond_log_funcs.rst +7 -0
  14. src_code_for_reproducibility/docs/source/src.environments.dond.dond_player.rst +7 -0
  15. src_code_for_reproducibility/docs/source/src.environments.dond.dond_return_funcs.rst +7 -0
  16. src_code_for_reproducibility/docs/source/src.environments.dond.dond_statistics_funcs.rst +7 -0
  17. src_code_for_reproducibility/docs/source/src.environments.dond.rst +19 -0
  18. src_code_for_reproducibility/docs/source/src.environments.environment_imports.rst +7 -0
  19. src_code_for_reproducibility/docs/source/src.environments.ipd.ipd_statistics_funcs.rst +7 -0
  20. src_code_for_reproducibility/docs/source/src.environments.rst +25 -0
  21. src_code_for_reproducibility/docs/source/src.experiments.arithmetic_test.rst +7 -0
  22. src_code_for_reproducibility/docs/source/src.experiments.last_completion.rst +7 -0
  23. src_code_for_reproducibility/docs/source/src.experiments.rst +17 -0
  24. src_code_for_reproducibility/docs/source/src.generation.rst +15 -0
  25. src_code_for_reproducibility/docs/source/src.generation.run_games.rst +7 -0
  26. src_code_for_reproducibility/docs/source/src.models.oai_agent.rst +7 -0
  27. src_code_for_reproducibility/docs/source/src.models.rst +20 -0
  28. src_code_for_reproducibility/docs/source/src.models.updatable_worker.rst +7 -0
  29. src_code_for_reproducibility/docs/source/src.models.vllm_worker_wrap.rst +7 -0
  30. src_code_for_reproducibility/docs/source/src.rst +28 -0
  31. src_code_for_reproducibility/docs/source/src.training.ppo_train.rst +7 -0
  32. src_code_for_reproducibility/docs/source/src.training.rst +19 -0
  33. src_code_for_reproducibility/docs/source/src.utils.log_gpu_usage.rst +7 -0
  34. src_code_for_reproducibility/markov_games/diplomacy/diplomacy_agent.py +259 -0
  35. src_code_for_reproducibility/markov_games/diplomacy/diplomacy_env.py +230 -0
  36. src_code_for_reproducibility/markov_games/diplomacy/diplomacy_logging.py +360 -0
  37. src_code_for_reproducibility/markov_games/diplomacy/diplomacy_logging_for_training.py +0 -0
  38. src_code_for_reproducibility/markov_games/ipd/Ipd_hard_coded_agents.py +72 -0
  39. src_code_for_reproducibility/markov_games/ipd/__init__.py +7 -0
  40. src_code_for_reproducibility/markov_games/ipd/__pycache__/Ipd_hard_coded_agents.cpython-312.pyc +0 -0
  41. src_code_for_reproducibility/markov_games/ipd/__pycache__/__init__.cpython-312.pyc +0 -0
  42. src_code_for_reproducibility/markov_games/ipd/__pycache__/ipd_simulation.cpython-312.pyc +0 -0
  43. src_code_for_reproducibility/markov_games/ipd/__pycache__/ipd_statistics.cpython-312.pyc +0 -0
  44. src_code_for_reproducibility/markov_games/ipd/ipd_agent.py +115 -0
  45. src_code_for_reproducibility/markov_games/ipd/ipd_simulation.py +162 -0
  46. src_code_for_reproducibility/markov_games/ipd/ipd_statistics.py +18 -0
  47. src_code_for_reproducibility/markov_games/negotiation/README.md +40 -0
  48. src_code_for_reproducibility/markov_games/negotiation/__pycache__/dond_agent.cpython-312.pyc +0 -0
  49. src_code_for_reproducibility/markov_games/negotiation/__pycache__/dond_simulation.cpython-312.pyc +0 -0
  50. src_code_for_reproducibility/markov_games/negotiation/__pycache__/nego_agent.cpython-312.pyc +0 -0
seed_1111/Qwen/Qwen2.5-7B-Instruct/adapters/agent_adapter/adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:51c7556de3a256fc5617051978705cc989aaa572366999e755aa162e03fca885
3
+ size 323014168
seed_1111/Qwen/Qwen2.5-7B-Instruct/adapters/critic_adapter/adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:59c990983fc8fae67d5cab961c03f7c68cb799470378d58c1ec8e42789cbc620
3
+ size 323014168
seed_1111/agent_trainer/critic_optimizer_state.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f1574fdb90735a922b09c67d07f7abdbd51181f00dc7bed878cb80adb5f50c1d
3
+ size 2631
seed_1111/agent_trainer/policy_optimizer_state.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5f85f1f84eaee7fa314b2b4c345a051ea2ec17f4b13d15bade07d30107145ea9
3
+ size 646269121
seed_1111/agent_trainer/trainer_annealing_state.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:45d86164247159137f1b2aad8d24c59d64d493371eff50f6334cb468e177ed83
3
+ size 104
seed_1111/random_state.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b71bae3ed719da08fb5942994d7b6018f48839521fa21e68a5a0e0c2a55a398b
3
+ size 12174
src_code_for_reproducibility/chat_utils/__pycache__/chat_turn.cpython-312.pyc ADDED
Binary file (1.32 kB). View file
 
src_code_for_reproducibility/chat_utils/__pycache__/template_specific.cpython-312.pyc ADDED
Binary file (4.24 kB). View file
 
src_code_for_reproducibility/docs/source/conf.py ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Configuration file for the Sphinx documentation builder.
2
+ import os
3
+ import sys
4
+ sys.path.insert(0, os.path.abspath('../..'))
5
+
6
+ # -- Project information -----------------------------------------------------
7
+ project = 'llm_negotiation'
8
+ copyright = '2023, Your Name'
9
+ author = 'Your Name'
10
+
11
+ # -- General configuration ---------------------------------------------------
12
+ extensions = [
13
+ 'sphinx.ext.autodoc',
14
+ 'sphinx.ext.viewcode',
15
+ 'sphinx.ext.napoleon',
16
+ 'sphinx.ext.autosummary',
17
+ 'sphinx.ext.intersphinx',
18
+ 'sphinx.ext.mathjax',
19
+ 'sphinxcontrib.mermaid',
20
+ 'sphinx_rtd_theme',
21
+ ]
22
+
23
+ templates_path = ['_templates']
24
+ exclude_patterns = []
25
+
26
+ # -- Options for HTML output -------------------------------------------------
27
+ html_theme = 'sphinx_rtd_theme'
28
+ html_static_path = ['_static']
29
+
30
+ # -- Napoleon settings -------------------------------------------------------
31
+ napoleon_google_docstring = True
32
+ napoleon_numpy_docstring = False
33
+ napoleon_include_init_with_doc = True
34
+ napoleon_include_private_with_doc = False
35
+ napoleon_include_special_with_doc = True
36
+ napoleon_use_admonition_for_examples = False
37
+ napoleon_use_admonition_for_notes = False
38
+ napoleon_use_admonition_for_references = False
39
+ napoleon_use_ivar = False
40
+ napoleon_use_param = True
41
+ napoleon_use_rtype = True
42
+ napoleon_preprocess_types = False
43
+ napoleon_type_aliases = None
44
+ napoleon_attr_annotations = True
45
+
46
+ # -- Path setup --------------------------------------------------------------
47
+ # Make sure the project's modules can be found by Sphinx
48
+ sys.path.insert(0, os.path.abspath('../../src'))
src_code_for_reproducibility/docs/source/environments.rst ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ =================
2
+ MARL Environments
3
+ =================
4
+
5
+ This section provides detailed documentation for the multi-agent negotiation environments included in the library.
6
+
7
+ Each environment follows the standard interface described in :doc:`../environments` but has its own unique game rules,
8
+ dynamics, and implementation details.
9
+
10
+ .. toctree::
11
+ :maxdepth: 2
12
+ :caption: Available Environments:
13
+
14
+ environments/ipd
15
+ environments/diplomacy
16
+ environments/dond
17
+
18
+ Overview
19
+ --------
20
+
21
+ The library currently includes the following environments:
22
+
23
+ 1. **Iterated Prisoner's Dilemma (IPD)**: A classic game theory problem where two agents repeatedly decide whether to cooperate or defect, with different payoffs based on their joint actions.
24
+
25
+ 2. **Diplomacy**: An adaptation of the board game Diplomacy, where seven European powers compete for control of supply centers through strategic moves and alliances.
26
+
27
+ 3. **Deal or No Deal (DOND)**: A negotiation environment based on `the paper Deal or No Deal? End-to-End Learning for Negotiation Dialogues <https://arxiv.org/pdf/1706.05125>`_ in which agents negotiate over the distribution of a set of prizes.
28
+
29
+ Each environment documentation includes:
30
+
31
+ - Game rules and background
32
+ - Implementation details
33
+ - API reference
34
+ - Example usage
35
+ - Advanced features and customization options
src_code_for_reproducibility/docs/source/index.rst ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Welcome to LLM Negotiation's documentation!
2
+ ===========================================
3
+ This library is a collection of tools for training and evaluating LLM-based agents in multi-agent environments. It is designed to be easy to use and extend.
4
+
5
+ .. toctree::
6
+ :maxdepth: 3
7
+ :caption: Contents:
8
+
9
+ installation
10
+ marl_standard
11
+ environments
12
+ launch
13
+ usage
14
+ modules
15
+ contributing
16
+
17
+ Indices and tables
18
+ ==================
19
+
20
+ * :ref:`genindex`
21
+ * :ref:`modindex`
22
+ * :ref:`search`
src_code_for_reproducibility/docs/source/marl_standard.rst ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ =================
2
+ Abstract Standard for Multi-Agent Negotiation Environments
3
+ =================
4
+
5
+ Multi-Agent Negotiation Environments require more features than gymnasium environments in order to be used as interfaces in general game running code.
6
+ The two fundamental differences between gymnasium environments and Multi-Agent Negotiation Environments are:
7
+
8
+ 1. Response from the LLM is a text action, not a discrete action. Therefore, appropriate parsing of the text is required. The model may need to be run multiple times to get the full action.
9
+ This is why we introduce the `AgentHandler` class, which is responsible for parsing the LLM's response.
10
+ 2. The environment needs to be able to handle multi-agent interactions.
11
+ This is why we introduce the `NegotiationEnvironment` class, which is responsible for handling the multi-agent interactions.
12
+ 3. MARL environments are complex to describe. In different contexts, the same environment may be described differently. Therefore, both the environement and the agent handlers are
13
+ responsible for describing a particular trajectory. This information is given by the `get_log_info` method.
14
+ 4. There might be a lot of overlap between the neural networks used by each agent. For instance, the same model may be used for all agents. This motivates a requirement for a
15
+ policy identifier for each agent.
16
+
17
+ Taking inspiration from the `gymnasium <https://gymnasium.farama.org/>`_ library, we introduce a new standard for Multi-Agent Negotiation Environments.
18
+
19
+ Our standard is based on the following features:
20
+
21
+ Environments are of the form:
22
+
23
+ .. code-block:: python
24
+
25
+ class MarlEnvironment():
26
+
27
+ def __init__(self):
28
+ """Initialize the environment."""
29
+ pass
30
+
31
+ def reset(self):
32
+ """Reset the environment to an initial state and return the initial observation.
33
+ Returns:
34
+ observation (dict): A dictionary where keys are agent identifiers and values are observations.
35
+ """
36
+ # (...)
37
+ return observation
38
+
39
+ def step(self, actions):
40
+ """Take a step in the environment using the provided actions.
41
+
42
+ Args:
43
+ actions (dict): A dictionary where keys are agent identifiers and values are actions.
44
+
45
+ Returns:
46
+ observations (dict): A dictionary where keys are agent identifiers and values are observations.
47
+ reward (dict): A dictionary where keys are agent identifiers and values are rewards.
48
+ done (bool): Whether the episode has ended.
49
+ info (dict): Additional information about the environment.
50
+ """
51
+ # (...)
52
+ return observations, done, info
53
+
54
+ def get_log_info(self):
55
+ """Get additional information about the environment. This information is used to log the game.
56
+ Returns:
57
+ log_info (dict): Information about the environment required to log the game.
58
+ """
59
+ # (...)
60
+ return log_info
61
+
62
+ def render(self):
63
+ """Render the current state of the environment."""
64
+ pass
65
+
66
+ def close(self):
67
+ """Perform any necessary cleanup."""
68
+ pass
69
+
70
+
71
+ class AgentState():
72
+
73
+ def __init__(self):
74
+ """Initialize the agent state."""
75
+ pass
76
+
77
+ def step(self, observation_from_env, policy_output=None):
78
+ """Update the agent state based on the observation and action.
79
+ The action is the output of the LLM.
80
+ """
81
+
82
+ Args:
83
+ observation_from_env (dict): The observation of the environment.
84
+ policy_output : The output of the policy.
85
+
86
+ Returns:
87
+ policy_id (str): The policy identifier.
88
+ policy_input (dict): The input to the policy.
89
+ action : The official action to be sent to the environment.
90
+ done (bool): Whether the LLM action is ready to be sent to the environment.
91
+ info (dict): Additional information about the agent.
92
+ """
93
+ # (...)
94
+ return policy_id, policy_input, action, done, info
95
+
96
+ def get_log_info(self):
97
+ """Get information about the agent required to log a trajectory.
98
+ Returns:
99
+ log_info (dict): Information about the agent required to log a trajectory.
100
+ """
101
+ # (...)
102
+ return log_info
103
+
104
+ def render(self):
105
+ """Render the current state of the environment."""
106
+ pass
107
+
108
+ def close(self):
109
+ """Perform any necessary cleanup."""
110
+ pass
111
+
112
+
113
+ Implicitely, the keys of the `observations` in the `step` method of the `MarlEnvironment` interface represent the set of agents from which an action is expected at the current step. The next step should only expect actions from the agents in the `observations` dictionary.
114
+
115
+ As you can see, both classes have a `get_log_info` method. This method is used to log the game. It returns a dictionary with keys being the agent identifiers and values being the information to log. The reason we need this is because the environment and the agent handler may need to log different information. It makes it easier to log from the perspective of each agent. The core environment class should not need to know about the details of the agent handler.
116
+
117
+
118
+
119
+ Running Environments in Parallel
120
+ --------------------------------
121
+ This standard allows the use of the `run_batched_matches` function (TODO: link) to run environments in an efficient way. The core idea is to batch the policy calls for all agents in the environment.
122
+
123
+ .. note::
124
+ The ``run_batched_matches`` function allows you to run multiple negotiation games, or "matches," in parallel.
125
+ After each environment is initialized, the function continuously loops over all active matches and checks which agents
126
+ are still pending actions. Each agent's logic can require multiple calls to the policy (e.g., an LLM) before an action
127
+ becomes "ready" to be sent to the environment. (For instance, an agent might need multiple policy calls before having a string which can be parsed into a valid action.) While an agent is waiting for a policy output, these calls for all agents across all matches are grouped together by unique policy identifier and processed in batch for efficiency. This is the core functionality of the ``run_batched_matches`` function.
128
+
129
+ Only once all actions from the required agents at a given step for an environment are ready does the function make a single ``env.step(...)`` call; this ensures
130
+ every match moves forward in lockstep for all its active agents. As soon as an environment signals it is done, the function
131
+ retrieves logged information from both the environment and the agent states before removing this match from the active set.
132
+
133
+ If there are more matches waiting to be processed, they are then started one by one to maintain the specified degree of parallelism.
134
+ This batching approach provides an efficient mechanism to handle multi-agent or multi-policy environments, ensuring minimal
135
+ overhead and a clear, unified flow for stepping through matches.
136
+
137
+ Here is a diagram that shows how the `run_batched_matches` function works at a high level:
138
+
139
+ .. image:: media/runbatch.png
140
+ :alt: Alternate text for the image
141
+ :width: 1000px
src_code_for_reproducibility/docs/source/src.environments.dond.dond_log_funcs.rst ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ src.environments.dond.dond\_log\_funcs module
2
+ =============================================
3
+
4
+ .. automodule:: src.environments.dond.dond_log_funcs
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
src_code_for_reproducibility/docs/source/src.environments.dond.dond_player.rst ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ src.environments.dond.dond\_agent module
2
+ =========================================
3
+
4
+ .. automodule:: src.environments.dond.dond_agent
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
src_code_for_reproducibility/docs/source/src.environments.dond.dond_return_funcs.rst ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ src.environments.dond.dond\_return\_funcs module
2
+ ================================================
3
+
4
+ .. automodule:: src.environments.dond.dond_return_funcs
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
src_code_for_reproducibility/docs/source/src.environments.dond.dond_statistics_funcs.rst ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ src.environments.dond.dond\_statistics\_funcs module
2
+ ====================================================
3
+
4
+ .. automodule:: src.environments.dond.dond_statistics_funcs
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
src_code_for_reproducibility/docs/source/src.environments.dond.rst ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ src.environments.dond package
2
+ =============================
3
+
4
+ .. automodule:: src.environments.dond
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
8
+
9
+ Submodules
10
+ ----------
11
+
12
+ .. toctree::
13
+ :maxdepth: 4
14
+
15
+ src.environments.dond.dond_agent
16
+ src.environments.dond.dond_game
17
+ src.environments.dond.dond_log_funcs
18
+ src.environments.dond.dond_statistics_funcs
19
+ src.environments.dond.dond_training_data_funcs
src_code_for_reproducibility/docs/source/src.environments.environment_imports.rst ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ src.environments.environment\_imports module
2
+ ============================================
3
+
4
+ .. automodule:: src.environments.environment_imports
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
src_code_for_reproducibility/docs/source/src.environments.ipd.ipd_statistics_funcs.rst ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ src.environments.ipd.ipd\_statistics\_funcs module
2
+ ==================================================
3
+
4
+ .. automodule:: src.environments.ipd.ipd_statistics_funcs
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
src_code_for_reproducibility/docs/source/src.environments.rst ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ src.environments package
2
+ ========================
3
+
4
+ .. automodule:: src.environments
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
8
+
9
+ Subpackages
10
+ -----------
11
+
12
+ .. toctree::
13
+ :maxdepth: 4
14
+
15
+ src.environments.dond
16
+ src.environments.ipd
17
+
18
+ Submodules
19
+ ----------
20
+
21
+ .. toctree::
22
+ :maxdepth: 4
23
+
24
+ src.environments.env_imports
25
+ src.environments.environment_imports
src_code_for_reproducibility/docs/source/src.experiments.arithmetic_test.rst ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ src.experiments.arithmetic\_test module
2
+ =======================================
3
+
4
+ .. automodule:: src.experiments.arithmetic_test
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
src_code_for_reproducibility/docs/source/src.experiments.last_completion.rst ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ src.experiments.last\_completion module
2
+ =======================================
3
+
4
+ .. automodule:: src.experiments.last_completion
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
src_code_for_reproducibility/docs/source/src.experiments.rst ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ src.experiments package
2
+ =======================
3
+
4
+ .. automodule:: src.experiments
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
8
+
9
+ Submodules
10
+ ----------
11
+
12
+ .. toctree::
13
+ :maxdepth: 4
14
+
15
+ src.experiments.arithmetic_test
16
+ src.experiments.generate_and_train
17
+ src.experiments.last_completion
src_code_for_reproducibility/docs/source/src.generation.rst ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ src.generation package
2
+ ======================
3
+
4
+ .. automodule:: src.generation
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
8
+
9
+ Submodules
10
+ ----------
11
+
12
+ .. toctree::
13
+ :maxdepth: 4
14
+
15
+ src.generation.run_games
src_code_for_reproducibility/docs/source/src.generation.run_games.rst ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ src.generation.run\_games module
2
+ ================================
3
+
4
+ .. automodule:: src.generation.run_games
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
src_code_for_reproducibility/docs/source/src.models.oai_agent.rst ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ src.models.oai\_agent module
2
+ ============================
3
+
4
+ .. automodule:: src.models.oai_agent
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
src_code_for_reproducibility/docs/source/src.models.rst ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ src.models package
2
+ ==================
3
+
4
+ .. automodule:: src.models
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
8
+
9
+ Submodules
10
+ ----------
11
+
12
+ .. toctree::
13
+ :maxdepth: 4
14
+
15
+ src.models.dummy_local_llm
16
+ src.models.local_llm
17
+ src.models.new_local_llm
18
+ src.models.server_llm
19
+ src.models.updatable_worker
20
+ src.models.vllm_worker_wrap
src_code_for_reproducibility/docs/source/src.models.updatable_worker.rst ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ src.models.updatable\_worker module
2
+ ===================================
3
+
4
+ .. automodule:: src.models.updatable_worker
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
src_code_for_reproducibility/docs/source/src.models.vllm_worker_wrap.rst ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ src.models.vllm\_worker\_wrap module
2
+ ====================================
3
+
4
+ .. automodule:: src.models.vllm_worker_wrap
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
src_code_for_reproducibility/docs/source/src.rst ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ src package
2
+ ===========
3
+
4
+ .. automodule:: src
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
8
+
9
+ Subpackages
10
+ -----------
11
+
12
+ .. toctree::
13
+ :maxdepth: 4
14
+
15
+ src.environments
16
+ src.experiments
17
+ src.generation
18
+ src.models
19
+ src.training
20
+ src.utils
21
+
22
+ Submodules
23
+ ----------
24
+
25
+ .. toctree::
26
+ :maxdepth: 4
27
+
28
+ src.run
src_code_for_reproducibility/docs/source/src.training.ppo_train.rst ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ src.training.ppo\_train module
2
+ ==============================
3
+
4
+ .. automodule:: src.training.ppo_train
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
src_code_for_reproducibility/docs/source/src.training.rst ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ src.training package
2
+ ====================
3
+
4
+ .. automodule:: src.training
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
8
+
9
+ Submodules
10
+ ----------
11
+
12
+ .. toctree::
13
+ :maxdepth: 4
14
+
15
+ src.training.ppo_train
16
+ src.training.ppo_train_value_head
17
+ src.training.reinforce_training
18
+ src.training.rl_convs_processing
19
+ src.training.train_main
src_code_for_reproducibility/docs/source/src.utils.log_gpu_usage.rst ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ src.utils.log\_gpu\_usage module
2
+ ================================
3
+
4
+ .. automodule:: src.utils.log_gpu_usage
5
+ :members:
6
+ :undoc-members:
7
+ :show-inheritance:
src_code_for_reproducibility/markov_games/diplomacy/diplomacy_agent.py ADDED
@@ -0,0 +1,259 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Dict, List, Tuple, Optional, Any
2
+ import copy
3
+
4
+ class DiplomacyAgent:
5
+ """Agent handler for Diplomacy game that follows the MARL standard.
6
+
7
+ This class is responsible for parsing LLM output into valid Diplomacy orders,
8
+ managing the agent state, and providing information for logging.
9
+ """
10
+
11
+ def __init__(self, policy_id: str, power_name: str, random_valid_move=False):
12
+ """Initialize the agent handler for a power in the Diplomacy game.
13
+
14
+ Args:
15
+ power_name: The name of the power this agent controls (e.g., 'FRANCE', 'ENGLAND')
16
+ policy_id: The identifier for the policy this agent uses
17
+ random_valid_move: If True, will select random valid moves instead of using LLM (default: False)
18
+ """
19
+ self.policy_id = policy_id
20
+ self.power_name = power_name
21
+ self.orders = []
22
+ self.wait = True
23
+ self.processing_state = "WAITING_FOR_ORDERS"
24
+ self.parsed_orders = []
25
+ self.order_status = {}
26
+ self.message_history = []
27
+ self.random_valid_move = random_valid_move
28
+
29
+ def step(self, observation_from_env, policy_output=None):
30
+ """Update the agent state based on the observation and LLM output.
31
+
32
+ Args:
33
+ observation_from_env: The observation from the environment
34
+ policy_output: The output from the LLM
35
+
36
+ Returns:
37
+ policy_id: The policy identifier
38
+ policy_input: The input to the policy
39
+ action: The official action to be sent to the environment
40
+ done: Whether the LLM action is ready to be sent to the environment
41
+ info: Additional information about the agent
42
+ """
43
+ info = {}
44
+
45
+ # If random_valid_move is enabled, select random valid moves
46
+ if self.random_valid_move:
47
+ valid_orders = self._select_random_valid_moves(observation_from_env)
48
+ self.orders = valid_orders
49
+ self.wait = False
50
+ action = {
51
+ "orders": valid_orders,
52
+ "wait": False
53
+ }
54
+ return self.policy_id, {}, action, True, info
55
+
56
+ # If no policy output, this is the initial step - prepare prompt
57
+ if policy_output is None:
58
+ # Create initial prompt for the LLM
59
+ phase = observation_from_env.get('phase', '')
60
+ units = observation_from_env.get('units', {}).get(self.power_name, [])
61
+ centers = observation_from_env.get('centers', {}).get(self.power_name, [])
62
+ orderable_locations = observation_from_env.get('orderable_locations', {})
63
+
64
+ prompt = self._create_prompt(phase, units, centers, orderable_locations)
65
+
66
+ return self.policy_id, {"prompt": prompt}, None, False, info
67
+
68
+ # Process the LLM output to extract orders
69
+ success, parsed_orders = self._parse_llm_output(policy_output)
70
+ self.parsed_orders = parsed_orders
71
+
72
+ if not success:
73
+ # Need more information from LLM
74
+ clarification_prompt = self._create_clarification_prompt(policy_output, parsed_orders)
75
+ return self.policy_id, {"prompt": clarification_prompt}, None, False, info
76
+
77
+ # Validate if the orders are valid for the current phase
78
+ valid_orders = self._validate_orders(parsed_orders, observation_from_env)
79
+
80
+ if valid_orders:
81
+ # Orders are valid, prepare action for environment
82
+ self.orders = valid_orders
83
+ self.wait = False
84
+ action = {
85
+ "orders": valid_orders,
86
+ "wait": False
87
+ }
88
+ return self.policy_id, {}, action, True, info
89
+ else:
90
+ # Orders are invalid, ask for new ones
91
+ error_prompt = self._create_error_prompt(parsed_orders, observation_from_env)
92
+ return self.policy_id, {"prompt": error_prompt}, None, False, info
93
+
94
+ def _create_prompt(self, phase, units, centers, orderable_locations):
95
+ """Create the initial prompt for the LLM.
96
+
97
+ Args:
98
+ phase: The current game phase
99
+ units: List of units controlled by this power
100
+ centers: List of supply centers controlled by this power
101
+ orderable_locations: List of locations where orders can be issued
102
+
103
+ Returns:
104
+ A prompt string for the LLM
105
+ """
106
+ prompt = f"You are playing as {self.power_name} in Diplomacy. The current phase is {phase}.\n\n"
107
+ prompt += f"Your units: {', '.join(units)}\n"
108
+ prompt += f"Your supply centers: {', '.join(centers)}\n"
109
+ prompt += f"Locations you can order: {', '.join(orderable_locations)}\n\n"
110
+
111
+ if phase.endswith('M'): # Movement phase
112
+ prompt += "Please provide orders for your units in the form:\n"
113
+ prompt += "- A LON H (hold)\n"
114
+ prompt += "- F NTH - NWY (move)\n"
115
+ prompt += "- A WAL S F LON (support)\n"
116
+ prompt += "- F NWG C A NWY - EDI (convoy)\n"
117
+ elif phase.endswith('R'): # Retreat phase
118
+ prompt += "Please provide retreat orders for your dislodged units:\n"
119
+ prompt += "- A PAR R MAR (retreat to MAR)\n"
120
+ prompt += "- A PAR D (disband)\n"
121
+ elif phase.endswith('A'): # Adjustment phase
122
+ if len(units) < len(centers):
123
+ prompt += "You can build units. Please provide build orders:\n"
124
+ prompt += "- A PAR B (build army in PAR)\n"
125
+ prompt += "- F BRE B (build fleet in BRE)\n"
126
+ prompt += "- WAIVE (waive a build)\n"
127
+ elif len(units) > len(centers):
128
+ prompt += "You must remove units. Please provide disbandment orders:\n"
129
+ prompt += "- A PAR D (disband army in PAR)\n"
130
+ prompt += "- F BRE D (disband fleet in BRE)\n"
131
+
132
+ prompt += "\nProvide your orders as a list, one per line."
133
+ return prompt
134
+
135
+ def _parse_llm_output(self, llm_output):
136
+ """Parse the LLM output to extract orders.
137
+
138
+ Args:
139
+ llm_output: The raw output from the LLM
140
+
141
+ Returns:
142
+ success: Whether parsing was successful
143
+ parsed_orders: List of parsed orders
144
+ """
145
+ # Simple parsing for now - extract lines that look like orders
146
+ lines = llm_output.strip().split('\n')
147
+ orders = []
148
+
149
+ for line in lines:
150
+ # Remove list markers, hyphens, etc.
151
+ line = line.strip('- *•').strip()
152
+
153
+ # Skip empty lines and lines that don't look like orders
154
+ if not line or line.startswith('I ') or line.startswith('Let\'s'):
155
+ continue
156
+
157
+ # Check if it looks like a Diplomacy order
158
+ if (' H' in line or ' -' in line or ' S ' in line or ' C ' in line or
159
+ ' R ' in line or ' D' in line or ' B' in line or line == 'WAIVE'):
160
+ orders.append(line)
161
+
162
+ return len(orders) > 0, orders
163
+
164
+ def _validate_orders(self, orders, observation):
165
+ """Validate if the orders are valid for the current phase.
166
+
167
+ Args:
168
+ orders: List of orders to validate
169
+ observation: Current observation from the environment
170
+
171
+ Returns:
172
+ List of valid orders or None if invalid
173
+ """
174
+ # For simplicity, we'll assume all parsed orders are valid
175
+ # In a real implementation, we would use the game's validation logic
176
+ return orders
177
+
178
+ def _create_clarification_prompt(self, previous_output, parsed_orders):
179
+ """Create a prompt asking for clarification when orders couldn't be parsed.
180
+
181
+ Args:
182
+ previous_output: The previous LLM output
183
+ parsed_orders: Any orders that were successfully parsed
184
+
185
+ Returns:
186
+ A prompt string for the LLM
187
+ """
188
+ prompt = f"I couldn't fully understand your orders for {self.power_name}. "
189
+
190
+ if parsed_orders:
191
+ prompt += f"I understood these orders:\n"
192
+ for order in parsed_orders:
193
+ prompt += f"- {order}\n"
194
+
195
+ prompt += "\nPlease provide clear, valid Diplomacy orders in the format:\n"
196
+ prompt += "- A LON H\n- F NTH - NWY\n- etc.\n"
197
+ return prompt
198
+
199
+ def _create_error_prompt(self, invalid_orders, observation):
200
+ """Create a prompt when orders are invalid.
201
+
202
+ Args:
203
+ invalid_orders: The invalid orders
204
+ observation: Current observation from the environment
205
+
206
+ Returns:
207
+ A prompt string for the LLM
208
+ """
209
+ prompt = f"The following orders for {self.power_name} are invalid:\n"
210
+ for order in invalid_orders:
211
+ prompt += f"- {order}\n"
212
+
213
+ prompt += "\nPlease provide valid orders for your units."
214
+ return prompt
215
+
216
+ def get_log_info(self):
217
+ """Get information about the agent required to log a trajectory.
218
+
219
+ Returns:
220
+ log_info: Information about the agent required to log a trajectory.
221
+ """
222
+ return {
223
+ "power_name": self.power_name,
224
+ "orders": self.orders,
225
+ "wait": self.wait,
226
+ "parsing_state": self.processing_state,
227
+ "message_history": self.message_history
228
+ }
229
+
230
+ def render(self):
231
+ """Render the current state of the agent."""
232
+ print(f"Power: {self.power_name}")
233
+ print(f"Orders: {self.orders}")
234
+ print(f"Wait: {self.wait}")
235
+
236
+ def close(self):
237
+ """Perform any necessary cleanup."""
238
+ pass
239
+
240
+ def _select_random_valid_moves(self, observation):
241
+ """Select random valid moves for all units.
242
+
243
+ Args:
244
+ observation: Current observation from the environment
245
+
246
+ Returns:
247
+ List of valid orders
248
+ """
249
+ import random
250
+
251
+ possible_orders = observation.get('possible_orders', {})
252
+ valid_orders = []
253
+
254
+ # For each location with possible orders, select one randomly
255
+ for location, orders in possible_orders.items():
256
+ if orders: # If there are any possible orders for this location
257
+ valid_orders.append(random.choice(orders))
258
+
259
+ return valid_orders
src_code_for_reproducibility/markov_games/diplomacy/diplomacy_env.py ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Dict, List, Tuple, Optional, Any
2
+ from diplomacy import Game
3
+ import random
4
+
5
+ class DiplomacyEnv:
6
+ """Multi-Agent Reinforcement Learning environment for Diplomacy.
7
+
8
+ This class wraps the Diplomacy game engine to provide an interface
9
+ compliant with the MARL standard.
10
+ """
11
+
12
+ def __init__(self, random_seed=None, map_name="standard", game_id=None, rules=None, max_steps=50):
13
+ """Initialize the Diplomacy environment.
14
+
15
+ Args:
16
+ map_name: The name of the map to use (default: "standard")
17
+ game_id: Optional game ID
18
+ rules: Optional rules to apply to the game
19
+ max_steps: Maximum number of steps before forcing game end (default: 10)
20
+ """
21
+ self.random_seed = random_seed
22
+ self.map_name = map_name
23
+ self.game_id = game_id
24
+ self.rules = rules or []
25
+ self.game = None
26
+ self.active_powers = []
27
+ self.render_mode = None
28
+ self.max_steps = max_steps
29
+ self.current_steps = 0
30
+
31
+ def reset(self):
32
+ """Reset the environment to an initial state and return the initial observation.
33
+
34
+ Returns:
35
+ observation: A dictionary where keys are agent identifiers and values are observations.
36
+ """
37
+ # Initialize a new game
38
+ self.game = Game(game_id=self.game_id, map_name=self.map_name)
39
+
40
+ # Apply rules
41
+ for rule in self.rules:
42
+ self.game.add_rule(rule)
43
+
44
+ # Determine active powers (not eliminated)
45
+ self.active_powers = [name for name, power in self.game.powers.items()
46
+ if not power.is_eliminated()]
47
+
48
+ # Reset step counter
49
+ self.current_steps = 0
50
+
51
+ # Create initial observations for all powers
52
+ observations = {}
53
+ for power_name in self.active_powers:
54
+ observations[power_name] = self._create_observation(power_name)
55
+
56
+ return observations
57
+
58
+ def step(self, actions):
59
+ """Take a step in the environment using the provided actions.
60
+
61
+ Args:
62
+ actions: A dictionary where keys are agent identifiers and values are actions.
63
+
64
+ Returns:
65
+ observations: A dictionary where keys are agent identifiers and values are observations.
66
+ done: Whether the episode has ended.
67
+ info: Additional information about the environment.
68
+ """
69
+ print(f"stepping {self.current_steps}")
70
+ self.current_steps += 1
71
+ # Apply actions (orders) for each power
72
+ for power_name, action in actions.items():
73
+ if power_name in self.active_powers:
74
+ orders = action.get("orders", [])
75
+ wait = action.get("wait", True)
76
+
77
+ # Set orders for the power
78
+ if orders:
79
+ self.game.set_orders(power_name, orders)
80
+
81
+ # Set wait flag
82
+ self.game.set_wait(power_name, wait)
83
+
84
+ # Check if all active powers are ready to proceed
85
+ if self.game.does_not_wait():
86
+ # Process the current phase
87
+ self.game.process()
88
+
89
+
90
+ # Update active powers list after processing
91
+ self.active_powers = [name for name, power in self.game.powers.items()
92
+ if not power.is_eliminated()]
93
+
94
+ # Create observations for all active powers
95
+ observations = {}
96
+ for power_name in self.active_powers:
97
+ observations[power_name] = self._create_observation(power_name)
98
+
99
+ # Check if the game is done (either naturally or due to max steps)
100
+ done = self.game.is_game_done or self.current_steps >= self.max_steps
101
+
102
+ # Create info dict
103
+ info = {
104
+ "phase": self.game.get_current_phase(),
105
+ "active_powers": self.active_powers,
106
+ "centers": self.game.get_centers(),
107
+ "units": self.game.get_units(),
108
+ "current_steps": self.current_steps,
109
+ "max_steps_reached": self.current_steps >= self.max_steps
110
+ }
111
+
112
+ return observations, done, info
113
+
114
+ def _create_observation(self, power_name):
115
+ """Create observation for a specific power.
116
+
117
+ Args:
118
+ power_name: The name of the power
119
+
120
+ Returns:
121
+ An observation dictionary
122
+ """
123
+ observation = {
124
+ "phase": self.game.get_current_phase(),
125
+ "units": self.game.get_units(),
126
+ "centers": self.game.get_centers(),
127
+ "orderable_locations": self.game.get_orderable_locations(power_name),
128
+ "order_status": self.game.get_order_status(power_name),
129
+ "possible_orders": self._get_possible_orders_for_power(power_name)
130
+ }
131
+ return observation
132
+
133
+ def _get_possible_orders_for_power(self, power_name):
134
+ """Get all possible orders for a power's units.
135
+
136
+ Args:
137
+ power_name: The name of the power
138
+
139
+ Returns:
140
+ A dictionary mapping units to their possible orders
141
+ """
142
+ all_possible_orders = self.game.get_all_possible_orders()
143
+
144
+ # Filter for only the locations where this power has units
145
+ power_units = self.game.get_units(power_name)
146
+ power_unit_locations = [unit[2:] for unit in power_units]
147
+
148
+ # For retreat phases, include retreating units
149
+ if self.game.phase_type == 'R':
150
+ power = self.game.get_power(power_name)
151
+ power_unit_locations.extend([unit[2:] for unit in power.retreats])
152
+
153
+ # For adjustment phases, include buildable locations
154
+ elif self.game.phase_type == 'A':
155
+ power = self.game.get_power(power_name)
156
+ # If we have more centers than units, we can build
157
+ if len(power.centers) > len(power.units):
158
+ buildable_sites = self.game._build_sites(power)
159
+ power_unit_locations.extend(buildable_sites)
160
+ # If we have more units than centers, we need to remove
161
+ elif len(power.units) > len(power.centers):
162
+ # All units are candidates for removal
163
+ pass
164
+
165
+ # Filter the possible orders to only those for this power's units/locations
166
+ power_possible_orders = {}
167
+ for loc, orders in all_possible_orders.items():
168
+ if loc[:3] in power_unit_locations:
169
+ power_possible_orders[loc] = orders
170
+
171
+ return power_possible_orders
172
+
173
+ def get_log_info(self):
174
+ """Get additional information about the environment for logging.
175
+
176
+ Returns:
177
+ log_info: Information about the environment required to log the game.
178
+ """
179
+ if not self.game:
180
+ return {}
181
+
182
+ return {
183
+ "game_id": self.game.game_id,
184
+ "phase": self.game.get_current_phase(),
185
+ "map_name": self.game.map_name,
186
+ "centers": self.game.get_centers(),
187
+ "units": self.game.get_units(),
188
+ "powers": {name: {
189
+ "units": power.units,
190
+ "centers": power.centers,
191
+ "is_eliminated": power.is_eliminated(),
192
+ "order_status": self.game.get_order_status(name)
193
+ } for name, power in self.game.powers.items()},
194
+ "orders": self.game.get_orders(),
195
+ "active_powers": self.active_powers,
196
+ "is_game_done": self.game.is_game_done,
197
+ "outcome": self.game.outcome if self.game.is_game_done else None
198
+ }
199
+
200
+ def render(self, mode='human'):
201
+ """Render the current state of the environment.
202
+
203
+ Args:
204
+ mode: The rendering mode ('human', 'svg', etc.)
205
+
206
+ Returns:
207
+ The rendered image if applicable
208
+ """
209
+ self.render_mode = mode
210
+ if self.game:
211
+ if mode == 'human':
212
+ # Just print basic game state
213
+ print(f"Game: {self.game.game_id}")
214
+ print(f"Phase: {self.game.get_current_phase()}")
215
+ print(f"Active Powers: {self.active_powers}")
216
+ print("Supply Centers:")
217
+ for power_name, centers in self.game.get_centers().items():
218
+ print(f" {power_name}: {centers}")
219
+ print("Units:")
220
+ for power_name, units in self.game.get_units().items():
221
+ print(f" {power_name}: {units}")
222
+ return None
223
+ elif mode == 'svg':
224
+ # Return SVG representation
225
+ return self.game.render(output_format='svg')
226
+ return None
227
+
228
+ def close(self):
229
+ """Perform any necessary cleanup."""
230
+ self.game = None
src_code_for_reproducibility/markov_games/diplomacy/diplomacy_logging.py ADDED
@@ -0,0 +1,360 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+ from utils.common_imports import *
4
+
5
+
6
+
7
+ def diplomacy_log_match(
8
+ path,
9
+ agents_log_info,
10
+ env_log_info,
11
+ metrics_func=None,
12
+ metrics_func_args=None
13
+ ):
14
+ """
15
+ Logs the Diplomacy game data and generates HTML visualizations using the get_log_info methods.
16
+
17
+ Args:
18
+ path (str): Base path to save the data.
19
+ agents_log_info (list): List of agent information dictionaries containing the get_log_info results.
20
+ env_log_info (dict): Environment information from its get_log_info method.
21
+ metrics_func (str, optional): Name of the function to calculate metrics.
22
+ metrics_func_args (dict, optional): Arguments for the metrics function.
23
+ """
24
+ # Create directory structure
25
+ os.makedirs(path, exist_ok=True)
26
+
27
+ # Save the environment log info
28
+ env_log_path = os.path.join(path, "env_log.json")
29
+ with open(env_log_path, "w") as f:
30
+ json.dump(env_log_info, f, indent=4, default=_json_serialize)
31
+
32
+ # Process each agent's log info
33
+ for agent_log in agents_log_info:
34
+ power_name = agent_log["power_name"]
35
+
36
+ # Define paths for raw data and statistics subfolders
37
+ power_path = os.path.join(path, power_name)
38
+ raw_data_path = os.path.join(power_path, "raw_data")
39
+ statistics_path = os.path.join(power_path, "statistics")
40
+
41
+ # Ensure directories exist
42
+ os.makedirs(raw_data_path, exist_ok=True)
43
+ os.makedirs(statistics_path, exist_ok=True)
44
+
45
+ # Determine the next available file number for raw data
46
+ raw_files = os.listdir(raw_data_path)
47
+ raw_numbers = [int(f.split('_')[-1].split('.')[0]) for f in raw_files if f.startswith("log_")]
48
+ next_raw_number = max(raw_numbers, default=0) + 1
49
+ raw_file = os.path.join(raw_data_path, f"log_{next_raw_number}.json")
50
+
51
+ # Save agent log info
52
+ with open(raw_file, "w") as f:
53
+ json.dump(agent_log, f, indent=4, default=_json_serialize)
54
+
55
+ # Log metrics if a metrics function is provided
56
+ if metrics_func:
57
+ metrics_files = os.listdir(statistics_path)
58
+ metrics_numbers = [int(f.split('_')[-1].split('.')[0]) for f in metrics_files if f.startswith("metrics_")]
59
+ next_metrics_number = max(metrics_numbers, default=0) + 1
60
+ metrics_file = os.path.join(statistics_path, f"metrics_{next_metrics_number}.json")
61
+
62
+ metrics = globals()[metrics_func](agent_log, info, **metrics_func_args)
63
+ with open(metrics_file, "w") as f:
64
+ json.dump(metrics, f, indent=4)
65
+
66
+ # Generate the HTML visualization
67
+ html_content = generate_diplomacy_html(agents_log_info, env_log_info)
68
+
69
+ # Ensure the html directory exists
70
+ html_path = os.path.join(path, "html")
71
+ os.makedirs(html_path, exist_ok=True)
72
+
73
+ # Determine the next available file number for HTML
74
+ html_files = os.listdir(html_path)
75
+ html_numbers = [int(f.split('_')[-1].split('.')[0]) for f in html_files if f.startswith("game_summary_")]
76
+ next_html_number = max(html_numbers, default=0) + 1
77
+ html_file = os.path.join(html_path, f"game_summary_{next_html_number}.html")
78
+
79
+ # Save the HTML content to a file
80
+ with open(html_file, "w") as f:
81
+ f.write(html_content)
82
+
83
+ def generate_diplomacy_html(agent_infos, env_info):
84
+ """
85
+ Generate HTML visualization for a Diplomacy game.
86
+
87
+ Args:
88
+ agent_infos (list): List of agent information dictionaries from get_log_info.
89
+ env_info (dict): Environment information from get_log_info.
90
+
91
+ Returns:
92
+ str: HTML content for the game visualization.
93
+ """
94
+ # Extract game information
95
+ game_id = env_info.get("game_id", "Unknown")
96
+ phase = env_info.get("phase", "Unknown")
97
+ map_name = env_info.get("map_name", "standard")
98
+ is_game_done = env_info.get("is_game_done", False)
99
+ outcome = env_info.get("outcome", [])
100
+
101
+ centers = env_info.get("centers", {})
102
+ units = env_info.get("units", {})
103
+
104
+ # HTML head and style
105
+ html_content = """
106
+ <!DOCTYPE html>
107
+ <html lang="en">
108
+ <head>
109
+ <meta charset="UTF-8">
110
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
111
+ <title>Diplomacy Game {game_id}</title>
112
+ <style>
113
+ body {{
114
+ font-family: 'Arial', sans-serif;
115
+ background-color: #f5f5f5;
116
+ color: #333333;
117
+ margin: 0;
118
+ padding: 20px;
119
+ }}
120
+ .container {{
121
+ display: grid;
122
+ grid-template-columns: repeat(3, 1fr);
123
+ grid-gap: 20px;
124
+ margin-bottom: 30px;
125
+ }}
126
+ .central-info {{
127
+ grid-column: span 3;
128
+ background: #fff;
129
+ padding: 20px;
130
+ border-radius: 10px;
131
+ box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1);
132
+ margin-bottom: 20px;
133
+ }}
134
+ .power-column {{
135
+ background: #fff;
136
+ padding: 15px;
137
+ border-radius: 10px;
138
+ box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1);
139
+ }}
140
+ .message {{
141
+ margin-bottom: 15px;
142
+ padding: 12px;
143
+ border-radius: 8px;
144
+ box-shadow: 0 1px 4px rgba(0, 0, 0, 0.1);
145
+ }}
146
+ .user {{
147
+ background: rgba(235, 245, 255, 0.8);
148
+ border-left: 4px solid #007bff;
149
+ }}
150
+ .assistant {{
151
+ background: rgba(240, 255, 240, 0.8);
152
+ border-right: 4px solid #28a745;
153
+ }}
154
+ .orders {{
155
+ background: rgba(255, 248, 225, 0.8);
156
+ border-left: 4px solid #ffc107;
157
+ }}
158
+ .role {{
159
+ font-weight: bold;
160
+ margin-bottom: 5px;
161
+ color: #333333;
162
+ }}
163
+ .power-name {{
164
+ text-align: center;
165
+ font-size: 1.4em;
166
+ margin-bottom: 15px;
167
+ color: #000;
168
+ font-weight: 600;
169
+ text-transform: uppercase;
170
+ letter-spacing: 1px;
171
+ }}
172
+ .game-info {{
173
+ display: grid;
174
+ grid-template-columns: repeat(2, 1fr);
175
+ grid-gap: 15px;
176
+ }}
177
+ .info-card {{
178
+ background: #f9f9f9;
179
+ padding: 15px;
180
+ border-radius: 8px;
181
+ box-shadow: 0 1px 3px rgba(0, 0, 0, 0.1);
182
+ }}
183
+ .supply-centers, .units-list {{
184
+ display: flex;
185
+ flex-wrap: wrap;
186
+ justify-content: space-between;
187
+ }}
188
+ .supply-center, .unit {{
189
+ flex: 0 0 30%;
190
+ margin-bottom: 10px;
191
+ padding: 8px;
192
+ background: #f0f0f0;
193
+ border-radius: 5px;
194
+ text-align: center;
195
+ }}
196
+ h2 {{
197
+ border-bottom: 2px solid #eee;
198
+ padding-bottom: 10px;
199
+ margin-top: 0;
200
+ }}
201
+ .outcome {{
202
+ background: #e8f5e9;
203
+ padding: 15px;
204
+ border-radius: 8px;
205
+ margin-top: 15px;
206
+ font-weight: bold;
207
+ text-align: center;
208
+ }}
209
+ .austria {{ border-top: 5px solid #ff5050; }}
210
+ .england {{ border-top: 5px solid #5050ff; }}
211
+ .france {{ border-top: 5px solid #50c0ff; }}
212
+ .germany {{ border-top: 5px solid #808080; }}
213
+ .italy {{ border-top: 5px solid #50ff50; }}
214
+ .russia {{ border-top: 5px solid #ffffff; border: 1px solid #ccc; }}
215
+ .turkey {{ border-top: 5px solid #c0c000; }}
216
+ </style>
217
+ </head>
218
+ <body>
219
+ <div class="central-info">
220
+ <h2>Game Information</h2>
221
+ <div class="game-info">
222
+ <div class="info-card">
223
+ <h3>Game Details</h3>
224
+ <p><strong>Game ID:</strong> {game_id}</p>
225
+ <p><strong>Phase:</strong> {phase}</p>
226
+ <p><strong>Map:</strong> {map_name}</p>
227
+ <p><strong>Status:</strong> {status}</p>
228
+ </div>
229
+ <div class="info-card">
230
+ <h3>Supply Centers</h3>
231
+ <div class="supply-centers">
232
+ """.format(
233
+ game_id=game_id,
234
+ phase=phase,
235
+ map_name=map_name,
236
+ status="Completed" if is_game_done else "Active"
237
+ )
238
+
239
+ # Add supply center information
240
+ for power, power_centers in centers.items():
241
+ html_content += f"""
242
+ <div class="supply-center">
243
+ <strong>{power}:</strong> {len(power_centers)}
244
+ </div>
245
+ """
246
+
247
+ html_content += """
248
+ </div>
249
+ </div>
250
+ </div>
251
+ """
252
+
253
+ # Add outcome if game is done
254
+ if is_game_done and outcome:
255
+ winners = outcome[1:] if len(outcome) > 1 else ["Draw"]
256
+ html_content += f"""
257
+ <div class="outcome">
258
+ <h3>Game Outcome</h3>
259
+ <p>Winners: {', '.join(winners)}</p>
260
+ </div>
261
+ """
262
+
263
+ html_content += """
264
+ </div>
265
+ <div class="container">
266
+ """
267
+
268
+ # Add each power's information
269
+ for agent_log in agent_infos:
270
+ power_name = agent_log["power_name"]
271
+ power_class = power_name.lower()
272
+ orders = agent_log.get("orders", [])
273
+ message_history = agent_log.get("message_history", [])
274
+
275
+ html_content += f"""
276
+ <div class="power-column {power_class}">
277
+ <div class="power-name">{power_name}</div>
278
+
279
+ <div class="info-card">
280
+ <h3>Units</h3>
281
+ <ul>
282
+ """
283
+
284
+ # Add units information
285
+ power_units = units.get(power_name, [])
286
+ for unit in power_units:
287
+ html_content += f"<li>{unit}</li>"
288
+
289
+ html_content += """
290
+ </ul>
291
+ </div>
292
+
293
+ <div class="message orders">
294
+ <div class="role">Final Orders</div>
295
+ <ul>
296
+ """
297
+
298
+ # Add orders
299
+ for order in orders:
300
+ html_content += f"<li>{order}</li>"
301
+
302
+ html_content += """
303
+ </ul>
304
+ </div>
305
+ """
306
+
307
+ # Add message history
308
+ for message in message_history:
309
+ if isinstance(message, dict):
310
+ # Skip system messages or handle differently
311
+ if message.get("role") == "system":
312
+ continue
313
+
314
+ role = message.get("role", "unknown")
315
+ content = message.get("content", "")
316
+
317
+ role_class = "user" if role == "user" else "assistant"
318
+ role_display = "Environment" if role == "user" else f"LLM ({power_name})"
319
+
320
+ # Escape HTML characters in content
321
+ content = content.replace("<", "&lt;").replace(">", "&gt;").replace("\n", "<br>")
322
+
323
+ html_content += f"""
324
+ <div class="message {role_class}">
325
+ <div class="role">{role_display}</div>
326
+ <p>{content}</p>
327
+ </div>
328
+ """
329
+ elif isinstance(message, str):
330
+ # Simple string messages (may be used in some implementations)
331
+ html_content += f"""
332
+ <div class="message">
333
+ <p>{message}</p>
334
+ </div>
335
+ """
336
+
337
+ html_content += """
338
+ </div>
339
+ """
340
+
341
+ html_content += """
342
+ </div>
343
+ </body>
344
+ </html>
345
+ """
346
+
347
+ return html_content
348
+
349
+ def _json_serialize(obj):
350
+ """
351
+ A helper function to convert non-JSON-serializable objects
352
+ (like OrderResult) into strings or dicts.
353
+ """
354
+ # Check for the specific object types you know are problematic
355
+ if obj.__class__.__name__ == "OrderResult":
356
+ # Return a string representation or a dict
357
+ return str(obj)
358
+
359
+ # Fallback: attempt to convert anything else to string
360
+ return str(obj)
src_code_for_reproducibility/markov_games/diplomacy/diplomacy_logging_for_training.py ADDED
File without changes
src_code_for_reproducibility/markov_games/ipd/Ipd_hard_coded_agents.py ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from dataclasses import dataclass
2
+ from typing import Any, Tuple
3
+
4
+ from mllm.markov_games.ipd.ipd_agent import IPDAgent
5
+ from mllm.markov_games.rollout_tree import AgentActLog, ChatTurn
6
+
7
+
8
+ @dataclass
9
+ class AlwaysCooperateIPDAgent(IPDAgent):
10
+ async def act(self, observation) -> Tuple[Any, AgentActLog]:
11
+ """
12
+ Always plays the cooperate action, ignoring observation.
13
+ Returns the configured cooperate_string so the simulation parses it as "C".
14
+ """
15
+
16
+ action = self.cooperate_string
17
+
18
+ # Log a minimal, structured chat turn for consistency with other agents
19
+ turn_text = f"Playing cooperate: {action}"
20
+ self.state.chat_history.append(
21
+ ChatTurn(
22
+ agent_id=self.agent_id,
23
+ role="assistant",
24
+ content=turn_text,
25
+ is_state_end=True,
26
+ )
27
+ )
28
+
29
+ act_log = AgentActLog(
30
+ chat_turns=[self.state.chat_history[-1]],
31
+ info=None,
32
+ )
33
+
34
+ # Advance internal counters similar to IPDAgent semantics
35
+ self.state.chat_counter = len(self.state.chat_history)
36
+ self.state.round_nb = observation.round_nb
37
+
38
+ return action, act_log
39
+
40
+
41
+ @dataclass
42
+ class AlwaysDefectIPDAgent(IPDAgent):
43
+ async def act(self, observation) -> Tuple[Any, AgentActLog]:
44
+ """
45
+ Always plays the defect action, ignoring observation.
46
+ Returns the configured defect_string so the simulation parses it as "D".
47
+ """
48
+
49
+ action = self.defect_string
50
+
51
+ # Log a minimal, structured chat turn for consistency with other agents
52
+ turn_text = f"Playing defect: {action}"
53
+ self.state.chat_history.append(
54
+ ChatTurn(
55
+ agent_id=self.agent_id,
56
+ role="assistant",
57
+ content=turn_text,
58
+ is_state_end=True,
59
+ )
60
+ )
61
+
62
+ act_log = AgentActLog(
63
+ chat_turns=[self.state.chat_history[-1]],
64
+ info=None,
65
+ )
66
+
67
+ # Advance internal counters similar to IPDAgent semantics
68
+ self.state.chat_counter = len(self.state.chat_history)
69
+ self.state.round_nb = observation.round_nb
70
+
71
+ return action, act_log
72
+
src_code_for_reproducibility/markov_games/ipd/__init__.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ from .Ipd_hard_coded_agents import AlwaysCooperateIPDAgent, AlwaysDefectIPDAgent
2
+
3
+ __all__ = [
4
+ "AlwaysCooperateIPDAgent",
5
+ "AlwaysDefectIPDAgent",
6
+ ]
7
+
src_code_for_reproducibility/markov_games/ipd/__pycache__/Ipd_hard_coded_agents.cpython-312.pyc ADDED
Binary file (2.86 kB). View file
 
src_code_for_reproducibility/markov_games/ipd/__pycache__/__init__.cpython-312.pyc ADDED
Binary file (310 Bytes). View file
 
src_code_for_reproducibility/markov_games/ipd/__pycache__/ipd_simulation.cpython-312.pyc ADDED
Binary file (6.72 kB). View file
 
src_code_for_reproducibility/markov_games/ipd/__pycache__/ipd_statistics.cpython-312.pyc ADDED
Binary file (1.28 kB). View file
 
src_code_for_reproducibility/markov_games/ipd/ipd_agent.py ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import copy
2
+ import json
3
+ import random
4
+ import re
5
+ from collections.abc import Callable
6
+ from copy import deepcopy
7
+ from dataclasses import dataclass, field
8
+ from typing import Any, Dict, List, Optional, Tuple, Union
9
+
10
+ from mllm.markov_games.agent import Agent
11
+ from mllm.markov_games.rollout_tree import AgentActLog, ChatTurn
12
+
13
+
14
+ @dataclass
15
+ class IPDAgentState:
16
+ """
17
+ TOWRITE
18
+ """
19
+
20
+ nb_retries: int
21
+ round_nb: int
22
+ chat_counter: int
23
+ chat_history: List[ChatTurn]
24
+
25
+
26
+ @dataclass
27
+ class IPDAgent(Agent):
28
+ seed: int
29
+ agent_id: str
30
+ agent_name: str
31
+ policy: Callable[[List[Dict]], str]
32
+ intro_prompt: str # Introduction prompt explaining the game rules
33
+ goal_prompt: str # Prompt explaining the agent's goal
34
+ strategy_prompt: str # Prompt suggesting a strategy to the agent
35
+ max_errors: int # Maximum number of errors allowed before default action
36
+ allow_reasoning: bool # Whether to allow reasoning in the response
37
+ max_reasoning_chars: int # Maximum number of characters for reasoning
38
+ cooperate_string: str # string parsed as playing cooperate by simulation
39
+ defect_string: str # string parsed as playing defect by simulation
40
+
41
+ def __post_init__(self):
42
+ self.state = IPDAgentState(
43
+ nb_retries=0, round_nb=0, chat_counter=0, chat_history=[]
44
+ )
45
+
46
+ async def act(self, observation) -> Tuple[Any, AgentActLog]:
47
+ """
48
+ TOWRITE
49
+ """
50
+
51
+ action = None
52
+ action_is_ready = False
53
+ round_nb = observation.round_nb
54
+
55
+ # If it's the first round, we need to send the intro prompt
56
+ if round_nb == 0 and self.state.chat_counter == 0:
57
+ self.state.chat_history.append(
58
+ ChatTurn(
59
+ agent_id=self.agent_id,
60
+ role="user",
61
+ content=self.intro_prompt,
62
+ is_state_end=True,
63
+ )
64
+ )
65
+
66
+ # If new round
67
+ if round_nb > self.state.round_nb:
68
+ coagent_action = observation.last_coagent_move
69
+ user_message = f"Last round, the other agent played {coagent_action}."
70
+ self.state.chat_history.append(
71
+ ChatTurn(
72
+ agent_id=self.agent_id,
73
+ role="user",
74
+ content=user_message,
75
+ is_state_end=True,
76
+ )
77
+ )
78
+
79
+ # If not new round, try to get valid action from policy
80
+ output_chat_turn: ChatTurn = await self.policy(
81
+ state=self.state.chat_history,
82
+ agent_id=self.agent_id,
83
+ regex=f"({self.cooperate_string}|{self.defect_string})",
84
+ )
85
+ self.state.chat_history.append(output_chat_turn)
86
+ action = output_chat_turn.content
87
+
88
+ agent_step_log = AgentActLog(
89
+ chat_turns=self.state.chat_history[self.state.chat_counter :], info=None
90
+ )
91
+ self.state.chat_counter = len(self.state.chat_history)
92
+ self.state.round_nb = round_nb
93
+
94
+ return action, agent_step_log
95
+
96
+ def get_safe_copy(self):
97
+ """
98
+ Return a safe copy of the agent.
99
+ """
100
+ agent_copy = copy.copy(self)
101
+ agent_copy.state = copy.deepcopy(self.state)
102
+ return agent_copy
103
+
104
+ def reset(self):
105
+ self.state = IPDAgentState()
106
+ raise NotImplementedError
107
+
108
+ def render(self):
109
+ pass
110
+
111
+ def close(self):
112
+ pass
113
+
114
+ def get_agent_info(self):
115
+ pass
src_code_for_reproducibility/markov_games/ipd/ipd_simulation.py ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import copy
2
+ import random
3
+ from dataclasses import dataclass
4
+ from typing import Any, Dict, List, Optional, Tuple
5
+
6
+ import numpy as np
7
+
8
+ from mllm.markov_games.markov_game import Simulation
9
+ from mllm.markov_games.rollout_tree import SimulationStepLog
10
+ from mllm.utils.get_coagent_id import get_coagent_id
11
+
12
+
13
+ @dataclass
14
+ class IPDState:
15
+ """
16
+ State of the Iterated Prisoner's Dilemma game.
17
+ """
18
+
19
+ round_nb: int = 0
20
+ done: bool = False
21
+ last_moves: Dict[str, str] | None = None
22
+
23
+
24
+ @dataclass
25
+ class IPDObs:
26
+ """
27
+ Observation in Iterated Prisoner's Dilemma game.
28
+ """
29
+
30
+ round_nb: int
31
+ last_coagent_move: str | None
32
+
33
+
34
+ class IPD(Simulation):
35
+ """
36
+ Iterated Prisoner's Dilemma simulation following the standard.
37
+
38
+ In each round of the game, two agents simultaneously choose to either cooperate (C) or defect (D).
39
+ The payoffs are as follows:
40
+ - If both cooperate: Both receive the "reward" (usually 3 points)
41
+ - If both defect: Both receive the "punishment" (usually 1 point)
42
+ - If one cooperates and one defects: The defector receives the "temptation" (usually 5 points)
43
+ and the cooperator receives the "sucker" payoff (usually 0 points)
44
+
45
+ The game is played for a specified number of rounds.
46
+ """
47
+
48
+ def __init__(
49
+ self,
50
+ agent_ids: List[str],
51
+ agent_names: List[str],
52
+ seed: int,
53
+ rounds_per_game: int,
54
+ reward: float, # Both cooperate
55
+ punishment: float, # Both defect
56
+ temptation: float, # Defector's reward when other cooperates
57
+ sucker: float, # Cooperator's reward when other defects
58
+ cooperate_actions: List[str],
59
+ defect_actions: List[str],
60
+ ):
61
+ self.agent_ids = agent_ids
62
+ self.agent_names = agent_names
63
+ self.seed = seed
64
+ self.rounds_per_game = rounds_per_game
65
+ self.reward = reward
66
+ self.punishment = punishment
67
+ self.temptation = temptation
68
+ self.sucker = sucker
69
+ self.cooperate_actions = cooperate_actions
70
+ self.defect_actions = defect_actions
71
+ self.state = IPDState()
72
+
73
+ def step(self, actions: Dict[str, str]) -> Tuple[bool, SimulationStepLog]:
74
+ """
75
+ Take a step in the environment using the provided actions.
76
+ Here, the observations are just the states of the game.
77
+
78
+ Args:
79
+ actions (dict): A dictionary where keys are agent identifiers and values are actions ('C' or 'D').
80
+
81
+ Returns:
82
+ observations (dict): A dictionary where keys are agent identifiers and values are observations.
83
+ done (bool): Whether the episode has ended.
84
+ info (dict): Additional information about the environment.
85
+ """
86
+
87
+ # Calculate rewards using payoff matrix
88
+ agent0_action = actions[self.agent_ids[0]]
89
+ agent1_action = actions[self.agent_ids[1]]
90
+
91
+ # Normalize actions to standard cooperate/defect/gibberish format
92
+ def normalize_action(action):
93
+ if action in self.cooperate_actions:
94
+ return "C"
95
+ elif action in self.defect_actions:
96
+ return "D"
97
+ else:
98
+ return "D"
99
+
100
+ norm_action0 = normalize_action(agent0_action)
101
+ norm_action1 = normalize_action(agent1_action)
102
+
103
+ payoffs = {
104
+ ("C", "C"): [self.reward, self.reward],
105
+ ("C", "D"): [self.sucker, self.temptation],
106
+ ("D", "C"): [self.temptation, self.sucker],
107
+ ("D", "D"): [self.punishment, self.punishment],
108
+ }
109
+
110
+ round_rewards = {
111
+ self.agent_ids[0]: payoffs[(norm_action0, norm_action1)][0],
112
+ self.agent_ids[1]: payoffs[(norm_action0, norm_action1)][1],
113
+ }
114
+
115
+ # Update game state
116
+ self.state.round_nb += 1
117
+ self.state.last_moves = copy.deepcopy(actions)
118
+ done = self.state.round_nb >= self.rounds_per_game
119
+ step_log = SimulationStepLog(
120
+ rewards=round_rewards,
121
+ info={
122
+ "actions": {
123
+ self.agent_ids[0]: norm_action0,
124
+ self.agent_ids[1]: norm_action1,
125
+ }
126
+ },
127
+ )
128
+
129
+ return done, step_log
130
+
131
+ def get_obs(self):
132
+ """Returns all agent observations in dict
133
+ Returns:
134
+ observations
135
+ """
136
+ observations = {}
137
+ for agent_id in self.agent_ids:
138
+ observations[agent_id] = self.get_obs_agent(agent_id)
139
+ return observations
140
+
141
+ def get_obs_agent(self, agent_id):
142
+ """Returns observation for agent_id"""
143
+ if self.state.last_moves != None:
144
+ other_id = get_coagent_id(self.agent_ids, agent_id)
145
+ last_coagent_move = self.state.last_moves[other_id]
146
+ else:
147
+ last_coagent_move = None
148
+ obs = IPDObs(round_nb=self.state.round_nb, last_coagent_move=last_coagent_move)
149
+ return obs
150
+
151
+ def reset(self):
152
+ """Returns initial observations and states"""
153
+ self.state = IPDState()
154
+ return self.get_obs()
155
+
156
+ def get_safe_copy(self):
157
+ """
158
+ Return a safe copy of the simulation.
159
+ """
160
+ simulation_copy = copy.copy(self)
161
+ simulation_copy.state = copy.deepcopy(self.state)
162
+ return simulation_copy
src_code_for_reproducibility/markov_games/ipd/ipd_statistics.py ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from typing import Dict, Callable, List, Tuple
4
+
5
+ from mllm.markov_games.rollout_tree import SimulationStepLog
6
+
7
+
8
+ def avg_reward(sl: SimulationStepLog) -> List[Tuple[str, float]]:
9
+ for aid in sl.rewards.keys():
10
+ if "buffer" in str(aid) and "live" not in str(aid):
11
+ return None
12
+ # One value per agent at each step
13
+ rewards_dict = {f"reward-{aid}": float(v) for aid, v in (sl.rewards or {}).items()}
14
+ return [(key, value) for key, value in rewards_dict.items() if value is not None]
15
+
16
+ stat_functs: list[Callable[[SimulationStepLog], List[Tuple[str, float]]]] = [
17
+ avg_reward,
18
+ ]
src_code_for_reproducibility/markov_games/negotiation/README.md ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Negotiation Games: core mechanics and variants
2
+
3
+ This family of games feature two agents who, in each round, may briefly communicate and then simultaneously propose how to split a fixed resource (most commonly 10 coins). Rewards are the amount kept multiplied by an agent’s per-unit value. The starting speaker alternates deterministically across rounds.
4
+
5
+ Communication is optional and variant-dependent: some settings encourage rich messaging to share private information, while others remove messaging entirely to focus on allocation behavior.
6
+
7
+ Proportional splitting is used when the two proposals exceed the available total: allocations are scaled proportionally rather than discarded. This preserves a useful learning signal even when agents over-claim.
8
+
9
+ ### Variants (in increasing difficulty)
10
+
11
+ - No‑Press Split
12
+ - Single item type (coins)
13
+ - No communication; agents go straight to making split proposals, with the starting player alternating deterministically.
14
+ - Motivation: mirrors no‑communication setups (e.g., Advantage Alignment) while keeping the split decision nontrivial.
15
+ - Deterministic Mode: values are fixed and public: one agent values coins at 10, the other at 1 (alternates each round).
16
+ - Stochastic Mode: values are random and uncorrelated.
17
+
18
+ - Trust-and-Split RPS (TAS-RPS)
19
+ - Single item type (coins)
20
+ - Each round, a rock–paper–scissors hand draw creates a strong asymmetry: the winner’s per-coin value is 10, the loser’s is 1.
21
+ - Each agent initially sees only their own hand and must communicate to coordinate an optimal split.
22
+ - Motivation: enforce large value disparity so one’s own value reveals little about the other’s (avoiding ceiling effects) and incentivize meaningful communication.
23
+
24
+ - Trust-and-Split (TAS)
25
+ - Single item type (coins); each round, each agent’s per-coin value is independently sampled in a broad range (e.g., 1–20).
26
+ - Each agent observes only their own value; they may use short messages to share and negotiate.
27
+ - Motivation: a simple blend that tests whether agents learn to exchange private information and coordinate proportional, value-aware splits.
28
+
29
+ - Deal-or-No-Deal (DOND)
30
+ - Introduced in [Deal or No Deal? End-to-End Learning for Negotiation Dialogues](https://arxiv.org/pdf/1706.05125)
31
+ - Multiple item types (typically "books", "hats" and "balls") with limited stocks; each agent has its own per-type values.
32
+ - A deal pays out only if both proposals exactly agree and respect the stock; otherwise no deal (zero reward) that round.
33
+ - Motivation: a known benchmark closer to real-world bargaining, where both parties must explicitly agree.
34
+
35
+
36
+
37
+
38
+
39
+
40
+
src_code_for_reproducibility/markov_games/negotiation/__pycache__/dond_agent.cpython-312.pyc ADDED
Binary file (4.19 kB). View file
 
src_code_for_reproducibility/markov_games/negotiation/__pycache__/dond_simulation.cpython-312.pyc ADDED
Binary file (10.2 kB). View file
 
src_code_for_reproducibility/markov_games/negotiation/__pycache__/nego_agent.cpython-312.pyc ADDED
Binary file (10.9 kB). View file