Add files using upload-large-folder tool

Browse files

Files changed (12) hide show

seed_42/Qwen/Qwen2.5-7B-Instruct/adapters/agent_adapter/adapter_model.safetensors +1 -1
seed_42/agent_trainer/policy_optimizer_state.pt +1 -1
seed_42/agent_trainer/trainer_annealing_state.pkl +1 -1
seed_42/random_state.pkl +1 -1
src_code_for_reproducibility/markov_games/__pycache__/alternative_actions_runner.cpython-312.pyc +0 -0
src_code_for_reproducibility/markov_games/__pycache__/markov_game.cpython-312.pyc +0 -0
src_code_for_reproducibility/markov_games/negotiation/README.md +3 -16
src_code_for_reproducibility/markov_games/negotiation/nego_hard_coded_policies.py +10 -4
src_code_for_reproducibility/markov_games/negotiation/negotiation_statistics.py +5 -0
src_code_for_reproducibility/markov_games/negotiation/no_press_nego_agent.py +14 -0
src_code_for_reproducibility/markov_games/negotiation/tas_rps_simulation.py +19 -10
src_code_for_reproducibility/models/__pycache__/__init__.cpython-312.pyc +0 -0

seed_42/Qwen/Qwen2.5-7B-Instruct/adapters/agent_adapter/adapter_model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:5f52affcd642fa67620c5f7c3155cb8a867b8f45e80119606c46cb2301660cde
 size 323014168

 version https://git-lfs.github.com/spec/v1
+oid sha256:6cf7df3f718064f8b8bccd484ee71d607c706663cefa78a180e45d4dcf8fc0b7
 size 323014168

seed_42/agent_trainer/policy_optimizer_state.pt CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:911335e20ef7b5f2e34bb166cb8f236807ed6874707956bad488dac1989ca6e9
 size 646269121

 version https://git-lfs.github.com/spec/v1
+oid sha256:2ad344079b33d8b7633e4db957f7f603999d25b31d44becaf441bf8f8a6cb607
 size 646269121

seed_42/agent_trainer/trainer_annealing_state.pkl CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7b6175536b701094d5172932b38c6ca6c17baa2f07ab83ebb00f80f9d1c96bc9
 size 104

 version https://git-lfs.github.com/spec/v1
+oid sha256:7a41c35f20678f6c02b24d48db5127433d12a342565274268816a2f39fd757e8
 size 104

seed_42/random_state.pkl CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b2113fdb42ab3e7764f6a201a6b7edb00a002a4d9dead874859847cfcadac96f
 size 12254

 version https://git-lfs.github.com/spec/v1
+oid sha256:1bed461beebec279976e9e4353eb3aff688ccd1bd5ff66516a692c1a0356b610
 size 12254

src_code_for_reproducibility/markov_games/__pycache__/alternative_actions_runner.cpython-312.pyc CHANGED Viewed

Binary files a/src_code_for_reproducibility/markov_games/__pycache__/alternative_actions_runner.cpython-312.pyc and b/src_code_for_reproducibility/markov_games/__pycache__/alternative_actions_runner.cpython-312.pyc differ

src_code_for_reproducibility/markov_games/__pycache__/markov_game.cpython-312.pyc CHANGED Viewed

Binary files a/src_code_for_reproducibility/markov_games/__pycache__/markov_game.cpython-312.pyc and b/src_code_for_reproducibility/markov_games/__pycache__/markov_game.cpython-312.pyc differ

src_code_for_reproducibility/markov_games/negotiation/README.md CHANGED Viewed

@@ -9,29 +9,16 @@ Proportional splitting is used when the two proposals exceed the available total
 ### Variants (in increasing difficulty)
 - No‑Press Split
-  - Single item type (coins)
-  - No communication; agents go straight to making split proposals, with the starting player alternating deterministically.
   - Motivation: mirrors no‑communication setups (e.g., Advantage Alignment) while keeping the split decision nontrivial.
-  - Deterministic Mode: values are fixed and public: one agent values coins at 10, the other at 1 (alternates each round).
-  - Stochastic Mode: values are random and uncorrelated.
 - Trust-and-Split RPS (TAS-RPS)
   - Single item type (coins)
   - Each round, a rock–paper–scissors hand draw creates a strong asymmetry: the winner’s per-coin value is 10, the loser’s is 1.
   - Each agent initially sees only their own hand and must communicate to coordinate an optimal split.
   - Motivation: enforce large value disparity so one’s own value reveals little about the other’s (avoiding ceiling effects) and incentivize meaningful communication.
-- Trust-and-Split (TAS)
-  - Single item type (coins); each round, each agent’s per-coin value is independently sampled in a broad range (e.g., 1–20).
-  - Each agent observes only their own value; they may use short messages to share and negotiate.
-  - Motivation: a simple blend that tests whether agents learn to exchange private information and coordinate proportional, value-aware splits.
-- Deal-or-No-Deal (DOND)
-  - Introduced in [Deal or No Deal? End-to-End Learning for Negotiation Dialogues](https://arxiv.org/pdf/1706.05125)
-  - Multiple item types (typically "books", "hats" and "balls") with limited stocks; each agent has its own per-type values.
-  - A deal pays out only if both proposals exactly agree and respect the stock; otherwise no deal (zero reward) that round.
-  - Motivation: a known benchmark closer to real-world bargaining, where both parties must explicitly agree.

 ### Variants (in increasing difficulty)
 - No‑Press Split
+  - Multiple item types (e.g., hats, balls, books)
+  - The item values for each agent are public.
+  - No communication; agents go straight to making split proposals.
   - Motivation: mirrors no‑communication setups (e.g., Advantage Alignment) while keeping the split decision nontrivial.
 - Trust-and-Split RPS (TAS-RPS)
   - Single item type (coins)
   - Each round, a rock–paper–scissors hand draw creates a strong asymmetry: the winner’s per-coin value is 10, the loser’s is 1.
   - Each agent initially sees only their own hand and must communicate to coordinate an optimal split.
   - Motivation: enforce large value disparity so one’s own value reveals little about the other’s (avoiding ceiling effects) and incentivize meaningful communication.

src_code_for_reproducibility/markov_games/negotiation/nego_hard_coded_policies.py CHANGED Viewed

@@ -1,11 +1,17 @@
 import asyncio
-from typing import Optional
 from mllm.markov_games.negotiation.nego_agent import NegotiationAgent
 from mllm.markov_games.negotiation.no_press_nego_agent import NoPressAgent
 from mllm.markov_games.negotiation.no_press_nego_simulation import NoPressObs
 from mllm.markov_games.rollout_tree import AgentActLog, ChatTurn
-from mllm.markov_games.negotiation.nego_simulation import Split
-from typing import Any, Tuple
 class HardCodedNegoWelfareMaximizingPolicy(NoPressAgent):
     async def act(self, observation: NoPressObs) -> Tuple[Any, AgentActLog]:
@@ -40,6 +46,7 @@ class HardCodedNegoWelfareMaximizingPolicy(NoPressAgent):
         )
         return action, act_log
 class HardCodedNegoGreedyPolicy(NoPressAgent):
     async def act(self, observation: NoPressObs) -> Tuple[Any, AgentActLog]:
         """
@@ -61,4 +68,3 @@ class HardCodedNegoGreedyPolicy(NoPressAgent):
             info=None,
         )
         return action, act_log

+"""
+File: mllm/markov_games/negotiation/nego_hard_coded_policies.py
+Summary: Provides deterministic negotiation policies for testing and baselines.
+"""
 import asyncio
+from typing import Any, Optional, Tuple
 from mllm.markov_games.negotiation.nego_agent import NegotiationAgent
+from mllm.markov_games.negotiation.nego_simulation import Split
 from mllm.markov_games.negotiation.no_press_nego_agent import NoPressAgent
 from mllm.markov_games.negotiation.no_press_nego_simulation import NoPressObs
 from mllm.markov_games.rollout_tree import AgentActLog, ChatTurn
 class HardCodedNegoWelfareMaximizingPolicy(NoPressAgent):
     async def act(self, observation: NoPressObs) -> Tuple[Any, AgentActLog]:
         )
         return action, act_log
 class HardCodedNegoGreedyPolicy(NoPressAgent):
     async def act(self, observation: NoPressObs) -> Tuple[Any, AgentActLog]:
         """
             info=None,
         )
         return action, act_log

src_code_for_reproducibility/markov_games/negotiation/negotiation_statistics.py CHANGED Viewed

@@ -1,3 +1,8 @@
 from __future__ import annotations
 from typing import Callable, Dict, List, Tuple

+"""
+File: mllm/markov_games/negotiation/negotiation_statistics.py
+Summary: Aggregates and reports statistics for negotiation experiments.
+"""
 from __future__ import annotations
 from typing import Callable, Dict, List, Tuple

src_code_for_reproducibility/markov_games/negotiation/no_press_nego_agent.py CHANGED Viewed

@@ -1,3 +1,8 @@
 from typing import Any, Dict, List, Tuple
 from mllm.markov_games.negotiation.nego_agent import (
@@ -49,9 +54,11 @@ class NoPressAgent(NegotiationAgent):
         self.send_split_prompt = "Submit Your Proposal\n" "Respond as {proposal_style}"
     def get_message_regex(self, observation: NoPressObs) -> str:
         return r"^$"  # No messages allowed
     def get_split_regex(self, observation: NoPressObs) -> str:
         items = list(observation.quantities.keys())
         # Accept both singular and plural forms
         item_pattern = "|".join(
@@ -61,6 +68,12 @@ class NoPressAgent(NegotiationAgent):
         return regex
     def get_split_action(self, policy_output: str, observation: NoPressObs) -> Split:
         items = list(observation.quantities.keys())
         import re as _re
@@ -78,6 +91,7 @@ class NoPressAgent(NegotiationAgent):
             inner_regex = rf"(?i)(10|[0-9])\s*({item_pattern})"
             def normalize_item_name(item_str):
                 for orig in items:
                     if item_str.lower() == orig.lower():
                         return orig

+"""
+File: mllm/markov_games/negotiation/no_press_nego_agent.py
+Summary: Agent variant for no-press negotiations without explicit messaging.
+"""
 from typing import Any, Dict, List, Tuple
 from mllm.markov_games.negotiation.nego_agent import (
         self.send_split_prompt = "Submit Your Proposal\n" "Respond as {proposal_style}"
     def get_message_regex(self, observation: NoPressObs) -> str:
+        """Return an empty pattern because the no-press variant forbids chat."""
         return r"^$"  # No messages allowed
     def get_split_regex(self, observation: NoPressObs) -> str:
+        """Match proposals like ``Proposal: 4 coins, 6 apples`` case-insensitively."""
         items = list(observation.quantities.keys())
         # Accept both singular and plural forms
         item_pattern = "|".join(
         return regex
     def get_split_action(self, policy_output: str, observation: NoPressObs) -> Split:
+        """
+        Parse the LLM proposal into a normalized ``Split`` structure.
+        The regex-based parser is lenient (accepts pluralization variants) so that
+        prompt tweaks do not require re-training the extraction logic.
+        """
         items = list(observation.quantities.keys())
         import re as _re
             inner_regex = rf"(?i)(10|[0-9])\s*({item_pattern})"
             def normalize_item_name(item_str):
+                """Canonicalize plural/singular user text back to the config item id."""
                 for orig in items:
                     if item_str.lower() == orig.lower():
                         return orig

src_code_for_reproducibility/markov_games/negotiation/tas_rps_simulation.py CHANGED Viewed

@@ -1,19 +1,12 @@
 """
-Trust-and-Split simulation.
-This environment models a simple bargaining game over 10 coins with messaging.
-Agents are assigned rock/paper/scissors hands, with the winner getting value 10 per coin
-and the loser getting value 1 per coin. Agents alternate sending messages for a fixed
-number of turns per round and then each submits a split proposal indicating how many
-coins they keep for themselves. Rewards are proportional if the proposed totals exceed 10.
 """
 import copy
 from dataclasses import dataclass
 from typing import Any, Dict, List, Literal, Tuple
-from numpy.random import default_rng
 from mllm.markov_games.negotiation.nego_simulation import (
     Message,
     NegotiationObs,
@@ -46,6 +39,8 @@ def _get_rps_winner(
 @dataclass
 class TrustAndSplitRPSState(NegotiationState):
     hands: Dict[
         AgentId, Literal["rock", "paper", "scissors"]
     ]  # rock, paper, or scissors
@@ -54,6 +49,8 @@ class TrustAndSplitRPSState(NegotiationState):
 @dataclass
 class TrustAndSplitRPSObs(NegotiationObs):
     hand: Literal["rock", "paper", "scissors"]
     last_hand_agent: Literal["rock", "paper", "scissors"] | None
     last_hand_coagent: Literal["rock", "paper", "scissors"] | None
@@ -61,6 +58,8 @@ class TrustAndSplitRPSObs(NegotiationObs):
 class TrustAndSplitRPSSimulation(NegotiationSimulation):
     def __init__(
         self,
         alternating_hands: bool = False,
@@ -81,6 +80,13 @@ class TrustAndSplitRPSSimulation(NegotiationSimulation):
         self,
         alternate_hands: bool = False,
     ) -> Tuple[Dict[AgentId, str], Dict[AgentId, float]]:
         hands = ["rock", "paper", "scissors"]
         if alternate_hands:
             previous_hands = list(self.state.previous_hands.values())
@@ -115,6 +121,7 @@ class TrustAndSplitRPSSimulation(NegotiationSimulation):
             return agent_hands, values
     def set_new_round_of_variant(self):
         self.state.previous_hands = copy.deepcopy(self.state.hands)
         new_hands, new_values = self._sample_hands_and_values(
             alternate_hands=self.alternating_hands
@@ -128,6 +135,7 @@ class TrustAndSplitRPSSimulation(NegotiationSimulation):
     def get_info_of_variant(
         self, state: NegotiationState, actions: Dict[AgentId, Any]
     ) -> Dict[str, Any]:
         return {
             "quantities": copy.deepcopy(state.quantities),
             "hands": copy.deepcopy(state.hands),
@@ -138,12 +146,13 @@ class TrustAndSplitRPSSimulation(NegotiationSimulation):
         }
     def get_rewards(self, splits: Dict[AgentId, Split]) -> Dict[AgentId, float]:
         return compute_tas_style_rewards(
             self.agent_ids, self.state.values, splits, self.state.quantities
         )
     def get_obs_agent(self, agent_id):
-        """Returns observation for agent_id"""
         other_id = self._other(agent_id)
         last_value_coagent = (
             None

 """
+File: mllm/markov_games/negotiation/tas_rps_simulation.py
+Summary: Simulation for TAS Rock-Paper-Scissors blended scenarios.
 """
 import copy
 from dataclasses import dataclass
 from typing import Any, Dict, List, Literal, Tuple
 from mllm.markov_games.negotiation.nego_simulation import (
     Message,
     NegotiationObs,
 @dataclass
 class TrustAndSplitRPSState(NegotiationState):
+    """Negotiation state augmented with the current and previous RPS hands."""
     hands: Dict[
         AgentId, Literal["rock", "paper", "scissors"]
     ]  # rock, paper, or scissors
 @dataclass
 class TrustAndSplitRPSObs(NegotiationObs):
+    """Agent-facing observation enriched with last-hand metadata."""
     hand: Literal["rock", "paper", "scissors"]
     last_hand_agent: Literal["rock", "paper", "scissors"] | None
     last_hand_coagent: Literal["rock", "paper", "scissors"] | None
 class TrustAndSplitRPSSimulation(NegotiationSimulation):
+    """Negotiation variant that splices TAS splitting with RPS-determined stakes."""
     def __init__(
         self,
         alternating_hands: bool = False,
         self,
         alternate_hands: bool = False,
     ) -> Tuple[Dict[AgentId, str], Dict[AgentId, float]]:
+        """
+        Sample a rock-paper-scissors hand for each agent plus the per-hand value.
+        When ``alternate_hands`` is True we deliberately flip the previous round's
+        winner/loser roles to create nonstationary payoffs; otherwise we draw
+        uniformly without replacement.
+        """
         hands = ["rock", "paper", "scissors"]
         if alternate_hands:
             previous_hands = list(self.state.previous_hands.values())
             return agent_hands, values
     def set_new_round_of_variant(self):
+        """Refresh hands/values and reset round-specific state."""
         self.state.previous_hands = copy.deepcopy(self.state.hands)
         new_hands, new_values = self._sample_hands_and_values(
             alternate_hands=self.alternating_hands
     def get_info_of_variant(
         self, state: NegotiationState, actions: Dict[AgentId, Any]
     ) -> Dict[str, Any]:
+        """Expose variant-specific tensors for downstream logging/analysis."""
         return {
             "quantities": copy.deepcopy(state.quantities),
             "hands": copy.deepcopy(state.hands),
         }
     def get_rewards(self, splits: Dict[AgentId, Split]) -> Dict[AgentId, float]:
+        """Delegates to TAS reward helper because the payout rule is identical."""
         return compute_tas_style_rewards(
             self.agent_ids, self.state.values, splits, self.state.quantities
         )
     def get_obs_agent(self, agent_id):
+        """Return a full Trust-and-Split observation for ``agent_id``."""
         other_id = self._other(agent_id)
         last_value_coagent = (
             None

src_code_for_reproducibility/models/__pycache__/__init__.cpython-312.pyc CHANGED Viewed

Binary files a/src_code_for_reproducibility/models/__pycache__/__init__.cpython-312.pyc and b/src_code_for_reproducibility/models/__pycache__/__init__.cpython-312.pyc differ