koichi12 commited on Feb 12, 2025

Commit

fb8b131

verified ·

1 Parent(s): 28dd79d

Add files using upload-large-folder tool

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__init__.py +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/__init__.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/attention_net_supervised.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/parametric_actions_cartpole.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/parametric_actions_cartpole_embeddings_learnt_by_model.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/attention_net_supervised.py +77 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__init__.py +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/__init__.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/action_mask_model.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/autoregressive_action_dist.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/autoregressive_action_model.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/centralized_critic_models.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/custom_loss_model.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/fast_model.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/mobilenet_v2_encoder.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/mobilenet_v2_with_lstm_models.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/parametric_actions_model.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/shared_weights_model.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/simple_rpg_model.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/action_mask_model.py +126 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/autoregressive_action_dist.py +149 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/autoregressive_action_model.py +162 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/centralized_critic_models.py +182 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/custom_loss_model.py +137 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/fast_model.py +80 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/mobilenet_v2_encoder.py +48 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/mobilenet_v2_with_lstm_models.py +160 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/neural_computer.py +247 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/parametric_actions_model.py +201 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/shared_weights_model.py +206 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/simple_rpg_model.py +65 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/parametric_actions_cartpole.py +121 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/parametric_actions_cartpole_embeddings_learnt_by_model.py +107 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/__init__.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/cartpole_dqn_export.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/change_config_during_training.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/checkpoint_by_custom_criteria.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/restore_1_of_n_agents_from_checkpoint.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__init__.py +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/__init__.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/custom_heuristic_policy.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/different_spaces_for_agents.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/multi_agent_cartpole.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/multi_agent_pendulum.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/pettingzoo_independent_learning.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/pettingzoo_parameter_sharing.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/pettingzoo_shared_value_function.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/rock_paper_scissors_heuristic_vs_learned.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/rock_paper_scissors_learned_vs_learned.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/self_play_league_based_with_open_spiel.cpython-311.pyc +0 -0

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__init__.py ADDED Viewed

File without changes

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/__init__.cpython-311.pyc ADDED Viewed

Binary file (206 Bytes). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/attention_net_supervised.cpython-311.pyc ADDED Viewed

Binary file (4.78 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/parametric_actions_cartpole.cpython-311.pyc ADDED Viewed

Binary file (4.59 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/parametric_actions_cartpole_embeddings_learnt_by_model.cpython-311.pyc ADDED Viewed

Binary file (4.3 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/attention_net_supervised.py ADDED Viewed

	@@ -0,0 +1,77 @@

+# @OldAPIStack
+from gymnasium.spaces import Box, Discrete
+import numpy as np
+from rllib.models.tf.attention_net import TrXLNet
+from ray.rllib.utils.framework import try_import_tf
+tf1, tf, tfv = try_import_tf()
+def bit_shift_generator(seq_length, shift, batch_size):
+    while True:
+        values = np.array([0.0, 1.0], dtype=np.float32)
+        seq = np.random.choice(values, (batch_size, seq_length, 1))
+        targets = np.squeeze(np.roll(seq, shift, axis=1).astype(np.int32))
+        targets[:, :shift] = 0
+        yield seq, targets
+def train_loss(targets, outputs):
+    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
+        labels=targets, logits=outputs
+    )
+    return tf.reduce_mean(loss)
+def train_bit_shift(seq_length, num_iterations, print_every_n):
+    optimizer = tf.keras.optimizers.Adam(1e-3)
+    model = TrXLNet(
+        observation_space=Box(low=0, high=1, shape=(1,), dtype=np.int32),
+        action_space=Discrete(2),
+        num_outputs=2,
+        model_config={"max_seq_len": seq_length},
+        name="trxl",
+        num_transformer_units=1,
+        attention_dim=10,
+        num_heads=5,
+        head_dim=20,
+        position_wise_mlp_dim=20,
+    )
+    shift = 10
+    train_batch = 10
+    test_batch = 100
+    data_gen = bit_shift_generator(seq_length, shift=shift, batch_size=train_batch)
+    test_gen = bit_shift_generator(seq_length, shift=shift, batch_size=test_batch)
+    @tf.function
+    def update_step(inputs, targets):
+        model_out = model(
+            {"obs": inputs},
+            state=[tf.reshape(inputs, [-1, seq_length, 1])],
+            seq_lens=np.full(shape=(train_batch,), fill_value=seq_length),
+        )
+        optimizer.minimize(
+            lambda: train_loss(targets, model_out), lambda: model.trainable_variables
+        )
+    for i, (inputs, targets) in zip(range(num_iterations), data_gen):
+        inputs_in = np.reshape(inputs, [-1, 1])
+        targets_in = np.reshape(targets, [-1])
+        update_step(tf.convert_to_tensor(inputs_in), tf.convert_to_tensor(targets_in))
+        if i % print_every_n == 0:
+            test_inputs, test_targets = next(test_gen)
+            print(i, train_loss(test_targets, model(test_inputs)))
+if __name__ == "__main__":
+    tf.enable_eager_execution()
+    train_bit_shift(
+        seq_length=20,
+        num_iterations=2000,
+        print_every_n=200,
+    )

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__init__.py ADDED Viewed

File without changes

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/__init__.cpython-311.pyc ADDED Viewed

Binary file (213 Bytes). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/action_mask_model.cpython-311.pyc ADDED Viewed

Binary file (5.32 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/autoregressive_action_dist.cpython-311.pyc ADDED Viewed

Binary file (9.65 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/autoregressive_action_model.cpython-311.pyc ADDED Viewed

Binary file (8.05 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/centralized_critic_models.cpython-311.pyc ADDED Viewed

Binary file (10.5 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/custom_loss_model.cpython-311.pyc ADDED Viewed

Binary file (8.29 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/fast_model.cpython-311.pyc ADDED Viewed

Binary file (5.56 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/mobilenet_v2_encoder.cpython-311.pyc ADDED Viewed

Binary file (3.04 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/mobilenet_v2_with_lstm_models.cpython-311.pyc ADDED Viewed

Binary file (9.69 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/parametric_actions_model.cpython-311.pyc ADDED Viewed

Binary file (8.57 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/shared_weights_model.cpython-311.pyc ADDED Viewed

Binary file (11.1 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/simple_rpg_model.cpython-311.pyc ADDED Viewed

Binary file (4.19 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/action_mask_model.py ADDED Viewed

	@@ -0,0 +1,126 @@

+# @OldAPIStack
+from gymnasium.spaces import Dict
+from ray.rllib.models.tf.fcnet import FullyConnectedNetwork
+from ray.rllib.models.tf.tf_modelv2 import TFModelV2
+from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
+from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
+from ray.rllib.utils.framework import try_import_tf, try_import_torch
+from ray.rllib.utils.torch_utils import FLOAT_MIN
+tf1, tf, tfv = try_import_tf()
+torch, nn = try_import_torch()
+class ActionMaskModel(TFModelV2):
+    """Model that handles simple discrete action masking.
+    This assumes the outputs are logits for a single Categorical action dist.
+    Getting this to work with a more complex output (e.g., if the action space
+    is a tuple of several distributions) is also possible but left as an
+    exercise to the reader.
+    """
+    def __init__(
+        self, obs_space, action_space, num_outputs, model_config, name, **kwargs
+    ):
+        orig_space = getattr(obs_space, "original_space", obs_space)
+        assert (
+            isinstance(orig_space, Dict)
+            and "action_mask" in orig_space.spaces
+            and "observations" in orig_space.spaces
+        )
+        super().__init__(obs_space, action_space, num_outputs, model_config, name)
+        self.internal_model = FullyConnectedNetwork(
+            orig_space["observations"],
+            action_space,
+            num_outputs,
+            model_config,
+            name + "_internal",
+        )
+        # disable action masking --> will likely lead to invalid actions
+        self.no_masking = model_config["custom_model_config"].get("no_masking", False)
+    def forward(self, input_dict, state, seq_lens):
+        # Extract the available actions tensor from the observation.
+        action_mask = input_dict["obs"]["action_mask"]
+        # Compute the unmasked logits.
+        logits, _ = self.internal_model({"obs": input_dict["obs"]["observations"]})
+        # If action masking is disabled, directly return unmasked logits
+        if self.no_masking:
+            return logits, state
+        # Convert action_mask into a [0.0 || -inf]-type mask.
+        inf_mask = tf.maximum(tf.math.log(action_mask), tf.float32.min)
+        masked_logits = logits + inf_mask
+        # Return masked logits.
+        return masked_logits, state
+    def value_function(self):
+        return self.internal_model.value_function()
+class TorchActionMaskModel(TorchModelV2, nn.Module):
+    """PyTorch version of above ActionMaskingModel."""
+    def __init__(
+        self,
+        obs_space,
+        action_space,
+        num_outputs,
+        model_config,
+        name,
+        **kwargs,
+    ):
+        orig_space = getattr(obs_space, "original_space", obs_space)
+        assert (
+            isinstance(orig_space, Dict)
+            and "action_mask" in orig_space.spaces
+            and "observations" in orig_space.spaces
+        )
+        TorchModelV2.__init__(
+            self, obs_space, action_space, num_outputs, model_config, name, **kwargs
+        )
+        nn.Module.__init__(self)
+        self.internal_model = TorchFC(
+            orig_space["observations"],
+            action_space,
+            num_outputs,
+            model_config,
+            name + "_internal",
+        )
+        # disable action masking --> will likely lead to invalid actions
+        self.no_masking = False
+        if "no_masking" in model_config["custom_model_config"]:
+            self.no_masking = model_config["custom_model_config"]["no_masking"]
+    def forward(self, input_dict, state, seq_lens):
+        # Extract the available actions tensor from the observation.
+        action_mask = input_dict["obs"]["action_mask"]
+        # Compute the unmasked logits.
+        logits, _ = self.internal_model({"obs": input_dict["obs"]["observations"]})
+        # If action masking is disabled, directly return unmasked logits
+        if self.no_masking:
+            return logits, state
+        # Convert action_mask into a [0.0 || -inf]-type mask.
+        inf_mask = torch.clamp(torch.log(action_mask), min=FLOAT_MIN)
+        masked_logits = logits + inf_mask
+        # Return masked logits.
+        return masked_logits, state
+    def value_function(self):
+        return self.internal_model.value_function()

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/autoregressive_action_dist.py ADDED Viewed

	@@ -0,0 +1,149 @@

+# @OldAPIStack
+from ray.rllib.models.tf.tf_action_dist import Categorical, ActionDistribution
+from ray.rllib.models.torch.torch_action_dist import (
+    TorchCategorical,
+    TorchDistributionWrapper,
+)
+from ray.rllib.utils.framework import try_import_tf, try_import_torch
+tf1, tf, tfv = try_import_tf()
+torch, nn = try_import_torch()
+class BinaryAutoregressiveDistribution(ActionDistribution):
+    """Action distribution P(a1, a2) = P(a1) * P(a2 | a1)"""
+    def deterministic_sample(self):
+        # First, sample a1.
+        a1_dist = self._a1_distribution()
+        a1 = a1_dist.deterministic_sample()
+        # Sample a2 conditioned on a1.
+        a2_dist = self._a2_distribution(a1)
+        a2 = a2_dist.deterministic_sample()
+        self._action_logp = a1_dist.logp(a1) + a2_dist.logp(a2)
+        # Return the action tuple.
+        return (a1, a2)
+    def sample(self):
+        # First, sample a1.
+        a1_dist = self._a1_distribution()
+        a1 = a1_dist.sample()
+        # Sample a2 conditioned on a1.
+        a2_dist = self._a2_distribution(a1)
+        a2 = a2_dist.sample()
+        self._action_logp = a1_dist.logp(a1) + a2_dist.logp(a2)
+        # Return the action tuple.
+        return (a1, a2)
+    def logp(self, actions):
+        a1, a2 = actions[:, 0], actions[:, 1]
+        a1_vec = tf.expand_dims(tf.cast(a1, tf.float32), 1)
+        a1_logits, a2_logits = self.model.action_model([self.inputs, a1_vec])
+        return Categorical(a1_logits).logp(a1) + Categorical(a2_logits).logp(a2)
+    def sampled_action_logp(self):
+        return self._action_logp
+    def entropy(self):
+        a1_dist = self._a1_distribution()
+        a2_dist = self._a2_distribution(a1_dist.sample())
+        return a1_dist.entropy() + a2_dist.entropy()
+    def kl(self, other):
+        a1_dist = self._a1_distribution()
+        a1_terms = a1_dist.kl(other._a1_distribution())
+        a1 = a1_dist.sample()
+        a2_terms = self._a2_distribution(a1).kl(other._a2_distribution(a1))
+        return a1_terms + a2_terms
+    def _a1_distribution(self):
+        BATCH = tf.shape(self.inputs)[0]
+        a1_logits, _ = self.model.action_model([self.inputs, tf.zeros((BATCH, 1))])
+        a1_dist = Categorical(a1_logits)
+        return a1_dist
+    def _a2_distribution(self, a1):
+        a1_vec = tf.expand_dims(tf.cast(a1, tf.float32), 1)
+        _, a2_logits = self.model.action_model([self.inputs, a1_vec])
+        a2_dist = Categorical(a2_logits)
+        return a2_dist
+    @staticmethod
+    def required_model_output_shape(action_space, model_config):
+        return 16  # controls model output feature vector size
+class TorchBinaryAutoregressiveDistribution(TorchDistributionWrapper):
+    """Action distribution P(a1, a2) = P(a1) * P(a2 | a1)"""
+    def deterministic_sample(self):
+        # First, sample a1.
+        a1_dist = self._a1_distribution()
+        a1 = a1_dist.deterministic_sample()
+        # Sample a2 conditioned on a1.
+        a2_dist = self._a2_distribution(a1)
+        a2 = a2_dist.deterministic_sample()
+        self._action_logp = a1_dist.logp(a1) + a2_dist.logp(a2)
+        # Return the action tuple.
+        return (a1, a2)
+    def sample(self):
+        # First, sample a1.
+        a1_dist = self._a1_distribution()
+        a1 = a1_dist.sample()
+        # Sample a2 conditioned on a1.
+        a2_dist = self._a2_distribution(a1)
+        a2 = a2_dist.sample()
+        self._action_logp = a1_dist.logp(a1) + a2_dist.logp(a2)
+        # Return the action tuple.
+        return (a1, a2)
+    def logp(self, actions):
+        a1, a2 = actions[:, 0], actions[:, 1]
+        a1_vec = torch.unsqueeze(a1.float(), 1)
+        a1_logits, a2_logits = self.model.action_module(self.inputs, a1_vec)
+        return TorchCategorical(a1_logits).logp(a1) + TorchCategorical(a2_logits).logp(
+            a2
+        )
+    def sampled_action_logp(self):
+        return self._action_logp
+    def entropy(self):
+        a1_dist = self._a1_distribution()
+        a2_dist = self._a2_distribution(a1_dist.sample())
+        return a1_dist.entropy() + a2_dist.entropy()
+    def kl(self, other):
+        a1_dist = self._a1_distribution()
+        a1_terms = a1_dist.kl(other._a1_distribution())
+        a1 = a1_dist.sample()
+        a2_terms = self._a2_distribution(a1).kl(other._a2_distribution(a1))
+        return a1_terms + a2_terms
+    def _a1_distribution(self):
+        BATCH = self.inputs.shape[0]
+        zeros = torch.zeros((BATCH, 1)).to(self.inputs.device)
+        a1_logits, _ = self.model.action_module(self.inputs, zeros)
+        a1_dist = TorchCategorical(a1_logits)
+        return a1_dist
+    def _a2_distribution(self, a1):
+        a1_vec = torch.unsqueeze(a1.float(), 1)
+        _, a2_logits = self.model.action_module(self.inputs, a1_vec)
+        a2_dist = TorchCategorical(a2_logits)
+        return a2_dist
+    @staticmethod
+    def required_model_output_shape(action_space, model_config):
+        return 16  # controls model output feature vector size

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/autoregressive_action_model.py ADDED Viewed

	@@ -0,0 +1,162 @@

+# @OldAPIStack
+from gymnasium.spaces import Discrete, Tuple
+from ray.rllib.models.tf.misc import normc_initializer
+from ray.rllib.models.tf.tf_modelv2 import TFModelV2
+from ray.rllib.models.torch.misc import normc_initializer as normc_init_torch
+from ray.rllib.models.torch.misc import SlimFC
+from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
+from ray.rllib.utils.framework import try_import_tf, try_import_torch
+tf1, tf, tfv = try_import_tf()
+torch, nn = try_import_torch()
+class AutoregressiveActionModel(TFModelV2):
+    """Implements the `.action_model` branch required above."""
+    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
+        super(AutoregressiveActionModel, self).__init__(
+            obs_space, action_space, num_outputs, model_config, name
+        )
+        if action_space != Tuple([Discrete(2), Discrete(2)]):
+            raise ValueError("This model only supports the [2, 2] action space")
+        # Inputs
+        obs_input = tf.keras.layers.Input(shape=obs_space.shape, name="obs_input")
+        a1_input = tf.keras.layers.Input(shape=(1,), name="a1_input")
+        ctx_input = tf.keras.layers.Input(shape=(num_outputs,), name="ctx_input")
+        # Output of the model (normally 'logits', but for an autoregressive
+        # dist this is more like a context/feature layer encoding the obs)
+        context = tf.keras.layers.Dense(
+            num_outputs,
+            name="hidden",
+            activation=tf.nn.tanh,
+            kernel_initializer=normc_initializer(1.0),
+        )(obs_input)
+        # V(s)
+        value_out = tf.keras.layers.Dense(
+            1,
+            name="value_out",
+            activation=None,
+            kernel_initializer=normc_initializer(0.01),
+        )(context)
+        # P(a1 | obs)
+        a1_logits = tf.keras.layers.Dense(
+            2,
+            name="a1_logits",
+            activation=None,
+            kernel_initializer=normc_initializer(0.01),
+        )(ctx_input)
+        # P(a2 | a1)
+        # --note: typically you'd want to implement P(a2 | a1, obs) as follows:
+        # a2_context = tf.keras.layers.Concatenate(axis=1)(
+        #     [ctx_input, a1_input])
+        a2_context = a1_input
+        a2_hidden = tf.keras.layers.Dense(
+            16,
+            name="a2_hidden",
+            activation=tf.nn.tanh,
+            kernel_initializer=normc_initializer(1.0),
+        )(a2_context)
+        a2_logits = tf.keras.layers.Dense(
+            2,
+            name="a2_logits",
+            activation=None,
+            kernel_initializer=normc_initializer(0.01),
+        )(a2_hidden)
+        # Base layers
+        self.base_model = tf.keras.Model(obs_input, [context, value_out])
+        self.base_model.summary()
+        # Autoregressive action sampler
+        self.action_model = tf.keras.Model(
+            [ctx_input, a1_input], [a1_logits, a2_logits]
+        )
+        self.action_model.summary()
+    def forward(self, input_dict, state, seq_lens):
+        context, self._value_out = self.base_model(input_dict["obs"])
+        return context, state
+    def value_function(self):
+        return tf.reshape(self._value_out, [-1])
+class TorchAutoregressiveActionModel(TorchModelV2, nn.Module):
+    """PyTorch version of the AutoregressiveActionModel above."""
+    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
+        TorchModelV2.__init__(
+            self, obs_space, action_space, num_outputs, model_config, name
+        )
+        nn.Module.__init__(self)
+        if action_space != Tuple([Discrete(2), Discrete(2)]):
+            raise ValueError("This model only supports the [2, 2] action space")
+        # Output of the model (normally 'logits', but for an autoregressive
+        # dist this is more like a context/feature layer encoding the obs)
+        self.context_layer = SlimFC(
+            in_size=obs_space.shape[0],
+            out_size=num_outputs,
+            initializer=normc_init_torch(1.0),
+            activation_fn=nn.Tanh,
+        )
+        # V(s)
+        self.value_branch = SlimFC(
+            in_size=num_outputs,
+            out_size=1,
+            initializer=normc_init_torch(0.01),
+            activation_fn=None,
+        )
+        # P(a1 | obs)
+        self.a1_logits = SlimFC(
+            in_size=num_outputs,
+            out_size=2,
+            activation_fn=None,
+            initializer=normc_init_torch(0.01),
+        )
+        class _ActionModel(nn.Module):
+            def __init__(self):
+                nn.Module.__init__(self)
+                self.a2_hidden = SlimFC(
+                    in_size=1,
+                    out_size=16,
+                    activation_fn=nn.Tanh,
+                    initializer=normc_init_torch(1.0),
+                )
+                self.a2_logits = SlimFC(
+                    in_size=16,
+                    out_size=2,
+                    activation_fn=None,
+                    initializer=normc_init_torch(0.01),
+                )
+            def forward(self_, ctx_input, a1_input):
+                a1_logits = self.a1_logits(ctx_input)
+                a2_logits = self_.a2_logits(self_.a2_hidden(a1_input))
+                return a1_logits, a2_logits
+        # P(a2 | a1)
+        # --note: typically you'd want to implement P(a2 | a1, obs) as follows:
+        # a2_context = tf.keras.layers.Concatenate(axis=1)(
+        #     [ctx_input, a1_input])
+        self.action_module = _ActionModel()
+        self._context = None
+    def forward(self, input_dict, state, seq_lens):
+        self._context = self.context_layer(input_dict["obs"])
+        return self._context, state
+    def value_function(self):
+        return torch.reshape(self.value_branch(self._context), [-1])

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/centralized_critic_models.py ADDED Viewed

	@@ -0,0 +1,182 @@

+# @OldAPIStack
+from gymnasium.spaces import Box
+from ray.rllib.models.modelv2 import ModelV2
+from ray.rllib.models.tf.tf_modelv2 import TFModelV2
+from ray.rllib.models.tf.fcnet import FullyConnectedNetwork
+from ray.rllib.models.torch.misc import SlimFC
+from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
+from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
+from ray.rllib.utils.annotations import override
+from ray.rllib.utils.framework import try_import_tf, try_import_torch
+tf1, tf, tfv = try_import_tf()
+torch, nn = try_import_torch()
+class CentralizedCriticModel(TFModelV2):
+    """Multi-agent model that implements a centralized value function."""
+    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
+        super(CentralizedCriticModel, self).__init__(
+            obs_space, action_space, num_outputs, model_config, name
+        )
+        # Base of the model
+        self.model = FullyConnectedNetwork(
+            obs_space, action_space, num_outputs, model_config, name
+        )
+        # Central VF maps (obs, opp_obs, opp_act) -> vf_pred
+        obs = tf.keras.layers.Input(shape=(6,), name="obs")
+        opp_obs = tf.keras.layers.Input(shape=(6,), name="opp_obs")
+        opp_act = tf.keras.layers.Input(shape=(2,), name="opp_act")
+        concat_obs = tf.keras.layers.Concatenate(axis=1)([obs, opp_obs, opp_act])
+        central_vf_dense = tf.keras.layers.Dense(
+            16, activation=tf.nn.tanh, name="c_vf_dense"
+        )(concat_obs)
+        central_vf_out = tf.keras.layers.Dense(1, activation=None, name="c_vf_out")(
+            central_vf_dense
+        )
+        self.central_vf = tf.keras.Model(
+            inputs=[obs, opp_obs, opp_act], outputs=central_vf_out
+        )
+    @override(ModelV2)
+    def forward(self, input_dict, state, seq_lens):
+        return self.model.forward(input_dict, state, seq_lens)
+    def central_value_function(self, obs, opponent_obs, opponent_actions):
+        return tf.reshape(
+            self.central_vf(
+                [obs, opponent_obs, tf.one_hot(tf.cast(opponent_actions, tf.int32), 2)]
+            ),
+            [-1],
+        )
+    @override(ModelV2)
+    def value_function(self):
+        return self.model.value_function()  # not used
+class YetAnotherCentralizedCriticModel(TFModelV2):
+    """Multi-agent model that implements a centralized value function.
+    It assumes the observation is a dict with 'own_obs' and 'opponent_obs', the
+    former of which can be used for computing actions (i.e., decentralized
+    execution), and the latter for optimization (i.e., centralized learning).
+    This model has two parts:
+    - An action model that looks at just 'own_obs' to compute actions
+    - A value model that also looks at the 'opponent_obs' / 'opponent_action'
+      to compute the value (it does this by using the 'obs_flat' tensor).
+    """
+    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
+        super(YetAnotherCentralizedCriticModel, self).__init__(
+            obs_space, action_space, num_outputs, model_config, name
+        )
+        self.action_model = FullyConnectedNetwork(
+            Box(low=0, high=1, shape=(6,)),  # one-hot encoded Discrete(6)
+            action_space,
+            num_outputs,
+            model_config,
+            name + "_action",
+        )
+        self.value_model = FullyConnectedNetwork(
+            obs_space, action_space, 1, model_config, name + "_vf"
+        )
+    def forward(self, input_dict, state, seq_lens):
+        self._value_out, _ = self.value_model(
+            {"obs": input_dict["obs_flat"]}, state, seq_lens
+        )
+        return self.action_model({"obs": input_dict["obs"]["own_obs"]}, state, seq_lens)
+    def value_function(self):
+        return tf.reshape(self._value_out, [-1])
+class TorchCentralizedCriticModel(TorchModelV2, nn.Module):
+    """Multi-agent model that implements a centralized VF."""
+    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
+        TorchModelV2.__init__(
+            self, obs_space, action_space, num_outputs, model_config, name
+        )
+        nn.Module.__init__(self)
+        # Base of the model
+        self.model = TorchFC(obs_space, action_space, num_outputs, model_config, name)
+        # Central VF maps (obs, opp_obs, opp_act) -> vf_pred
+        input_size = 6 + 6 + 2  # obs + opp_obs + opp_act
+        self.central_vf = nn.Sequential(
+            SlimFC(input_size, 16, activation_fn=nn.Tanh),
+            SlimFC(16, 1),
+        )
+    @override(ModelV2)
+    def forward(self, input_dict, state, seq_lens):
+        model_out, _ = self.model(input_dict, state, seq_lens)
+        return model_out, []
+    def central_value_function(self, obs, opponent_obs, opponent_actions):
+        input_ = torch.cat(
+            [
+                obs,
+                opponent_obs,
+                torch.nn.functional.one_hot(opponent_actions.long(), 2).float(),
+            ],
+            1,
+        )
+        return torch.reshape(self.central_vf(input_), [-1])
+    @override(ModelV2)
+    def value_function(self):
+        return self.model.value_function()  # not used
+class YetAnotherTorchCentralizedCriticModel(TorchModelV2, nn.Module):
+    """Multi-agent model that implements a centralized value function.
+    It assumes the observation is a dict with 'own_obs' and 'opponent_obs', the
+    former of which can be used for computing actions (i.e., decentralized
+    execution), and the latter for optimization (i.e., centralized learning).
+    This model has two parts:
+    - An action model that looks at just 'own_obs' to compute actions
+    - A value model that also looks at the 'opponent_obs' / 'opponent_action'
+      to compute the value (it does this by using the 'obs_flat' tensor).
+    """
+    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
+        TorchModelV2.__init__(
+            self, obs_space, action_space, num_outputs, model_config, name
+        )
+        nn.Module.__init__(self)
+        self.action_model = TorchFC(
+            Box(low=0, high=1, shape=(6,)),  # one-hot encoded Discrete(6)
+            action_space,
+            num_outputs,
+            model_config,
+            name + "_action",
+        )
+        self.value_model = TorchFC(
+            obs_space, action_space, 1, model_config, name + "_vf"
+        )
+        self._model_in = None
+    def forward(self, input_dict, state, seq_lens):
+        # Store model-input for possible `value_function()` call.
+        self._model_in = [input_dict["obs_flat"], state, seq_lens]
+        return self.action_model({"obs": input_dict["obs"]["own_obs"]}, state, seq_lens)
+    def value_function(self):
+        value_out, _ = self.value_model(
+            {"obs": self._model_in[0]}, self._model_in[1], self._model_in[2]
+        )
+        return torch.reshape(value_out, [-1])

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/custom_loss_model.py ADDED Viewed

	@@ -0,0 +1,137 @@

+import numpy as np
+from ray.rllib.models.modelv2 import ModelV2, restore_original_dimensions
+from ray.rllib.models.tf.tf_action_dist import Categorical
+from ray.rllib.models.tf.tf_modelv2 import TFModelV2
+from ray.rllib.models.tf.fcnet import FullyConnectedNetwork
+from ray.rllib.models.torch.torch_action_dist import TorchCategorical
+from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
+from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
+from ray.rllib.utils.annotations import override
+from ray.rllib.utils.framework import try_import_tf, try_import_torch
+from ray.rllib.offline import JsonReader
+tf1, tf, tfv = try_import_tf()
+torch, nn = try_import_torch()
+class CustomLossModel(TFModelV2):
+    """Custom model that adds an imitation loss on top of the policy loss."""
+    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
+        super().__init__(obs_space, action_space, num_outputs, model_config, name)
+        self.fcnet = FullyConnectedNetwork(
+            self.obs_space, self.action_space, num_outputs, model_config, name="fcnet"
+        )
+    @override(ModelV2)
+    def forward(self, input_dict, state, seq_lens):
+        # Delegate to our FCNet.
+        return self.fcnet(input_dict, state, seq_lens)
+    @override(ModelV2)
+    def value_function(self):
+        # Delegate to our FCNet.
+        return self.fcnet.value_function()
+    @override(ModelV2)
+    def custom_loss(self, policy_loss, loss_inputs):
+        # Create a new input reader per worker.
+        reader = JsonReader(self.model_config["custom_model_config"]["input_files"])
+        input_ops = reader.tf_input_ops()
+        # Define a secondary loss by building a graph copy with weight sharing.
+        obs = restore_original_dimensions(
+            tf.cast(input_ops["obs"], tf.float32), self.obs_space
+        )
+        logits, _ = self.forward({"obs": obs}, [], None)
+        # Compute the IL loss.
+        action_dist = Categorical(logits, self.model_config)
+        self.policy_loss = policy_loss
+        self.imitation_loss = tf.reduce_mean(-action_dist.logp(input_ops["actions"]))
+        return policy_loss + 10 * self.imitation_loss
+    def metrics(self):
+        return {
+            "policy_loss": self.policy_loss,
+            "imitation_loss": self.imitation_loss,
+        }
+class TorchCustomLossModel(TorchModelV2, nn.Module):
+    """PyTorch version of the CustomLossModel above."""
+    def __init__(
+        self, obs_space, action_space, num_outputs, model_config, name, input_files
+    ):
+        super().__init__(obs_space, action_space, num_outputs, model_config, name)
+        nn.Module.__init__(self)
+        self.input_files = input_files
+        # Create a new input reader per worker.
+        self.reader = JsonReader(self.input_files)
+        self.fcnet = TorchFC(
+            self.obs_space, self.action_space, num_outputs, model_config, name="fcnet"
+        )
+    @override(ModelV2)
+    def forward(self, input_dict, state, seq_lens):
+        # Delegate to our FCNet.
+        return self.fcnet(input_dict, state, seq_lens)
+    @override(ModelV2)
+    def value_function(self):
+        # Delegate to our FCNet.
+        return self.fcnet.value_function()
+    @override(ModelV2)
+    def custom_loss(self, policy_loss, loss_inputs):
+        """Calculates a custom loss on top of the given policy_loss(es).
+        Args:
+            policy_loss (List[TensorType]): The list of already calculated
+                policy losses (as many as there are optimizers).
+            loss_inputs: Struct of np.ndarrays holding the
+                entire train batch.
+        Returns:
+            List[TensorType]: The altered list of policy losses. In case the
+                custom loss should have its own optimizer, make sure the
+                returned list is one larger than the incoming policy_loss list.
+                In case you simply want to mix in the custom loss into the
+                already calculated policy losses, return a list of altered
+                policy losses (as done in this example below).
+        """
+        # Get the next batch from our input files.
+        batch = self.reader.next()
+        # Define a secondary loss by building a graph copy with weight sharing.
+        obs = restore_original_dimensions(
+            torch.from_numpy(batch["obs"]).float().to(policy_loss[0].device),
+            self.obs_space,
+            tensorlib="torch",
+        )
+        logits, _ = self.forward({"obs": obs}, [], None)
+        # Compute the IL loss.
+        action_dist = TorchCategorical(logits, self.model_config)
+        imitation_loss = torch.mean(
+            -action_dist.logp(
+                torch.from_numpy(batch["actions"]).to(policy_loss[0].device)
+            )
+        )
+        self.imitation_loss_metric = imitation_loss.item()
+        self.policy_loss_metric = np.mean([loss.item() for loss in policy_loss])
+        # Add the imitation loss to each already calculated policy loss term.
+        # Alternatively (if custom loss has its own optimizer):
+        # return policy_loss + [10 * self.imitation_loss]
+        return [loss_ + 10 * imitation_loss for loss_ in policy_loss]
+    def metrics(self):
+        return {
+            "policy_loss": self.policy_loss_metric,
+            "imitation_loss": self.imitation_loss_metric,
+        }

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/fast_model.py ADDED Viewed

	@@ -0,0 +1,80 @@

+# @OldAPIStack
+from ray.rllib.models.modelv2 import ModelV2
+from ray.rllib.models.tf.tf_modelv2 import TFModelV2
+from ray.rllib.models.torch.misc import SlimFC
+from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
+from ray.rllib.utils.annotations import override
+from ray.rllib.utils.framework import try_import_tf, try_import_torch
+tf1, tf, tfv = try_import_tf()
+torch, nn = try_import_torch()
+class FastModel(TFModelV2):
+    """An example for a non-Keras ModelV2 in tf that learns a single weight.
+    Defines all network architecture in `forward` (not `__init__` as it's
+    usually done for Keras-style TFModelV2s).
+    """
+    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
+        super().__init__(obs_space, action_space, num_outputs, model_config, name)
+        # Have we registered our vars yet (see `forward`)?
+        self._registered = False
+    @override(ModelV2)
+    def forward(self, input_dict, state, seq_lens):
+        with tf1.variable_scope("model", reuse=tf1.AUTO_REUSE):
+            bias = tf1.get_variable(
+                dtype=tf.float32,
+                name="bias",
+                initializer=tf.keras.initializers.Zeros(),
+                shape=(),
+            )
+            output = bias + tf.zeros([tf.shape(input_dict["obs"])[0], self.num_outputs])
+            self._value_out = tf.reduce_mean(output, -1)  # fake value
+        if not self._registered:
+            self.register_variables(
+                tf1.get_collection(
+                    tf1.GraphKeys.TRAINABLE_VARIABLES, scope=".+/model/.+"
+                )
+            )
+            self._registered = True
+        return output, []
+    @override(ModelV2)
+    def value_function(self):
+        return tf.reshape(self._value_out, [-1])
+class TorchFastModel(TorchModelV2, nn.Module):
+    """Torch version of FastModel (tf)."""
+    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
+        TorchModelV2.__init__(
+            self, obs_space, action_space, num_outputs, model_config, name
+        )
+        nn.Module.__init__(self)
+        self.bias = nn.Parameter(
+            torch.tensor([0.0], dtype=torch.float32, requires_grad=True)
+        )
+        # Only needed to give some params to the optimizer (even though,
+        # they are never used anywhere).
+        self.dummy_layer = SlimFC(1, 1)
+        self._output = None
+    @override(ModelV2)
+    def forward(self, input_dict, state, seq_lens):
+        self._output = self.bias + torch.zeros(
+            size=(input_dict["obs"].shape[0], self.num_outputs)
+        ).to(self.bias.device)
+        return self._output, []
+    @override(ModelV2)
+    def value_function(self):
+        assert self._output is not None, "must call forward first!"
+        return torch.reshape(torch.mean(self._output, -1), [-1])

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/mobilenet_v2_encoder.py ADDED Viewed

	@@ -0,0 +1,48 @@

+# @OldAPIStack
+"""
+This file implements a MobileNet v2 Encoder.
+It uses MobileNet v2 to encode images into a latent space of 1000 dimensions.
+Depending on the experiment, the MobileNet v2 encoder layers can be frozen or
+unfrozen. This is controlled by the `freeze` parameter in the config.
+This is an example of how a pre-trained neural network can be used as an encoder
+in RLlib. You can modify this example to accommodate your own encoder network or
+other pre-trained networks.
+"""
+from ray.rllib.core.models.base import Encoder, ENCODER_OUT
+from ray.rllib.core.models.configs import ModelConfig
+from ray.rllib.core.models.torch.base import TorchModel
+from ray.rllib.utils.framework import try_import_torch
+torch, nn = try_import_torch()
+MOBILENET_INPUT_SHAPE = (3, 224, 224)
+class MobileNetV2EncoderConfig(ModelConfig):
+    # MobileNet v2 has a flat output with a length of 1000.
+    output_dims = (1000,)
+    freeze = True
+    def build(self, framework):
+        assert framework == "torch", "Unsupported framework `{}`!".format(framework)
+        return MobileNetV2Encoder(self)
+class MobileNetV2Encoder(TorchModel, Encoder):
+    """A MobileNet v2 encoder for RLlib."""
+    def __init__(self, config):
+        super().__init__(config)
+        self.net = torch.hub.load(
+            "pytorch/vision:v0.6.0", "mobilenet_v2", pretrained=True
+        )
+        if config.freeze:
+            # We don't want to train this encoder, so freeze its parameters!
+            for p in self.net.parameters():
+                p.requires_grad = False
+    def _forward(self, input_dict, **kwargs):
+        return {ENCODER_OUT: (self.net(input_dict["obs"]))}

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/mobilenet_v2_with_lstm_models.py ADDED Viewed

	@@ -0,0 +1,160 @@

+# @OldAPIStack
+import numpy as np
+from ray.rllib.models.modelv2 import ModelV2
+from ray.rllib.models.tf.recurrent_net import RecurrentNetwork
+from ray.rllib.models.torch.misc import SlimFC
+from ray.rllib.models.torch.recurrent_net import RecurrentNetwork as TorchRNN
+from ray.rllib.utils.annotations import override
+from ray.rllib.utils.framework import try_import_tf, try_import_torch
+tf1, tf, tfv = try_import_tf()
+torch, nn = try_import_torch()
+class MobileV2PlusRNNModel(RecurrentNetwork):
+    """A conv. + recurrent keras net example using a pre-trained MobileNet."""
+    def __init__(
+        self, obs_space, action_space, num_outputs, model_config, name, cnn_shape
+    ):
+        super(MobileV2PlusRNNModel, self).__init__(
+            obs_space, action_space, num_outputs, model_config, name
+        )
+        self.cell_size = 16
+        visual_size = cnn_shape[0] * cnn_shape[1] * cnn_shape[2]
+        state_in_h = tf.keras.layers.Input(shape=(self.cell_size,), name="h")
+        state_in_c = tf.keras.layers.Input(shape=(self.cell_size,), name="c")
+        seq_in = tf.keras.layers.Input(shape=(), name="seq_in", dtype=tf.int32)
+        inputs = tf.keras.layers.Input(shape=(None, visual_size), name="visual_inputs")
+        input_visual = inputs
+        input_visual = tf.reshape(
+            input_visual, [-1, cnn_shape[0], cnn_shape[1], cnn_shape[2]]
+        )
+        cnn_input = tf.keras.layers.Input(shape=cnn_shape, name="cnn_input")
+        cnn_model = tf.keras.applications.mobilenet_v2.MobileNetV2(
+            alpha=1.0,
+            include_top=True,
+            weights=None,
+            input_tensor=cnn_input,
+            pooling=None,
+        )
+        vision_out = cnn_model(input_visual)
+        vision_out = tf.reshape(
+            vision_out, [-1, tf.shape(inputs)[1], vision_out.shape.as_list()[-1]]
+        )
+        lstm_out, state_h, state_c = tf.keras.layers.LSTM(
+            self.cell_size, return_sequences=True, return_state=True, name="lstm"
+        )(
+            inputs=vision_out,
+            mask=tf.sequence_mask(seq_in),
+            initial_state=[state_in_h, state_in_c],
+        )
+        # Postprocess LSTM output with another hidden layer and compute values.
+        logits = tf.keras.layers.Dense(
+            self.num_outputs, activation=tf.keras.activations.linear, name="logits"
+        )(lstm_out)
+        values = tf.keras.layers.Dense(1, activation=None, name="values")(lstm_out)
+        # Create the RNN model
+        self.rnn_model = tf.keras.Model(
+            inputs=[inputs, seq_in, state_in_h, state_in_c],
+            outputs=[logits, values, state_h, state_c],
+        )
+        self.rnn_model.summary()
+    @override(RecurrentNetwork)
+    def forward_rnn(self, inputs, state, seq_lens):
+        model_out, self._value_out, h, c = self.rnn_model([inputs, seq_lens] + state)
+        return model_out, [h, c]
+    @override(ModelV2)
+    def get_initial_state(self):
+        return [
+            np.zeros(self.cell_size, np.float32),
+            np.zeros(self.cell_size, np.float32),
+        ]
+    @override(ModelV2)
+    def value_function(self):
+        return tf.reshape(self._value_out, [-1])
+class TorchMobileV2PlusRNNModel(TorchRNN, nn.Module):
+    """A conv. + recurrent torch net example using a pre-trained MobileNet."""
+    def __init__(
+        self, obs_space, action_space, num_outputs, model_config, name, cnn_shape
+    ):
+        TorchRNN.__init__(
+            self, obs_space, action_space, num_outputs, model_config, name
+        )
+        nn.Module.__init__(self)
+        self.lstm_state_size = 16
+        self.cnn_shape = list(cnn_shape)
+        self.visual_size_in = cnn_shape[0] * cnn_shape[1] * cnn_shape[2]
+        # MobileNetV2 has a flat output of (1000,).
+        self.visual_size_out = 1000
+        # Load the MobileNetV2 from torch.hub.
+        self.cnn_model = torch.hub.load(
+            "pytorch/vision:v0.6.0", "mobilenet_v2", pretrained=True
+        )
+        self.lstm = nn.LSTM(
+            self.visual_size_out, self.lstm_state_size, batch_first=True
+        )
+        # Postprocess LSTM output with another hidden layer and compute values.
+        self.logits = SlimFC(self.lstm_state_size, self.num_outputs)
+        self.value_branch = SlimFC(self.lstm_state_size, 1)
+        # Holds the current "base" output (before logits layer).
+        self._features = None
+    @override(TorchRNN)
+    def forward_rnn(self, inputs, state, seq_lens):
+        # Create image dims.
+        vision_in = torch.reshape(inputs, [-1] + self.cnn_shape)
+        vision_out = self.cnn_model(vision_in)
+        # Flatten.
+        vision_out_time_ranked = torch.reshape(
+            vision_out, [inputs.shape[0], inputs.shape[1], vision_out.shape[-1]]
+        )
+        if len(state[0].shape) == 2:
+            state[0] = state[0].unsqueeze(0)
+            state[1] = state[1].unsqueeze(0)
+        # Forward through LSTM.
+        self._features, [h, c] = self.lstm(vision_out_time_ranked, state)
+        # Forward LSTM out through logits layer and value layer.
+        logits = self.logits(self._features)
+        return logits, [h.squeeze(0), c.squeeze(0)]
+    @override(ModelV2)
+    def get_initial_state(self):
+        # Place hidden states on same device as model.
+        h = [
+            list(self.cnn_model.modules())[-1]
+            .weight.new(1, self.lstm_state_size)
+            .zero_()
+            .squeeze(0),
+            list(self.cnn_model.modules())[-1]
+            .weight.new(1, self.lstm_state_size)
+            .zero_()
+            .squeeze(0),
+        ]
+        return h
+    @override(ModelV2)
+    def value_function(self):
+        assert self._features is not None, "must call forward() first"
+        return torch.reshape(self.value_branch(self._features), [-1])

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/neural_computer.py ADDED Viewed

	@@ -0,0 +1,247 @@

+# @OldAPIStack
+from collections import OrderedDict
+import gymnasium as gym
+from typing import Union, Dict, List, Tuple
+from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
+from ray.rllib.models.torch.misc import SlimFC
+from ray.rllib.utils.framework import try_import_torch
+from ray.rllib.utils.typing import ModelConfigDict, TensorType
+try:
+    from dnc import DNC
+except ModuleNotFoundError:
+    print("dnc module not found. Did you forget to 'pip install dnc'?")
+    raise
+torch, nn = try_import_torch()
+class DNCMemory(TorchModelV2, nn.Module):
+    """Differentiable Neural Computer wrapper around ixaxaar's DNC implementation,
+    see https://github.com/ixaxaar/pytorch-dnc"""
+    DEFAULT_CONFIG = {
+        "dnc_model": DNC,
+        # Number of controller hidden layers
+        "num_hidden_layers": 1,
+        # Number of weights per controller hidden layer
+        "hidden_size": 64,
+        # Number of LSTM units
+        "num_layers": 1,
+        # Number of read heads, i.e. how many addrs are read at once
+        "read_heads": 4,
+        # Number of memory cells in the controller
+        "nr_cells": 32,
+        # Size of each cell
+        "cell_size": 16,
+        # LSTM activation function
+        "nonlinearity": "tanh",
+        # Observation goes through this torch.nn.Module before
+        # feeding to the DNC
+        "preprocessor": torch.nn.Sequential(torch.nn.Linear(64, 64), torch.nn.Tanh()),
+        # Input size to the preprocessor
+        "preprocessor_input_size": 64,
+        # The output size of the preprocessor
+        # and the input size of the dnc
+        "preprocessor_output_size": 64,
+    }
+    MEMORY_KEYS = [
+        "memory",
+        "link_matrix",
+        "precedence",
+        "read_weights",
+        "write_weights",
+        "usage_vector",
+    ]
+    def __init__(
+        self,
+        obs_space: gym.spaces.Space,
+        action_space: gym.spaces.Space,
+        num_outputs: int,
+        model_config: ModelConfigDict,
+        name: str,
+        **custom_model_kwargs,
+    ):
+        nn.Module.__init__(self)
+        super(DNCMemory, self).__init__(
+            obs_space, action_space, num_outputs, model_config, name
+        )
+        self.num_outputs = num_outputs
+        self.obs_dim = gym.spaces.utils.flatdim(obs_space)
+        self.act_dim = gym.spaces.utils.flatdim(action_space)
+        self.cfg = dict(self.DEFAULT_CONFIG, **custom_model_kwargs)
+        assert (
+            self.cfg["num_layers"] == 1
+        ), "num_layers != 1 has not been implemented yet"
+        self.cur_val = None
+        self.preprocessor = torch.nn.Sequential(
+            torch.nn.Linear(self.obs_dim, self.cfg["preprocessor_input_size"]),
+            self.cfg["preprocessor"],
+        )
+        self.logit_branch = SlimFC(
+            in_size=self.cfg["hidden_size"],
+            out_size=self.num_outputs,
+            activation_fn=None,
+            initializer=torch.nn.init.xavier_uniform_,
+        )
+        self.value_branch = SlimFC(
+            in_size=self.cfg["hidden_size"],
+            out_size=1,
+            activation_fn=None,
+            initializer=torch.nn.init.xavier_uniform_,
+        )
+        self.dnc: Union[None, DNC] = None
+    def get_initial_state(self) -> List[TensorType]:
+        ctrl_hidden = [
+            torch.zeros(self.cfg["num_hidden_layers"], self.cfg["hidden_size"]),
+            torch.zeros(self.cfg["num_hidden_layers"], self.cfg["hidden_size"]),
+        ]
+        m = self.cfg["nr_cells"]
+        r = self.cfg["read_heads"]
+        w = self.cfg["cell_size"]
+        memory = [
+            torch.zeros(m, w),  # memory
+            torch.zeros(1, m, m),  # link_matrix
+            torch.zeros(1, m),  # precedence
+            torch.zeros(r, m),  # read_weights
+            torch.zeros(1, m),  # write_weights
+            torch.zeros(m),  # usage_vector
+        ]
+        read_vecs = torch.zeros(w * r)
+        state = [*ctrl_hidden, read_vecs, *memory]
+        assert len(state) == 9
+        return state
+    def value_function(self) -> TensorType:
+        assert self.cur_val is not None, "must call forward() first"
+        return self.cur_val
+    def unpack_state(
+        self,
+        state: List[TensorType],
+    ) -> Tuple[List[Tuple[TensorType, TensorType]], Dict[str, TensorType], TensorType]:
+        """Given a list of tensors, reformat for self.dnc input"""
+        assert len(state) == 9, "Failed to verify unpacked state"
+        ctrl_hidden: List[Tuple[TensorType, TensorType]] = [
+            (
+                state[0].permute(1, 0, 2).contiguous(),
+                state[1].permute(1, 0, 2).contiguous(),
+            )
+        ]
+        read_vecs: TensorType = state[2]
+        memory: List[TensorType] = state[3:]
+        memory_dict: OrderedDict[str, TensorType] = OrderedDict(
+            zip(self.MEMORY_KEYS, memory)
+        )
+        return ctrl_hidden, memory_dict, read_vecs
+    def pack_state(
+        self,
+        ctrl_hidden: List[Tuple[TensorType, TensorType]],
+        memory_dict: Dict[str, TensorType],
+        read_vecs: TensorType,
+    ) -> List[TensorType]:
+        """Given the dnc output, pack it into a list of tensors
+        for rllib state. Order is ctrl_hidden, read_vecs, memory_dict"""
+        state = []
+        ctrl_hidden = [
+            ctrl_hidden[0][0].permute(1, 0, 2),
+            ctrl_hidden[0][1].permute(1, 0, 2),
+        ]
+        state += ctrl_hidden
+        assert len(state) == 2, "Failed to verify packed state"
+        state.append(read_vecs)
+        assert len(state) == 3, "Failed to verify packed state"
+        state += memory_dict.values()
+        assert len(state) == 9, "Failed to verify packed state"
+        return state
+    def validate_unpack(self, dnc_output, unpacked_state):
+        """Ensure the unpacked state shapes match the DNC output"""
+        s_ctrl_hidden, s_memory_dict, s_read_vecs = unpacked_state
+        ctrl_hidden, memory_dict, read_vecs = dnc_output
+        for i in range(len(ctrl_hidden)):
+            for j in range(len(ctrl_hidden[i])):
+                assert s_ctrl_hidden[i][j].shape == ctrl_hidden[i][j].shape, (
+                    "Controller state mismatch: got "
+                    f"{s_ctrl_hidden[i][j].shape} should be "
+                    f"{ctrl_hidden[i][j].shape}"
+                )
+        for k in memory_dict:
+            assert s_memory_dict[k].shape == memory_dict[k].shape, (
+                "Memory state mismatch at key "
+                f"{k}: got {s_memory_dict[k].shape} should be "
+                f"{memory_dict[k].shape}"
+            )
+        assert s_read_vecs.shape == read_vecs.shape, (
+            "Read state mismatch: got "
+            f"{s_read_vecs.shape} should be "
+            f"{read_vecs.shape}"
+        )
+    def build_dnc(self, device_idx: Union[int, None]) -> None:
+        self.dnc = self.cfg["dnc_model"](
+            input_size=self.cfg["preprocessor_output_size"],
+            hidden_size=self.cfg["hidden_size"],
+            num_layers=self.cfg["num_layers"],
+            num_hidden_layers=self.cfg["num_hidden_layers"],
+            read_heads=self.cfg["read_heads"],
+            cell_size=self.cfg["cell_size"],
+            nr_cells=self.cfg["nr_cells"],
+            nonlinearity=self.cfg["nonlinearity"],
+            gpu_id=device_idx,
+        )
+    def forward(
+        self,
+        input_dict: Dict[str, TensorType],
+        state: List[TensorType],
+        seq_lens: TensorType,
+    ) -> Tuple[TensorType, List[TensorType]]:
+        flat = input_dict["obs_flat"]
+        # Batch and Time
+        # Forward expects outputs as [B, T, logits]
+        B = len(seq_lens)
+        T = flat.shape[0] // B
+        # Deconstruct batch into batch and time dimensions: [B, T, feats]
+        flat = torch.reshape(flat, [-1, T] + list(flat.shape[1:]))
+        # First run
+        if self.dnc is None:
+            gpu_id = flat.device.index if flat.device.index is not None else -1
+            self.build_dnc(gpu_id)
+            hidden = (None, None, None)
+        else:
+            hidden = self.unpack_state(state)  # type: ignore
+        # Run thru preprocessor before DNC
+        z = self.preprocessor(flat.reshape(B * T, self.obs_dim))
+        z = z.reshape(B, T, self.cfg["preprocessor_output_size"])
+        output, hidden = self.dnc(z, hidden)
+        packed_state = self.pack_state(*hidden)
+        # Compute action/value from output
+        logits = self.logit_branch(output.view(B * T, -1))
+        values = self.value_branch(output.view(B * T, -1))
+        self.cur_val = values.squeeze(1)
+        return logits, packed_state

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/parametric_actions_model.py ADDED Viewed

	@@ -0,0 +1,201 @@

+# @OldAPIStack
+from gymnasium.spaces import Box
+from ray.rllib.algorithms.dqn.distributional_q_tf_model import DistributionalQTFModel
+from ray.rllib.algorithms.dqn.dqn_torch_model import DQNTorchModel
+from ray.rllib.models.tf.fcnet import FullyConnectedNetwork
+from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
+from ray.rllib.utils.framework import try_import_tf, try_import_torch
+from ray.rllib.utils.torch_utils import FLOAT_MAX, FLOAT_MIN
+tf1, tf, tfv = try_import_tf()
+torch, nn = try_import_torch()
+class ParametricActionsModel(DistributionalQTFModel):
+    """Parametric action model that handles the dot product and masking.
+    This assumes the outputs are logits for a single Categorical action dist.
+    Getting this to work with a more complex output (e.g., if the action space
+    is a tuple of several distributions) is also possible but left as an
+    exercise to the reader.
+    """
+    def __init__(
+        self,
+        obs_space,
+        action_space,
+        num_outputs,
+        model_config,
+        name,
+        true_obs_shape=(4,),
+        action_embed_size=2,
+        **kw
+    ):
+        super(ParametricActionsModel, self).__init__(
+            obs_space, action_space, num_outputs, model_config, name, **kw
+        )
+        self.action_embed_model = FullyConnectedNetwork(
+            Box(-1, 1, shape=true_obs_shape),
+            action_space,
+            action_embed_size,
+            model_config,
+            name + "_action_embed",
+        )
+    def forward(self, input_dict, state, seq_lens):
+        # Extract the available actions tensor from the observation.
+        avail_actions = input_dict["obs"]["avail_actions"]
+        action_mask = input_dict["obs"]["action_mask"]
+        # Compute the predicted action embedding
+        action_embed, _ = self.action_embed_model({"obs": input_dict["obs"]["cart"]})
+        # Expand the model output to [BATCH, 1, EMBED_SIZE]. Note that the
+        # avail actions tensor is of shape [BATCH, MAX_ACTIONS, EMBED_SIZE].
+        intent_vector = tf.expand_dims(action_embed, 1)
+        # Batch dot product => shape of logits is [BATCH, MAX_ACTIONS].
+        action_logits = tf.reduce_sum(avail_actions * intent_vector, axis=2)
+        # Mask out invalid actions (use tf.float32.min for stability)
+        inf_mask = tf.maximum(tf.math.log(action_mask), tf.float32.min)
+        return action_logits + inf_mask, state
+    def value_function(self):
+        return self.action_embed_model.value_function()
+class TorchParametricActionsModel(DQNTorchModel):
+    """PyTorch version of above ParametricActionsModel."""
+    def __init__(
+        self,
+        obs_space,
+        action_space,
+        num_outputs,
+        model_config,
+        name,
+        true_obs_shape=(4,),
+        action_embed_size=2,
+        **kw
+    ):
+        DQNTorchModel.__init__(
+            self, obs_space, action_space, num_outputs, model_config, name, **kw
+        )
+        self.action_embed_model = TorchFC(
+            Box(-1, 1, shape=true_obs_shape),
+            action_space,
+            action_embed_size,
+            model_config,
+            name + "_action_embed",
+        )
+    def forward(self, input_dict, state, seq_lens):
+        # Extract the available actions tensor from the observation.
+        avail_actions = input_dict["obs"]["avail_actions"]
+        action_mask = input_dict["obs"]["action_mask"]
+        # Compute the predicted action embedding
+        action_embed, _ = self.action_embed_model({"obs": input_dict["obs"]["cart"]})
+        # Expand the model output to [BATCH, 1, EMBED_SIZE]. Note that the
+        # avail actions tensor is of shape [BATCH, MAX_ACTIONS, EMBED_SIZE].
+        intent_vector = torch.unsqueeze(action_embed, 1)
+        # Batch dot product => shape of logits is [BATCH, MAX_ACTIONS].
+        action_logits = torch.sum(avail_actions * intent_vector, dim=2)
+        # Mask out invalid actions (use -inf to tag invalid).
+        # These are then recognized by the EpsilonGreedy exploration component
+        # as invalid actions that are not to be chosen.
+        inf_mask = torch.clamp(torch.log(action_mask), FLOAT_MIN, FLOAT_MAX)
+        return action_logits + inf_mask, state
+    def value_function(self):
+        return self.action_embed_model.value_function()
+class ParametricActionsModelThatLearnsEmbeddings(DistributionalQTFModel):
+    """Same as the above ParametricActionsModel.
+    However, this version also learns the action embeddings.
+    """
+    def __init__(
+        self,
+        obs_space,
+        action_space,
+        num_outputs,
+        model_config,
+        name,
+        true_obs_shape=(4,),
+        action_embed_size=2,
+        **kw
+    ):
+        super(ParametricActionsModelThatLearnsEmbeddings, self).__init__(
+            obs_space, action_space, num_outputs, model_config, name, **kw
+        )
+        action_ids_shifted = tf.constant(
+            list(range(1, num_outputs + 1)), dtype=tf.float32
+        )
+        obs_cart = tf.keras.layers.Input(shape=true_obs_shape, name="obs_cart")
+        valid_avail_actions_mask = tf.keras.layers.Input(
+            shape=(num_outputs,), name="valid_avail_actions_mask"
+        )
+        self.pred_action_embed_model = FullyConnectedNetwork(
+            Box(-1, 1, shape=true_obs_shape),
+            action_space,
+            action_embed_size,
+            model_config,
+            name + "_pred_action_embed",
+        )
+        # Compute the predicted action embedding
+        pred_action_embed, _ = self.pred_action_embed_model({"obs": obs_cart})
+        _value_out = self.pred_action_embed_model.value_function()
+        # Expand the model output to [BATCH, 1, EMBED_SIZE]. Note that the
+        # avail actions tensor is of shape [BATCH, MAX_ACTIONS, EMBED_SIZE].
+        intent_vector = tf.expand_dims(pred_action_embed, 1)
+        valid_avail_actions = action_ids_shifted * valid_avail_actions_mask
+        # Embedding for valid available actions which will be learned.
+        # Embedding vector for 0 is an invalid embedding (a "dummy embedding").
+        valid_avail_actions_embed = tf.keras.layers.Embedding(
+            input_dim=num_outputs + 1,
+            output_dim=action_embed_size,
+            name="action_embed_matrix",
+        )(valid_avail_actions)
+        # Batch dot product => shape of logits is [BATCH, MAX_ACTIONS].
+        action_logits = tf.reduce_sum(valid_avail_actions_embed * intent_vector, axis=2)
+        # Mask out invalid actions (use tf.float32.min for stability)
+        inf_mask = tf.maximum(tf.math.log(valid_avail_actions_mask), tf.float32.min)
+        action_logits = action_logits + inf_mask
+        self.param_actions_model = tf.keras.Model(
+            inputs=[obs_cart, valid_avail_actions_mask],
+            outputs=[action_logits, _value_out],
+        )
+        self.param_actions_model.summary()
+    def forward(self, input_dict, state, seq_lens):
+        # Extract the available actions mask tensor from the observation.
+        valid_avail_actions_mask = input_dict["obs"]["valid_avail_actions_mask"]
+        action_logits, self._value_out = self.param_actions_model(
+            [input_dict["obs"]["cart"], valid_avail_actions_mask]
+        )
+        return action_logits, state
+    def value_function(self):
+        return self._value_out

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/shared_weights_model.py ADDED Viewed

	@@ -0,0 +1,206 @@

+# @OldAPIStack
+import numpy as np
+from ray.rllib.models.modelv2 import ModelV2
+from ray.rllib.models.tf.tf_modelv2 import TFModelV2
+from ray.rllib.models.torch.misc import SlimFC
+from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
+from ray.rllib.utils.annotations import override
+from ray.rllib.utils.framework import try_import_tf, try_import_torch
+tf1, tf, tfv = try_import_tf()
+torch, nn = try_import_torch()
+TF2_GLOBAL_SHARED_LAYER = None
+class TF2SharedWeightsModel(TFModelV2):
+    """Example of weight sharing between two different TFModelV2s.
+    NOTE: This will only work for tf2.x. When running with config.framework=tf,
+    use SharedWeightsModel1 and SharedWeightsModel2 below, instead!
+    The shared (single) layer is simply defined outside of the two Models,
+    then used by both Models in their forward pass.
+    """
+    def __init__(
+        self, observation_space, action_space, num_outputs, model_config, name
+    ):
+        super().__init__(
+            observation_space, action_space, num_outputs, model_config, name
+        )
+        global TF2_GLOBAL_SHARED_LAYER
+        # The global, shared layer to be used by both models.
+        if TF2_GLOBAL_SHARED_LAYER is None:
+            TF2_GLOBAL_SHARED_LAYER = tf.keras.layers.Dense(
+                units=64, activation=tf.nn.relu, name="fc1"
+            )
+        inputs = tf.keras.layers.Input(observation_space.shape)
+        last_layer = TF2_GLOBAL_SHARED_LAYER(inputs)
+        output = tf.keras.layers.Dense(
+            units=num_outputs, activation=None, name="fc_out"
+        )(last_layer)
+        vf = tf.keras.layers.Dense(units=1, activation=None, name="value_out")(
+            last_layer
+        )
+        self.base_model = tf.keras.models.Model(inputs, [output, vf])
+    @override(ModelV2)
+    def forward(self, input_dict, state, seq_lens):
+        out, self._value_out = self.base_model(input_dict["obs"])
+        return out, []
+    @override(ModelV2)
+    def value_function(self):
+        return tf.reshape(self._value_out, [-1])
+class SharedWeightsModel1(TFModelV2):
+    """Example of weight sharing between two different TFModelV2s.
+    NOTE: This will only work for tf1 (static graph). When running with
+    config.framework_str=tf2, use TF2SharedWeightsModel, instead!
+    Here, we share the variables defined in the 'shared' variable scope
+    by entering it explicitly with tf1.AUTO_REUSE. This creates the
+    variables for the 'fc1' layer in a global scope called 'shared'
+    (outside of the Policy's normal variable scope).
+    """
+    def __init__(
+        self, observation_space, action_space, num_outputs, model_config, name
+    ):
+        super().__init__(
+            observation_space, action_space, num_outputs, model_config, name
+        )
+        inputs = tf.keras.layers.Input(observation_space.shape)
+        with tf1.variable_scope(
+            tf1.VariableScope(tf1.AUTO_REUSE, "shared"),
+            reuse=tf1.AUTO_REUSE,
+            auxiliary_name_scope=False,
+        ):
+            last_layer = tf.keras.layers.Dense(
+                units=64, activation=tf.nn.relu, name="fc1"
+            )(inputs)
+        output = tf.keras.layers.Dense(
+            units=num_outputs, activation=None, name="fc_out"
+        )(last_layer)
+        vf = tf.keras.layers.Dense(units=1, activation=None, name="value_out")(
+            last_layer
+        )
+        self.base_model = tf.keras.models.Model(inputs, [output, vf])
+    @override(ModelV2)
+    def forward(self, input_dict, state, seq_lens):
+        out, self._value_out = self.base_model(input_dict["obs"])
+        return out, []
+    @override(ModelV2)
+    def value_function(self):
+        return tf.reshape(self._value_out, [-1])
+class SharedWeightsModel2(TFModelV2):
+    """The "other" TFModelV2 using the same shared space as the one above."""
+    def __init__(
+        self, observation_space, action_space, num_outputs, model_config, name
+    ):
+        super().__init__(
+            observation_space, action_space, num_outputs, model_config, name
+        )
+        inputs = tf.keras.layers.Input(observation_space.shape)
+        # Weights shared with SharedWeightsModel1.
+        with tf1.variable_scope(
+            tf1.VariableScope(tf1.AUTO_REUSE, "shared"),
+            reuse=tf1.AUTO_REUSE,
+            auxiliary_name_scope=False,
+        ):
+            last_layer = tf.keras.layers.Dense(
+                units=64, activation=tf.nn.relu, name="fc1"
+            )(inputs)
+        output = tf.keras.layers.Dense(
+            units=num_outputs, activation=None, name="fc_out"
+        )(last_layer)
+        vf = tf.keras.layers.Dense(units=1, activation=None, name="value_out")(
+            last_layer
+        )
+        self.base_model = tf.keras.models.Model(inputs, [output, vf])
+    @override(ModelV2)
+    def forward(self, input_dict, state, seq_lens):
+        out, self._value_out = self.base_model(input_dict["obs"])
+        return out, []
+    @override(ModelV2)
+    def value_function(self):
+        return tf.reshape(self._value_out, [-1])
+TORCH_GLOBAL_SHARED_LAYER = None
+if torch:
+    # The global, shared layer to be used by both models.
+    TORCH_GLOBAL_SHARED_LAYER = SlimFC(
+        64,
+        64,
+        activation_fn=nn.ReLU,
+        initializer=torch.nn.init.xavier_uniform_,
+    )
+class TorchSharedWeightsModel(TorchModelV2, nn.Module):
+    """Example of weight sharing between two different TorchModelV2s.
+    The shared (single) layer is simply defined outside of the two Models,
+    then used by both Models in their forward pass.
+    """
+    def __init__(
+        self, observation_space, action_space, num_outputs, model_config, name
+    ):
+        TorchModelV2.__init__(
+            self, observation_space, action_space, num_outputs, model_config, name
+        )
+        nn.Module.__init__(self)
+        # Non-shared initial layer.
+        self.first_layer = SlimFC(
+            int(np.prod(observation_space.shape)),
+            64,
+            activation_fn=nn.ReLU,
+            initializer=torch.nn.init.xavier_uniform_,
+        )
+        # Non-shared final layer.
+        self.last_layer = SlimFC(
+            64,
+            self.num_outputs,
+            activation_fn=None,
+            initializer=torch.nn.init.xavier_uniform_,
+        )
+        self.vf = SlimFC(
+            64,
+            1,
+            activation_fn=None,
+            initializer=torch.nn.init.xavier_uniform_,
+        )
+        self._global_shared_layer = TORCH_GLOBAL_SHARED_LAYER
+        self._output = None
+    @override(ModelV2)
+    def forward(self, input_dict, state, seq_lens):
+        out = self.first_layer(input_dict["obs"])
+        self._output = self._global_shared_layer(out)
+        model_out = self.last_layer(self._output)
+        return model_out, []
+    @override(ModelV2)
+    def value_function(self):
+        assert self._output is not None, "must call forward first!"
+        return torch.reshape(self.vf(self._output), [-1])

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/simple_rpg_model.py ADDED Viewed

	@@ -0,0 +1,65 @@

+# @OldAPIStack
+from ray.rllib.models.tf.tf_modelv2 import TFModelV2
+from ray.rllib.models.tf.fcnet import FullyConnectedNetwork as TFFCNet
+from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
+from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFCNet
+from ray.rllib.utils.framework import try_import_tf, try_import_torch
+tf1, tf, tfv = try_import_tf()
+torch, nn = try_import_torch()
+class CustomTorchRPGModel(TorchModelV2, nn.Module):
+    """Example of interpreting repeated observations."""
+    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
+        super().__init__(obs_space, action_space, num_outputs, model_config, name)
+        nn.Module.__init__(self)
+        self.model = TorchFCNet(
+            obs_space, action_space, num_outputs, model_config, name
+        )
+    def forward(self, input_dict, state, seq_lens):
+        # The unpacked input tensors, where M=MAX_PLAYERS, N=MAX_ITEMS:
+        # {
+        #   'items', <torch.Tensor shape=(?, M, N, 5)>,
+        #   'location', <torch.Tensor shape=(?, M, 2)>,
+        #   'status', <torch.Tensor shape=(?, M, 10)>,
+        # }
+        print("The unpacked input tensors:", input_dict["obs"])
+        print()
+        print("Unbatched repeat dim", input_dict["obs"].unbatch_repeat_dim())
+        print()
+        print("Fully unbatched", input_dict["obs"].unbatch_all())
+        print()
+        return self.model.forward(input_dict, state, seq_lens)
+    def value_function(self):
+        return self.model.value_function()
+class CustomTFRPGModel(TFModelV2):
+    """Example of interpreting repeated observations."""
+    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
+        super().__init__(obs_space, action_space, num_outputs, model_config, name)
+        self.model = TFFCNet(obs_space, action_space, num_outputs, model_config, name)
+    def forward(self, input_dict, state, seq_lens):
+        # The unpacked input tensors, where M=MAX_PLAYERS, N=MAX_ITEMS:
+        # {
+        #   'items', <tf.Tensor shape=(?, M, N, 5)>,
+        #   'location', <tf.Tensor shape=(?, M, 2)>,
+        #   'status', <tf.Tensor shape=(?, M, 10)>,
+        # }
+        print("The unpacked input tensors:", input_dict["obs"])
+        print()
+        print("Unbatched repeat dim", input_dict["obs"].unbatch_repeat_dim())
+        print()
+        if tf.executing_eagerly():
+            print("Fully unbatched", input_dict["obs"].unbatch_all())
+            print()
+        return self.model.forward(input_dict, state, seq_lens)
+    def value_function(self):
+        return self.model.value_function()

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/parametric_actions_cartpole.py ADDED Viewed

	@@ -0,0 +1,121 @@

+# @OldAPIStack
+"""Example of handling variable length or parametric action spaces.
+This toy example demonstrates the action-embedding based approach for handling large
+discrete action spaces (potentially infinite in size), similar to this example:
+    https://neuro.cs.ut.ee/the-use-of-embeddings-in-openai-five/
+This example works with RLlib's policy gradient style algorithms
+(e.g., PG, PPO, IMPALA, A2C) and DQN.
+Note that since the model outputs now include "-inf" tf.float32.min
+values, not all algorithm options are supported. For example,
+algorithms might crash if they don't properly ignore the -inf action scores.
+Working configurations are given below.
+"""
+import argparse
+import os
+import ray
+from ray import air, tune
+from ray.air.constants import TRAINING_ITERATION
+from ray.rllib.examples.envs.classes.parametric_actions_cartpole import (
+    ParametricActionsCartPole,
+)
+from ray.rllib.examples._old_api_stack.models.parametric_actions_model import (
+    ParametricActionsModel,
+    TorchParametricActionsModel,
+)
+from ray.rllib.models import ModelCatalog
+from ray.rllib.utils.metrics import (
+    ENV_RUNNER_RESULTS,
+    EPISODE_RETURN_MEAN,
+    NUM_ENV_STEPS_SAMPLED_LIFETIME,
+)
+from ray.rllib.utils.test_utils import check_learning_achieved
+from ray.tune.registry import register_env
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--run", type=str, default="PPO", help="The RLlib-registered algorithm to use."
+)
+parser.add_argument(
+    "--framework",
+    choices=["tf", "tf2", "torch"],
+    default="torch",
+    help="The DL framework specifier.",
+)
+parser.add_argument(
+    "--as-test",
+    action="store_true",
+    help="Whether this script should be run as a test: --stop-reward must "
+    "be achieved within --stop-timesteps AND --stop-iters.",
+)
+parser.add_argument(
+    "--stop-iters", type=int, default=200, help="Number of iterations to train."
+)
+parser.add_argument(
+    "--stop-timesteps", type=int, default=100000, help="Number of timesteps to train."
+)
+parser.add_argument(
+    "--stop-reward", type=float, default=150.0, help="Reward at which we stop training."
+)
+if __name__ == "__main__":
+    args = parser.parse_args()
+    ray.init()
+    register_env("pa_cartpole", lambda _: ParametricActionsCartPole(10))
+    ModelCatalog.register_custom_model(
+        "pa_model",
+        TorchParametricActionsModel
+        if args.framework == "torch"
+        else ParametricActionsModel,
+    )
+    if args.run == "DQN":
+        cfg = {
+            # TODO(ekl) we need to set these to prevent the masked values
+            # from being further processed in DistributionalQModel, which
+            # would mess up the masking. It is possible to support these if we
+            # defined a custom DistributionalQModel that is aware of masking.
+            "hiddens": [],
+            "dueling": False,
+            "enable_rl_module_and_learner": False,
+            "enable_env_runner_and_connector_v2": False,
+        }
+    else:
+        cfg = {}
+    config = dict(
+        {
+            "env": "pa_cartpole",
+            "model": {
+                "custom_model": "pa_model",
+            },
+            # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
+            "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
+            "num_env_runners": 0,
+            "framework": args.framework,
+        },
+        **cfg,
+    )
+    stop = {
+        TRAINING_ITERATION: args.stop_iters,
+        f"{NUM_ENV_STEPS_SAMPLED_LIFETIME}": args.stop_timesteps,
+        f"{ENV_RUNNER_RESULTS}/{EPISODE_RETURN_MEAN}": args.stop_reward,
+    }
+    results = tune.Tuner(
+        args.run,
+        run_config=air.RunConfig(stop=stop, verbose=1),
+        param_space=config,
+    ).fit()
+    if args.as_test:
+        check_learning_achieved(results, args.stop_reward)
+    ray.shutdown()

.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/parametric_actions_cartpole_embeddings_learnt_by_model.py ADDED Viewed

	@@ -0,0 +1,107 @@

+# @OldAPIStack
+"""Example of handling variable length or parametric action spaces.
+This is a toy example of the action-embedding based approach for handling large
+discrete action spaces (potentially infinite in size), similar to this:
+    https://neuro.cs.ut.ee/the-use-of-embeddings-in-openai-five/
+This currently works with RLlib's policy gradient style algorithms
+(e.g., PG, PPO, IMPALA, A2C) and also DQN.
+Note that since the model outputs now include "-inf" tf.float32.min
+values, not all algorithm options are supported at the moment. For example,
+algorithms might crash if they don't properly ignore the -inf action scores.
+Working configurations are given below.
+"""
+import argparse
+import os
+import ray
+from ray import air, tune
+from ray.air.constants import TRAINING_ITERATION
+from ray.rllib.examples.envs.classes.parametric_actions_cartpole import (
+    ParametricActionsCartPoleNoEmbeddings,
+)
+from ray.rllib.examples._old_api_stack.models.parametric_actions_model import (
+    ParametricActionsModelThatLearnsEmbeddings,
+)
+from ray.rllib.models import ModelCatalog
+from ray.rllib.utils.metrics import (
+    ENV_RUNNER_RESULTS,
+    EPISODE_RETURN_MEAN,
+    NUM_ENV_STEPS_SAMPLED_LIFETIME,
+)
+from ray.rllib.utils.test_utils import check_learning_achieved
+from ray.tune.registry import register_env
+parser = argparse.ArgumentParser()
+parser.add_argument("--run", type=str, default="PPO")
+parser.add_argument(
+    "--framework",
+    choices=["tf", "tf2"],
+    default="tf",
+    help="The DL framework specifier (Torch not supported "
+    "due to the lack of a model).",
+)
+parser.add_argument("--as-test", action="store_true")
+parser.add_argument("--stop-iters", type=int, default=200)
+parser.add_argument("--stop-reward", type=float, default=150.0)
+parser.add_argument("--stop-timesteps", type=int, default=100000)
+if __name__ == "__main__":
+    args = parser.parse_args()
+    ray.init()
+    register_env("pa_cartpole", lambda _: ParametricActionsCartPoleNoEmbeddings(10))
+    ModelCatalog.register_custom_model(
+        "pa_model", ParametricActionsModelThatLearnsEmbeddings
+    )
+    if args.run == "DQN":
+        cfg = {
+            # TODO(ekl) we need to set these to prevent the masked values
+            # from being further processed in DistributionalQModel, which
+            # would mess up the masking. It is possible to support these if we
+            # defined a custom DistributionalQModel that is aware of masking.
+            "hiddens": [],
+            "dueling": False,
+            "enable_rl_module_and_learner": False,
+            "enable_env_runner_and_connector_v2": False,
+        }
+    else:
+        cfg = {}
+    config = dict(
+        {
+            "env": "pa_cartpole",
+            "model": {
+                "custom_model": "pa_model",
+            },
+            # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
+            "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
+            "num_env_runners": 0,
+            "framework": args.framework,
+            "action_mask_key": "valid_avail_actions_mask",
+        },
+        **cfg,
+    )
+    stop = {
+        TRAINING_ITERATION: args.stop_iters,
+        NUM_ENV_STEPS_SAMPLED_LIFETIME: args.stop_timesteps,
+        f"{ENV_RUNNER_RESULTS}/{EPISODE_RETURN_MEAN}": args.stop_reward,
+    }
+    results = tune.Tuner(
+        args.run,
+        run_config=air.RunConfig(stop=stop, verbose=2),
+        param_space=config,
+    ).fit()
+    if args.as_test:
+        check_learning_achieved(results, args.stop_reward)
+    ray.shutdown()

.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/__init__.cpython-311.pyc ADDED Viewed

Binary file (203 Bytes). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/cartpole_dqn_export.cpython-311.pyc ADDED Viewed

Binary file (4.55 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/change_config_during_training.cpython-311.pyc ADDED Viewed

Binary file (11.8 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/checkpoint_by_custom_criteria.cpython-311.pyc ADDED Viewed

Binary file (6.36 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/restore_1_of_n_agents_from_checkpoint.cpython-311.pyc ADDED Viewed

Binary file (7.57 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__init__.py ADDED Viewed

File without changes

.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/__init__.cpython-311.pyc ADDED Viewed

Binary file (203 Bytes). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/custom_heuristic_policy.cpython-311.pyc ADDED Viewed

Binary file (4.37 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/different_spaces_for_agents.cpython-311.pyc ADDED Viewed

Binary file (5.9 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/multi_agent_cartpole.cpython-311.pyc ADDED Viewed

Binary file (2.98 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/multi_agent_pendulum.cpython-311.pyc ADDED Viewed

Binary file (3.33 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/pettingzoo_independent_learning.cpython-311.pyc ADDED Viewed

Binary file (5.32 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/pettingzoo_parameter_sharing.cpython-311.pyc ADDED Viewed

Binary file (4.66 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/pettingzoo_shared_value_function.cpython-311.pyc ADDED Viewed

Binary file (485 Bytes). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/rock_paper_scissors_heuristic_vs_learned.cpython-311.pyc ADDED Viewed

Binary file (6.03 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/rock_paper_scissors_learned_vs_learned.cpython-311.pyc ADDED Viewed

Binary file (4.22 kB). View file

.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/self_play_league_based_with_open_spiel.cpython-311.pyc ADDED Viewed

Binary file (11.4 kB). View file