koichi12 commited on
Commit
fb8b131
·
verified ·
1 Parent(s): 28dd79d

Add files using upload-large-folder tool

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__init__.py +0 -0
  2. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/__init__.cpython-311.pyc +0 -0
  3. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/attention_net_supervised.cpython-311.pyc +0 -0
  4. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/parametric_actions_cartpole.cpython-311.pyc +0 -0
  5. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/parametric_actions_cartpole_embeddings_learnt_by_model.cpython-311.pyc +0 -0
  6. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/attention_net_supervised.py +77 -0
  7. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__init__.py +0 -0
  8. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/__init__.cpython-311.pyc +0 -0
  9. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/action_mask_model.cpython-311.pyc +0 -0
  10. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/autoregressive_action_dist.cpython-311.pyc +0 -0
  11. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/autoregressive_action_model.cpython-311.pyc +0 -0
  12. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/centralized_critic_models.cpython-311.pyc +0 -0
  13. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/custom_loss_model.cpython-311.pyc +0 -0
  14. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/fast_model.cpython-311.pyc +0 -0
  15. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/mobilenet_v2_encoder.cpython-311.pyc +0 -0
  16. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/mobilenet_v2_with_lstm_models.cpython-311.pyc +0 -0
  17. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/parametric_actions_model.cpython-311.pyc +0 -0
  18. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/shared_weights_model.cpython-311.pyc +0 -0
  19. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/simple_rpg_model.cpython-311.pyc +0 -0
  20. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/action_mask_model.py +126 -0
  21. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/autoregressive_action_dist.py +149 -0
  22. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/autoregressive_action_model.py +162 -0
  23. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/centralized_critic_models.py +182 -0
  24. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/custom_loss_model.py +137 -0
  25. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/fast_model.py +80 -0
  26. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/mobilenet_v2_encoder.py +48 -0
  27. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/mobilenet_v2_with_lstm_models.py +160 -0
  28. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/neural_computer.py +247 -0
  29. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/parametric_actions_model.py +201 -0
  30. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/shared_weights_model.py +206 -0
  31. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/simple_rpg_model.py +65 -0
  32. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/parametric_actions_cartpole.py +121 -0
  33. .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/parametric_actions_cartpole_embeddings_learnt_by_model.py +107 -0
  34. .venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/__init__.cpython-311.pyc +0 -0
  35. .venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/cartpole_dqn_export.cpython-311.pyc +0 -0
  36. .venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/change_config_during_training.cpython-311.pyc +0 -0
  37. .venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/checkpoint_by_custom_criteria.cpython-311.pyc +0 -0
  38. .venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/restore_1_of_n_agents_from_checkpoint.cpython-311.pyc +0 -0
  39. .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__init__.py +0 -0
  40. .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/__init__.cpython-311.pyc +0 -0
  41. .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/custom_heuristic_policy.cpython-311.pyc +0 -0
  42. .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/different_spaces_for_agents.cpython-311.pyc +0 -0
  43. .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/multi_agent_cartpole.cpython-311.pyc +0 -0
  44. .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/multi_agent_pendulum.cpython-311.pyc +0 -0
  45. .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/pettingzoo_independent_learning.cpython-311.pyc +0 -0
  46. .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/pettingzoo_parameter_sharing.cpython-311.pyc +0 -0
  47. .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/pettingzoo_shared_value_function.cpython-311.pyc +0 -0
  48. .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/rock_paper_scissors_heuristic_vs_learned.cpython-311.pyc +0 -0
  49. .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/rock_paper_scissors_learned_vs_learned.cpython-311.pyc +0 -0
  50. .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/self_play_league_based_with_open_spiel.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__init__.py ADDED
File without changes
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/__init__.cpython-311.pyc ADDED
Binary file (206 Bytes). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/attention_net_supervised.cpython-311.pyc ADDED
Binary file (4.78 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/parametric_actions_cartpole.cpython-311.pyc ADDED
Binary file (4.59 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/parametric_actions_cartpole_embeddings_learnt_by_model.cpython-311.pyc ADDED
Binary file (4.3 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/attention_net_supervised.py ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # @OldAPIStack
2
+ from gymnasium.spaces import Box, Discrete
3
+ import numpy as np
4
+
5
+ from rllib.models.tf.attention_net import TrXLNet
6
+ from ray.rllib.utils.framework import try_import_tf
7
+
8
+ tf1, tf, tfv = try_import_tf()
9
+
10
+
11
+ def bit_shift_generator(seq_length, shift, batch_size):
12
+ while True:
13
+ values = np.array([0.0, 1.0], dtype=np.float32)
14
+ seq = np.random.choice(values, (batch_size, seq_length, 1))
15
+ targets = np.squeeze(np.roll(seq, shift, axis=1).astype(np.int32))
16
+ targets[:, :shift] = 0
17
+ yield seq, targets
18
+
19
+
20
+ def train_loss(targets, outputs):
21
+ loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
22
+ labels=targets, logits=outputs
23
+ )
24
+ return tf.reduce_mean(loss)
25
+
26
+
27
+ def train_bit_shift(seq_length, num_iterations, print_every_n):
28
+
29
+ optimizer = tf.keras.optimizers.Adam(1e-3)
30
+
31
+ model = TrXLNet(
32
+ observation_space=Box(low=0, high=1, shape=(1,), dtype=np.int32),
33
+ action_space=Discrete(2),
34
+ num_outputs=2,
35
+ model_config={"max_seq_len": seq_length},
36
+ name="trxl",
37
+ num_transformer_units=1,
38
+ attention_dim=10,
39
+ num_heads=5,
40
+ head_dim=20,
41
+ position_wise_mlp_dim=20,
42
+ )
43
+
44
+ shift = 10
45
+ train_batch = 10
46
+ test_batch = 100
47
+ data_gen = bit_shift_generator(seq_length, shift=shift, batch_size=train_batch)
48
+ test_gen = bit_shift_generator(seq_length, shift=shift, batch_size=test_batch)
49
+
50
+ @tf.function
51
+ def update_step(inputs, targets):
52
+ model_out = model(
53
+ {"obs": inputs},
54
+ state=[tf.reshape(inputs, [-1, seq_length, 1])],
55
+ seq_lens=np.full(shape=(train_batch,), fill_value=seq_length),
56
+ )
57
+ optimizer.minimize(
58
+ lambda: train_loss(targets, model_out), lambda: model.trainable_variables
59
+ )
60
+
61
+ for i, (inputs, targets) in zip(range(num_iterations), data_gen):
62
+ inputs_in = np.reshape(inputs, [-1, 1])
63
+ targets_in = np.reshape(targets, [-1])
64
+ update_step(tf.convert_to_tensor(inputs_in), tf.convert_to_tensor(targets_in))
65
+
66
+ if i % print_every_n == 0:
67
+ test_inputs, test_targets = next(test_gen)
68
+ print(i, train_loss(test_targets, model(test_inputs)))
69
+
70
+
71
+ if __name__ == "__main__":
72
+ tf.enable_eager_execution()
73
+ train_bit_shift(
74
+ seq_length=20,
75
+ num_iterations=2000,
76
+ print_every_n=200,
77
+ )
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__init__.py ADDED
File without changes
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/__init__.cpython-311.pyc ADDED
Binary file (213 Bytes). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/action_mask_model.cpython-311.pyc ADDED
Binary file (5.32 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/autoregressive_action_dist.cpython-311.pyc ADDED
Binary file (9.65 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/autoregressive_action_model.cpython-311.pyc ADDED
Binary file (8.05 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/centralized_critic_models.cpython-311.pyc ADDED
Binary file (10.5 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/custom_loss_model.cpython-311.pyc ADDED
Binary file (8.29 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/fast_model.cpython-311.pyc ADDED
Binary file (5.56 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/mobilenet_v2_encoder.cpython-311.pyc ADDED
Binary file (3.04 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/mobilenet_v2_with_lstm_models.cpython-311.pyc ADDED
Binary file (9.69 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/parametric_actions_model.cpython-311.pyc ADDED
Binary file (8.57 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/shared_weights_model.cpython-311.pyc ADDED
Binary file (11.1 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/simple_rpg_model.cpython-311.pyc ADDED
Binary file (4.19 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/action_mask_model.py ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # @OldAPIStack
2
+ from gymnasium.spaces import Dict
3
+
4
+ from ray.rllib.models.tf.fcnet import FullyConnectedNetwork
5
+ from ray.rllib.models.tf.tf_modelv2 import TFModelV2
6
+ from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
7
+ from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
8
+ from ray.rllib.utils.framework import try_import_tf, try_import_torch
9
+ from ray.rllib.utils.torch_utils import FLOAT_MIN
10
+
11
+ tf1, tf, tfv = try_import_tf()
12
+ torch, nn = try_import_torch()
13
+
14
+
15
+ class ActionMaskModel(TFModelV2):
16
+ """Model that handles simple discrete action masking.
17
+
18
+ This assumes the outputs are logits for a single Categorical action dist.
19
+ Getting this to work with a more complex output (e.g., if the action space
20
+ is a tuple of several distributions) is also possible but left as an
21
+ exercise to the reader.
22
+ """
23
+
24
+ def __init__(
25
+ self, obs_space, action_space, num_outputs, model_config, name, **kwargs
26
+ ):
27
+
28
+ orig_space = getattr(obs_space, "original_space", obs_space)
29
+ assert (
30
+ isinstance(orig_space, Dict)
31
+ and "action_mask" in orig_space.spaces
32
+ and "observations" in orig_space.spaces
33
+ )
34
+
35
+ super().__init__(obs_space, action_space, num_outputs, model_config, name)
36
+
37
+ self.internal_model = FullyConnectedNetwork(
38
+ orig_space["observations"],
39
+ action_space,
40
+ num_outputs,
41
+ model_config,
42
+ name + "_internal",
43
+ )
44
+
45
+ # disable action masking --> will likely lead to invalid actions
46
+ self.no_masking = model_config["custom_model_config"].get("no_masking", False)
47
+
48
+ def forward(self, input_dict, state, seq_lens):
49
+ # Extract the available actions tensor from the observation.
50
+ action_mask = input_dict["obs"]["action_mask"]
51
+
52
+ # Compute the unmasked logits.
53
+ logits, _ = self.internal_model({"obs": input_dict["obs"]["observations"]})
54
+
55
+ # If action masking is disabled, directly return unmasked logits
56
+ if self.no_masking:
57
+ return logits, state
58
+
59
+ # Convert action_mask into a [0.0 || -inf]-type mask.
60
+ inf_mask = tf.maximum(tf.math.log(action_mask), tf.float32.min)
61
+ masked_logits = logits + inf_mask
62
+
63
+ # Return masked logits.
64
+ return masked_logits, state
65
+
66
+ def value_function(self):
67
+ return self.internal_model.value_function()
68
+
69
+
70
+ class TorchActionMaskModel(TorchModelV2, nn.Module):
71
+ """PyTorch version of above ActionMaskingModel."""
72
+
73
+ def __init__(
74
+ self,
75
+ obs_space,
76
+ action_space,
77
+ num_outputs,
78
+ model_config,
79
+ name,
80
+ **kwargs,
81
+ ):
82
+ orig_space = getattr(obs_space, "original_space", obs_space)
83
+ assert (
84
+ isinstance(orig_space, Dict)
85
+ and "action_mask" in orig_space.spaces
86
+ and "observations" in orig_space.spaces
87
+ )
88
+
89
+ TorchModelV2.__init__(
90
+ self, obs_space, action_space, num_outputs, model_config, name, **kwargs
91
+ )
92
+ nn.Module.__init__(self)
93
+
94
+ self.internal_model = TorchFC(
95
+ orig_space["observations"],
96
+ action_space,
97
+ num_outputs,
98
+ model_config,
99
+ name + "_internal",
100
+ )
101
+
102
+ # disable action masking --> will likely lead to invalid actions
103
+ self.no_masking = False
104
+ if "no_masking" in model_config["custom_model_config"]:
105
+ self.no_masking = model_config["custom_model_config"]["no_masking"]
106
+
107
+ def forward(self, input_dict, state, seq_lens):
108
+ # Extract the available actions tensor from the observation.
109
+ action_mask = input_dict["obs"]["action_mask"]
110
+
111
+ # Compute the unmasked logits.
112
+ logits, _ = self.internal_model({"obs": input_dict["obs"]["observations"]})
113
+
114
+ # If action masking is disabled, directly return unmasked logits
115
+ if self.no_masking:
116
+ return logits, state
117
+
118
+ # Convert action_mask into a [0.0 || -inf]-type mask.
119
+ inf_mask = torch.clamp(torch.log(action_mask), min=FLOAT_MIN)
120
+ masked_logits = logits + inf_mask
121
+
122
+ # Return masked logits.
123
+ return masked_logits, state
124
+
125
+ def value_function(self):
126
+ return self.internal_model.value_function()
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/autoregressive_action_dist.py ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # @OldAPIStack
2
+ from ray.rllib.models.tf.tf_action_dist import Categorical, ActionDistribution
3
+ from ray.rllib.models.torch.torch_action_dist import (
4
+ TorchCategorical,
5
+ TorchDistributionWrapper,
6
+ )
7
+ from ray.rllib.utils.framework import try_import_tf, try_import_torch
8
+
9
+ tf1, tf, tfv = try_import_tf()
10
+ torch, nn = try_import_torch()
11
+
12
+
13
+ class BinaryAutoregressiveDistribution(ActionDistribution):
14
+ """Action distribution P(a1, a2) = P(a1) * P(a2 | a1)"""
15
+
16
+ def deterministic_sample(self):
17
+ # First, sample a1.
18
+ a1_dist = self._a1_distribution()
19
+ a1 = a1_dist.deterministic_sample()
20
+
21
+ # Sample a2 conditioned on a1.
22
+ a2_dist = self._a2_distribution(a1)
23
+ a2 = a2_dist.deterministic_sample()
24
+ self._action_logp = a1_dist.logp(a1) + a2_dist.logp(a2)
25
+
26
+ # Return the action tuple.
27
+ return (a1, a2)
28
+
29
+ def sample(self):
30
+ # First, sample a1.
31
+ a1_dist = self._a1_distribution()
32
+ a1 = a1_dist.sample()
33
+
34
+ # Sample a2 conditioned on a1.
35
+ a2_dist = self._a2_distribution(a1)
36
+ a2 = a2_dist.sample()
37
+ self._action_logp = a1_dist.logp(a1) + a2_dist.logp(a2)
38
+
39
+ # Return the action tuple.
40
+ return (a1, a2)
41
+
42
+ def logp(self, actions):
43
+ a1, a2 = actions[:, 0], actions[:, 1]
44
+ a1_vec = tf.expand_dims(tf.cast(a1, tf.float32), 1)
45
+ a1_logits, a2_logits = self.model.action_model([self.inputs, a1_vec])
46
+ return Categorical(a1_logits).logp(a1) + Categorical(a2_logits).logp(a2)
47
+
48
+ def sampled_action_logp(self):
49
+ return self._action_logp
50
+
51
+ def entropy(self):
52
+ a1_dist = self._a1_distribution()
53
+ a2_dist = self._a2_distribution(a1_dist.sample())
54
+ return a1_dist.entropy() + a2_dist.entropy()
55
+
56
+ def kl(self, other):
57
+ a1_dist = self._a1_distribution()
58
+ a1_terms = a1_dist.kl(other._a1_distribution())
59
+
60
+ a1 = a1_dist.sample()
61
+ a2_terms = self._a2_distribution(a1).kl(other._a2_distribution(a1))
62
+ return a1_terms + a2_terms
63
+
64
+ def _a1_distribution(self):
65
+ BATCH = tf.shape(self.inputs)[0]
66
+ a1_logits, _ = self.model.action_model([self.inputs, tf.zeros((BATCH, 1))])
67
+ a1_dist = Categorical(a1_logits)
68
+ return a1_dist
69
+
70
+ def _a2_distribution(self, a1):
71
+ a1_vec = tf.expand_dims(tf.cast(a1, tf.float32), 1)
72
+ _, a2_logits = self.model.action_model([self.inputs, a1_vec])
73
+ a2_dist = Categorical(a2_logits)
74
+ return a2_dist
75
+
76
+ @staticmethod
77
+ def required_model_output_shape(action_space, model_config):
78
+ return 16 # controls model output feature vector size
79
+
80
+
81
+ class TorchBinaryAutoregressiveDistribution(TorchDistributionWrapper):
82
+ """Action distribution P(a1, a2) = P(a1) * P(a2 | a1)"""
83
+
84
+ def deterministic_sample(self):
85
+ # First, sample a1.
86
+ a1_dist = self._a1_distribution()
87
+ a1 = a1_dist.deterministic_sample()
88
+
89
+ # Sample a2 conditioned on a1.
90
+ a2_dist = self._a2_distribution(a1)
91
+ a2 = a2_dist.deterministic_sample()
92
+ self._action_logp = a1_dist.logp(a1) + a2_dist.logp(a2)
93
+
94
+ # Return the action tuple.
95
+ return (a1, a2)
96
+
97
+ def sample(self):
98
+ # First, sample a1.
99
+ a1_dist = self._a1_distribution()
100
+ a1 = a1_dist.sample()
101
+
102
+ # Sample a2 conditioned on a1.
103
+ a2_dist = self._a2_distribution(a1)
104
+ a2 = a2_dist.sample()
105
+ self._action_logp = a1_dist.logp(a1) + a2_dist.logp(a2)
106
+
107
+ # Return the action tuple.
108
+ return (a1, a2)
109
+
110
+ def logp(self, actions):
111
+ a1, a2 = actions[:, 0], actions[:, 1]
112
+ a1_vec = torch.unsqueeze(a1.float(), 1)
113
+ a1_logits, a2_logits = self.model.action_module(self.inputs, a1_vec)
114
+ return TorchCategorical(a1_logits).logp(a1) + TorchCategorical(a2_logits).logp(
115
+ a2
116
+ )
117
+
118
+ def sampled_action_logp(self):
119
+ return self._action_logp
120
+
121
+ def entropy(self):
122
+ a1_dist = self._a1_distribution()
123
+ a2_dist = self._a2_distribution(a1_dist.sample())
124
+ return a1_dist.entropy() + a2_dist.entropy()
125
+
126
+ def kl(self, other):
127
+ a1_dist = self._a1_distribution()
128
+ a1_terms = a1_dist.kl(other._a1_distribution())
129
+
130
+ a1 = a1_dist.sample()
131
+ a2_terms = self._a2_distribution(a1).kl(other._a2_distribution(a1))
132
+ return a1_terms + a2_terms
133
+
134
+ def _a1_distribution(self):
135
+ BATCH = self.inputs.shape[0]
136
+ zeros = torch.zeros((BATCH, 1)).to(self.inputs.device)
137
+ a1_logits, _ = self.model.action_module(self.inputs, zeros)
138
+ a1_dist = TorchCategorical(a1_logits)
139
+ return a1_dist
140
+
141
+ def _a2_distribution(self, a1):
142
+ a1_vec = torch.unsqueeze(a1.float(), 1)
143
+ _, a2_logits = self.model.action_module(self.inputs, a1_vec)
144
+ a2_dist = TorchCategorical(a2_logits)
145
+ return a2_dist
146
+
147
+ @staticmethod
148
+ def required_model_output_shape(action_space, model_config):
149
+ return 16 # controls model output feature vector size
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/autoregressive_action_model.py ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # @OldAPIStack
2
+ from gymnasium.spaces import Discrete, Tuple
3
+
4
+ from ray.rllib.models.tf.misc import normc_initializer
5
+ from ray.rllib.models.tf.tf_modelv2 import TFModelV2
6
+ from ray.rllib.models.torch.misc import normc_initializer as normc_init_torch
7
+ from ray.rllib.models.torch.misc import SlimFC
8
+ from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
9
+ from ray.rllib.utils.framework import try_import_tf, try_import_torch
10
+
11
+ tf1, tf, tfv = try_import_tf()
12
+ torch, nn = try_import_torch()
13
+
14
+
15
+ class AutoregressiveActionModel(TFModelV2):
16
+ """Implements the `.action_model` branch required above."""
17
+
18
+ def __init__(self, obs_space, action_space, num_outputs, model_config, name):
19
+ super(AutoregressiveActionModel, self).__init__(
20
+ obs_space, action_space, num_outputs, model_config, name
21
+ )
22
+ if action_space != Tuple([Discrete(2), Discrete(2)]):
23
+ raise ValueError("This model only supports the [2, 2] action space")
24
+
25
+ # Inputs
26
+ obs_input = tf.keras.layers.Input(shape=obs_space.shape, name="obs_input")
27
+ a1_input = tf.keras.layers.Input(shape=(1,), name="a1_input")
28
+ ctx_input = tf.keras.layers.Input(shape=(num_outputs,), name="ctx_input")
29
+
30
+ # Output of the model (normally 'logits', but for an autoregressive
31
+ # dist this is more like a context/feature layer encoding the obs)
32
+ context = tf.keras.layers.Dense(
33
+ num_outputs,
34
+ name="hidden",
35
+ activation=tf.nn.tanh,
36
+ kernel_initializer=normc_initializer(1.0),
37
+ )(obs_input)
38
+
39
+ # V(s)
40
+ value_out = tf.keras.layers.Dense(
41
+ 1,
42
+ name="value_out",
43
+ activation=None,
44
+ kernel_initializer=normc_initializer(0.01),
45
+ )(context)
46
+
47
+ # P(a1 | obs)
48
+ a1_logits = tf.keras.layers.Dense(
49
+ 2,
50
+ name="a1_logits",
51
+ activation=None,
52
+ kernel_initializer=normc_initializer(0.01),
53
+ )(ctx_input)
54
+
55
+ # P(a2 | a1)
56
+ # --note: typically you'd want to implement P(a2 | a1, obs) as follows:
57
+ # a2_context = tf.keras.layers.Concatenate(axis=1)(
58
+ # [ctx_input, a1_input])
59
+ a2_context = a1_input
60
+ a2_hidden = tf.keras.layers.Dense(
61
+ 16,
62
+ name="a2_hidden",
63
+ activation=tf.nn.tanh,
64
+ kernel_initializer=normc_initializer(1.0),
65
+ )(a2_context)
66
+ a2_logits = tf.keras.layers.Dense(
67
+ 2,
68
+ name="a2_logits",
69
+ activation=None,
70
+ kernel_initializer=normc_initializer(0.01),
71
+ )(a2_hidden)
72
+
73
+ # Base layers
74
+ self.base_model = tf.keras.Model(obs_input, [context, value_out])
75
+ self.base_model.summary()
76
+
77
+ # Autoregressive action sampler
78
+ self.action_model = tf.keras.Model(
79
+ [ctx_input, a1_input], [a1_logits, a2_logits]
80
+ )
81
+ self.action_model.summary()
82
+
83
+ def forward(self, input_dict, state, seq_lens):
84
+ context, self._value_out = self.base_model(input_dict["obs"])
85
+ return context, state
86
+
87
+ def value_function(self):
88
+ return tf.reshape(self._value_out, [-1])
89
+
90
+
91
+ class TorchAutoregressiveActionModel(TorchModelV2, nn.Module):
92
+ """PyTorch version of the AutoregressiveActionModel above."""
93
+
94
+ def __init__(self, obs_space, action_space, num_outputs, model_config, name):
95
+ TorchModelV2.__init__(
96
+ self, obs_space, action_space, num_outputs, model_config, name
97
+ )
98
+ nn.Module.__init__(self)
99
+
100
+ if action_space != Tuple([Discrete(2), Discrete(2)]):
101
+ raise ValueError("This model only supports the [2, 2] action space")
102
+
103
+ # Output of the model (normally 'logits', but for an autoregressive
104
+ # dist this is more like a context/feature layer encoding the obs)
105
+ self.context_layer = SlimFC(
106
+ in_size=obs_space.shape[0],
107
+ out_size=num_outputs,
108
+ initializer=normc_init_torch(1.0),
109
+ activation_fn=nn.Tanh,
110
+ )
111
+
112
+ # V(s)
113
+ self.value_branch = SlimFC(
114
+ in_size=num_outputs,
115
+ out_size=1,
116
+ initializer=normc_init_torch(0.01),
117
+ activation_fn=None,
118
+ )
119
+
120
+ # P(a1 | obs)
121
+ self.a1_logits = SlimFC(
122
+ in_size=num_outputs,
123
+ out_size=2,
124
+ activation_fn=None,
125
+ initializer=normc_init_torch(0.01),
126
+ )
127
+
128
+ class _ActionModel(nn.Module):
129
+ def __init__(self):
130
+ nn.Module.__init__(self)
131
+ self.a2_hidden = SlimFC(
132
+ in_size=1,
133
+ out_size=16,
134
+ activation_fn=nn.Tanh,
135
+ initializer=normc_init_torch(1.0),
136
+ )
137
+ self.a2_logits = SlimFC(
138
+ in_size=16,
139
+ out_size=2,
140
+ activation_fn=None,
141
+ initializer=normc_init_torch(0.01),
142
+ )
143
+
144
+ def forward(self_, ctx_input, a1_input):
145
+ a1_logits = self.a1_logits(ctx_input)
146
+ a2_logits = self_.a2_logits(self_.a2_hidden(a1_input))
147
+ return a1_logits, a2_logits
148
+
149
+ # P(a2 | a1)
150
+ # --note: typically you'd want to implement P(a2 | a1, obs) as follows:
151
+ # a2_context = tf.keras.layers.Concatenate(axis=1)(
152
+ # [ctx_input, a1_input])
153
+ self.action_module = _ActionModel()
154
+
155
+ self._context = None
156
+
157
+ def forward(self, input_dict, state, seq_lens):
158
+ self._context = self.context_layer(input_dict["obs"])
159
+ return self._context, state
160
+
161
+ def value_function(self):
162
+ return torch.reshape(self.value_branch(self._context), [-1])
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/centralized_critic_models.py ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # @OldAPIStack
2
+ from gymnasium.spaces import Box
3
+
4
+ from ray.rllib.models.modelv2 import ModelV2
5
+ from ray.rllib.models.tf.tf_modelv2 import TFModelV2
6
+ from ray.rllib.models.tf.fcnet import FullyConnectedNetwork
7
+ from ray.rllib.models.torch.misc import SlimFC
8
+ from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
9
+ from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
10
+ from ray.rllib.utils.annotations import override
11
+ from ray.rllib.utils.framework import try_import_tf, try_import_torch
12
+
13
+ tf1, tf, tfv = try_import_tf()
14
+ torch, nn = try_import_torch()
15
+
16
+
17
+ class CentralizedCriticModel(TFModelV2):
18
+ """Multi-agent model that implements a centralized value function."""
19
+
20
+ def __init__(self, obs_space, action_space, num_outputs, model_config, name):
21
+ super(CentralizedCriticModel, self).__init__(
22
+ obs_space, action_space, num_outputs, model_config, name
23
+ )
24
+ # Base of the model
25
+ self.model = FullyConnectedNetwork(
26
+ obs_space, action_space, num_outputs, model_config, name
27
+ )
28
+
29
+ # Central VF maps (obs, opp_obs, opp_act) -> vf_pred
30
+ obs = tf.keras.layers.Input(shape=(6,), name="obs")
31
+ opp_obs = tf.keras.layers.Input(shape=(6,), name="opp_obs")
32
+ opp_act = tf.keras.layers.Input(shape=(2,), name="opp_act")
33
+ concat_obs = tf.keras.layers.Concatenate(axis=1)([obs, opp_obs, opp_act])
34
+ central_vf_dense = tf.keras.layers.Dense(
35
+ 16, activation=tf.nn.tanh, name="c_vf_dense"
36
+ )(concat_obs)
37
+ central_vf_out = tf.keras.layers.Dense(1, activation=None, name="c_vf_out")(
38
+ central_vf_dense
39
+ )
40
+ self.central_vf = tf.keras.Model(
41
+ inputs=[obs, opp_obs, opp_act], outputs=central_vf_out
42
+ )
43
+
44
+ @override(ModelV2)
45
+ def forward(self, input_dict, state, seq_lens):
46
+ return self.model.forward(input_dict, state, seq_lens)
47
+
48
+ def central_value_function(self, obs, opponent_obs, opponent_actions):
49
+ return tf.reshape(
50
+ self.central_vf(
51
+ [obs, opponent_obs, tf.one_hot(tf.cast(opponent_actions, tf.int32), 2)]
52
+ ),
53
+ [-1],
54
+ )
55
+
56
+ @override(ModelV2)
57
+ def value_function(self):
58
+ return self.model.value_function() # not used
59
+
60
+
61
+ class YetAnotherCentralizedCriticModel(TFModelV2):
62
+ """Multi-agent model that implements a centralized value function.
63
+
64
+ It assumes the observation is a dict with 'own_obs' and 'opponent_obs', the
65
+ former of which can be used for computing actions (i.e., decentralized
66
+ execution), and the latter for optimization (i.e., centralized learning).
67
+
68
+ This model has two parts:
69
+ - An action model that looks at just 'own_obs' to compute actions
70
+ - A value model that also looks at the 'opponent_obs' / 'opponent_action'
71
+ to compute the value (it does this by using the 'obs_flat' tensor).
72
+ """
73
+
74
+ def __init__(self, obs_space, action_space, num_outputs, model_config, name):
75
+ super(YetAnotherCentralizedCriticModel, self).__init__(
76
+ obs_space, action_space, num_outputs, model_config, name
77
+ )
78
+
79
+ self.action_model = FullyConnectedNetwork(
80
+ Box(low=0, high=1, shape=(6,)), # one-hot encoded Discrete(6)
81
+ action_space,
82
+ num_outputs,
83
+ model_config,
84
+ name + "_action",
85
+ )
86
+
87
+ self.value_model = FullyConnectedNetwork(
88
+ obs_space, action_space, 1, model_config, name + "_vf"
89
+ )
90
+
91
+ def forward(self, input_dict, state, seq_lens):
92
+ self._value_out, _ = self.value_model(
93
+ {"obs": input_dict["obs_flat"]}, state, seq_lens
94
+ )
95
+ return self.action_model({"obs": input_dict["obs"]["own_obs"]}, state, seq_lens)
96
+
97
+ def value_function(self):
98
+ return tf.reshape(self._value_out, [-1])
99
+
100
+
101
+ class TorchCentralizedCriticModel(TorchModelV2, nn.Module):
102
+ """Multi-agent model that implements a centralized VF."""
103
+
104
+ def __init__(self, obs_space, action_space, num_outputs, model_config, name):
105
+ TorchModelV2.__init__(
106
+ self, obs_space, action_space, num_outputs, model_config, name
107
+ )
108
+ nn.Module.__init__(self)
109
+
110
+ # Base of the model
111
+ self.model = TorchFC(obs_space, action_space, num_outputs, model_config, name)
112
+
113
+ # Central VF maps (obs, opp_obs, opp_act) -> vf_pred
114
+ input_size = 6 + 6 + 2 # obs + opp_obs + opp_act
115
+ self.central_vf = nn.Sequential(
116
+ SlimFC(input_size, 16, activation_fn=nn.Tanh),
117
+ SlimFC(16, 1),
118
+ )
119
+
120
+ @override(ModelV2)
121
+ def forward(self, input_dict, state, seq_lens):
122
+ model_out, _ = self.model(input_dict, state, seq_lens)
123
+ return model_out, []
124
+
125
+ def central_value_function(self, obs, opponent_obs, opponent_actions):
126
+ input_ = torch.cat(
127
+ [
128
+ obs,
129
+ opponent_obs,
130
+ torch.nn.functional.one_hot(opponent_actions.long(), 2).float(),
131
+ ],
132
+ 1,
133
+ )
134
+ return torch.reshape(self.central_vf(input_), [-1])
135
+
136
+ @override(ModelV2)
137
+ def value_function(self):
138
+ return self.model.value_function() # not used
139
+
140
+
141
+ class YetAnotherTorchCentralizedCriticModel(TorchModelV2, nn.Module):
142
+ """Multi-agent model that implements a centralized value function.
143
+
144
+ It assumes the observation is a dict with 'own_obs' and 'opponent_obs', the
145
+ former of which can be used for computing actions (i.e., decentralized
146
+ execution), and the latter for optimization (i.e., centralized learning).
147
+
148
+ This model has two parts:
149
+ - An action model that looks at just 'own_obs' to compute actions
150
+ - A value model that also looks at the 'opponent_obs' / 'opponent_action'
151
+ to compute the value (it does this by using the 'obs_flat' tensor).
152
+ """
153
+
154
+ def __init__(self, obs_space, action_space, num_outputs, model_config, name):
155
+ TorchModelV2.__init__(
156
+ self, obs_space, action_space, num_outputs, model_config, name
157
+ )
158
+ nn.Module.__init__(self)
159
+
160
+ self.action_model = TorchFC(
161
+ Box(low=0, high=1, shape=(6,)), # one-hot encoded Discrete(6)
162
+ action_space,
163
+ num_outputs,
164
+ model_config,
165
+ name + "_action",
166
+ )
167
+
168
+ self.value_model = TorchFC(
169
+ obs_space, action_space, 1, model_config, name + "_vf"
170
+ )
171
+ self._model_in = None
172
+
173
+ def forward(self, input_dict, state, seq_lens):
174
+ # Store model-input for possible `value_function()` call.
175
+ self._model_in = [input_dict["obs_flat"], state, seq_lens]
176
+ return self.action_model({"obs": input_dict["obs"]["own_obs"]}, state, seq_lens)
177
+
178
+ def value_function(self):
179
+ value_out, _ = self.value_model(
180
+ {"obs": self._model_in[0]}, self._model_in[1], self._model_in[2]
181
+ )
182
+ return torch.reshape(value_out, [-1])
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/custom_loss_model.py ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+
3
+ from ray.rllib.models.modelv2 import ModelV2, restore_original_dimensions
4
+ from ray.rllib.models.tf.tf_action_dist import Categorical
5
+ from ray.rllib.models.tf.tf_modelv2 import TFModelV2
6
+ from ray.rllib.models.tf.fcnet import FullyConnectedNetwork
7
+ from ray.rllib.models.torch.torch_action_dist import TorchCategorical
8
+ from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
9
+ from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
10
+ from ray.rllib.utils.annotations import override
11
+ from ray.rllib.utils.framework import try_import_tf, try_import_torch
12
+ from ray.rllib.offline import JsonReader
13
+
14
+ tf1, tf, tfv = try_import_tf()
15
+ torch, nn = try_import_torch()
16
+
17
+
18
+ class CustomLossModel(TFModelV2):
19
+ """Custom model that adds an imitation loss on top of the policy loss."""
20
+
21
+ def __init__(self, obs_space, action_space, num_outputs, model_config, name):
22
+ super().__init__(obs_space, action_space, num_outputs, model_config, name)
23
+
24
+ self.fcnet = FullyConnectedNetwork(
25
+ self.obs_space, self.action_space, num_outputs, model_config, name="fcnet"
26
+ )
27
+
28
+ @override(ModelV2)
29
+ def forward(self, input_dict, state, seq_lens):
30
+ # Delegate to our FCNet.
31
+ return self.fcnet(input_dict, state, seq_lens)
32
+
33
+ @override(ModelV2)
34
+ def value_function(self):
35
+ # Delegate to our FCNet.
36
+ return self.fcnet.value_function()
37
+
38
+ @override(ModelV2)
39
+ def custom_loss(self, policy_loss, loss_inputs):
40
+ # Create a new input reader per worker.
41
+ reader = JsonReader(self.model_config["custom_model_config"]["input_files"])
42
+ input_ops = reader.tf_input_ops()
43
+
44
+ # Define a secondary loss by building a graph copy with weight sharing.
45
+ obs = restore_original_dimensions(
46
+ tf.cast(input_ops["obs"], tf.float32), self.obs_space
47
+ )
48
+ logits, _ = self.forward({"obs": obs}, [], None)
49
+
50
+ # Compute the IL loss.
51
+ action_dist = Categorical(logits, self.model_config)
52
+ self.policy_loss = policy_loss
53
+ self.imitation_loss = tf.reduce_mean(-action_dist.logp(input_ops["actions"]))
54
+ return policy_loss + 10 * self.imitation_loss
55
+
56
+ def metrics(self):
57
+ return {
58
+ "policy_loss": self.policy_loss,
59
+ "imitation_loss": self.imitation_loss,
60
+ }
61
+
62
+
63
+ class TorchCustomLossModel(TorchModelV2, nn.Module):
64
+ """PyTorch version of the CustomLossModel above."""
65
+
66
+ def __init__(
67
+ self, obs_space, action_space, num_outputs, model_config, name, input_files
68
+ ):
69
+ super().__init__(obs_space, action_space, num_outputs, model_config, name)
70
+ nn.Module.__init__(self)
71
+
72
+ self.input_files = input_files
73
+ # Create a new input reader per worker.
74
+ self.reader = JsonReader(self.input_files)
75
+ self.fcnet = TorchFC(
76
+ self.obs_space, self.action_space, num_outputs, model_config, name="fcnet"
77
+ )
78
+
79
+ @override(ModelV2)
80
+ def forward(self, input_dict, state, seq_lens):
81
+ # Delegate to our FCNet.
82
+ return self.fcnet(input_dict, state, seq_lens)
83
+
84
+ @override(ModelV2)
85
+ def value_function(self):
86
+ # Delegate to our FCNet.
87
+ return self.fcnet.value_function()
88
+
89
+ @override(ModelV2)
90
+ def custom_loss(self, policy_loss, loss_inputs):
91
+ """Calculates a custom loss on top of the given policy_loss(es).
92
+
93
+ Args:
94
+ policy_loss (List[TensorType]): The list of already calculated
95
+ policy losses (as many as there are optimizers).
96
+ loss_inputs: Struct of np.ndarrays holding the
97
+ entire train batch.
98
+
99
+ Returns:
100
+ List[TensorType]: The altered list of policy losses. In case the
101
+ custom loss should have its own optimizer, make sure the
102
+ returned list is one larger than the incoming policy_loss list.
103
+ In case you simply want to mix in the custom loss into the
104
+ already calculated policy losses, return a list of altered
105
+ policy losses (as done in this example below).
106
+ """
107
+ # Get the next batch from our input files.
108
+ batch = self.reader.next()
109
+
110
+ # Define a secondary loss by building a graph copy with weight sharing.
111
+ obs = restore_original_dimensions(
112
+ torch.from_numpy(batch["obs"]).float().to(policy_loss[0].device),
113
+ self.obs_space,
114
+ tensorlib="torch",
115
+ )
116
+ logits, _ = self.forward({"obs": obs}, [], None)
117
+
118
+ # Compute the IL loss.
119
+ action_dist = TorchCategorical(logits, self.model_config)
120
+ imitation_loss = torch.mean(
121
+ -action_dist.logp(
122
+ torch.from_numpy(batch["actions"]).to(policy_loss[0].device)
123
+ )
124
+ )
125
+ self.imitation_loss_metric = imitation_loss.item()
126
+ self.policy_loss_metric = np.mean([loss.item() for loss in policy_loss])
127
+
128
+ # Add the imitation loss to each already calculated policy loss term.
129
+ # Alternatively (if custom loss has its own optimizer):
130
+ # return policy_loss + [10 * self.imitation_loss]
131
+ return [loss_ + 10 * imitation_loss for loss_ in policy_loss]
132
+
133
+ def metrics(self):
134
+ return {
135
+ "policy_loss": self.policy_loss_metric,
136
+ "imitation_loss": self.imitation_loss_metric,
137
+ }
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/fast_model.py ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # @OldAPIStack
2
+ from ray.rllib.models.modelv2 import ModelV2
3
+ from ray.rllib.models.tf.tf_modelv2 import TFModelV2
4
+ from ray.rllib.models.torch.misc import SlimFC
5
+ from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
6
+ from ray.rllib.utils.annotations import override
7
+ from ray.rllib.utils.framework import try_import_tf, try_import_torch
8
+
9
+ tf1, tf, tfv = try_import_tf()
10
+ torch, nn = try_import_torch()
11
+
12
+
13
+ class FastModel(TFModelV2):
14
+ """An example for a non-Keras ModelV2 in tf that learns a single weight.
15
+
16
+ Defines all network architecture in `forward` (not `__init__` as it's
17
+ usually done for Keras-style TFModelV2s).
18
+ """
19
+
20
+ def __init__(self, obs_space, action_space, num_outputs, model_config, name):
21
+ super().__init__(obs_space, action_space, num_outputs, model_config, name)
22
+ # Have we registered our vars yet (see `forward`)?
23
+ self._registered = False
24
+
25
+ @override(ModelV2)
26
+ def forward(self, input_dict, state, seq_lens):
27
+ with tf1.variable_scope("model", reuse=tf1.AUTO_REUSE):
28
+ bias = tf1.get_variable(
29
+ dtype=tf.float32,
30
+ name="bias",
31
+ initializer=tf.keras.initializers.Zeros(),
32
+ shape=(),
33
+ )
34
+ output = bias + tf.zeros([tf.shape(input_dict["obs"])[0], self.num_outputs])
35
+ self._value_out = tf.reduce_mean(output, -1) # fake value
36
+
37
+ if not self._registered:
38
+ self.register_variables(
39
+ tf1.get_collection(
40
+ tf1.GraphKeys.TRAINABLE_VARIABLES, scope=".+/model/.+"
41
+ )
42
+ )
43
+ self._registered = True
44
+
45
+ return output, []
46
+
47
+ @override(ModelV2)
48
+ def value_function(self):
49
+ return tf.reshape(self._value_out, [-1])
50
+
51
+
52
+ class TorchFastModel(TorchModelV2, nn.Module):
53
+ """Torch version of FastModel (tf)."""
54
+
55
+ def __init__(self, obs_space, action_space, num_outputs, model_config, name):
56
+ TorchModelV2.__init__(
57
+ self, obs_space, action_space, num_outputs, model_config, name
58
+ )
59
+ nn.Module.__init__(self)
60
+
61
+ self.bias = nn.Parameter(
62
+ torch.tensor([0.0], dtype=torch.float32, requires_grad=True)
63
+ )
64
+
65
+ # Only needed to give some params to the optimizer (even though,
66
+ # they are never used anywhere).
67
+ self.dummy_layer = SlimFC(1, 1)
68
+ self._output = None
69
+
70
+ @override(ModelV2)
71
+ def forward(self, input_dict, state, seq_lens):
72
+ self._output = self.bias + torch.zeros(
73
+ size=(input_dict["obs"].shape[0], self.num_outputs)
74
+ ).to(self.bias.device)
75
+ return self._output, []
76
+
77
+ @override(ModelV2)
78
+ def value_function(self):
79
+ assert self._output is not None, "must call forward first!"
80
+ return torch.reshape(torch.mean(self._output, -1), [-1])
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/mobilenet_v2_encoder.py ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # @OldAPIStack
2
+ """
3
+ This file implements a MobileNet v2 Encoder.
4
+ It uses MobileNet v2 to encode images into a latent space of 1000 dimensions.
5
+
6
+ Depending on the experiment, the MobileNet v2 encoder layers can be frozen or
7
+ unfrozen. This is controlled by the `freeze` parameter in the config.
8
+
9
+ This is an example of how a pre-trained neural network can be used as an encoder
10
+ in RLlib. You can modify this example to accommodate your own encoder network or
11
+ other pre-trained networks.
12
+ """
13
+
14
+ from ray.rllib.core.models.base import Encoder, ENCODER_OUT
15
+ from ray.rllib.core.models.configs import ModelConfig
16
+ from ray.rllib.core.models.torch.base import TorchModel
17
+ from ray.rllib.utils.framework import try_import_torch
18
+
19
+ torch, nn = try_import_torch()
20
+
21
+ MOBILENET_INPUT_SHAPE = (3, 224, 224)
22
+
23
+
24
+ class MobileNetV2EncoderConfig(ModelConfig):
25
+ # MobileNet v2 has a flat output with a length of 1000.
26
+ output_dims = (1000,)
27
+ freeze = True
28
+
29
+ def build(self, framework):
30
+ assert framework == "torch", "Unsupported framework `{}`!".format(framework)
31
+ return MobileNetV2Encoder(self)
32
+
33
+
34
+ class MobileNetV2Encoder(TorchModel, Encoder):
35
+ """A MobileNet v2 encoder for RLlib."""
36
+
37
+ def __init__(self, config):
38
+ super().__init__(config)
39
+ self.net = torch.hub.load(
40
+ "pytorch/vision:v0.6.0", "mobilenet_v2", pretrained=True
41
+ )
42
+ if config.freeze:
43
+ # We don't want to train this encoder, so freeze its parameters!
44
+ for p in self.net.parameters():
45
+ p.requires_grad = False
46
+
47
+ def _forward(self, input_dict, **kwargs):
48
+ return {ENCODER_OUT: (self.net(input_dict["obs"]))}
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/mobilenet_v2_with_lstm_models.py ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # @OldAPIStack
2
+ import numpy as np
3
+
4
+ from ray.rllib.models.modelv2 import ModelV2
5
+ from ray.rllib.models.tf.recurrent_net import RecurrentNetwork
6
+ from ray.rllib.models.torch.misc import SlimFC
7
+ from ray.rllib.models.torch.recurrent_net import RecurrentNetwork as TorchRNN
8
+ from ray.rllib.utils.annotations import override
9
+ from ray.rllib.utils.framework import try_import_tf, try_import_torch
10
+
11
+ tf1, tf, tfv = try_import_tf()
12
+ torch, nn = try_import_torch()
13
+
14
+
15
+ class MobileV2PlusRNNModel(RecurrentNetwork):
16
+ """A conv. + recurrent keras net example using a pre-trained MobileNet."""
17
+
18
+ def __init__(
19
+ self, obs_space, action_space, num_outputs, model_config, name, cnn_shape
20
+ ):
21
+
22
+ super(MobileV2PlusRNNModel, self).__init__(
23
+ obs_space, action_space, num_outputs, model_config, name
24
+ )
25
+
26
+ self.cell_size = 16
27
+ visual_size = cnn_shape[0] * cnn_shape[1] * cnn_shape[2]
28
+
29
+ state_in_h = tf.keras.layers.Input(shape=(self.cell_size,), name="h")
30
+ state_in_c = tf.keras.layers.Input(shape=(self.cell_size,), name="c")
31
+ seq_in = tf.keras.layers.Input(shape=(), name="seq_in", dtype=tf.int32)
32
+
33
+ inputs = tf.keras.layers.Input(shape=(None, visual_size), name="visual_inputs")
34
+
35
+ input_visual = inputs
36
+ input_visual = tf.reshape(
37
+ input_visual, [-1, cnn_shape[0], cnn_shape[1], cnn_shape[2]]
38
+ )
39
+ cnn_input = tf.keras.layers.Input(shape=cnn_shape, name="cnn_input")
40
+
41
+ cnn_model = tf.keras.applications.mobilenet_v2.MobileNetV2(
42
+ alpha=1.0,
43
+ include_top=True,
44
+ weights=None,
45
+ input_tensor=cnn_input,
46
+ pooling=None,
47
+ )
48
+ vision_out = cnn_model(input_visual)
49
+ vision_out = tf.reshape(
50
+ vision_out, [-1, tf.shape(inputs)[1], vision_out.shape.as_list()[-1]]
51
+ )
52
+
53
+ lstm_out, state_h, state_c = tf.keras.layers.LSTM(
54
+ self.cell_size, return_sequences=True, return_state=True, name="lstm"
55
+ )(
56
+ inputs=vision_out,
57
+ mask=tf.sequence_mask(seq_in),
58
+ initial_state=[state_in_h, state_in_c],
59
+ )
60
+
61
+ # Postprocess LSTM output with another hidden layer and compute values.
62
+ logits = tf.keras.layers.Dense(
63
+ self.num_outputs, activation=tf.keras.activations.linear, name="logits"
64
+ )(lstm_out)
65
+ values = tf.keras.layers.Dense(1, activation=None, name="values")(lstm_out)
66
+
67
+ # Create the RNN model
68
+ self.rnn_model = tf.keras.Model(
69
+ inputs=[inputs, seq_in, state_in_h, state_in_c],
70
+ outputs=[logits, values, state_h, state_c],
71
+ )
72
+ self.rnn_model.summary()
73
+
74
+ @override(RecurrentNetwork)
75
+ def forward_rnn(self, inputs, state, seq_lens):
76
+ model_out, self._value_out, h, c = self.rnn_model([inputs, seq_lens] + state)
77
+ return model_out, [h, c]
78
+
79
+ @override(ModelV2)
80
+ def get_initial_state(self):
81
+ return [
82
+ np.zeros(self.cell_size, np.float32),
83
+ np.zeros(self.cell_size, np.float32),
84
+ ]
85
+
86
+ @override(ModelV2)
87
+ def value_function(self):
88
+ return tf.reshape(self._value_out, [-1])
89
+
90
+
91
+ class TorchMobileV2PlusRNNModel(TorchRNN, nn.Module):
92
+ """A conv. + recurrent torch net example using a pre-trained MobileNet."""
93
+
94
+ def __init__(
95
+ self, obs_space, action_space, num_outputs, model_config, name, cnn_shape
96
+ ):
97
+
98
+ TorchRNN.__init__(
99
+ self, obs_space, action_space, num_outputs, model_config, name
100
+ )
101
+ nn.Module.__init__(self)
102
+
103
+ self.lstm_state_size = 16
104
+ self.cnn_shape = list(cnn_shape)
105
+ self.visual_size_in = cnn_shape[0] * cnn_shape[1] * cnn_shape[2]
106
+ # MobileNetV2 has a flat output of (1000,).
107
+ self.visual_size_out = 1000
108
+
109
+ # Load the MobileNetV2 from torch.hub.
110
+ self.cnn_model = torch.hub.load(
111
+ "pytorch/vision:v0.6.0", "mobilenet_v2", pretrained=True
112
+ )
113
+
114
+ self.lstm = nn.LSTM(
115
+ self.visual_size_out, self.lstm_state_size, batch_first=True
116
+ )
117
+
118
+ # Postprocess LSTM output with another hidden layer and compute values.
119
+ self.logits = SlimFC(self.lstm_state_size, self.num_outputs)
120
+ self.value_branch = SlimFC(self.lstm_state_size, 1)
121
+ # Holds the current "base" output (before logits layer).
122
+ self._features = None
123
+
124
+ @override(TorchRNN)
125
+ def forward_rnn(self, inputs, state, seq_lens):
126
+ # Create image dims.
127
+ vision_in = torch.reshape(inputs, [-1] + self.cnn_shape)
128
+ vision_out = self.cnn_model(vision_in)
129
+ # Flatten.
130
+ vision_out_time_ranked = torch.reshape(
131
+ vision_out, [inputs.shape[0], inputs.shape[1], vision_out.shape[-1]]
132
+ )
133
+ if len(state[0].shape) == 2:
134
+ state[0] = state[0].unsqueeze(0)
135
+ state[1] = state[1].unsqueeze(0)
136
+ # Forward through LSTM.
137
+ self._features, [h, c] = self.lstm(vision_out_time_ranked, state)
138
+ # Forward LSTM out through logits layer and value layer.
139
+ logits = self.logits(self._features)
140
+ return logits, [h.squeeze(0), c.squeeze(0)]
141
+
142
+ @override(ModelV2)
143
+ def get_initial_state(self):
144
+ # Place hidden states on same device as model.
145
+ h = [
146
+ list(self.cnn_model.modules())[-1]
147
+ .weight.new(1, self.lstm_state_size)
148
+ .zero_()
149
+ .squeeze(0),
150
+ list(self.cnn_model.modules())[-1]
151
+ .weight.new(1, self.lstm_state_size)
152
+ .zero_()
153
+ .squeeze(0),
154
+ ]
155
+ return h
156
+
157
+ @override(ModelV2)
158
+ def value_function(self):
159
+ assert self._features is not None, "must call forward() first"
160
+ return torch.reshape(self.value_branch(self._features), [-1])
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/neural_computer.py ADDED
@@ -0,0 +1,247 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # @OldAPIStack
2
+ from collections import OrderedDict
3
+ import gymnasium as gym
4
+ from typing import Union, Dict, List, Tuple
5
+
6
+ from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
7
+ from ray.rllib.models.torch.misc import SlimFC
8
+ from ray.rllib.utils.framework import try_import_torch
9
+ from ray.rllib.utils.typing import ModelConfigDict, TensorType
10
+
11
+ try:
12
+ from dnc import DNC
13
+ except ModuleNotFoundError:
14
+ print("dnc module not found. Did you forget to 'pip install dnc'?")
15
+ raise
16
+
17
+ torch, nn = try_import_torch()
18
+
19
+
20
+ class DNCMemory(TorchModelV2, nn.Module):
21
+ """Differentiable Neural Computer wrapper around ixaxaar's DNC implementation,
22
+ see https://github.com/ixaxaar/pytorch-dnc"""
23
+
24
+ DEFAULT_CONFIG = {
25
+ "dnc_model": DNC,
26
+ # Number of controller hidden layers
27
+ "num_hidden_layers": 1,
28
+ # Number of weights per controller hidden layer
29
+ "hidden_size": 64,
30
+ # Number of LSTM units
31
+ "num_layers": 1,
32
+ # Number of read heads, i.e. how many addrs are read at once
33
+ "read_heads": 4,
34
+ # Number of memory cells in the controller
35
+ "nr_cells": 32,
36
+ # Size of each cell
37
+ "cell_size": 16,
38
+ # LSTM activation function
39
+ "nonlinearity": "tanh",
40
+ # Observation goes through this torch.nn.Module before
41
+ # feeding to the DNC
42
+ "preprocessor": torch.nn.Sequential(torch.nn.Linear(64, 64), torch.nn.Tanh()),
43
+ # Input size to the preprocessor
44
+ "preprocessor_input_size": 64,
45
+ # The output size of the preprocessor
46
+ # and the input size of the dnc
47
+ "preprocessor_output_size": 64,
48
+ }
49
+
50
+ MEMORY_KEYS = [
51
+ "memory",
52
+ "link_matrix",
53
+ "precedence",
54
+ "read_weights",
55
+ "write_weights",
56
+ "usage_vector",
57
+ ]
58
+
59
+ def __init__(
60
+ self,
61
+ obs_space: gym.spaces.Space,
62
+ action_space: gym.spaces.Space,
63
+ num_outputs: int,
64
+ model_config: ModelConfigDict,
65
+ name: str,
66
+ **custom_model_kwargs,
67
+ ):
68
+ nn.Module.__init__(self)
69
+ super(DNCMemory, self).__init__(
70
+ obs_space, action_space, num_outputs, model_config, name
71
+ )
72
+ self.num_outputs = num_outputs
73
+ self.obs_dim = gym.spaces.utils.flatdim(obs_space)
74
+ self.act_dim = gym.spaces.utils.flatdim(action_space)
75
+
76
+ self.cfg = dict(self.DEFAULT_CONFIG, **custom_model_kwargs)
77
+ assert (
78
+ self.cfg["num_layers"] == 1
79
+ ), "num_layers != 1 has not been implemented yet"
80
+ self.cur_val = None
81
+
82
+ self.preprocessor = torch.nn.Sequential(
83
+ torch.nn.Linear(self.obs_dim, self.cfg["preprocessor_input_size"]),
84
+ self.cfg["preprocessor"],
85
+ )
86
+
87
+ self.logit_branch = SlimFC(
88
+ in_size=self.cfg["hidden_size"],
89
+ out_size=self.num_outputs,
90
+ activation_fn=None,
91
+ initializer=torch.nn.init.xavier_uniform_,
92
+ )
93
+
94
+ self.value_branch = SlimFC(
95
+ in_size=self.cfg["hidden_size"],
96
+ out_size=1,
97
+ activation_fn=None,
98
+ initializer=torch.nn.init.xavier_uniform_,
99
+ )
100
+
101
+ self.dnc: Union[None, DNC] = None
102
+
103
+ def get_initial_state(self) -> List[TensorType]:
104
+ ctrl_hidden = [
105
+ torch.zeros(self.cfg["num_hidden_layers"], self.cfg["hidden_size"]),
106
+ torch.zeros(self.cfg["num_hidden_layers"], self.cfg["hidden_size"]),
107
+ ]
108
+ m = self.cfg["nr_cells"]
109
+ r = self.cfg["read_heads"]
110
+ w = self.cfg["cell_size"]
111
+ memory = [
112
+ torch.zeros(m, w), # memory
113
+ torch.zeros(1, m, m), # link_matrix
114
+ torch.zeros(1, m), # precedence
115
+ torch.zeros(r, m), # read_weights
116
+ torch.zeros(1, m), # write_weights
117
+ torch.zeros(m), # usage_vector
118
+ ]
119
+
120
+ read_vecs = torch.zeros(w * r)
121
+
122
+ state = [*ctrl_hidden, read_vecs, *memory]
123
+ assert len(state) == 9
124
+ return state
125
+
126
+ def value_function(self) -> TensorType:
127
+ assert self.cur_val is not None, "must call forward() first"
128
+ return self.cur_val
129
+
130
+ def unpack_state(
131
+ self,
132
+ state: List[TensorType],
133
+ ) -> Tuple[List[Tuple[TensorType, TensorType]], Dict[str, TensorType], TensorType]:
134
+ """Given a list of tensors, reformat for self.dnc input"""
135
+ assert len(state) == 9, "Failed to verify unpacked state"
136
+ ctrl_hidden: List[Tuple[TensorType, TensorType]] = [
137
+ (
138
+ state[0].permute(1, 0, 2).contiguous(),
139
+ state[1].permute(1, 0, 2).contiguous(),
140
+ )
141
+ ]
142
+ read_vecs: TensorType = state[2]
143
+ memory: List[TensorType] = state[3:]
144
+ memory_dict: OrderedDict[str, TensorType] = OrderedDict(
145
+ zip(self.MEMORY_KEYS, memory)
146
+ )
147
+
148
+ return ctrl_hidden, memory_dict, read_vecs
149
+
150
+ def pack_state(
151
+ self,
152
+ ctrl_hidden: List[Tuple[TensorType, TensorType]],
153
+ memory_dict: Dict[str, TensorType],
154
+ read_vecs: TensorType,
155
+ ) -> List[TensorType]:
156
+ """Given the dnc output, pack it into a list of tensors
157
+ for rllib state. Order is ctrl_hidden, read_vecs, memory_dict"""
158
+ state = []
159
+ ctrl_hidden = [
160
+ ctrl_hidden[0][0].permute(1, 0, 2),
161
+ ctrl_hidden[0][1].permute(1, 0, 2),
162
+ ]
163
+ state += ctrl_hidden
164
+ assert len(state) == 2, "Failed to verify packed state"
165
+ state.append(read_vecs)
166
+ assert len(state) == 3, "Failed to verify packed state"
167
+ state += memory_dict.values()
168
+ assert len(state) == 9, "Failed to verify packed state"
169
+ return state
170
+
171
+ def validate_unpack(self, dnc_output, unpacked_state):
172
+ """Ensure the unpacked state shapes match the DNC output"""
173
+ s_ctrl_hidden, s_memory_dict, s_read_vecs = unpacked_state
174
+ ctrl_hidden, memory_dict, read_vecs = dnc_output
175
+
176
+ for i in range(len(ctrl_hidden)):
177
+ for j in range(len(ctrl_hidden[i])):
178
+ assert s_ctrl_hidden[i][j].shape == ctrl_hidden[i][j].shape, (
179
+ "Controller state mismatch: got "
180
+ f"{s_ctrl_hidden[i][j].shape} should be "
181
+ f"{ctrl_hidden[i][j].shape}"
182
+ )
183
+
184
+ for k in memory_dict:
185
+ assert s_memory_dict[k].shape == memory_dict[k].shape, (
186
+ "Memory state mismatch at key "
187
+ f"{k}: got {s_memory_dict[k].shape} should be "
188
+ f"{memory_dict[k].shape}"
189
+ )
190
+
191
+ assert s_read_vecs.shape == read_vecs.shape, (
192
+ "Read state mismatch: got "
193
+ f"{s_read_vecs.shape} should be "
194
+ f"{read_vecs.shape}"
195
+ )
196
+
197
+ def build_dnc(self, device_idx: Union[int, None]) -> None:
198
+ self.dnc = self.cfg["dnc_model"](
199
+ input_size=self.cfg["preprocessor_output_size"],
200
+ hidden_size=self.cfg["hidden_size"],
201
+ num_layers=self.cfg["num_layers"],
202
+ num_hidden_layers=self.cfg["num_hidden_layers"],
203
+ read_heads=self.cfg["read_heads"],
204
+ cell_size=self.cfg["cell_size"],
205
+ nr_cells=self.cfg["nr_cells"],
206
+ nonlinearity=self.cfg["nonlinearity"],
207
+ gpu_id=device_idx,
208
+ )
209
+
210
+ def forward(
211
+ self,
212
+ input_dict: Dict[str, TensorType],
213
+ state: List[TensorType],
214
+ seq_lens: TensorType,
215
+ ) -> Tuple[TensorType, List[TensorType]]:
216
+
217
+ flat = input_dict["obs_flat"]
218
+ # Batch and Time
219
+ # Forward expects outputs as [B, T, logits]
220
+ B = len(seq_lens)
221
+ T = flat.shape[0] // B
222
+
223
+ # Deconstruct batch into batch and time dimensions: [B, T, feats]
224
+ flat = torch.reshape(flat, [-1, T] + list(flat.shape[1:]))
225
+
226
+ # First run
227
+ if self.dnc is None:
228
+ gpu_id = flat.device.index if flat.device.index is not None else -1
229
+ self.build_dnc(gpu_id)
230
+ hidden = (None, None, None)
231
+
232
+ else:
233
+ hidden = self.unpack_state(state) # type: ignore
234
+
235
+ # Run thru preprocessor before DNC
236
+ z = self.preprocessor(flat.reshape(B * T, self.obs_dim))
237
+ z = z.reshape(B, T, self.cfg["preprocessor_output_size"])
238
+ output, hidden = self.dnc(z, hidden)
239
+ packed_state = self.pack_state(*hidden)
240
+
241
+ # Compute action/value from output
242
+ logits = self.logit_branch(output.view(B * T, -1))
243
+ values = self.value_branch(output.view(B * T, -1))
244
+
245
+ self.cur_val = values.squeeze(1)
246
+
247
+ return logits, packed_state
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/parametric_actions_model.py ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # @OldAPIStack
2
+ from gymnasium.spaces import Box
3
+
4
+ from ray.rllib.algorithms.dqn.distributional_q_tf_model import DistributionalQTFModel
5
+ from ray.rllib.algorithms.dqn.dqn_torch_model import DQNTorchModel
6
+ from ray.rllib.models.tf.fcnet import FullyConnectedNetwork
7
+ from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
8
+ from ray.rllib.utils.framework import try_import_tf, try_import_torch
9
+ from ray.rllib.utils.torch_utils import FLOAT_MAX, FLOAT_MIN
10
+
11
+ tf1, tf, tfv = try_import_tf()
12
+ torch, nn = try_import_torch()
13
+
14
+
15
+ class ParametricActionsModel(DistributionalQTFModel):
16
+ """Parametric action model that handles the dot product and masking.
17
+
18
+ This assumes the outputs are logits for a single Categorical action dist.
19
+ Getting this to work with a more complex output (e.g., if the action space
20
+ is a tuple of several distributions) is also possible but left as an
21
+ exercise to the reader.
22
+ """
23
+
24
+ def __init__(
25
+ self,
26
+ obs_space,
27
+ action_space,
28
+ num_outputs,
29
+ model_config,
30
+ name,
31
+ true_obs_shape=(4,),
32
+ action_embed_size=2,
33
+ **kw
34
+ ):
35
+ super(ParametricActionsModel, self).__init__(
36
+ obs_space, action_space, num_outputs, model_config, name, **kw
37
+ )
38
+ self.action_embed_model = FullyConnectedNetwork(
39
+ Box(-1, 1, shape=true_obs_shape),
40
+ action_space,
41
+ action_embed_size,
42
+ model_config,
43
+ name + "_action_embed",
44
+ )
45
+
46
+ def forward(self, input_dict, state, seq_lens):
47
+ # Extract the available actions tensor from the observation.
48
+ avail_actions = input_dict["obs"]["avail_actions"]
49
+ action_mask = input_dict["obs"]["action_mask"]
50
+
51
+ # Compute the predicted action embedding
52
+ action_embed, _ = self.action_embed_model({"obs": input_dict["obs"]["cart"]})
53
+
54
+ # Expand the model output to [BATCH, 1, EMBED_SIZE]. Note that the
55
+ # avail actions tensor is of shape [BATCH, MAX_ACTIONS, EMBED_SIZE].
56
+ intent_vector = tf.expand_dims(action_embed, 1)
57
+
58
+ # Batch dot product => shape of logits is [BATCH, MAX_ACTIONS].
59
+ action_logits = tf.reduce_sum(avail_actions * intent_vector, axis=2)
60
+
61
+ # Mask out invalid actions (use tf.float32.min for stability)
62
+ inf_mask = tf.maximum(tf.math.log(action_mask), tf.float32.min)
63
+ return action_logits + inf_mask, state
64
+
65
+ def value_function(self):
66
+ return self.action_embed_model.value_function()
67
+
68
+
69
+ class TorchParametricActionsModel(DQNTorchModel):
70
+ """PyTorch version of above ParametricActionsModel."""
71
+
72
+ def __init__(
73
+ self,
74
+ obs_space,
75
+ action_space,
76
+ num_outputs,
77
+ model_config,
78
+ name,
79
+ true_obs_shape=(4,),
80
+ action_embed_size=2,
81
+ **kw
82
+ ):
83
+ DQNTorchModel.__init__(
84
+ self, obs_space, action_space, num_outputs, model_config, name, **kw
85
+ )
86
+
87
+ self.action_embed_model = TorchFC(
88
+ Box(-1, 1, shape=true_obs_shape),
89
+ action_space,
90
+ action_embed_size,
91
+ model_config,
92
+ name + "_action_embed",
93
+ )
94
+
95
+ def forward(self, input_dict, state, seq_lens):
96
+ # Extract the available actions tensor from the observation.
97
+ avail_actions = input_dict["obs"]["avail_actions"]
98
+ action_mask = input_dict["obs"]["action_mask"]
99
+
100
+ # Compute the predicted action embedding
101
+ action_embed, _ = self.action_embed_model({"obs": input_dict["obs"]["cart"]})
102
+
103
+ # Expand the model output to [BATCH, 1, EMBED_SIZE]. Note that the
104
+ # avail actions tensor is of shape [BATCH, MAX_ACTIONS, EMBED_SIZE].
105
+ intent_vector = torch.unsqueeze(action_embed, 1)
106
+
107
+ # Batch dot product => shape of logits is [BATCH, MAX_ACTIONS].
108
+ action_logits = torch.sum(avail_actions * intent_vector, dim=2)
109
+
110
+ # Mask out invalid actions (use -inf to tag invalid).
111
+ # These are then recognized by the EpsilonGreedy exploration component
112
+ # as invalid actions that are not to be chosen.
113
+ inf_mask = torch.clamp(torch.log(action_mask), FLOAT_MIN, FLOAT_MAX)
114
+
115
+ return action_logits + inf_mask, state
116
+
117
+ def value_function(self):
118
+ return self.action_embed_model.value_function()
119
+
120
+
121
+ class ParametricActionsModelThatLearnsEmbeddings(DistributionalQTFModel):
122
+ """Same as the above ParametricActionsModel.
123
+
124
+ However, this version also learns the action embeddings.
125
+ """
126
+
127
+ def __init__(
128
+ self,
129
+ obs_space,
130
+ action_space,
131
+ num_outputs,
132
+ model_config,
133
+ name,
134
+ true_obs_shape=(4,),
135
+ action_embed_size=2,
136
+ **kw
137
+ ):
138
+ super(ParametricActionsModelThatLearnsEmbeddings, self).__init__(
139
+ obs_space, action_space, num_outputs, model_config, name, **kw
140
+ )
141
+
142
+ action_ids_shifted = tf.constant(
143
+ list(range(1, num_outputs + 1)), dtype=tf.float32
144
+ )
145
+
146
+ obs_cart = tf.keras.layers.Input(shape=true_obs_shape, name="obs_cart")
147
+ valid_avail_actions_mask = tf.keras.layers.Input(
148
+ shape=(num_outputs,), name="valid_avail_actions_mask"
149
+ )
150
+
151
+ self.pred_action_embed_model = FullyConnectedNetwork(
152
+ Box(-1, 1, shape=true_obs_shape),
153
+ action_space,
154
+ action_embed_size,
155
+ model_config,
156
+ name + "_pred_action_embed",
157
+ )
158
+
159
+ # Compute the predicted action embedding
160
+ pred_action_embed, _ = self.pred_action_embed_model({"obs": obs_cart})
161
+ _value_out = self.pred_action_embed_model.value_function()
162
+
163
+ # Expand the model output to [BATCH, 1, EMBED_SIZE]. Note that the
164
+ # avail actions tensor is of shape [BATCH, MAX_ACTIONS, EMBED_SIZE].
165
+ intent_vector = tf.expand_dims(pred_action_embed, 1)
166
+
167
+ valid_avail_actions = action_ids_shifted * valid_avail_actions_mask
168
+ # Embedding for valid available actions which will be learned.
169
+ # Embedding vector for 0 is an invalid embedding (a "dummy embedding").
170
+ valid_avail_actions_embed = tf.keras.layers.Embedding(
171
+ input_dim=num_outputs + 1,
172
+ output_dim=action_embed_size,
173
+ name="action_embed_matrix",
174
+ )(valid_avail_actions)
175
+
176
+ # Batch dot product => shape of logits is [BATCH, MAX_ACTIONS].
177
+ action_logits = tf.reduce_sum(valid_avail_actions_embed * intent_vector, axis=2)
178
+
179
+ # Mask out invalid actions (use tf.float32.min for stability)
180
+ inf_mask = tf.maximum(tf.math.log(valid_avail_actions_mask), tf.float32.min)
181
+
182
+ action_logits = action_logits + inf_mask
183
+
184
+ self.param_actions_model = tf.keras.Model(
185
+ inputs=[obs_cart, valid_avail_actions_mask],
186
+ outputs=[action_logits, _value_out],
187
+ )
188
+ self.param_actions_model.summary()
189
+
190
+ def forward(self, input_dict, state, seq_lens):
191
+ # Extract the available actions mask tensor from the observation.
192
+ valid_avail_actions_mask = input_dict["obs"]["valid_avail_actions_mask"]
193
+
194
+ action_logits, self._value_out = self.param_actions_model(
195
+ [input_dict["obs"]["cart"], valid_avail_actions_mask]
196
+ )
197
+
198
+ return action_logits, state
199
+
200
+ def value_function(self):
201
+ return self._value_out
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/shared_weights_model.py ADDED
@@ -0,0 +1,206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # @OldAPIStack
2
+ import numpy as np
3
+
4
+ from ray.rllib.models.modelv2 import ModelV2
5
+ from ray.rllib.models.tf.tf_modelv2 import TFModelV2
6
+ from ray.rllib.models.torch.misc import SlimFC
7
+ from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
8
+ from ray.rllib.utils.annotations import override
9
+ from ray.rllib.utils.framework import try_import_tf, try_import_torch
10
+
11
+ tf1, tf, tfv = try_import_tf()
12
+ torch, nn = try_import_torch()
13
+
14
+ TF2_GLOBAL_SHARED_LAYER = None
15
+
16
+
17
+ class TF2SharedWeightsModel(TFModelV2):
18
+ """Example of weight sharing between two different TFModelV2s.
19
+
20
+ NOTE: This will only work for tf2.x. When running with config.framework=tf,
21
+ use SharedWeightsModel1 and SharedWeightsModel2 below, instead!
22
+
23
+ The shared (single) layer is simply defined outside of the two Models,
24
+ then used by both Models in their forward pass.
25
+ """
26
+
27
+ def __init__(
28
+ self, observation_space, action_space, num_outputs, model_config, name
29
+ ):
30
+ super().__init__(
31
+ observation_space, action_space, num_outputs, model_config, name
32
+ )
33
+
34
+ global TF2_GLOBAL_SHARED_LAYER
35
+ # The global, shared layer to be used by both models.
36
+ if TF2_GLOBAL_SHARED_LAYER is None:
37
+ TF2_GLOBAL_SHARED_LAYER = tf.keras.layers.Dense(
38
+ units=64, activation=tf.nn.relu, name="fc1"
39
+ )
40
+
41
+ inputs = tf.keras.layers.Input(observation_space.shape)
42
+ last_layer = TF2_GLOBAL_SHARED_LAYER(inputs)
43
+ output = tf.keras.layers.Dense(
44
+ units=num_outputs, activation=None, name="fc_out"
45
+ )(last_layer)
46
+ vf = tf.keras.layers.Dense(units=1, activation=None, name="value_out")(
47
+ last_layer
48
+ )
49
+ self.base_model = tf.keras.models.Model(inputs, [output, vf])
50
+
51
+ @override(ModelV2)
52
+ def forward(self, input_dict, state, seq_lens):
53
+ out, self._value_out = self.base_model(input_dict["obs"])
54
+ return out, []
55
+
56
+ @override(ModelV2)
57
+ def value_function(self):
58
+ return tf.reshape(self._value_out, [-1])
59
+
60
+
61
+ class SharedWeightsModel1(TFModelV2):
62
+ """Example of weight sharing between two different TFModelV2s.
63
+
64
+ NOTE: This will only work for tf1 (static graph). When running with
65
+ config.framework_str=tf2, use TF2SharedWeightsModel, instead!
66
+
67
+ Here, we share the variables defined in the 'shared' variable scope
68
+ by entering it explicitly with tf1.AUTO_REUSE. This creates the
69
+ variables for the 'fc1' layer in a global scope called 'shared'
70
+ (outside of the Policy's normal variable scope).
71
+ """
72
+
73
+ def __init__(
74
+ self, observation_space, action_space, num_outputs, model_config, name
75
+ ):
76
+ super().__init__(
77
+ observation_space, action_space, num_outputs, model_config, name
78
+ )
79
+
80
+ inputs = tf.keras.layers.Input(observation_space.shape)
81
+ with tf1.variable_scope(
82
+ tf1.VariableScope(tf1.AUTO_REUSE, "shared"),
83
+ reuse=tf1.AUTO_REUSE,
84
+ auxiliary_name_scope=False,
85
+ ):
86
+ last_layer = tf.keras.layers.Dense(
87
+ units=64, activation=tf.nn.relu, name="fc1"
88
+ )(inputs)
89
+ output = tf.keras.layers.Dense(
90
+ units=num_outputs, activation=None, name="fc_out"
91
+ )(last_layer)
92
+ vf = tf.keras.layers.Dense(units=1, activation=None, name="value_out")(
93
+ last_layer
94
+ )
95
+ self.base_model = tf.keras.models.Model(inputs, [output, vf])
96
+
97
+ @override(ModelV2)
98
+ def forward(self, input_dict, state, seq_lens):
99
+ out, self._value_out = self.base_model(input_dict["obs"])
100
+ return out, []
101
+
102
+ @override(ModelV2)
103
+ def value_function(self):
104
+ return tf.reshape(self._value_out, [-1])
105
+
106
+
107
+ class SharedWeightsModel2(TFModelV2):
108
+ """The "other" TFModelV2 using the same shared space as the one above."""
109
+
110
+ def __init__(
111
+ self, observation_space, action_space, num_outputs, model_config, name
112
+ ):
113
+ super().__init__(
114
+ observation_space, action_space, num_outputs, model_config, name
115
+ )
116
+
117
+ inputs = tf.keras.layers.Input(observation_space.shape)
118
+
119
+ # Weights shared with SharedWeightsModel1.
120
+ with tf1.variable_scope(
121
+ tf1.VariableScope(tf1.AUTO_REUSE, "shared"),
122
+ reuse=tf1.AUTO_REUSE,
123
+ auxiliary_name_scope=False,
124
+ ):
125
+ last_layer = tf.keras.layers.Dense(
126
+ units=64, activation=tf.nn.relu, name="fc1"
127
+ )(inputs)
128
+ output = tf.keras.layers.Dense(
129
+ units=num_outputs, activation=None, name="fc_out"
130
+ )(last_layer)
131
+ vf = tf.keras.layers.Dense(units=1, activation=None, name="value_out")(
132
+ last_layer
133
+ )
134
+ self.base_model = tf.keras.models.Model(inputs, [output, vf])
135
+
136
+ @override(ModelV2)
137
+ def forward(self, input_dict, state, seq_lens):
138
+ out, self._value_out = self.base_model(input_dict["obs"])
139
+ return out, []
140
+
141
+ @override(ModelV2)
142
+ def value_function(self):
143
+ return tf.reshape(self._value_out, [-1])
144
+
145
+
146
+ TORCH_GLOBAL_SHARED_LAYER = None
147
+ if torch:
148
+ # The global, shared layer to be used by both models.
149
+ TORCH_GLOBAL_SHARED_LAYER = SlimFC(
150
+ 64,
151
+ 64,
152
+ activation_fn=nn.ReLU,
153
+ initializer=torch.nn.init.xavier_uniform_,
154
+ )
155
+
156
+
157
+ class TorchSharedWeightsModel(TorchModelV2, nn.Module):
158
+ """Example of weight sharing between two different TorchModelV2s.
159
+
160
+ The shared (single) layer is simply defined outside of the two Models,
161
+ then used by both Models in their forward pass.
162
+ """
163
+
164
+ def __init__(
165
+ self, observation_space, action_space, num_outputs, model_config, name
166
+ ):
167
+ TorchModelV2.__init__(
168
+ self, observation_space, action_space, num_outputs, model_config, name
169
+ )
170
+ nn.Module.__init__(self)
171
+
172
+ # Non-shared initial layer.
173
+ self.first_layer = SlimFC(
174
+ int(np.prod(observation_space.shape)),
175
+ 64,
176
+ activation_fn=nn.ReLU,
177
+ initializer=torch.nn.init.xavier_uniform_,
178
+ )
179
+
180
+ # Non-shared final layer.
181
+ self.last_layer = SlimFC(
182
+ 64,
183
+ self.num_outputs,
184
+ activation_fn=None,
185
+ initializer=torch.nn.init.xavier_uniform_,
186
+ )
187
+ self.vf = SlimFC(
188
+ 64,
189
+ 1,
190
+ activation_fn=None,
191
+ initializer=torch.nn.init.xavier_uniform_,
192
+ )
193
+ self._global_shared_layer = TORCH_GLOBAL_SHARED_LAYER
194
+ self._output = None
195
+
196
+ @override(ModelV2)
197
+ def forward(self, input_dict, state, seq_lens):
198
+ out = self.first_layer(input_dict["obs"])
199
+ self._output = self._global_shared_layer(out)
200
+ model_out = self.last_layer(self._output)
201
+ return model_out, []
202
+
203
+ @override(ModelV2)
204
+ def value_function(self):
205
+ assert self._output is not None, "must call forward first!"
206
+ return torch.reshape(self.vf(self._output), [-1])
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/simple_rpg_model.py ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # @OldAPIStack
2
+ from ray.rllib.models.tf.tf_modelv2 import TFModelV2
3
+ from ray.rllib.models.tf.fcnet import FullyConnectedNetwork as TFFCNet
4
+ from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
5
+ from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFCNet
6
+ from ray.rllib.utils.framework import try_import_tf, try_import_torch
7
+
8
+ tf1, tf, tfv = try_import_tf()
9
+ torch, nn = try_import_torch()
10
+
11
+
12
+ class CustomTorchRPGModel(TorchModelV2, nn.Module):
13
+ """Example of interpreting repeated observations."""
14
+
15
+ def __init__(self, obs_space, action_space, num_outputs, model_config, name):
16
+ super().__init__(obs_space, action_space, num_outputs, model_config, name)
17
+ nn.Module.__init__(self)
18
+ self.model = TorchFCNet(
19
+ obs_space, action_space, num_outputs, model_config, name
20
+ )
21
+
22
+ def forward(self, input_dict, state, seq_lens):
23
+ # The unpacked input tensors, where M=MAX_PLAYERS, N=MAX_ITEMS:
24
+ # {
25
+ # 'items', <torch.Tensor shape=(?, M, N, 5)>,
26
+ # 'location', <torch.Tensor shape=(?, M, 2)>,
27
+ # 'status', <torch.Tensor shape=(?, M, 10)>,
28
+ # }
29
+ print("The unpacked input tensors:", input_dict["obs"])
30
+ print()
31
+ print("Unbatched repeat dim", input_dict["obs"].unbatch_repeat_dim())
32
+ print()
33
+ print("Fully unbatched", input_dict["obs"].unbatch_all())
34
+ print()
35
+ return self.model.forward(input_dict, state, seq_lens)
36
+
37
+ def value_function(self):
38
+ return self.model.value_function()
39
+
40
+
41
+ class CustomTFRPGModel(TFModelV2):
42
+ """Example of interpreting repeated observations."""
43
+
44
+ def __init__(self, obs_space, action_space, num_outputs, model_config, name):
45
+ super().__init__(obs_space, action_space, num_outputs, model_config, name)
46
+ self.model = TFFCNet(obs_space, action_space, num_outputs, model_config, name)
47
+
48
+ def forward(self, input_dict, state, seq_lens):
49
+ # The unpacked input tensors, where M=MAX_PLAYERS, N=MAX_ITEMS:
50
+ # {
51
+ # 'items', <tf.Tensor shape=(?, M, N, 5)>,
52
+ # 'location', <tf.Tensor shape=(?, M, 2)>,
53
+ # 'status', <tf.Tensor shape=(?, M, 10)>,
54
+ # }
55
+ print("The unpacked input tensors:", input_dict["obs"])
56
+ print()
57
+ print("Unbatched repeat dim", input_dict["obs"].unbatch_repeat_dim())
58
+ print()
59
+ if tf.executing_eagerly():
60
+ print("Fully unbatched", input_dict["obs"].unbatch_all())
61
+ print()
62
+ return self.model.forward(input_dict, state, seq_lens)
63
+
64
+ def value_function(self):
65
+ return self.model.value_function()
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/parametric_actions_cartpole.py ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # @OldAPIStack
2
+ """Example of handling variable length or parametric action spaces.
3
+
4
+ This toy example demonstrates the action-embedding based approach for handling large
5
+ discrete action spaces (potentially infinite in size), similar to this example:
6
+
7
+ https://neuro.cs.ut.ee/the-use-of-embeddings-in-openai-five/
8
+
9
+ This example works with RLlib's policy gradient style algorithms
10
+ (e.g., PG, PPO, IMPALA, A2C) and DQN.
11
+
12
+ Note that since the model outputs now include "-inf" tf.float32.min
13
+ values, not all algorithm options are supported. For example,
14
+ algorithms might crash if they don't properly ignore the -inf action scores.
15
+ Working configurations are given below.
16
+ """
17
+
18
+ import argparse
19
+ import os
20
+
21
+ import ray
22
+ from ray import air, tune
23
+ from ray.air.constants import TRAINING_ITERATION
24
+ from ray.rllib.examples.envs.classes.parametric_actions_cartpole import (
25
+ ParametricActionsCartPole,
26
+ )
27
+ from ray.rllib.examples._old_api_stack.models.parametric_actions_model import (
28
+ ParametricActionsModel,
29
+ TorchParametricActionsModel,
30
+ )
31
+ from ray.rllib.models import ModelCatalog
32
+ from ray.rllib.utils.metrics import (
33
+ ENV_RUNNER_RESULTS,
34
+ EPISODE_RETURN_MEAN,
35
+ NUM_ENV_STEPS_SAMPLED_LIFETIME,
36
+ )
37
+ from ray.rllib.utils.test_utils import check_learning_achieved
38
+ from ray.tune.registry import register_env
39
+
40
+ parser = argparse.ArgumentParser()
41
+ parser.add_argument(
42
+ "--run", type=str, default="PPO", help="The RLlib-registered algorithm to use."
43
+ )
44
+ parser.add_argument(
45
+ "--framework",
46
+ choices=["tf", "tf2", "torch"],
47
+ default="torch",
48
+ help="The DL framework specifier.",
49
+ )
50
+ parser.add_argument(
51
+ "--as-test",
52
+ action="store_true",
53
+ help="Whether this script should be run as a test: --stop-reward must "
54
+ "be achieved within --stop-timesteps AND --stop-iters.",
55
+ )
56
+ parser.add_argument(
57
+ "--stop-iters", type=int, default=200, help="Number of iterations to train."
58
+ )
59
+ parser.add_argument(
60
+ "--stop-timesteps", type=int, default=100000, help="Number of timesteps to train."
61
+ )
62
+ parser.add_argument(
63
+ "--stop-reward", type=float, default=150.0, help="Reward at which we stop training."
64
+ )
65
+
66
+ if __name__ == "__main__":
67
+ args = parser.parse_args()
68
+ ray.init()
69
+
70
+ register_env("pa_cartpole", lambda _: ParametricActionsCartPole(10))
71
+ ModelCatalog.register_custom_model(
72
+ "pa_model",
73
+ TorchParametricActionsModel
74
+ if args.framework == "torch"
75
+ else ParametricActionsModel,
76
+ )
77
+
78
+ if args.run == "DQN":
79
+ cfg = {
80
+ # TODO(ekl) we need to set these to prevent the masked values
81
+ # from being further processed in DistributionalQModel, which
82
+ # would mess up the masking. It is possible to support these if we
83
+ # defined a custom DistributionalQModel that is aware of masking.
84
+ "hiddens": [],
85
+ "dueling": False,
86
+ "enable_rl_module_and_learner": False,
87
+ "enable_env_runner_and_connector_v2": False,
88
+ }
89
+ else:
90
+ cfg = {}
91
+
92
+ config = dict(
93
+ {
94
+ "env": "pa_cartpole",
95
+ "model": {
96
+ "custom_model": "pa_model",
97
+ },
98
+ # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
99
+ "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
100
+ "num_env_runners": 0,
101
+ "framework": args.framework,
102
+ },
103
+ **cfg,
104
+ )
105
+
106
+ stop = {
107
+ TRAINING_ITERATION: args.stop_iters,
108
+ f"{NUM_ENV_STEPS_SAMPLED_LIFETIME}": args.stop_timesteps,
109
+ f"{ENV_RUNNER_RESULTS}/{EPISODE_RETURN_MEAN}": args.stop_reward,
110
+ }
111
+
112
+ results = tune.Tuner(
113
+ args.run,
114
+ run_config=air.RunConfig(stop=stop, verbose=1),
115
+ param_space=config,
116
+ ).fit()
117
+
118
+ if args.as_test:
119
+ check_learning_achieved(results, args.stop_reward)
120
+
121
+ ray.shutdown()
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/parametric_actions_cartpole_embeddings_learnt_by_model.py ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # @OldAPIStack
2
+ """Example of handling variable length or parametric action spaces.
3
+
4
+ This is a toy example of the action-embedding based approach for handling large
5
+ discrete action spaces (potentially infinite in size), similar to this:
6
+
7
+ https://neuro.cs.ut.ee/the-use-of-embeddings-in-openai-five/
8
+
9
+ This currently works with RLlib's policy gradient style algorithms
10
+ (e.g., PG, PPO, IMPALA, A2C) and also DQN.
11
+
12
+ Note that since the model outputs now include "-inf" tf.float32.min
13
+ values, not all algorithm options are supported at the moment. For example,
14
+ algorithms might crash if they don't properly ignore the -inf action scores.
15
+ Working configurations are given below.
16
+ """
17
+
18
+ import argparse
19
+ import os
20
+
21
+ import ray
22
+ from ray import air, tune
23
+ from ray.air.constants import TRAINING_ITERATION
24
+ from ray.rllib.examples.envs.classes.parametric_actions_cartpole import (
25
+ ParametricActionsCartPoleNoEmbeddings,
26
+ )
27
+ from ray.rllib.examples._old_api_stack.models.parametric_actions_model import (
28
+ ParametricActionsModelThatLearnsEmbeddings,
29
+ )
30
+ from ray.rllib.models import ModelCatalog
31
+ from ray.rllib.utils.metrics import (
32
+ ENV_RUNNER_RESULTS,
33
+ EPISODE_RETURN_MEAN,
34
+ NUM_ENV_STEPS_SAMPLED_LIFETIME,
35
+ )
36
+ from ray.rllib.utils.test_utils import check_learning_achieved
37
+ from ray.tune.registry import register_env
38
+
39
+ parser = argparse.ArgumentParser()
40
+ parser.add_argument("--run", type=str, default="PPO")
41
+ parser.add_argument(
42
+ "--framework",
43
+ choices=["tf", "tf2"],
44
+ default="tf",
45
+ help="The DL framework specifier (Torch not supported "
46
+ "due to the lack of a model).",
47
+ )
48
+ parser.add_argument("--as-test", action="store_true")
49
+ parser.add_argument("--stop-iters", type=int, default=200)
50
+ parser.add_argument("--stop-reward", type=float, default=150.0)
51
+ parser.add_argument("--stop-timesteps", type=int, default=100000)
52
+
53
+ if __name__ == "__main__":
54
+ args = parser.parse_args()
55
+ ray.init()
56
+
57
+ register_env("pa_cartpole", lambda _: ParametricActionsCartPoleNoEmbeddings(10))
58
+
59
+ ModelCatalog.register_custom_model(
60
+ "pa_model", ParametricActionsModelThatLearnsEmbeddings
61
+ )
62
+
63
+ if args.run == "DQN":
64
+ cfg = {
65
+ # TODO(ekl) we need to set these to prevent the masked values
66
+ # from being further processed in DistributionalQModel, which
67
+ # would mess up the masking. It is possible to support these if we
68
+ # defined a custom DistributionalQModel that is aware of masking.
69
+ "hiddens": [],
70
+ "dueling": False,
71
+ "enable_rl_module_and_learner": False,
72
+ "enable_env_runner_and_connector_v2": False,
73
+ }
74
+ else:
75
+ cfg = {}
76
+
77
+ config = dict(
78
+ {
79
+ "env": "pa_cartpole",
80
+ "model": {
81
+ "custom_model": "pa_model",
82
+ },
83
+ # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
84
+ "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
85
+ "num_env_runners": 0,
86
+ "framework": args.framework,
87
+ "action_mask_key": "valid_avail_actions_mask",
88
+ },
89
+ **cfg,
90
+ )
91
+
92
+ stop = {
93
+ TRAINING_ITERATION: args.stop_iters,
94
+ NUM_ENV_STEPS_SAMPLED_LIFETIME: args.stop_timesteps,
95
+ f"{ENV_RUNNER_RESULTS}/{EPISODE_RETURN_MEAN}": args.stop_reward,
96
+ }
97
+
98
+ results = tune.Tuner(
99
+ args.run,
100
+ run_config=air.RunConfig(stop=stop, verbose=2),
101
+ param_space=config,
102
+ ).fit()
103
+
104
+ if args.as_test:
105
+ check_learning_achieved(results, args.stop_reward)
106
+
107
+ ray.shutdown()
.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/__init__.cpython-311.pyc ADDED
Binary file (203 Bytes). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/cartpole_dqn_export.cpython-311.pyc ADDED
Binary file (4.55 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/change_config_during_training.cpython-311.pyc ADDED
Binary file (11.8 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/checkpoint_by_custom_criteria.cpython-311.pyc ADDED
Binary file (6.36 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/restore_1_of_n_agents_from_checkpoint.cpython-311.pyc ADDED
Binary file (7.57 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__init__.py ADDED
File without changes
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/__init__.cpython-311.pyc ADDED
Binary file (203 Bytes). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/custom_heuristic_policy.cpython-311.pyc ADDED
Binary file (4.37 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/different_spaces_for_agents.cpython-311.pyc ADDED
Binary file (5.9 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/multi_agent_cartpole.cpython-311.pyc ADDED
Binary file (2.98 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/multi_agent_pendulum.cpython-311.pyc ADDED
Binary file (3.33 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/pettingzoo_independent_learning.cpython-311.pyc ADDED
Binary file (5.32 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/pettingzoo_parameter_sharing.cpython-311.pyc ADDED
Binary file (4.66 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/pettingzoo_shared_value_function.cpython-311.pyc ADDED
Binary file (485 Bytes). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/rock_paper_scissors_heuristic_vs_learned.cpython-311.pyc ADDED
Binary file (6.03 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/rock_paper_scissors_learned_vs_learned.cpython-311.pyc ADDED
Binary file (4.22 kB). View file
 
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/self_play_league_based_with_open_spiel.cpython-311.pyc ADDED
Binary file (11.4 kB). View file