Add files using upload-large-folder tool
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__init__.py +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/__init__.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/attention_net_supervised.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/parametric_actions_cartpole.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/parametric_actions_cartpole_embeddings_learnt_by_model.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/attention_net_supervised.py +77 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__init__.py +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/__init__.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/action_mask_model.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/autoregressive_action_dist.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/autoregressive_action_model.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/centralized_critic_models.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/custom_loss_model.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/fast_model.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/mobilenet_v2_encoder.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/mobilenet_v2_with_lstm_models.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/parametric_actions_model.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/shared_weights_model.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/simple_rpg_model.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/action_mask_model.py +126 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/autoregressive_action_dist.py +149 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/autoregressive_action_model.py +162 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/centralized_critic_models.py +182 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/custom_loss_model.py +137 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/fast_model.py +80 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/mobilenet_v2_encoder.py +48 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/mobilenet_v2_with_lstm_models.py +160 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/neural_computer.py +247 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/parametric_actions_model.py +201 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/shared_weights_model.py +206 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/simple_rpg_model.py +65 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/parametric_actions_cartpole.py +121 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/parametric_actions_cartpole_embeddings_learnt_by_model.py +107 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/__init__.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/cartpole_dqn_export.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/change_config_during_training.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/checkpoint_by_custom_criteria.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/restore_1_of_n_agents_from_checkpoint.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__init__.py +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/__init__.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/custom_heuristic_policy.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/different_spaces_for_agents.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/multi_agent_cartpole.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/multi_agent_pendulum.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/pettingzoo_independent_learning.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/pettingzoo_parameter_sharing.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/pettingzoo_shared_value_function.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/rock_paper_scissors_heuristic_vs_learned.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/rock_paper_scissors_learned_vs_learned.cpython-311.pyc +0 -0
- .venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/self_play_league_based_with_open_spiel.cpython-311.pyc +0 -0
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__init__.py
ADDED
|
File without changes
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/__init__.cpython-311.pyc
ADDED
|
Binary file (206 Bytes). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/attention_net_supervised.cpython-311.pyc
ADDED
|
Binary file (4.78 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/parametric_actions_cartpole.cpython-311.pyc
ADDED
|
Binary file (4.59 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/__pycache__/parametric_actions_cartpole_embeddings_learnt_by_model.cpython-311.pyc
ADDED
|
Binary file (4.3 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/attention_net_supervised.py
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# @OldAPIStack
|
| 2 |
+
from gymnasium.spaces import Box, Discrete
|
| 3 |
+
import numpy as np
|
| 4 |
+
|
| 5 |
+
from rllib.models.tf.attention_net import TrXLNet
|
| 6 |
+
from ray.rllib.utils.framework import try_import_tf
|
| 7 |
+
|
| 8 |
+
tf1, tf, tfv = try_import_tf()
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
def bit_shift_generator(seq_length, shift, batch_size):
|
| 12 |
+
while True:
|
| 13 |
+
values = np.array([0.0, 1.0], dtype=np.float32)
|
| 14 |
+
seq = np.random.choice(values, (batch_size, seq_length, 1))
|
| 15 |
+
targets = np.squeeze(np.roll(seq, shift, axis=1).astype(np.int32))
|
| 16 |
+
targets[:, :shift] = 0
|
| 17 |
+
yield seq, targets
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
def train_loss(targets, outputs):
|
| 21 |
+
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
|
| 22 |
+
labels=targets, logits=outputs
|
| 23 |
+
)
|
| 24 |
+
return tf.reduce_mean(loss)
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def train_bit_shift(seq_length, num_iterations, print_every_n):
|
| 28 |
+
|
| 29 |
+
optimizer = tf.keras.optimizers.Adam(1e-3)
|
| 30 |
+
|
| 31 |
+
model = TrXLNet(
|
| 32 |
+
observation_space=Box(low=0, high=1, shape=(1,), dtype=np.int32),
|
| 33 |
+
action_space=Discrete(2),
|
| 34 |
+
num_outputs=2,
|
| 35 |
+
model_config={"max_seq_len": seq_length},
|
| 36 |
+
name="trxl",
|
| 37 |
+
num_transformer_units=1,
|
| 38 |
+
attention_dim=10,
|
| 39 |
+
num_heads=5,
|
| 40 |
+
head_dim=20,
|
| 41 |
+
position_wise_mlp_dim=20,
|
| 42 |
+
)
|
| 43 |
+
|
| 44 |
+
shift = 10
|
| 45 |
+
train_batch = 10
|
| 46 |
+
test_batch = 100
|
| 47 |
+
data_gen = bit_shift_generator(seq_length, shift=shift, batch_size=train_batch)
|
| 48 |
+
test_gen = bit_shift_generator(seq_length, shift=shift, batch_size=test_batch)
|
| 49 |
+
|
| 50 |
+
@tf.function
|
| 51 |
+
def update_step(inputs, targets):
|
| 52 |
+
model_out = model(
|
| 53 |
+
{"obs": inputs},
|
| 54 |
+
state=[tf.reshape(inputs, [-1, seq_length, 1])],
|
| 55 |
+
seq_lens=np.full(shape=(train_batch,), fill_value=seq_length),
|
| 56 |
+
)
|
| 57 |
+
optimizer.minimize(
|
| 58 |
+
lambda: train_loss(targets, model_out), lambda: model.trainable_variables
|
| 59 |
+
)
|
| 60 |
+
|
| 61 |
+
for i, (inputs, targets) in zip(range(num_iterations), data_gen):
|
| 62 |
+
inputs_in = np.reshape(inputs, [-1, 1])
|
| 63 |
+
targets_in = np.reshape(targets, [-1])
|
| 64 |
+
update_step(tf.convert_to_tensor(inputs_in), tf.convert_to_tensor(targets_in))
|
| 65 |
+
|
| 66 |
+
if i % print_every_n == 0:
|
| 67 |
+
test_inputs, test_targets = next(test_gen)
|
| 68 |
+
print(i, train_loss(test_targets, model(test_inputs)))
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
if __name__ == "__main__":
|
| 72 |
+
tf.enable_eager_execution()
|
| 73 |
+
train_bit_shift(
|
| 74 |
+
seq_length=20,
|
| 75 |
+
num_iterations=2000,
|
| 76 |
+
print_every_n=200,
|
| 77 |
+
)
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__init__.py
ADDED
|
File without changes
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/__init__.cpython-311.pyc
ADDED
|
Binary file (213 Bytes). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/action_mask_model.cpython-311.pyc
ADDED
|
Binary file (5.32 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/autoregressive_action_dist.cpython-311.pyc
ADDED
|
Binary file (9.65 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/autoregressive_action_model.cpython-311.pyc
ADDED
|
Binary file (8.05 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/centralized_critic_models.cpython-311.pyc
ADDED
|
Binary file (10.5 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/custom_loss_model.cpython-311.pyc
ADDED
|
Binary file (8.29 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/fast_model.cpython-311.pyc
ADDED
|
Binary file (5.56 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/mobilenet_v2_encoder.cpython-311.pyc
ADDED
|
Binary file (3.04 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/mobilenet_v2_with_lstm_models.cpython-311.pyc
ADDED
|
Binary file (9.69 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/parametric_actions_model.cpython-311.pyc
ADDED
|
Binary file (8.57 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/shared_weights_model.cpython-311.pyc
ADDED
|
Binary file (11.1 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/__pycache__/simple_rpg_model.cpython-311.pyc
ADDED
|
Binary file (4.19 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/action_mask_model.py
ADDED
|
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# @OldAPIStack
|
| 2 |
+
from gymnasium.spaces import Dict
|
| 3 |
+
|
| 4 |
+
from ray.rllib.models.tf.fcnet import FullyConnectedNetwork
|
| 5 |
+
from ray.rllib.models.tf.tf_modelv2 import TFModelV2
|
| 6 |
+
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
|
| 7 |
+
from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
|
| 8 |
+
from ray.rllib.utils.framework import try_import_tf, try_import_torch
|
| 9 |
+
from ray.rllib.utils.torch_utils import FLOAT_MIN
|
| 10 |
+
|
| 11 |
+
tf1, tf, tfv = try_import_tf()
|
| 12 |
+
torch, nn = try_import_torch()
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
class ActionMaskModel(TFModelV2):
|
| 16 |
+
"""Model that handles simple discrete action masking.
|
| 17 |
+
|
| 18 |
+
This assumes the outputs are logits for a single Categorical action dist.
|
| 19 |
+
Getting this to work with a more complex output (e.g., if the action space
|
| 20 |
+
is a tuple of several distributions) is also possible but left as an
|
| 21 |
+
exercise to the reader.
|
| 22 |
+
"""
|
| 23 |
+
|
| 24 |
+
def __init__(
|
| 25 |
+
self, obs_space, action_space, num_outputs, model_config, name, **kwargs
|
| 26 |
+
):
|
| 27 |
+
|
| 28 |
+
orig_space = getattr(obs_space, "original_space", obs_space)
|
| 29 |
+
assert (
|
| 30 |
+
isinstance(orig_space, Dict)
|
| 31 |
+
and "action_mask" in orig_space.spaces
|
| 32 |
+
and "observations" in orig_space.spaces
|
| 33 |
+
)
|
| 34 |
+
|
| 35 |
+
super().__init__(obs_space, action_space, num_outputs, model_config, name)
|
| 36 |
+
|
| 37 |
+
self.internal_model = FullyConnectedNetwork(
|
| 38 |
+
orig_space["observations"],
|
| 39 |
+
action_space,
|
| 40 |
+
num_outputs,
|
| 41 |
+
model_config,
|
| 42 |
+
name + "_internal",
|
| 43 |
+
)
|
| 44 |
+
|
| 45 |
+
# disable action masking --> will likely lead to invalid actions
|
| 46 |
+
self.no_masking = model_config["custom_model_config"].get("no_masking", False)
|
| 47 |
+
|
| 48 |
+
def forward(self, input_dict, state, seq_lens):
|
| 49 |
+
# Extract the available actions tensor from the observation.
|
| 50 |
+
action_mask = input_dict["obs"]["action_mask"]
|
| 51 |
+
|
| 52 |
+
# Compute the unmasked logits.
|
| 53 |
+
logits, _ = self.internal_model({"obs": input_dict["obs"]["observations"]})
|
| 54 |
+
|
| 55 |
+
# If action masking is disabled, directly return unmasked logits
|
| 56 |
+
if self.no_masking:
|
| 57 |
+
return logits, state
|
| 58 |
+
|
| 59 |
+
# Convert action_mask into a [0.0 || -inf]-type mask.
|
| 60 |
+
inf_mask = tf.maximum(tf.math.log(action_mask), tf.float32.min)
|
| 61 |
+
masked_logits = logits + inf_mask
|
| 62 |
+
|
| 63 |
+
# Return masked logits.
|
| 64 |
+
return masked_logits, state
|
| 65 |
+
|
| 66 |
+
def value_function(self):
|
| 67 |
+
return self.internal_model.value_function()
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
class TorchActionMaskModel(TorchModelV2, nn.Module):
|
| 71 |
+
"""PyTorch version of above ActionMaskingModel."""
|
| 72 |
+
|
| 73 |
+
def __init__(
|
| 74 |
+
self,
|
| 75 |
+
obs_space,
|
| 76 |
+
action_space,
|
| 77 |
+
num_outputs,
|
| 78 |
+
model_config,
|
| 79 |
+
name,
|
| 80 |
+
**kwargs,
|
| 81 |
+
):
|
| 82 |
+
orig_space = getattr(obs_space, "original_space", obs_space)
|
| 83 |
+
assert (
|
| 84 |
+
isinstance(orig_space, Dict)
|
| 85 |
+
and "action_mask" in orig_space.spaces
|
| 86 |
+
and "observations" in orig_space.spaces
|
| 87 |
+
)
|
| 88 |
+
|
| 89 |
+
TorchModelV2.__init__(
|
| 90 |
+
self, obs_space, action_space, num_outputs, model_config, name, **kwargs
|
| 91 |
+
)
|
| 92 |
+
nn.Module.__init__(self)
|
| 93 |
+
|
| 94 |
+
self.internal_model = TorchFC(
|
| 95 |
+
orig_space["observations"],
|
| 96 |
+
action_space,
|
| 97 |
+
num_outputs,
|
| 98 |
+
model_config,
|
| 99 |
+
name + "_internal",
|
| 100 |
+
)
|
| 101 |
+
|
| 102 |
+
# disable action masking --> will likely lead to invalid actions
|
| 103 |
+
self.no_masking = False
|
| 104 |
+
if "no_masking" in model_config["custom_model_config"]:
|
| 105 |
+
self.no_masking = model_config["custom_model_config"]["no_masking"]
|
| 106 |
+
|
| 107 |
+
def forward(self, input_dict, state, seq_lens):
|
| 108 |
+
# Extract the available actions tensor from the observation.
|
| 109 |
+
action_mask = input_dict["obs"]["action_mask"]
|
| 110 |
+
|
| 111 |
+
# Compute the unmasked logits.
|
| 112 |
+
logits, _ = self.internal_model({"obs": input_dict["obs"]["observations"]})
|
| 113 |
+
|
| 114 |
+
# If action masking is disabled, directly return unmasked logits
|
| 115 |
+
if self.no_masking:
|
| 116 |
+
return logits, state
|
| 117 |
+
|
| 118 |
+
# Convert action_mask into a [0.0 || -inf]-type mask.
|
| 119 |
+
inf_mask = torch.clamp(torch.log(action_mask), min=FLOAT_MIN)
|
| 120 |
+
masked_logits = logits + inf_mask
|
| 121 |
+
|
| 122 |
+
# Return masked logits.
|
| 123 |
+
return masked_logits, state
|
| 124 |
+
|
| 125 |
+
def value_function(self):
|
| 126 |
+
return self.internal_model.value_function()
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/autoregressive_action_dist.py
ADDED
|
@@ -0,0 +1,149 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# @OldAPIStack
|
| 2 |
+
from ray.rllib.models.tf.tf_action_dist import Categorical, ActionDistribution
|
| 3 |
+
from ray.rllib.models.torch.torch_action_dist import (
|
| 4 |
+
TorchCategorical,
|
| 5 |
+
TorchDistributionWrapper,
|
| 6 |
+
)
|
| 7 |
+
from ray.rllib.utils.framework import try_import_tf, try_import_torch
|
| 8 |
+
|
| 9 |
+
tf1, tf, tfv = try_import_tf()
|
| 10 |
+
torch, nn = try_import_torch()
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
class BinaryAutoregressiveDistribution(ActionDistribution):
|
| 14 |
+
"""Action distribution P(a1, a2) = P(a1) * P(a2 | a1)"""
|
| 15 |
+
|
| 16 |
+
def deterministic_sample(self):
|
| 17 |
+
# First, sample a1.
|
| 18 |
+
a1_dist = self._a1_distribution()
|
| 19 |
+
a1 = a1_dist.deterministic_sample()
|
| 20 |
+
|
| 21 |
+
# Sample a2 conditioned on a1.
|
| 22 |
+
a2_dist = self._a2_distribution(a1)
|
| 23 |
+
a2 = a2_dist.deterministic_sample()
|
| 24 |
+
self._action_logp = a1_dist.logp(a1) + a2_dist.logp(a2)
|
| 25 |
+
|
| 26 |
+
# Return the action tuple.
|
| 27 |
+
return (a1, a2)
|
| 28 |
+
|
| 29 |
+
def sample(self):
|
| 30 |
+
# First, sample a1.
|
| 31 |
+
a1_dist = self._a1_distribution()
|
| 32 |
+
a1 = a1_dist.sample()
|
| 33 |
+
|
| 34 |
+
# Sample a2 conditioned on a1.
|
| 35 |
+
a2_dist = self._a2_distribution(a1)
|
| 36 |
+
a2 = a2_dist.sample()
|
| 37 |
+
self._action_logp = a1_dist.logp(a1) + a2_dist.logp(a2)
|
| 38 |
+
|
| 39 |
+
# Return the action tuple.
|
| 40 |
+
return (a1, a2)
|
| 41 |
+
|
| 42 |
+
def logp(self, actions):
|
| 43 |
+
a1, a2 = actions[:, 0], actions[:, 1]
|
| 44 |
+
a1_vec = tf.expand_dims(tf.cast(a1, tf.float32), 1)
|
| 45 |
+
a1_logits, a2_logits = self.model.action_model([self.inputs, a1_vec])
|
| 46 |
+
return Categorical(a1_logits).logp(a1) + Categorical(a2_logits).logp(a2)
|
| 47 |
+
|
| 48 |
+
def sampled_action_logp(self):
|
| 49 |
+
return self._action_logp
|
| 50 |
+
|
| 51 |
+
def entropy(self):
|
| 52 |
+
a1_dist = self._a1_distribution()
|
| 53 |
+
a2_dist = self._a2_distribution(a1_dist.sample())
|
| 54 |
+
return a1_dist.entropy() + a2_dist.entropy()
|
| 55 |
+
|
| 56 |
+
def kl(self, other):
|
| 57 |
+
a1_dist = self._a1_distribution()
|
| 58 |
+
a1_terms = a1_dist.kl(other._a1_distribution())
|
| 59 |
+
|
| 60 |
+
a1 = a1_dist.sample()
|
| 61 |
+
a2_terms = self._a2_distribution(a1).kl(other._a2_distribution(a1))
|
| 62 |
+
return a1_terms + a2_terms
|
| 63 |
+
|
| 64 |
+
def _a1_distribution(self):
|
| 65 |
+
BATCH = tf.shape(self.inputs)[0]
|
| 66 |
+
a1_logits, _ = self.model.action_model([self.inputs, tf.zeros((BATCH, 1))])
|
| 67 |
+
a1_dist = Categorical(a1_logits)
|
| 68 |
+
return a1_dist
|
| 69 |
+
|
| 70 |
+
def _a2_distribution(self, a1):
|
| 71 |
+
a1_vec = tf.expand_dims(tf.cast(a1, tf.float32), 1)
|
| 72 |
+
_, a2_logits = self.model.action_model([self.inputs, a1_vec])
|
| 73 |
+
a2_dist = Categorical(a2_logits)
|
| 74 |
+
return a2_dist
|
| 75 |
+
|
| 76 |
+
@staticmethod
|
| 77 |
+
def required_model_output_shape(action_space, model_config):
|
| 78 |
+
return 16 # controls model output feature vector size
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
class TorchBinaryAutoregressiveDistribution(TorchDistributionWrapper):
|
| 82 |
+
"""Action distribution P(a1, a2) = P(a1) * P(a2 | a1)"""
|
| 83 |
+
|
| 84 |
+
def deterministic_sample(self):
|
| 85 |
+
# First, sample a1.
|
| 86 |
+
a1_dist = self._a1_distribution()
|
| 87 |
+
a1 = a1_dist.deterministic_sample()
|
| 88 |
+
|
| 89 |
+
# Sample a2 conditioned on a1.
|
| 90 |
+
a2_dist = self._a2_distribution(a1)
|
| 91 |
+
a2 = a2_dist.deterministic_sample()
|
| 92 |
+
self._action_logp = a1_dist.logp(a1) + a2_dist.logp(a2)
|
| 93 |
+
|
| 94 |
+
# Return the action tuple.
|
| 95 |
+
return (a1, a2)
|
| 96 |
+
|
| 97 |
+
def sample(self):
|
| 98 |
+
# First, sample a1.
|
| 99 |
+
a1_dist = self._a1_distribution()
|
| 100 |
+
a1 = a1_dist.sample()
|
| 101 |
+
|
| 102 |
+
# Sample a2 conditioned on a1.
|
| 103 |
+
a2_dist = self._a2_distribution(a1)
|
| 104 |
+
a2 = a2_dist.sample()
|
| 105 |
+
self._action_logp = a1_dist.logp(a1) + a2_dist.logp(a2)
|
| 106 |
+
|
| 107 |
+
# Return the action tuple.
|
| 108 |
+
return (a1, a2)
|
| 109 |
+
|
| 110 |
+
def logp(self, actions):
|
| 111 |
+
a1, a2 = actions[:, 0], actions[:, 1]
|
| 112 |
+
a1_vec = torch.unsqueeze(a1.float(), 1)
|
| 113 |
+
a1_logits, a2_logits = self.model.action_module(self.inputs, a1_vec)
|
| 114 |
+
return TorchCategorical(a1_logits).logp(a1) + TorchCategorical(a2_logits).logp(
|
| 115 |
+
a2
|
| 116 |
+
)
|
| 117 |
+
|
| 118 |
+
def sampled_action_logp(self):
|
| 119 |
+
return self._action_logp
|
| 120 |
+
|
| 121 |
+
def entropy(self):
|
| 122 |
+
a1_dist = self._a1_distribution()
|
| 123 |
+
a2_dist = self._a2_distribution(a1_dist.sample())
|
| 124 |
+
return a1_dist.entropy() + a2_dist.entropy()
|
| 125 |
+
|
| 126 |
+
def kl(self, other):
|
| 127 |
+
a1_dist = self._a1_distribution()
|
| 128 |
+
a1_terms = a1_dist.kl(other._a1_distribution())
|
| 129 |
+
|
| 130 |
+
a1 = a1_dist.sample()
|
| 131 |
+
a2_terms = self._a2_distribution(a1).kl(other._a2_distribution(a1))
|
| 132 |
+
return a1_terms + a2_terms
|
| 133 |
+
|
| 134 |
+
def _a1_distribution(self):
|
| 135 |
+
BATCH = self.inputs.shape[0]
|
| 136 |
+
zeros = torch.zeros((BATCH, 1)).to(self.inputs.device)
|
| 137 |
+
a1_logits, _ = self.model.action_module(self.inputs, zeros)
|
| 138 |
+
a1_dist = TorchCategorical(a1_logits)
|
| 139 |
+
return a1_dist
|
| 140 |
+
|
| 141 |
+
def _a2_distribution(self, a1):
|
| 142 |
+
a1_vec = torch.unsqueeze(a1.float(), 1)
|
| 143 |
+
_, a2_logits = self.model.action_module(self.inputs, a1_vec)
|
| 144 |
+
a2_dist = TorchCategorical(a2_logits)
|
| 145 |
+
return a2_dist
|
| 146 |
+
|
| 147 |
+
@staticmethod
|
| 148 |
+
def required_model_output_shape(action_space, model_config):
|
| 149 |
+
return 16 # controls model output feature vector size
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/autoregressive_action_model.py
ADDED
|
@@ -0,0 +1,162 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# @OldAPIStack
|
| 2 |
+
from gymnasium.spaces import Discrete, Tuple
|
| 3 |
+
|
| 4 |
+
from ray.rllib.models.tf.misc import normc_initializer
|
| 5 |
+
from ray.rllib.models.tf.tf_modelv2 import TFModelV2
|
| 6 |
+
from ray.rllib.models.torch.misc import normc_initializer as normc_init_torch
|
| 7 |
+
from ray.rllib.models.torch.misc import SlimFC
|
| 8 |
+
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
|
| 9 |
+
from ray.rllib.utils.framework import try_import_tf, try_import_torch
|
| 10 |
+
|
| 11 |
+
tf1, tf, tfv = try_import_tf()
|
| 12 |
+
torch, nn = try_import_torch()
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
class AutoregressiveActionModel(TFModelV2):
|
| 16 |
+
"""Implements the `.action_model` branch required above."""
|
| 17 |
+
|
| 18 |
+
def __init__(self, obs_space, action_space, num_outputs, model_config, name):
|
| 19 |
+
super(AutoregressiveActionModel, self).__init__(
|
| 20 |
+
obs_space, action_space, num_outputs, model_config, name
|
| 21 |
+
)
|
| 22 |
+
if action_space != Tuple([Discrete(2), Discrete(2)]):
|
| 23 |
+
raise ValueError("This model only supports the [2, 2] action space")
|
| 24 |
+
|
| 25 |
+
# Inputs
|
| 26 |
+
obs_input = tf.keras.layers.Input(shape=obs_space.shape, name="obs_input")
|
| 27 |
+
a1_input = tf.keras.layers.Input(shape=(1,), name="a1_input")
|
| 28 |
+
ctx_input = tf.keras.layers.Input(shape=(num_outputs,), name="ctx_input")
|
| 29 |
+
|
| 30 |
+
# Output of the model (normally 'logits', but for an autoregressive
|
| 31 |
+
# dist this is more like a context/feature layer encoding the obs)
|
| 32 |
+
context = tf.keras.layers.Dense(
|
| 33 |
+
num_outputs,
|
| 34 |
+
name="hidden",
|
| 35 |
+
activation=tf.nn.tanh,
|
| 36 |
+
kernel_initializer=normc_initializer(1.0),
|
| 37 |
+
)(obs_input)
|
| 38 |
+
|
| 39 |
+
# V(s)
|
| 40 |
+
value_out = tf.keras.layers.Dense(
|
| 41 |
+
1,
|
| 42 |
+
name="value_out",
|
| 43 |
+
activation=None,
|
| 44 |
+
kernel_initializer=normc_initializer(0.01),
|
| 45 |
+
)(context)
|
| 46 |
+
|
| 47 |
+
# P(a1 | obs)
|
| 48 |
+
a1_logits = tf.keras.layers.Dense(
|
| 49 |
+
2,
|
| 50 |
+
name="a1_logits",
|
| 51 |
+
activation=None,
|
| 52 |
+
kernel_initializer=normc_initializer(0.01),
|
| 53 |
+
)(ctx_input)
|
| 54 |
+
|
| 55 |
+
# P(a2 | a1)
|
| 56 |
+
# --note: typically you'd want to implement P(a2 | a1, obs) as follows:
|
| 57 |
+
# a2_context = tf.keras.layers.Concatenate(axis=1)(
|
| 58 |
+
# [ctx_input, a1_input])
|
| 59 |
+
a2_context = a1_input
|
| 60 |
+
a2_hidden = tf.keras.layers.Dense(
|
| 61 |
+
16,
|
| 62 |
+
name="a2_hidden",
|
| 63 |
+
activation=tf.nn.tanh,
|
| 64 |
+
kernel_initializer=normc_initializer(1.0),
|
| 65 |
+
)(a2_context)
|
| 66 |
+
a2_logits = tf.keras.layers.Dense(
|
| 67 |
+
2,
|
| 68 |
+
name="a2_logits",
|
| 69 |
+
activation=None,
|
| 70 |
+
kernel_initializer=normc_initializer(0.01),
|
| 71 |
+
)(a2_hidden)
|
| 72 |
+
|
| 73 |
+
# Base layers
|
| 74 |
+
self.base_model = tf.keras.Model(obs_input, [context, value_out])
|
| 75 |
+
self.base_model.summary()
|
| 76 |
+
|
| 77 |
+
# Autoregressive action sampler
|
| 78 |
+
self.action_model = tf.keras.Model(
|
| 79 |
+
[ctx_input, a1_input], [a1_logits, a2_logits]
|
| 80 |
+
)
|
| 81 |
+
self.action_model.summary()
|
| 82 |
+
|
| 83 |
+
def forward(self, input_dict, state, seq_lens):
|
| 84 |
+
context, self._value_out = self.base_model(input_dict["obs"])
|
| 85 |
+
return context, state
|
| 86 |
+
|
| 87 |
+
def value_function(self):
|
| 88 |
+
return tf.reshape(self._value_out, [-1])
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
class TorchAutoregressiveActionModel(TorchModelV2, nn.Module):
|
| 92 |
+
"""PyTorch version of the AutoregressiveActionModel above."""
|
| 93 |
+
|
| 94 |
+
def __init__(self, obs_space, action_space, num_outputs, model_config, name):
|
| 95 |
+
TorchModelV2.__init__(
|
| 96 |
+
self, obs_space, action_space, num_outputs, model_config, name
|
| 97 |
+
)
|
| 98 |
+
nn.Module.__init__(self)
|
| 99 |
+
|
| 100 |
+
if action_space != Tuple([Discrete(2), Discrete(2)]):
|
| 101 |
+
raise ValueError("This model only supports the [2, 2] action space")
|
| 102 |
+
|
| 103 |
+
# Output of the model (normally 'logits', but for an autoregressive
|
| 104 |
+
# dist this is more like a context/feature layer encoding the obs)
|
| 105 |
+
self.context_layer = SlimFC(
|
| 106 |
+
in_size=obs_space.shape[0],
|
| 107 |
+
out_size=num_outputs,
|
| 108 |
+
initializer=normc_init_torch(1.0),
|
| 109 |
+
activation_fn=nn.Tanh,
|
| 110 |
+
)
|
| 111 |
+
|
| 112 |
+
# V(s)
|
| 113 |
+
self.value_branch = SlimFC(
|
| 114 |
+
in_size=num_outputs,
|
| 115 |
+
out_size=1,
|
| 116 |
+
initializer=normc_init_torch(0.01),
|
| 117 |
+
activation_fn=None,
|
| 118 |
+
)
|
| 119 |
+
|
| 120 |
+
# P(a1 | obs)
|
| 121 |
+
self.a1_logits = SlimFC(
|
| 122 |
+
in_size=num_outputs,
|
| 123 |
+
out_size=2,
|
| 124 |
+
activation_fn=None,
|
| 125 |
+
initializer=normc_init_torch(0.01),
|
| 126 |
+
)
|
| 127 |
+
|
| 128 |
+
class _ActionModel(nn.Module):
|
| 129 |
+
def __init__(self):
|
| 130 |
+
nn.Module.__init__(self)
|
| 131 |
+
self.a2_hidden = SlimFC(
|
| 132 |
+
in_size=1,
|
| 133 |
+
out_size=16,
|
| 134 |
+
activation_fn=nn.Tanh,
|
| 135 |
+
initializer=normc_init_torch(1.0),
|
| 136 |
+
)
|
| 137 |
+
self.a2_logits = SlimFC(
|
| 138 |
+
in_size=16,
|
| 139 |
+
out_size=2,
|
| 140 |
+
activation_fn=None,
|
| 141 |
+
initializer=normc_init_torch(0.01),
|
| 142 |
+
)
|
| 143 |
+
|
| 144 |
+
def forward(self_, ctx_input, a1_input):
|
| 145 |
+
a1_logits = self.a1_logits(ctx_input)
|
| 146 |
+
a2_logits = self_.a2_logits(self_.a2_hidden(a1_input))
|
| 147 |
+
return a1_logits, a2_logits
|
| 148 |
+
|
| 149 |
+
# P(a2 | a1)
|
| 150 |
+
# --note: typically you'd want to implement P(a2 | a1, obs) as follows:
|
| 151 |
+
# a2_context = tf.keras.layers.Concatenate(axis=1)(
|
| 152 |
+
# [ctx_input, a1_input])
|
| 153 |
+
self.action_module = _ActionModel()
|
| 154 |
+
|
| 155 |
+
self._context = None
|
| 156 |
+
|
| 157 |
+
def forward(self, input_dict, state, seq_lens):
|
| 158 |
+
self._context = self.context_layer(input_dict["obs"])
|
| 159 |
+
return self._context, state
|
| 160 |
+
|
| 161 |
+
def value_function(self):
|
| 162 |
+
return torch.reshape(self.value_branch(self._context), [-1])
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/centralized_critic_models.py
ADDED
|
@@ -0,0 +1,182 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# @OldAPIStack
|
| 2 |
+
from gymnasium.spaces import Box
|
| 3 |
+
|
| 4 |
+
from ray.rllib.models.modelv2 import ModelV2
|
| 5 |
+
from ray.rllib.models.tf.tf_modelv2 import TFModelV2
|
| 6 |
+
from ray.rllib.models.tf.fcnet import FullyConnectedNetwork
|
| 7 |
+
from ray.rllib.models.torch.misc import SlimFC
|
| 8 |
+
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
|
| 9 |
+
from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
|
| 10 |
+
from ray.rllib.utils.annotations import override
|
| 11 |
+
from ray.rllib.utils.framework import try_import_tf, try_import_torch
|
| 12 |
+
|
| 13 |
+
tf1, tf, tfv = try_import_tf()
|
| 14 |
+
torch, nn = try_import_torch()
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
class CentralizedCriticModel(TFModelV2):
|
| 18 |
+
"""Multi-agent model that implements a centralized value function."""
|
| 19 |
+
|
| 20 |
+
def __init__(self, obs_space, action_space, num_outputs, model_config, name):
|
| 21 |
+
super(CentralizedCriticModel, self).__init__(
|
| 22 |
+
obs_space, action_space, num_outputs, model_config, name
|
| 23 |
+
)
|
| 24 |
+
# Base of the model
|
| 25 |
+
self.model = FullyConnectedNetwork(
|
| 26 |
+
obs_space, action_space, num_outputs, model_config, name
|
| 27 |
+
)
|
| 28 |
+
|
| 29 |
+
# Central VF maps (obs, opp_obs, opp_act) -> vf_pred
|
| 30 |
+
obs = tf.keras.layers.Input(shape=(6,), name="obs")
|
| 31 |
+
opp_obs = tf.keras.layers.Input(shape=(6,), name="opp_obs")
|
| 32 |
+
opp_act = tf.keras.layers.Input(shape=(2,), name="opp_act")
|
| 33 |
+
concat_obs = tf.keras.layers.Concatenate(axis=1)([obs, opp_obs, opp_act])
|
| 34 |
+
central_vf_dense = tf.keras.layers.Dense(
|
| 35 |
+
16, activation=tf.nn.tanh, name="c_vf_dense"
|
| 36 |
+
)(concat_obs)
|
| 37 |
+
central_vf_out = tf.keras.layers.Dense(1, activation=None, name="c_vf_out")(
|
| 38 |
+
central_vf_dense
|
| 39 |
+
)
|
| 40 |
+
self.central_vf = tf.keras.Model(
|
| 41 |
+
inputs=[obs, opp_obs, opp_act], outputs=central_vf_out
|
| 42 |
+
)
|
| 43 |
+
|
| 44 |
+
@override(ModelV2)
|
| 45 |
+
def forward(self, input_dict, state, seq_lens):
|
| 46 |
+
return self.model.forward(input_dict, state, seq_lens)
|
| 47 |
+
|
| 48 |
+
def central_value_function(self, obs, opponent_obs, opponent_actions):
|
| 49 |
+
return tf.reshape(
|
| 50 |
+
self.central_vf(
|
| 51 |
+
[obs, opponent_obs, tf.one_hot(tf.cast(opponent_actions, tf.int32), 2)]
|
| 52 |
+
),
|
| 53 |
+
[-1],
|
| 54 |
+
)
|
| 55 |
+
|
| 56 |
+
@override(ModelV2)
|
| 57 |
+
def value_function(self):
|
| 58 |
+
return self.model.value_function() # not used
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
class YetAnotherCentralizedCriticModel(TFModelV2):
|
| 62 |
+
"""Multi-agent model that implements a centralized value function.
|
| 63 |
+
|
| 64 |
+
It assumes the observation is a dict with 'own_obs' and 'opponent_obs', the
|
| 65 |
+
former of which can be used for computing actions (i.e., decentralized
|
| 66 |
+
execution), and the latter for optimization (i.e., centralized learning).
|
| 67 |
+
|
| 68 |
+
This model has two parts:
|
| 69 |
+
- An action model that looks at just 'own_obs' to compute actions
|
| 70 |
+
- A value model that also looks at the 'opponent_obs' / 'opponent_action'
|
| 71 |
+
to compute the value (it does this by using the 'obs_flat' tensor).
|
| 72 |
+
"""
|
| 73 |
+
|
| 74 |
+
def __init__(self, obs_space, action_space, num_outputs, model_config, name):
|
| 75 |
+
super(YetAnotherCentralizedCriticModel, self).__init__(
|
| 76 |
+
obs_space, action_space, num_outputs, model_config, name
|
| 77 |
+
)
|
| 78 |
+
|
| 79 |
+
self.action_model = FullyConnectedNetwork(
|
| 80 |
+
Box(low=0, high=1, shape=(6,)), # one-hot encoded Discrete(6)
|
| 81 |
+
action_space,
|
| 82 |
+
num_outputs,
|
| 83 |
+
model_config,
|
| 84 |
+
name + "_action",
|
| 85 |
+
)
|
| 86 |
+
|
| 87 |
+
self.value_model = FullyConnectedNetwork(
|
| 88 |
+
obs_space, action_space, 1, model_config, name + "_vf"
|
| 89 |
+
)
|
| 90 |
+
|
| 91 |
+
def forward(self, input_dict, state, seq_lens):
|
| 92 |
+
self._value_out, _ = self.value_model(
|
| 93 |
+
{"obs": input_dict["obs_flat"]}, state, seq_lens
|
| 94 |
+
)
|
| 95 |
+
return self.action_model({"obs": input_dict["obs"]["own_obs"]}, state, seq_lens)
|
| 96 |
+
|
| 97 |
+
def value_function(self):
|
| 98 |
+
return tf.reshape(self._value_out, [-1])
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
class TorchCentralizedCriticModel(TorchModelV2, nn.Module):
|
| 102 |
+
"""Multi-agent model that implements a centralized VF."""
|
| 103 |
+
|
| 104 |
+
def __init__(self, obs_space, action_space, num_outputs, model_config, name):
|
| 105 |
+
TorchModelV2.__init__(
|
| 106 |
+
self, obs_space, action_space, num_outputs, model_config, name
|
| 107 |
+
)
|
| 108 |
+
nn.Module.__init__(self)
|
| 109 |
+
|
| 110 |
+
# Base of the model
|
| 111 |
+
self.model = TorchFC(obs_space, action_space, num_outputs, model_config, name)
|
| 112 |
+
|
| 113 |
+
# Central VF maps (obs, opp_obs, opp_act) -> vf_pred
|
| 114 |
+
input_size = 6 + 6 + 2 # obs + opp_obs + opp_act
|
| 115 |
+
self.central_vf = nn.Sequential(
|
| 116 |
+
SlimFC(input_size, 16, activation_fn=nn.Tanh),
|
| 117 |
+
SlimFC(16, 1),
|
| 118 |
+
)
|
| 119 |
+
|
| 120 |
+
@override(ModelV2)
|
| 121 |
+
def forward(self, input_dict, state, seq_lens):
|
| 122 |
+
model_out, _ = self.model(input_dict, state, seq_lens)
|
| 123 |
+
return model_out, []
|
| 124 |
+
|
| 125 |
+
def central_value_function(self, obs, opponent_obs, opponent_actions):
|
| 126 |
+
input_ = torch.cat(
|
| 127 |
+
[
|
| 128 |
+
obs,
|
| 129 |
+
opponent_obs,
|
| 130 |
+
torch.nn.functional.one_hot(opponent_actions.long(), 2).float(),
|
| 131 |
+
],
|
| 132 |
+
1,
|
| 133 |
+
)
|
| 134 |
+
return torch.reshape(self.central_vf(input_), [-1])
|
| 135 |
+
|
| 136 |
+
@override(ModelV2)
|
| 137 |
+
def value_function(self):
|
| 138 |
+
return self.model.value_function() # not used
|
| 139 |
+
|
| 140 |
+
|
| 141 |
+
class YetAnotherTorchCentralizedCriticModel(TorchModelV2, nn.Module):
|
| 142 |
+
"""Multi-agent model that implements a centralized value function.
|
| 143 |
+
|
| 144 |
+
It assumes the observation is a dict with 'own_obs' and 'opponent_obs', the
|
| 145 |
+
former of which can be used for computing actions (i.e., decentralized
|
| 146 |
+
execution), and the latter for optimization (i.e., centralized learning).
|
| 147 |
+
|
| 148 |
+
This model has two parts:
|
| 149 |
+
- An action model that looks at just 'own_obs' to compute actions
|
| 150 |
+
- A value model that also looks at the 'opponent_obs' / 'opponent_action'
|
| 151 |
+
to compute the value (it does this by using the 'obs_flat' tensor).
|
| 152 |
+
"""
|
| 153 |
+
|
| 154 |
+
def __init__(self, obs_space, action_space, num_outputs, model_config, name):
|
| 155 |
+
TorchModelV2.__init__(
|
| 156 |
+
self, obs_space, action_space, num_outputs, model_config, name
|
| 157 |
+
)
|
| 158 |
+
nn.Module.__init__(self)
|
| 159 |
+
|
| 160 |
+
self.action_model = TorchFC(
|
| 161 |
+
Box(low=0, high=1, shape=(6,)), # one-hot encoded Discrete(6)
|
| 162 |
+
action_space,
|
| 163 |
+
num_outputs,
|
| 164 |
+
model_config,
|
| 165 |
+
name + "_action",
|
| 166 |
+
)
|
| 167 |
+
|
| 168 |
+
self.value_model = TorchFC(
|
| 169 |
+
obs_space, action_space, 1, model_config, name + "_vf"
|
| 170 |
+
)
|
| 171 |
+
self._model_in = None
|
| 172 |
+
|
| 173 |
+
def forward(self, input_dict, state, seq_lens):
|
| 174 |
+
# Store model-input for possible `value_function()` call.
|
| 175 |
+
self._model_in = [input_dict["obs_flat"], state, seq_lens]
|
| 176 |
+
return self.action_model({"obs": input_dict["obs"]["own_obs"]}, state, seq_lens)
|
| 177 |
+
|
| 178 |
+
def value_function(self):
|
| 179 |
+
value_out, _ = self.value_model(
|
| 180 |
+
{"obs": self._model_in[0]}, self._model_in[1], self._model_in[2]
|
| 181 |
+
)
|
| 182 |
+
return torch.reshape(value_out, [-1])
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/custom_loss_model.py
ADDED
|
@@ -0,0 +1,137 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import numpy as np
|
| 2 |
+
|
| 3 |
+
from ray.rllib.models.modelv2 import ModelV2, restore_original_dimensions
|
| 4 |
+
from ray.rllib.models.tf.tf_action_dist import Categorical
|
| 5 |
+
from ray.rllib.models.tf.tf_modelv2 import TFModelV2
|
| 6 |
+
from ray.rllib.models.tf.fcnet import FullyConnectedNetwork
|
| 7 |
+
from ray.rllib.models.torch.torch_action_dist import TorchCategorical
|
| 8 |
+
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
|
| 9 |
+
from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
|
| 10 |
+
from ray.rllib.utils.annotations import override
|
| 11 |
+
from ray.rllib.utils.framework import try_import_tf, try_import_torch
|
| 12 |
+
from ray.rllib.offline import JsonReader
|
| 13 |
+
|
| 14 |
+
tf1, tf, tfv = try_import_tf()
|
| 15 |
+
torch, nn = try_import_torch()
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
class CustomLossModel(TFModelV2):
|
| 19 |
+
"""Custom model that adds an imitation loss on top of the policy loss."""
|
| 20 |
+
|
| 21 |
+
def __init__(self, obs_space, action_space, num_outputs, model_config, name):
|
| 22 |
+
super().__init__(obs_space, action_space, num_outputs, model_config, name)
|
| 23 |
+
|
| 24 |
+
self.fcnet = FullyConnectedNetwork(
|
| 25 |
+
self.obs_space, self.action_space, num_outputs, model_config, name="fcnet"
|
| 26 |
+
)
|
| 27 |
+
|
| 28 |
+
@override(ModelV2)
|
| 29 |
+
def forward(self, input_dict, state, seq_lens):
|
| 30 |
+
# Delegate to our FCNet.
|
| 31 |
+
return self.fcnet(input_dict, state, seq_lens)
|
| 32 |
+
|
| 33 |
+
@override(ModelV2)
|
| 34 |
+
def value_function(self):
|
| 35 |
+
# Delegate to our FCNet.
|
| 36 |
+
return self.fcnet.value_function()
|
| 37 |
+
|
| 38 |
+
@override(ModelV2)
|
| 39 |
+
def custom_loss(self, policy_loss, loss_inputs):
|
| 40 |
+
# Create a new input reader per worker.
|
| 41 |
+
reader = JsonReader(self.model_config["custom_model_config"]["input_files"])
|
| 42 |
+
input_ops = reader.tf_input_ops()
|
| 43 |
+
|
| 44 |
+
# Define a secondary loss by building a graph copy with weight sharing.
|
| 45 |
+
obs = restore_original_dimensions(
|
| 46 |
+
tf.cast(input_ops["obs"], tf.float32), self.obs_space
|
| 47 |
+
)
|
| 48 |
+
logits, _ = self.forward({"obs": obs}, [], None)
|
| 49 |
+
|
| 50 |
+
# Compute the IL loss.
|
| 51 |
+
action_dist = Categorical(logits, self.model_config)
|
| 52 |
+
self.policy_loss = policy_loss
|
| 53 |
+
self.imitation_loss = tf.reduce_mean(-action_dist.logp(input_ops["actions"]))
|
| 54 |
+
return policy_loss + 10 * self.imitation_loss
|
| 55 |
+
|
| 56 |
+
def metrics(self):
|
| 57 |
+
return {
|
| 58 |
+
"policy_loss": self.policy_loss,
|
| 59 |
+
"imitation_loss": self.imitation_loss,
|
| 60 |
+
}
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
class TorchCustomLossModel(TorchModelV2, nn.Module):
|
| 64 |
+
"""PyTorch version of the CustomLossModel above."""
|
| 65 |
+
|
| 66 |
+
def __init__(
|
| 67 |
+
self, obs_space, action_space, num_outputs, model_config, name, input_files
|
| 68 |
+
):
|
| 69 |
+
super().__init__(obs_space, action_space, num_outputs, model_config, name)
|
| 70 |
+
nn.Module.__init__(self)
|
| 71 |
+
|
| 72 |
+
self.input_files = input_files
|
| 73 |
+
# Create a new input reader per worker.
|
| 74 |
+
self.reader = JsonReader(self.input_files)
|
| 75 |
+
self.fcnet = TorchFC(
|
| 76 |
+
self.obs_space, self.action_space, num_outputs, model_config, name="fcnet"
|
| 77 |
+
)
|
| 78 |
+
|
| 79 |
+
@override(ModelV2)
|
| 80 |
+
def forward(self, input_dict, state, seq_lens):
|
| 81 |
+
# Delegate to our FCNet.
|
| 82 |
+
return self.fcnet(input_dict, state, seq_lens)
|
| 83 |
+
|
| 84 |
+
@override(ModelV2)
|
| 85 |
+
def value_function(self):
|
| 86 |
+
# Delegate to our FCNet.
|
| 87 |
+
return self.fcnet.value_function()
|
| 88 |
+
|
| 89 |
+
@override(ModelV2)
|
| 90 |
+
def custom_loss(self, policy_loss, loss_inputs):
|
| 91 |
+
"""Calculates a custom loss on top of the given policy_loss(es).
|
| 92 |
+
|
| 93 |
+
Args:
|
| 94 |
+
policy_loss (List[TensorType]): The list of already calculated
|
| 95 |
+
policy losses (as many as there are optimizers).
|
| 96 |
+
loss_inputs: Struct of np.ndarrays holding the
|
| 97 |
+
entire train batch.
|
| 98 |
+
|
| 99 |
+
Returns:
|
| 100 |
+
List[TensorType]: The altered list of policy losses. In case the
|
| 101 |
+
custom loss should have its own optimizer, make sure the
|
| 102 |
+
returned list is one larger than the incoming policy_loss list.
|
| 103 |
+
In case you simply want to mix in the custom loss into the
|
| 104 |
+
already calculated policy losses, return a list of altered
|
| 105 |
+
policy losses (as done in this example below).
|
| 106 |
+
"""
|
| 107 |
+
# Get the next batch from our input files.
|
| 108 |
+
batch = self.reader.next()
|
| 109 |
+
|
| 110 |
+
# Define a secondary loss by building a graph copy with weight sharing.
|
| 111 |
+
obs = restore_original_dimensions(
|
| 112 |
+
torch.from_numpy(batch["obs"]).float().to(policy_loss[0].device),
|
| 113 |
+
self.obs_space,
|
| 114 |
+
tensorlib="torch",
|
| 115 |
+
)
|
| 116 |
+
logits, _ = self.forward({"obs": obs}, [], None)
|
| 117 |
+
|
| 118 |
+
# Compute the IL loss.
|
| 119 |
+
action_dist = TorchCategorical(logits, self.model_config)
|
| 120 |
+
imitation_loss = torch.mean(
|
| 121 |
+
-action_dist.logp(
|
| 122 |
+
torch.from_numpy(batch["actions"]).to(policy_loss[0].device)
|
| 123 |
+
)
|
| 124 |
+
)
|
| 125 |
+
self.imitation_loss_metric = imitation_loss.item()
|
| 126 |
+
self.policy_loss_metric = np.mean([loss.item() for loss in policy_loss])
|
| 127 |
+
|
| 128 |
+
# Add the imitation loss to each already calculated policy loss term.
|
| 129 |
+
# Alternatively (if custom loss has its own optimizer):
|
| 130 |
+
# return policy_loss + [10 * self.imitation_loss]
|
| 131 |
+
return [loss_ + 10 * imitation_loss for loss_ in policy_loss]
|
| 132 |
+
|
| 133 |
+
def metrics(self):
|
| 134 |
+
return {
|
| 135 |
+
"policy_loss": self.policy_loss_metric,
|
| 136 |
+
"imitation_loss": self.imitation_loss_metric,
|
| 137 |
+
}
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/fast_model.py
ADDED
|
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# @OldAPIStack
|
| 2 |
+
from ray.rllib.models.modelv2 import ModelV2
|
| 3 |
+
from ray.rllib.models.tf.tf_modelv2 import TFModelV2
|
| 4 |
+
from ray.rllib.models.torch.misc import SlimFC
|
| 5 |
+
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
|
| 6 |
+
from ray.rllib.utils.annotations import override
|
| 7 |
+
from ray.rllib.utils.framework import try_import_tf, try_import_torch
|
| 8 |
+
|
| 9 |
+
tf1, tf, tfv = try_import_tf()
|
| 10 |
+
torch, nn = try_import_torch()
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
class FastModel(TFModelV2):
|
| 14 |
+
"""An example for a non-Keras ModelV2 in tf that learns a single weight.
|
| 15 |
+
|
| 16 |
+
Defines all network architecture in `forward` (not `__init__` as it's
|
| 17 |
+
usually done for Keras-style TFModelV2s).
|
| 18 |
+
"""
|
| 19 |
+
|
| 20 |
+
def __init__(self, obs_space, action_space, num_outputs, model_config, name):
|
| 21 |
+
super().__init__(obs_space, action_space, num_outputs, model_config, name)
|
| 22 |
+
# Have we registered our vars yet (see `forward`)?
|
| 23 |
+
self._registered = False
|
| 24 |
+
|
| 25 |
+
@override(ModelV2)
|
| 26 |
+
def forward(self, input_dict, state, seq_lens):
|
| 27 |
+
with tf1.variable_scope("model", reuse=tf1.AUTO_REUSE):
|
| 28 |
+
bias = tf1.get_variable(
|
| 29 |
+
dtype=tf.float32,
|
| 30 |
+
name="bias",
|
| 31 |
+
initializer=tf.keras.initializers.Zeros(),
|
| 32 |
+
shape=(),
|
| 33 |
+
)
|
| 34 |
+
output = bias + tf.zeros([tf.shape(input_dict["obs"])[0], self.num_outputs])
|
| 35 |
+
self._value_out = tf.reduce_mean(output, -1) # fake value
|
| 36 |
+
|
| 37 |
+
if not self._registered:
|
| 38 |
+
self.register_variables(
|
| 39 |
+
tf1.get_collection(
|
| 40 |
+
tf1.GraphKeys.TRAINABLE_VARIABLES, scope=".+/model/.+"
|
| 41 |
+
)
|
| 42 |
+
)
|
| 43 |
+
self._registered = True
|
| 44 |
+
|
| 45 |
+
return output, []
|
| 46 |
+
|
| 47 |
+
@override(ModelV2)
|
| 48 |
+
def value_function(self):
|
| 49 |
+
return tf.reshape(self._value_out, [-1])
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
class TorchFastModel(TorchModelV2, nn.Module):
|
| 53 |
+
"""Torch version of FastModel (tf)."""
|
| 54 |
+
|
| 55 |
+
def __init__(self, obs_space, action_space, num_outputs, model_config, name):
|
| 56 |
+
TorchModelV2.__init__(
|
| 57 |
+
self, obs_space, action_space, num_outputs, model_config, name
|
| 58 |
+
)
|
| 59 |
+
nn.Module.__init__(self)
|
| 60 |
+
|
| 61 |
+
self.bias = nn.Parameter(
|
| 62 |
+
torch.tensor([0.0], dtype=torch.float32, requires_grad=True)
|
| 63 |
+
)
|
| 64 |
+
|
| 65 |
+
# Only needed to give some params to the optimizer (even though,
|
| 66 |
+
# they are never used anywhere).
|
| 67 |
+
self.dummy_layer = SlimFC(1, 1)
|
| 68 |
+
self._output = None
|
| 69 |
+
|
| 70 |
+
@override(ModelV2)
|
| 71 |
+
def forward(self, input_dict, state, seq_lens):
|
| 72 |
+
self._output = self.bias + torch.zeros(
|
| 73 |
+
size=(input_dict["obs"].shape[0], self.num_outputs)
|
| 74 |
+
).to(self.bias.device)
|
| 75 |
+
return self._output, []
|
| 76 |
+
|
| 77 |
+
@override(ModelV2)
|
| 78 |
+
def value_function(self):
|
| 79 |
+
assert self._output is not None, "must call forward first!"
|
| 80 |
+
return torch.reshape(torch.mean(self._output, -1), [-1])
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/mobilenet_v2_encoder.py
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# @OldAPIStack
|
| 2 |
+
"""
|
| 3 |
+
This file implements a MobileNet v2 Encoder.
|
| 4 |
+
It uses MobileNet v2 to encode images into a latent space of 1000 dimensions.
|
| 5 |
+
|
| 6 |
+
Depending on the experiment, the MobileNet v2 encoder layers can be frozen or
|
| 7 |
+
unfrozen. This is controlled by the `freeze` parameter in the config.
|
| 8 |
+
|
| 9 |
+
This is an example of how a pre-trained neural network can be used as an encoder
|
| 10 |
+
in RLlib. You can modify this example to accommodate your own encoder network or
|
| 11 |
+
other pre-trained networks.
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
from ray.rllib.core.models.base import Encoder, ENCODER_OUT
|
| 15 |
+
from ray.rllib.core.models.configs import ModelConfig
|
| 16 |
+
from ray.rllib.core.models.torch.base import TorchModel
|
| 17 |
+
from ray.rllib.utils.framework import try_import_torch
|
| 18 |
+
|
| 19 |
+
torch, nn = try_import_torch()
|
| 20 |
+
|
| 21 |
+
MOBILENET_INPUT_SHAPE = (3, 224, 224)
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
class MobileNetV2EncoderConfig(ModelConfig):
|
| 25 |
+
# MobileNet v2 has a flat output with a length of 1000.
|
| 26 |
+
output_dims = (1000,)
|
| 27 |
+
freeze = True
|
| 28 |
+
|
| 29 |
+
def build(self, framework):
|
| 30 |
+
assert framework == "torch", "Unsupported framework `{}`!".format(framework)
|
| 31 |
+
return MobileNetV2Encoder(self)
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
class MobileNetV2Encoder(TorchModel, Encoder):
|
| 35 |
+
"""A MobileNet v2 encoder for RLlib."""
|
| 36 |
+
|
| 37 |
+
def __init__(self, config):
|
| 38 |
+
super().__init__(config)
|
| 39 |
+
self.net = torch.hub.load(
|
| 40 |
+
"pytorch/vision:v0.6.0", "mobilenet_v2", pretrained=True
|
| 41 |
+
)
|
| 42 |
+
if config.freeze:
|
| 43 |
+
# We don't want to train this encoder, so freeze its parameters!
|
| 44 |
+
for p in self.net.parameters():
|
| 45 |
+
p.requires_grad = False
|
| 46 |
+
|
| 47 |
+
def _forward(self, input_dict, **kwargs):
|
| 48 |
+
return {ENCODER_OUT: (self.net(input_dict["obs"]))}
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/mobilenet_v2_with_lstm_models.py
ADDED
|
@@ -0,0 +1,160 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# @OldAPIStack
|
| 2 |
+
import numpy as np
|
| 3 |
+
|
| 4 |
+
from ray.rllib.models.modelv2 import ModelV2
|
| 5 |
+
from ray.rllib.models.tf.recurrent_net import RecurrentNetwork
|
| 6 |
+
from ray.rllib.models.torch.misc import SlimFC
|
| 7 |
+
from ray.rllib.models.torch.recurrent_net import RecurrentNetwork as TorchRNN
|
| 8 |
+
from ray.rllib.utils.annotations import override
|
| 9 |
+
from ray.rllib.utils.framework import try_import_tf, try_import_torch
|
| 10 |
+
|
| 11 |
+
tf1, tf, tfv = try_import_tf()
|
| 12 |
+
torch, nn = try_import_torch()
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
class MobileV2PlusRNNModel(RecurrentNetwork):
|
| 16 |
+
"""A conv. + recurrent keras net example using a pre-trained MobileNet."""
|
| 17 |
+
|
| 18 |
+
def __init__(
|
| 19 |
+
self, obs_space, action_space, num_outputs, model_config, name, cnn_shape
|
| 20 |
+
):
|
| 21 |
+
|
| 22 |
+
super(MobileV2PlusRNNModel, self).__init__(
|
| 23 |
+
obs_space, action_space, num_outputs, model_config, name
|
| 24 |
+
)
|
| 25 |
+
|
| 26 |
+
self.cell_size = 16
|
| 27 |
+
visual_size = cnn_shape[0] * cnn_shape[1] * cnn_shape[2]
|
| 28 |
+
|
| 29 |
+
state_in_h = tf.keras.layers.Input(shape=(self.cell_size,), name="h")
|
| 30 |
+
state_in_c = tf.keras.layers.Input(shape=(self.cell_size,), name="c")
|
| 31 |
+
seq_in = tf.keras.layers.Input(shape=(), name="seq_in", dtype=tf.int32)
|
| 32 |
+
|
| 33 |
+
inputs = tf.keras.layers.Input(shape=(None, visual_size), name="visual_inputs")
|
| 34 |
+
|
| 35 |
+
input_visual = inputs
|
| 36 |
+
input_visual = tf.reshape(
|
| 37 |
+
input_visual, [-1, cnn_shape[0], cnn_shape[1], cnn_shape[2]]
|
| 38 |
+
)
|
| 39 |
+
cnn_input = tf.keras.layers.Input(shape=cnn_shape, name="cnn_input")
|
| 40 |
+
|
| 41 |
+
cnn_model = tf.keras.applications.mobilenet_v2.MobileNetV2(
|
| 42 |
+
alpha=1.0,
|
| 43 |
+
include_top=True,
|
| 44 |
+
weights=None,
|
| 45 |
+
input_tensor=cnn_input,
|
| 46 |
+
pooling=None,
|
| 47 |
+
)
|
| 48 |
+
vision_out = cnn_model(input_visual)
|
| 49 |
+
vision_out = tf.reshape(
|
| 50 |
+
vision_out, [-1, tf.shape(inputs)[1], vision_out.shape.as_list()[-1]]
|
| 51 |
+
)
|
| 52 |
+
|
| 53 |
+
lstm_out, state_h, state_c = tf.keras.layers.LSTM(
|
| 54 |
+
self.cell_size, return_sequences=True, return_state=True, name="lstm"
|
| 55 |
+
)(
|
| 56 |
+
inputs=vision_out,
|
| 57 |
+
mask=tf.sequence_mask(seq_in),
|
| 58 |
+
initial_state=[state_in_h, state_in_c],
|
| 59 |
+
)
|
| 60 |
+
|
| 61 |
+
# Postprocess LSTM output with another hidden layer and compute values.
|
| 62 |
+
logits = tf.keras.layers.Dense(
|
| 63 |
+
self.num_outputs, activation=tf.keras.activations.linear, name="logits"
|
| 64 |
+
)(lstm_out)
|
| 65 |
+
values = tf.keras.layers.Dense(1, activation=None, name="values")(lstm_out)
|
| 66 |
+
|
| 67 |
+
# Create the RNN model
|
| 68 |
+
self.rnn_model = tf.keras.Model(
|
| 69 |
+
inputs=[inputs, seq_in, state_in_h, state_in_c],
|
| 70 |
+
outputs=[logits, values, state_h, state_c],
|
| 71 |
+
)
|
| 72 |
+
self.rnn_model.summary()
|
| 73 |
+
|
| 74 |
+
@override(RecurrentNetwork)
|
| 75 |
+
def forward_rnn(self, inputs, state, seq_lens):
|
| 76 |
+
model_out, self._value_out, h, c = self.rnn_model([inputs, seq_lens] + state)
|
| 77 |
+
return model_out, [h, c]
|
| 78 |
+
|
| 79 |
+
@override(ModelV2)
|
| 80 |
+
def get_initial_state(self):
|
| 81 |
+
return [
|
| 82 |
+
np.zeros(self.cell_size, np.float32),
|
| 83 |
+
np.zeros(self.cell_size, np.float32),
|
| 84 |
+
]
|
| 85 |
+
|
| 86 |
+
@override(ModelV2)
|
| 87 |
+
def value_function(self):
|
| 88 |
+
return tf.reshape(self._value_out, [-1])
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
class TorchMobileV2PlusRNNModel(TorchRNN, nn.Module):
|
| 92 |
+
"""A conv. + recurrent torch net example using a pre-trained MobileNet."""
|
| 93 |
+
|
| 94 |
+
def __init__(
|
| 95 |
+
self, obs_space, action_space, num_outputs, model_config, name, cnn_shape
|
| 96 |
+
):
|
| 97 |
+
|
| 98 |
+
TorchRNN.__init__(
|
| 99 |
+
self, obs_space, action_space, num_outputs, model_config, name
|
| 100 |
+
)
|
| 101 |
+
nn.Module.__init__(self)
|
| 102 |
+
|
| 103 |
+
self.lstm_state_size = 16
|
| 104 |
+
self.cnn_shape = list(cnn_shape)
|
| 105 |
+
self.visual_size_in = cnn_shape[0] * cnn_shape[1] * cnn_shape[2]
|
| 106 |
+
# MobileNetV2 has a flat output of (1000,).
|
| 107 |
+
self.visual_size_out = 1000
|
| 108 |
+
|
| 109 |
+
# Load the MobileNetV2 from torch.hub.
|
| 110 |
+
self.cnn_model = torch.hub.load(
|
| 111 |
+
"pytorch/vision:v0.6.0", "mobilenet_v2", pretrained=True
|
| 112 |
+
)
|
| 113 |
+
|
| 114 |
+
self.lstm = nn.LSTM(
|
| 115 |
+
self.visual_size_out, self.lstm_state_size, batch_first=True
|
| 116 |
+
)
|
| 117 |
+
|
| 118 |
+
# Postprocess LSTM output with another hidden layer and compute values.
|
| 119 |
+
self.logits = SlimFC(self.lstm_state_size, self.num_outputs)
|
| 120 |
+
self.value_branch = SlimFC(self.lstm_state_size, 1)
|
| 121 |
+
# Holds the current "base" output (before logits layer).
|
| 122 |
+
self._features = None
|
| 123 |
+
|
| 124 |
+
@override(TorchRNN)
|
| 125 |
+
def forward_rnn(self, inputs, state, seq_lens):
|
| 126 |
+
# Create image dims.
|
| 127 |
+
vision_in = torch.reshape(inputs, [-1] + self.cnn_shape)
|
| 128 |
+
vision_out = self.cnn_model(vision_in)
|
| 129 |
+
# Flatten.
|
| 130 |
+
vision_out_time_ranked = torch.reshape(
|
| 131 |
+
vision_out, [inputs.shape[0], inputs.shape[1], vision_out.shape[-1]]
|
| 132 |
+
)
|
| 133 |
+
if len(state[0].shape) == 2:
|
| 134 |
+
state[0] = state[0].unsqueeze(0)
|
| 135 |
+
state[1] = state[1].unsqueeze(0)
|
| 136 |
+
# Forward through LSTM.
|
| 137 |
+
self._features, [h, c] = self.lstm(vision_out_time_ranked, state)
|
| 138 |
+
# Forward LSTM out through logits layer and value layer.
|
| 139 |
+
logits = self.logits(self._features)
|
| 140 |
+
return logits, [h.squeeze(0), c.squeeze(0)]
|
| 141 |
+
|
| 142 |
+
@override(ModelV2)
|
| 143 |
+
def get_initial_state(self):
|
| 144 |
+
# Place hidden states on same device as model.
|
| 145 |
+
h = [
|
| 146 |
+
list(self.cnn_model.modules())[-1]
|
| 147 |
+
.weight.new(1, self.lstm_state_size)
|
| 148 |
+
.zero_()
|
| 149 |
+
.squeeze(0),
|
| 150 |
+
list(self.cnn_model.modules())[-1]
|
| 151 |
+
.weight.new(1, self.lstm_state_size)
|
| 152 |
+
.zero_()
|
| 153 |
+
.squeeze(0),
|
| 154 |
+
]
|
| 155 |
+
return h
|
| 156 |
+
|
| 157 |
+
@override(ModelV2)
|
| 158 |
+
def value_function(self):
|
| 159 |
+
assert self._features is not None, "must call forward() first"
|
| 160 |
+
return torch.reshape(self.value_branch(self._features), [-1])
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/neural_computer.py
ADDED
|
@@ -0,0 +1,247 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# @OldAPIStack
|
| 2 |
+
from collections import OrderedDict
|
| 3 |
+
import gymnasium as gym
|
| 4 |
+
from typing import Union, Dict, List, Tuple
|
| 5 |
+
|
| 6 |
+
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
|
| 7 |
+
from ray.rllib.models.torch.misc import SlimFC
|
| 8 |
+
from ray.rllib.utils.framework import try_import_torch
|
| 9 |
+
from ray.rllib.utils.typing import ModelConfigDict, TensorType
|
| 10 |
+
|
| 11 |
+
try:
|
| 12 |
+
from dnc import DNC
|
| 13 |
+
except ModuleNotFoundError:
|
| 14 |
+
print("dnc module not found. Did you forget to 'pip install dnc'?")
|
| 15 |
+
raise
|
| 16 |
+
|
| 17 |
+
torch, nn = try_import_torch()
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
class DNCMemory(TorchModelV2, nn.Module):
|
| 21 |
+
"""Differentiable Neural Computer wrapper around ixaxaar's DNC implementation,
|
| 22 |
+
see https://github.com/ixaxaar/pytorch-dnc"""
|
| 23 |
+
|
| 24 |
+
DEFAULT_CONFIG = {
|
| 25 |
+
"dnc_model": DNC,
|
| 26 |
+
# Number of controller hidden layers
|
| 27 |
+
"num_hidden_layers": 1,
|
| 28 |
+
# Number of weights per controller hidden layer
|
| 29 |
+
"hidden_size": 64,
|
| 30 |
+
# Number of LSTM units
|
| 31 |
+
"num_layers": 1,
|
| 32 |
+
# Number of read heads, i.e. how many addrs are read at once
|
| 33 |
+
"read_heads": 4,
|
| 34 |
+
# Number of memory cells in the controller
|
| 35 |
+
"nr_cells": 32,
|
| 36 |
+
# Size of each cell
|
| 37 |
+
"cell_size": 16,
|
| 38 |
+
# LSTM activation function
|
| 39 |
+
"nonlinearity": "tanh",
|
| 40 |
+
# Observation goes through this torch.nn.Module before
|
| 41 |
+
# feeding to the DNC
|
| 42 |
+
"preprocessor": torch.nn.Sequential(torch.nn.Linear(64, 64), torch.nn.Tanh()),
|
| 43 |
+
# Input size to the preprocessor
|
| 44 |
+
"preprocessor_input_size": 64,
|
| 45 |
+
# The output size of the preprocessor
|
| 46 |
+
# and the input size of the dnc
|
| 47 |
+
"preprocessor_output_size": 64,
|
| 48 |
+
}
|
| 49 |
+
|
| 50 |
+
MEMORY_KEYS = [
|
| 51 |
+
"memory",
|
| 52 |
+
"link_matrix",
|
| 53 |
+
"precedence",
|
| 54 |
+
"read_weights",
|
| 55 |
+
"write_weights",
|
| 56 |
+
"usage_vector",
|
| 57 |
+
]
|
| 58 |
+
|
| 59 |
+
def __init__(
|
| 60 |
+
self,
|
| 61 |
+
obs_space: gym.spaces.Space,
|
| 62 |
+
action_space: gym.spaces.Space,
|
| 63 |
+
num_outputs: int,
|
| 64 |
+
model_config: ModelConfigDict,
|
| 65 |
+
name: str,
|
| 66 |
+
**custom_model_kwargs,
|
| 67 |
+
):
|
| 68 |
+
nn.Module.__init__(self)
|
| 69 |
+
super(DNCMemory, self).__init__(
|
| 70 |
+
obs_space, action_space, num_outputs, model_config, name
|
| 71 |
+
)
|
| 72 |
+
self.num_outputs = num_outputs
|
| 73 |
+
self.obs_dim = gym.spaces.utils.flatdim(obs_space)
|
| 74 |
+
self.act_dim = gym.spaces.utils.flatdim(action_space)
|
| 75 |
+
|
| 76 |
+
self.cfg = dict(self.DEFAULT_CONFIG, **custom_model_kwargs)
|
| 77 |
+
assert (
|
| 78 |
+
self.cfg["num_layers"] == 1
|
| 79 |
+
), "num_layers != 1 has not been implemented yet"
|
| 80 |
+
self.cur_val = None
|
| 81 |
+
|
| 82 |
+
self.preprocessor = torch.nn.Sequential(
|
| 83 |
+
torch.nn.Linear(self.obs_dim, self.cfg["preprocessor_input_size"]),
|
| 84 |
+
self.cfg["preprocessor"],
|
| 85 |
+
)
|
| 86 |
+
|
| 87 |
+
self.logit_branch = SlimFC(
|
| 88 |
+
in_size=self.cfg["hidden_size"],
|
| 89 |
+
out_size=self.num_outputs,
|
| 90 |
+
activation_fn=None,
|
| 91 |
+
initializer=torch.nn.init.xavier_uniform_,
|
| 92 |
+
)
|
| 93 |
+
|
| 94 |
+
self.value_branch = SlimFC(
|
| 95 |
+
in_size=self.cfg["hidden_size"],
|
| 96 |
+
out_size=1,
|
| 97 |
+
activation_fn=None,
|
| 98 |
+
initializer=torch.nn.init.xavier_uniform_,
|
| 99 |
+
)
|
| 100 |
+
|
| 101 |
+
self.dnc: Union[None, DNC] = None
|
| 102 |
+
|
| 103 |
+
def get_initial_state(self) -> List[TensorType]:
|
| 104 |
+
ctrl_hidden = [
|
| 105 |
+
torch.zeros(self.cfg["num_hidden_layers"], self.cfg["hidden_size"]),
|
| 106 |
+
torch.zeros(self.cfg["num_hidden_layers"], self.cfg["hidden_size"]),
|
| 107 |
+
]
|
| 108 |
+
m = self.cfg["nr_cells"]
|
| 109 |
+
r = self.cfg["read_heads"]
|
| 110 |
+
w = self.cfg["cell_size"]
|
| 111 |
+
memory = [
|
| 112 |
+
torch.zeros(m, w), # memory
|
| 113 |
+
torch.zeros(1, m, m), # link_matrix
|
| 114 |
+
torch.zeros(1, m), # precedence
|
| 115 |
+
torch.zeros(r, m), # read_weights
|
| 116 |
+
torch.zeros(1, m), # write_weights
|
| 117 |
+
torch.zeros(m), # usage_vector
|
| 118 |
+
]
|
| 119 |
+
|
| 120 |
+
read_vecs = torch.zeros(w * r)
|
| 121 |
+
|
| 122 |
+
state = [*ctrl_hidden, read_vecs, *memory]
|
| 123 |
+
assert len(state) == 9
|
| 124 |
+
return state
|
| 125 |
+
|
| 126 |
+
def value_function(self) -> TensorType:
|
| 127 |
+
assert self.cur_val is not None, "must call forward() first"
|
| 128 |
+
return self.cur_val
|
| 129 |
+
|
| 130 |
+
def unpack_state(
|
| 131 |
+
self,
|
| 132 |
+
state: List[TensorType],
|
| 133 |
+
) -> Tuple[List[Tuple[TensorType, TensorType]], Dict[str, TensorType], TensorType]:
|
| 134 |
+
"""Given a list of tensors, reformat for self.dnc input"""
|
| 135 |
+
assert len(state) == 9, "Failed to verify unpacked state"
|
| 136 |
+
ctrl_hidden: List[Tuple[TensorType, TensorType]] = [
|
| 137 |
+
(
|
| 138 |
+
state[0].permute(1, 0, 2).contiguous(),
|
| 139 |
+
state[1].permute(1, 0, 2).contiguous(),
|
| 140 |
+
)
|
| 141 |
+
]
|
| 142 |
+
read_vecs: TensorType = state[2]
|
| 143 |
+
memory: List[TensorType] = state[3:]
|
| 144 |
+
memory_dict: OrderedDict[str, TensorType] = OrderedDict(
|
| 145 |
+
zip(self.MEMORY_KEYS, memory)
|
| 146 |
+
)
|
| 147 |
+
|
| 148 |
+
return ctrl_hidden, memory_dict, read_vecs
|
| 149 |
+
|
| 150 |
+
def pack_state(
|
| 151 |
+
self,
|
| 152 |
+
ctrl_hidden: List[Tuple[TensorType, TensorType]],
|
| 153 |
+
memory_dict: Dict[str, TensorType],
|
| 154 |
+
read_vecs: TensorType,
|
| 155 |
+
) -> List[TensorType]:
|
| 156 |
+
"""Given the dnc output, pack it into a list of tensors
|
| 157 |
+
for rllib state. Order is ctrl_hidden, read_vecs, memory_dict"""
|
| 158 |
+
state = []
|
| 159 |
+
ctrl_hidden = [
|
| 160 |
+
ctrl_hidden[0][0].permute(1, 0, 2),
|
| 161 |
+
ctrl_hidden[0][1].permute(1, 0, 2),
|
| 162 |
+
]
|
| 163 |
+
state += ctrl_hidden
|
| 164 |
+
assert len(state) == 2, "Failed to verify packed state"
|
| 165 |
+
state.append(read_vecs)
|
| 166 |
+
assert len(state) == 3, "Failed to verify packed state"
|
| 167 |
+
state += memory_dict.values()
|
| 168 |
+
assert len(state) == 9, "Failed to verify packed state"
|
| 169 |
+
return state
|
| 170 |
+
|
| 171 |
+
def validate_unpack(self, dnc_output, unpacked_state):
|
| 172 |
+
"""Ensure the unpacked state shapes match the DNC output"""
|
| 173 |
+
s_ctrl_hidden, s_memory_dict, s_read_vecs = unpacked_state
|
| 174 |
+
ctrl_hidden, memory_dict, read_vecs = dnc_output
|
| 175 |
+
|
| 176 |
+
for i in range(len(ctrl_hidden)):
|
| 177 |
+
for j in range(len(ctrl_hidden[i])):
|
| 178 |
+
assert s_ctrl_hidden[i][j].shape == ctrl_hidden[i][j].shape, (
|
| 179 |
+
"Controller state mismatch: got "
|
| 180 |
+
f"{s_ctrl_hidden[i][j].shape} should be "
|
| 181 |
+
f"{ctrl_hidden[i][j].shape}"
|
| 182 |
+
)
|
| 183 |
+
|
| 184 |
+
for k in memory_dict:
|
| 185 |
+
assert s_memory_dict[k].shape == memory_dict[k].shape, (
|
| 186 |
+
"Memory state mismatch at key "
|
| 187 |
+
f"{k}: got {s_memory_dict[k].shape} should be "
|
| 188 |
+
f"{memory_dict[k].shape}"
|
| 189 |
+
)
|
| 190 |
+
|
| 191 |
+
assert s_read_vecs.shape == read_vecs.shape, (
|
| 192 |
+
"Read state mismatch: got "
|
| 193 |
+
f"{s_read_vecs.shape} should be "
|
| 194 |
+
f"{read_vecs.shape}"
|
| 195 |
+
)
|
| 196 |
+
|
| 197 |
+
def build_dnc(self, device_idx: Union[int, None]) -> None:
|
| 198 |
+
self.dnc = self.cfg["dnc_model"](
|
| 199 |
+
input_size=self.cfg["preprocessor_output_size"],
|
| 200 |
+
hidden_size=self.cfg["hidden_size"],
|
| 201 |
+
num_layers=self.cfg["num_layers"],
|
| 202 |
+
num_hidden_layers=self.cfg["num_hidden_layers"],
|
| 203 |
+
read_heads=self.cfg["read_heads"],
|
| 204 |
+
cell_size=self.cfg["cell_size"],
|
| 205 |
+
nr_cells=self.cfg["nr_cells"],
|
| 206 |
+
nonlinearity=self.cfg["nonlinearity"],
|
| 207 |
+
gpu_id=device_idx,
|
| 208 |
+
)
|
| 209 |
+
|
| 210 |
+
def forward(
|
| 211 |
+
self,
|
| 212 |
+
input_dict: Dict[str, TensorType],
|
| 213 |
+
state: List[TensorType],
|
| 214 |
+
seq_lens: TensorType,
|
| 215 |
+
) -> Tuple[TensorType, List[TensorType]]:
|
| 216 |
+
|
| 217 |
+
flat = input_dict["obs_flat"]
|
| 218 |
+
# Batch and Time
|
| 219 |
+
# Forward expects outputs as [B, T, logits]
|
| 220 |
+
B = len(seq_lens)
|
| 221 |
+
T = flat.shape[0] // B
|
| 222 |
+
|
| 223 |
+
# Deconstruct batch into batch and time dimensions: [B, T, feats]
|
| 224 |
+
flat = torch.reshape(flat, [-1, T] + list(flat.shape[1:]))
|
| 225 |
+
|
| 226 |
+
# First run
|
| 227 |
+
if self.dnc is None:
|
| 228 |
+
gpu_id = flat.device.index if flat.device.index is not None else -1
|
| 229 |
+
self.build_dnc(gpu_id)
|
| 230 |
+
hidden = (None, None, None)
|
| 231 |
+
|
| 232 |
+
else:
|
| 233 |
+
hidden = self.unpack_state(state) # type: ignore
|
| 234 |
+
|
| 235 |
+
# Run thru preprocessor before DNC
|
| 236 |
+
z = self.preprocessor(flat.reshape(B * T, self.obs_dim))
|
| 237 |
+
z = z.reshape(B, T, self.cfg["preprocessor_output_size"])
|
| 238 |
+
output, hidden = self.dnc(z, hidden)
|
| 239 |
+
packed_state = self.pack_state(*hidden)
|
| 240 |
+
|
| 241 |
+
# Compute action/value from output
|
| 242 |
+
logits = self.logit_branch(output.view(B * T, -1))
|
| 243 |
+
values = self.value_branch(output.view(B * T, -1))
|
| 244 |
+
|
| 245 |
+
self.cur_val = values.squeeze(1)
|
| 246 |
+
|
| 247 |
+
return logits, packed_state
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/parametric_actions_model.py
ADDED
|
@@ -0,0 +1,201 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# @OldAPIStack
|
| 2 |
+
from gymnasium.spaces import Box
|
| 3 |
+
|
| 4 |
+
from ray.rllib.algorithms.dqn.distributional_q_tf_model import DistributionalQTFModel
|
| 5 |
+
from ray.rllib.algorithms.dqn.dqn_torch_model import DQNTorchModel
|
| 6 |
+
from ray.rllib.models.tf.fcnet import FullyConnectedNetwork
|
| 7 |
+
from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
|
| 8 |
+
from ray.rllib.utils.framework import try_import_tf, try_import_torch
|
| 9 |
+
from ray.rllib.utils.torch_utils import FLOAT_MAX, FLOAT_MIN
|
| 10 |
+
|
| 11 |
+
tf1, tf, tfv = try_import_tf()
|
| 12 |
+
torch, nn = try_import_torch()
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
class ParametricActionsModel(DistributionalQTFModel):
|
| 16 |
+
"""Parametric action model that handles the dot product and masking.
|
| 17 |
+
|
| 18 |
+
This assumes the outputs are logits for a single Categorical action dist.
|
| 19 |
+
Getting this to work with a more complex output (e.g., if the action space
|
| 20 |
+
is a tuple of several distributions) is also possible but left as an
|
| 21 |
+
exercise to the reader.
|
| 22 |
+
"""
|
| 23 |
+
|
| 24 |
+
def __init__(
|
| 25 |
+
self,
|
| 26 |
+
obs_space,
|
| 27 |
+
action_space,
|
| 28 |
+
num_outputs,
|
| 29 |
+
model_config,
|
| 30 |
+
name,
|
| 31 |
+
true_obs_shape=(4,),
|
| 32 |
+
action_embed_size=2,
|
| 33 |
+
**kw
|
| 34 |
+
):
|
| 35 |
+
super(ParametricActionsModel, self).__init__(
|
| 36 |
+
obs_space, action_space, num_outputs, model_config, name, **kw
|
| 37 |
+
)
|
| 38 |
+
self.action_embed_model = FullyConnectedNetwork(
|
| 39 |
+
Box(-1, 1, shape=true_obs_shape),
|
| 40 |
+
action_space,
|
| 41 |
+
action_embed_size,
|
| 42 |
+
model_config,
|
| 43 |
+
name + "_action_embed",
|
| 44 |
+
)
|
| 45 |
+
|
| 46 |
+
def forward(self, input_dict, state, seq_lens):
|
| 47 |
+
# Extract the available actions tensor from the observation.
|
| 48 |
+
avail_actions = input_dict["obs"]["avail_actions"]
|
| 49 |
+
action_mask = input_dict["obs"]["action_mask"]
|
| 50 |
+
|
| 51 |
+
# Compute the predicted action embedding
|
| 52 |
+
action_embed, _ = self.action_embed_model({"obs": input_dict["obs"]["cart"]})
|
| 53 |
+
|
| 54 |
+
# Expand the model output to [BATCH, 1, EMBED_SIZE]. Note that the
|
| 55 |
+
# avail actions tensor is of shape [BATCH, MAX_ACTIONS, EMBED_SIZE].
|
| 56 |
+
intent_vector = tf.expand_dims(action_embed, 1)
|
| 57 |
+
|
| 58 |
+
# Batch dot product => shape of logits is [BATCH, MAX_ACTIONS].
|
| 59 |
+
action_logits = tf.reduce_sum(avail_actions * intent_vector, axis=2)
|
| 60 |
+
|
| 61 |
+
# Mask out invalid actions (use tf.float32.min for stability)
|
| 62 |
+
inf_mask = tf.maximum(tf.math.log(action_mask), tf.float32.min)
|
| 63 |
+
return action_logits + inf_mask, state
|
| 64 |
+
|
| 65 |
+
def value_function(self):
|
| 66 |
+
return self.action_embed_model.value_function()
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
class TorchParametricActionsModel(DQNTorchModel):
|
| 70 |
+
"""PyTorch version of above ParametricActionsModel."""
|
| 71 |
+
|
| 72 |
+
def __init__(
|
| 73 |
+
self,
|
| 74 |
+
obs_space,
|
| 75 |
+
action_space,
|
| 76 |
+
num_outputs,
|
| 77 |
+
model_config,
|
| 78 |
+
name,
|
| 79 |
+
true_obs_shape=(4,),
|
| 80 |
+
action_embed_size=2,
|
| 81 |
+
**kw
|
| 82 |
+
):
|
| 83 |
+
DQNTorchModel.__init__(
|
| 84 |
+
self, obs_space, action_space, num_outputs, model_config, name, **kw
|
| 85 |
+
)
|
| 86 |
+
|
| 87 |
+
self.action_embed_model = TorchFC(
|
| 88 |
+
Box(-1, 1, shape=true_obs_shape),
|
| 89 |
+
action_space,
|
| 90 |
+
action_embed_size,
|
| 91 |
+
model_config,
|
| 92 |
+
name + "_action_embed",
|
| 93 |
+
)
|
| 94 |
+
|
| 95 |
+
def forward(self, input_dict, state, seq_lens):
|
| 96 |
+
# Extract the available actions tensor from the observation.
|
| 97 |
+
avail_actions = input_dict["obs"]["avail_actions"]
|
| 98 |
+
action_mask = input_dict["obs"]["action_mask"]
|
| 99 |
+
|
| 100 |
+
# Compute the predicted action embedding
|
| 101 |
+
action_embed, _ = self.action_embed_model({"obs": input_dict["obs"]["cart"]})
|
| 102 |
+
|
| 103 |
+
# Expand the model output to [BATCH, 1, EMBED_SIZE]. Note that the
|
| 104 |
+
# avail actions tensor is of shape [BATCH, MAX_ACTIONS, EMBED_SIZE].
|
| 105 |
+
intent_vector = torch.unsqueeze(action_embed, 1)
|
| 106 |
+
|
| 107 |
+
# Batch dot product => shape of logits is [BATCH, MAX_ACTIONS].
|
| 108 |
+
action_logits = torch.sum(avail_actions * intent_vector, dim=2)
|
| 109 |
+
|
| 110 |
+
# Mask out invalid actions (use -inf to tag invalid).
|
| 111 |
+
# These are then recognized by the EpsilonGreedy exploration component
|
| 112 |
+
# as invalid actions that are not to be chosen.
|
| 113 |
+
inf_mask = torch.clamp(torch.log(action_mask), FLOAT_MIN, FLOAT_MAX)
|
| 114 |
+
|
| 115 |
+
return action_logits + inf_mask, state
|
| 116 |
+
|
| 117 |
+
def value_function(self):
|
| 118 |
+
return self.action_embed_model.value_function()
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
class ParametricActionsModelThatLearnsEmbeddings(DistributionalQTFModel):
|
| 122 |
+
"""Same as the above ParametricActionsModel.
|
| 123 |
+
|
| 124 |
+
However, this version also learns the action embeddings.
|
| 125 |
+
"""
|
| 126 |
+
|
| 127 |
+
def __init__(
|
| 128 |
+
self,
|
| 129 |
+
obs_space,
|
| 130 |
+
action_space,
|
| 131 |
+
num_outputs,
|
| 132 |
+
model_config,
|
| 133 |
+
name,
|
| 134 |
+
true_obs_shape=(4,),
|
| 135 |
+
action_embed_size=2,
|
| 136 |
+
**kw
|
| 137 |
+
):
|
| 138 |
+
super(ParametricActionsModelThatLearnsEmbeddings, self).__init__(
|
| 139 |
+
obs_space, action_space, num_outputs, model_config, name, **kw
|
| 140 |
+
)
|
| 141 |
+
|
| 142 |
+
action_ids_shifted = tf.constant(
|
| 143 |
+
list(range(1, num_outputs + 1)), dtype=tf.float32
|
| 144 |
+
)
|
| 145 |
+
|
| 146 |
+
obs_cart = tf.keras.layers.Input(shape=true_obs_shape, name="obs_cart")
|
| 147 |
+
valid_avail_actions_mask = tf.keras.layers.Input(
|
| 148 |
+
shape=(num_outputs,), name="valid_avail_actions_mask"
|
| 149 |
+
)
|
| 150 |
+
|
| 151 |
+
self.pred_action_embed_model = FullyConnectedNetwork(
|
| 152 |
+
Box(-1, 1, shape=true_obs_shape),
|
| 153 |
+
action_space,
|
| 154 |
+
action_embed_size,
|
| 155 |
+
model_config,
|
| 156 |
+
name + "_pred_action_embed",
|
| 157 |
+
)
|
| 158 |
+
|
| 159 |
+
# Compute the predicted action embedding
|
| 160 |
+
pred_action_embed, _ = self.pred_action_embed_model({"obs": obs_cart})
|
| 161 |
+
_value_out = self.pred_action_embed_model.value_function()
|
| 162 |
+
|
| 163 |
+
# Expand the model output to [BATCH, 1, EMBED_SIZE]. Note that the
|
| 164 |
+
# avail actions tensor is of shape [BATCH, MAX_ACTIONS, EMBED_SIZE].
|
| 165 |
+
intent_vector = tf.expand_dims(pred_action_embed, 1)
|
| 166 |
+
|
| 167 |
+
valid_avail_actions = action_ids_shifted * valid_avail_actions_mask
|
| 168 |
+
# Embedding for valid available actions which will be learned.
|
| 169 |
+
# Embedding vector for 0 is an invalid embedding (a "dummy embedding").
|
| 170 |
+
valid_avail_actions_embed = tf.keras.layers.Embedding(
|
| 171 |
+
input_dim=num_outputs + 1,
|
| 172 |
+
output_dim=action_embed_size,
|
| 173 |
+
name="action_embed_matrix",
|
| 174 |
+
)(valid_avail_actions)
|
| 175 |
+
|
| 176 |
+
# Batch dot product => shape of logits is [BATCH, MAX_ACTIONS].
|
| 177 |
+
action_logits = tf.reduce_sum(valid_avail_actions_embed * intent_vector, axis=2)
|
| 178 |
+
|
| 179 |
+
# Mask out invalid actions (use tf.float32.min for stability)
|
| 180 |
+
inf_mask = tf.maximum(tf.math.log(valid_avail_actions_mask), tf.float32.min)
|
| 181 |
+
|
| 182 |
+
action_logits = action_logits + inf_mask
|
| 183 |
+
|
| 184 |
+
self.param_actions_model = tf.keras.Model(
|
| 185 |
+
inputs=[obs_cart, valid_avail_actions_mask],
|
| 186 |
+
outputs=[action_logits, _value_out],
|
| 187 |
+
)
|
| 188 |
+
self.param_actions_model.summary()
|
| 189 |
+
|
| 190 |
+
def forward(self, input_dict, state, seq_lens):
|
| 191 |
+
# Extract the available actions mask tensor from the observation.
|
| 192 |
+
valid_avail_actions_mask = input_dict["obs"]["valid_avail_actions_mask"]
|
| 193 |
+
|
| 194 |
+
action_logits, self._value_out = self.param_actions_model(
|
| 195 |
+
[input_dict["obs"]["cart"], valid_avail_actions_mask]
|
| 196 |
+
)
|
| 197 |
+
|
| 198 |
+
return action_logits, state
|
| 199 |
+
|
| 200 |
+
def value_function(self):
|
| 201 |
+
return self._value_out
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/shared_weights_model.py
ADDED
|
@@ -0,0 +1,206 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# @OldAPIStack
|
| 2 |
+
import numpy as np
|
| 3 |
+
|
| 4 |
+
from ray.rllib.models.modelv2 import ModelV2
|
| 5 |
+
from ray.rllib.models.tf.tf_modelv2 import TFModelV2
|
| 6 |
+
from ray.rllib.models.torch.misc import SlimFC
|
| 7 |
+
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
|
| 8 |
+
from ray.rllib.utils.annotations import override
|
| 9 |
+
from ray.rllib.utils.framework import try_import_tf, try_import_torch
|
| 10 |
+
|
| 11 |
+
tf1, tf, tfv = try_import_tf()
|
| 12 |
+
torch, nn = try_import_torch()
|
| 13 |
+
|
| 14 |
+
TF2_GLOBAL_SHARED_LAYER = None
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
class TF2SharedWeightsModel(TFModelV2):
|
| 18 |
+
"""Example of weight sharing between two different TFModelV2s.
|
| 19 |
+
|
| 20 |
+
NOTE: This will only work for tf2.x. When running with config.framework=tf,
|
| 21 |
+
use SharedWeightsModel1 and SharedWeightsModel2 below, instead!
|
| 22 |
+
|
| 23 |
+
The shared (single) layer is simply defined outside of the two Models,
|
| 24 |
+
then used by both Models in their forward pass.
|
| 25 |
+
"""
|
| 26 |
+
|
| 27 |
+
def __init__(
|
| 28 |
+
self, observation_space, action_space, num_outputs, model_config, name
|
| 29 |
+
):
|
| 30 |
+
super().__init__(
|
| 31 |
+
observation_space, action_space, num_outputs, model_config, name
|
| 32 |
+
)
|
| 33 |
+
|
| 34 |
+
global TF2_GLOBAL_SHARED_LAYER
|
| 35 |
+
# The global, shared layer to be used by both models.
|
| 36 |
+
if TF2_GLOBAL_SHARED_LAYER is None:
|
| 37 |
+
TF2_GLOBAL_SHARED_LAYER = tf.keras.layers.Dense(
|
| 38 |
+
units=64, activation=tf.nn.relu, name="fc1"
|
| 39 |
+
)
|
| 40 |
+
|
| 41 |
+
inputs = tf.keras.layers.Input(observation_space.shape)
|
| 42 |
+
last_layer = TF2_GLOBAL_SHARED_LAYER(inputs)
|
| 43 |
+
output = tf.keras.layers.Dense(
|
| 44 |
+
units=num_outputs, activation=None, name="fc_out"
|
| 45 |
+
)(last_layer)
|
| 46 |
+
vf = tf.keras.layers.Dense(units=1, activation=None, name="value_out")(
|
| 47 |
+
last_layer
|
| 48 |
+
)
|
| 49 |
+
self.base_model = tf.keras.models.Model(inputs, [output, vf])
|
| 50 |
+
|
| 51 |
+
@override(ModelV2)
|
| 52 |
+
def forward(self, input_dict, state, seq_lens):
|
| 53 |
+
out, self._value_out = self.base_model(input_dict["obs"])
|
| 54 |
+
return out, []
|
| 55 |
+
|
| 56 |
+
@override(ModelV2)
|
| 57 |
+
def value_function(self):
|
| 58 |
+
return tf.reshape(self._value_out, [-1])
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
class SharedWeightsModel1(TFModelV2):
|
| 62 |
+
"""Example of weight sharing between two different TFModelV2s.
|
| 63 |
+
|
| 64 |
+
NOTE: This will only work for tf1 (static graph). When running with
|
| 65 |
+
config.framework_str=tf2, use TF2SharedWeightsModel, instead!
|
| 66 |
+
|
| 67 |
+
Here, we share the variables defined in the 'shared' variable scope
|
| 68 |
+
by entering it explicitly with tf1.AUTO_REUSE. This creates the
|
| 69 |
+
variables for the 'fc1' layer in a global scope called 'shared'
|
| 70 |
+
(outside of the Policy's normal variable scope).
|
| 71 |
+
"""
|
| 72 |
+
|
| 73 |
+
def __init__(
|
| 74 |
+
self, observation_space, action_space, num_outputs, model_config, name
|
| 75 |
+
):
|
| 76 |
+
super().__init__(
|
| 77 |
+
observation_space, action_space, num_outputs, model_config, name
|
| 78 |
+
)
|
| 79 |
+
|
| 80 |
+
inputs = tf.keras.layers.Input(observation_space.shape)
|
| 81 |
+
with tf1.variable_scope(
|
| 82 |
+
tf1.VariableScope(tf1.AUTO_REUSE, "shared"),
|
| 83 |
+
reuse=tf1.AUTO_REUSE,
|
| 84 |
+
auxiliary_name_scope=False,
|
| 85 |
+
):
|
| 86 |
+
last_layer = tf.keras.layers.Dense(
|
| 87 |
+
units=64, activation=tf.nn.relu, name="fc1"
|
| 88 |
+
)(inputs)
|
| 89 |
+
output = tf.keras.layers.Dense(
|
| 90 |
+
units=num_outputs, activation=None, name="fc_out"
|
| 91 |
+
)(last_layer)
|
| 92 |
+
vf = tf.keras.layers.Dense(units=1, activation=None, name="value_out")(
|
| 93 |
+
last_layer
|
| 94 |
+
)
|
| 95 |
+
self.base_model = tf.keras.models.Model(inputs, [output, vf])
|
| 96 |
+
|
| 97 |
+
@override(ModelV2)
|
| 98 |
+
def forward(self, input_dict, state, seq_lens):
|
| 99 |
+
out, self._value_out = self.base_model(input_dict["obs"])
|
| 100 |
+
return out, []
|
| 101 |
+
|
| 102 |
+
@override(ModelV2)
|
| 103 |
+
def value_function(self):
|
| 104 |
+
return tf.reshape(self._value_out, [-1])
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
class SharedWeightsModel2(TFModelV2):
|
| 108 |
+
"""The "other" TFModelV2 using the same shared space as the one above."""
|
| 109 |
+
|
| 110 |
+
def __init__(
|
| 111 |
+
self, observation_space, action_space, num_outputs, model_config, name
|
| 112 |
+
):
|
| 113 |
+
super().__init__(
|
| 114 |
+
observation_space, action_space, num_outputs, model_config, name
|
| 115 |
+
)
|
| 116 |
+
|
| 117 |
+
inputs = tf.keras.layers.Input(observation_space.shape)
|
| 118 |
+
|
| 119 |
+
# Weights shared with SharedWeightsModel1.
|
| 120 |
+
with tf1.variable_scope(
|
| 121 |
+
tf1.VariableScope(tf1.AUTO_REUSE, "shared"),
|
| 122 |
+
reuse=tf1.AUTO_REUSE,
|
| 123 |
+
auxiliary_name_scope=False,
|
| 124 |
+
):
|
| 125 |
+
last_layer = tf.keras.layers.Dense(
|
| 126 |
+
units=64, activation=tf.nn.relu, name="fc1"
|
| 127 |
+
)(inputs)
|
| 128 |
+
output = tf.keras.layers.Dense(
|
| 129 |
+
units=num_outputs, activation=None, name="fc_out"
|
| 130 |
+
)(last_layer)
|
| 131 |
+
vf = tf.keras.layers.Dense(units=1, activation=None, name="value_out")(
|
| 132 |
+
last_layer
|
| 133 |
+
)
|
| 134 |
+
self.base_model = tf.keras.models.Model(inputs, [output, vf])
|
| 135 |
+
|
| 136 |
+
@override(ModelV2)
|
| 137 |
+
def forward(self, input_dict, state, seq_lens):
|
| 138 |
+
out, self._value_out = self.base_model(input_dict["obs"])
|
| 139 |
+
return out, []
|
| 140 |
+
|
| 141 |
+
@override(ModelV2)
|
| 142 |
+
def value_function(self):
|
| 143 |
+
return tf.reshape(self._value_out, [-1])
|
| 144 |
+
|
| 145 |
+
|
| 146 |
+
TORCH_GLOBAL_SHARED_LAYER = None
|
| 147 |
+
if torch:
|
| 148 |
+
# The global, shared layer to be used by both models.
|
| 149 |
+
TORCH_GLOBAL_SHARED_LAYER = SlimFC(
|
| 150 |
+
64,
|
| 151 |
+
64,
|
| 152 |
+
activation_fn=nn.ReLU,
|
| 153 |
+
initializer=torch.nn.init.xavier_uniform_,
|
| 154 |
+
)
|
| 155 |
+
|
| 156 |
+
|
| 157 |
+
class TorchSharedWeightsModel(TorchModelV2, nn.Module):
|
| 158 |
+
"""Example of weight sharing between two different TorchModelV2s.
|
| 159 |
+
|
| 160 |
+
The shared (single) layer is simply defined outside of the two Models,
|
| 161 |
+
then used by both Models in their forward pass.
|
| 162 |
+
"""
|
| 163 |
+
|
| 164 |
+
def __init__(
|
| 165 |
+
self, observation_space, action_space, num_outputs, model_config, name
|
| 166 |
+
):
|
| 167 |
+
TorchModelV2.__init__(
|
| 168 |
+
self, observation_space, action_space, num_outputs, model_config, name
|
| 169 |
+
)
|
| 170 |
+
nn.Module.__init__(self)
|
| 171 |
+
|
| 172 |
+
# Non-shared initial layer.
|
| 173 |
+
self.first_layer = SlimFC(
|
| 174 |
+
int(np.prod(observation_space.shape)),
|
| 175 |
+
64,
|
| 176 |
+
activation_fn=nn.ReLU,
|
| 177 |
+
initializer=torch.nn.init.xavier_uniform_,
|
| 178 |
+
)
|
| 179 |
+
|
| 180 |
+
# Non-shared final layer.
|
| 181 |
+
self.last_layer = SlimFC(
|
| 182 |
+
64,
|
| 183 |
+
self.num_outputs,
|
| 184 |
+
activation_fn=None,
|
| 185 |
+
initializer=torch.nn.init.xavier_uniform_,
|
| 186 |
+
)
|
| 187 |
+
self.vf = SlimFC(
|
| 188 |
+
64,
|
| 189 |
+
1,
|
| 190 |
+
activation_fn=None,
|
| 191 |
+
initializer=torch.nn.init.xavier_uniform_,
|
| 192 |
+
)
|
| 193 |
+
self._global_shared_layer = TORCH_GLOBAL_SHARED_LAYER
|
| 194 |
+
self._output = None
|
| 195 |
+
|
| 196 |
+
@override(ModelV2)
|
| 197 |
+
def forward(self, input_dict, state, seq_lens):
|
| 198 |
+
out = self.first_layer(input_dict["obs"])
|
| 199 |
+
self._output = self._global_shared_layer(out)
|
| 200 |
+
model_out = self.last_layer(self._output)
|
| 201 |
+
return model_out, []
|
| 202 |
+
|
| 203 |
+
@override(ModelV2)
|
| 204 |
+
def value_function(self):
|
| 205 |
+
assert self._output is not None, "must call forward first!"
|
| 206 |
+
return torch.reshape(self.vf(self._output), [-1])
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/models/simple_rpg_model.py
ADDED
|
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# @OldAPIStack
|
| 2 |
+
from ray.rllib.models.tf.tf_modelv2 import TFModelV2
|
| 3 |
+
from ray.rllib.models.tf.fcnet import FullyConnectedNetwork as TFFCNet
|
| 4 |
+
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
|
| 5 |
+
from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFCNet
|
| 6 |
+
from ray.rllib.utils.framework import try_import_tf, try_import_torch
|
| 7 |
+
|
| 8 |
+
tf1, tf, tfv = try_import_tf()
|
| 9 |
+
torch, nn = try_import_torch()
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
class CustomTorchRPGModel(TorchModelV2, nn.Module):
|
| 13 |
+
"""Example of interpreting repeated observations."""
|
| 14 |
+
|
| 15 |
+
def __init__(self, obs_space, action_space, num_outputs, model_config, name):
|
| 16 |
+
super().__init__(obs_space, action_space, num_outputs, model_config, name)
|
| 17 |
+
nn.Module.__init__(self)
|
| 18 |
+
self.model = TorchFCNet(
|
| 19 |
+
obs_space, action_space, num_outputs, model_config, name
|
| 20 |
+
)
|
| 21 |
+
|
| 22 |
+
def forward(self, input_dict, state, seq_lens):
|
| 23 |
+
# The unpacked input tensors, where M=MAX_PLAYERS, N=MAX_ITEMS:
|
| 24 |
+
# {
|
| 25 |
+
# 'items', <torch.Tensor shape=(?, M, N, 5)>,
|
| 26 |
+
# 'location', <torch.Tensor shape=(?, M, 2)>,
|
| 27 |
+
# 'status', <torch.Tensor shape=(?, M, 10)>,
|
| 28 |
+
# }
|
| 29 |
+
print("The unpacked input tensors:", input_dict["obs"])
|
| 30 |
+
print()
|
| 31 |
+
print("Unbatched repeat dim", input_dict["obs"].unbatch_repeat_dim())
|
| 32 |
+
print()
|
| 33 |
+
print("Fully unbatched", input_dict["obs"].unbatch_all())
|
| 34 |
+
print()
|
| 35 |
+
return self.model.forward(input_dict, state, seq_lens)
|
| 36 |
+
|
| 37 |
+
def value_function(self):
|
| 38 |
+
return self.model.value_function()
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
class CustomTFRPGModel(TFModelV2):
|
| 42 |
+
"""Example of interpreting repeated observations."""
|
| 43 |
+
|
| 44 |
+
def __init__(self, obs_space, action_space, num_outputs, model_config, name):
|
| 45 |
+
super().__init__(obs_space, action_space, num_outputs, model_config, name)
|
| 46 |
+
self.model = TFFCNet(obs_space, action_space, num_outputs, model_config, name)
|
| 47 |
+
|
| 48 |
+
def forward(self, input_dict, state, seq_lens):
|
| 49 |
+
# The unpacked input tensors, where M=MAX_PLAYERS, N=MAX_ITEMS:
|
| 50 |
+
# {
|
| 51 |
+
# 'items', <tf.Tensor shape=(?, M, N, 5)>,
|
| 52 |
+
# 'location', <tf.Tensor shape=(?, M, 2)>,
|
| 53 |
+
# 'status', <tf.Tensor shape=(?, M, 10)>,
|
| 54 |
+
# }
|
| 55 |
+
print("The unpacked input tensors:", input_dict["obs"])
|
| 56 |
+
print()
|
| 57 |
+
print("Unbatched repeat dim", input_dict["obs"].unbatch_repeat_dim())
|
| 58 |
+
print()
|
| 59 |
+
if tf.executing_eagerly():
|
| 60 |
+
print("Fully unbatched", input_dict["obs"].unbatch_all())
|
| 61 |
+
print()
|
| 62 |
+
return self.model.forward(input_dict, state, seq_lens)
|
| 63 |
+
|
| 64 |
+
def value_function(self):
|
| 65 |
+
return self.model.value_function()
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/parametric_actions_cartpole.py
ADDED
|
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# @OldAPIStack
|
| 2 |
+
"""Example of handling variable length or parametric action spaces.
|
| 3 |
+
|
| 4 |
+
This toy example demonstrates the action-embedding based approach for handling large
|
| 5 |
+
discrete action spaces (potentially infinite in size), similar to this example:
|
| 6 |
+
|
| 7 |
+
https://neuro.cs.ut.ee/the-use-of-embeddings-in-openai-five/
|
| 8 |
+
|
| 9 |
+
This example works with RLlib's policy gradient style algorithms
|
| 10 |
+
(e.g., PG, PPO, IMPALA, A2C) and DQN.
|
| 11 |
+
|
| 12 |
+
Note that since the model outputs now include "-inf" tf.float32.min
|
| 13 |
+
values, not all algorithm options are supported. For example,
|
| 14 |
+
algorithms might crash if they don't properly ignore the -inf action scores.
|
| 15 |
+
Working configurations are given below.
|
| 16 |
+
"""
|
| 17 |
+
|
| 18 |
+
import argparse
|
| 19 |
+
import os
|
| 20 |
+
|
| 21 |
+
import ray
|
| 22 |
+
from ray import air, tune
|
| 23 |
+
from ray.air.constants import TRAINING_ITERATION
|
| 24 |
+
from ray.rllib.examples.envs.classes.parametric_actions_cartpole import (
|
| 25 |
+
ParametricActionsCartPole,
|
| 26 |
+
)
|
| 27 |
+
from ray.rllib.examples._old_api_stack.models.parametric_actions_model import (
|
| 28 |
+
ParametricActionsModel,
|
| 29 |
+
TorchParametricActionsModel,
|
| 30 |
+
)
|
| 31 |
+
from ray.rllib.models import ModelCatalog
|
| 32 |
+
from ray.rllib.utils.metrics import (
|
| 33 |
+
ENV_RUNNER_RESULTS,
|
| 34 |
+
EPISODE_RETURN_MEAN,
|
| 35 |
+
NUM_ENV_STEPS_SAMPLED_LIFETIME,
|
| 36 |
+
)
|
| 37 |
+
from ray.rllib.utils.test_utils import check_learning_achieved
|
| 38 |
+
from ray.tune.registry import register_env
|
| 39 |
+
|
| 40 |
+
parser = argparse.ArgumentParser()
|
| 41 |
+
parser.add_argument(
|
| 42 |
+
"--run", type=str, default="PPO", help="The RLlib-registered algorithm to use."
|
| 43 |
+
)
|
| 44 |
+
parser.add_argument(
|
| 45 |
+
"--framework",
|
| 46 |
+
choices=["tf", "tf2", "torch"],
|
| 47 |
+
default="torch",
|
| 48 |
+
help="The DL framework specifier.",
|
| 49 |
+
)
|
| 50 |
+
parser.add_argument(
|
| 51 |
+
"--as-test",
|
| 52 |
+
action="store_true",
|
| 53 |
+
help="Whether this script should be run as a test: --stop-reward must "
|
| 54 |
+
"be achieved within --stop-timesteps AND --stop-iters.",
|
| 55 |
+
)
|
| 56 |
+
parser.add_argument(
|
| 57 |
+
"--stop-iters", type=int, default=200, help="Number of iterations to train."
|
| 58 |
+
)
|
| 59 |
+
parser.add_argument(
|
| 60 |
+
"--stop-timesteps", type=int, default=100000, help="Number of timesteps to train."
|
| 61 |
+
)
|
| 62 |
+
parser.add_argument(
|
| 63 |
+
"--stop-reward", type=float, default=150.0, help="Reward at which we stop training."
|
| 64 |
+
)
|
| 65 |
+
|
| 66 |
+
if __name__ == "__main__":
|
| 67 |
+
args = parser.parse_args()
|
| 68 |
+
ray.init()
|
| 69 |
+
|
| 70 |
+
register_env("pa_cartpole", lambda _: ParametricActionsCartPole(10))
|
| 71 |
+
ModelCatalog.register_custom_model(
|
| 72 |
+
"pa_model",
|
| 73 |
+
TorchParametricActionsModel
|
| 74 |
+
if args.framework == "torch"
|
| 75 |
+
else ParametricActionsModel,
|
| 76 |
+
)
|
| 77 |
+
|
| 78 |
+
if args.run == "DQN":
|
| 79 |
+
cfg = {
|
| 80 |
+
# TODO(ekl) we need to set these to prevent the masked values
|
| 81 |
+
# from being further processed in DistributionalQModel, which
|
| 82 |
+
# would mess up the masking. It is possible to support these if we
|
| 83 |
+
# defined a custom DistributionalQModel that is aware of masking.
|
| 84 |
+
"hiddens": [],
|
| 85 |
+
"dueling": False,
|
| 86 |
+
"enable_rl_module_and_learner": False,
|
| 87 |
+
"enable_env_runner_and_connector_v2": False,
|
| 88 |
+
}
|
| 89 |
+
else:
|
| 90 |
+
cfg = {}
|
| 91 |
+
|
| 92 |
+
config = dict(
|
| 93 |
+
{
|
| 94 |
+
"env": "pa_cartpole",
|
| 95 |
+
"model": {
|
| 96 |
+
"custom_model": "pa_model",
|
| 97 |
+
},
|
| 98 |
+
# Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
|
| 99 |
+
"num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
|
| 100 |
+
"num_env_runners": 0,
|
| 101 |
+
"framework": args.framework,
|
| 102 |
+
},
|
| 103 |
+
**cfg,
|
| 104 |
+
)
|
| 105 |
+
|
| 106 |
+
stop = {
|
| 107 |
+
TRAINING_ITERATION: args.stop_iters,
|
| 108 |
+
f"{NUM_ENV_STEPS_SAMPLED_LIFETIME}": args.stop_timesteps,
|
| 109 |
+
f"{ENV_RUNNER_RESULTS}/{EPISODE_RETURN_MEAN}": args.stop_reward,
|
| 110 |
+
}
|
| 111 |
+
|
| 112 |
+
results = tune.Tuner(
|
| 113 |
+
args.run,
|
| 114 |
+
run_config=air.RunConfig(stop=stop, verbose=1),
|
| 115 |
+
param_space=config,
|
| 116 |
+
).fit()
|
| 117 |
+
|
| 118 |
+
if args.as_test:
|
| 119 |
+
check_learning_achieved(results, args.stop_reward)
|
| 120 |
+
|
| 121 |
+
ray.shutdown()
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/_old_api_stack/parametric_actions_cartpole_embeddings_learnt_by_model.py
ADDED
|
@@ -0,0 +1,107 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# @OldAPIStack
|
| 2 |
+
"""Example of handling variable length or parametric action spaces.
|
| 3 |
+
|
| 4 |
+
This is a toy example of the action-embedding based approach for handling large
|
| 5 |
+
discrete action spaces (potentially infinite in size), similar to this:
|
| 6 |
+
|
| 7 |
+
https://neuro.cs.ut.ee/the-use-of-embeddings-in-openai-five/
|
| 8 |
+
|
| 9 |
+
This currently works with RLlib's policy gradient style algorithms
|
| 10 |
+
(e.g., PG, PPO, IMPALA, A2C) and also DQN.
|
| 11 |
+
|
| 12 |
+
Note that since the model outputs now include "-inf" tf.float32.min
|
| 13 |
+
values, not all algorithm options are supported at the moment. For example,
|
| 14 |
+
algorithms might crash if they don't properly ignore the -inf action scores.
|
| 15 |
+
Working configurations are given below.
|
| 16 |
+
"""
|
| 17 |
+
|
| 18 |
+
import argparse
|
| 19 |
+
import os
|
| 20 |
+
|
| 21 |
+
import ray
|
| 22 |
+
from ray import air, tune
|
| 23 |
+
from ray.air.constants import TRAINING_ITERATION
|
| 24 |
+
from ray.rllib.examples.envs.classes.parametric_actions_cartpole import (
|
| 25 |
+
ParametricActionsCartPoleNoEmbeddings,
|
| 26 |
+
)
|
| 27 |
+
from ray.rllib.examples._old_api_stack.models.parametric_actions_model import (
|
| 28 |
+
ParametricActionsModelThatLearnsEmbeddings,
|
| 29 |
+
)
|
| 30 |
+
from ray.rllib.models import ModelCatalog
|
| 31 |
+
from ray.rllib.utils.metrics import (
|
| 32 |
+
ENV_RUNNER_RESULTS,
|
| 33 |
+
EPISODE_RETURN_MEAN,
|
| 34 |
+
NUM_ENV_STEPS_SAMPLED_LIFETIME,
|
| 35 |
+
)
|
| 36 |
+
from ray.rllib.utils.test_utils import check_learning_achieved
|
| 37 |
+
from ray.tune.registry import register_env
|
| 38 |
+
|
| 39 |
+
parser = argparse.ArgumentParser()
|
| 40 |
+
parser.add_argument("--run", type=str, default="PPO")
|
| 41 |
+
parser.add_argument(
|
| 42 |
+
"--framework",
|
| 43 |
+
choices=["tf", "tf2"],
|
| 44 |
+
default="tf",
|
| 45 |
+
help="The DL framework specifier (Torch not supported "
|
| 46 |
+
"due to the lack of a model).",
|
| 47 |
+
)
|
| 48 |
+
parser.add_argument("--as-test", action="store_true")
|
| 49 |
+
parser.add_argument("--stop-iters", type=int, default=200)
|
| 50 |
+
parser.add_argument("--stop-reward", type=float, default=150.0)
|
| 51 |
+
parser.add_argument("--stop-timesteps", type=int, default=100000)
|
| 52 |
+
|
| 53 |
+
if __name__ == "__main__":
|
| 54 |
+
args = parser.parse_args()
|
| 55 |
+
ray.init()
|
| 56 |
+
|
| 57 |
+
register_env("pa_cartpole", lambda _: ParametricActionsCartPoleNoEmbeddings(10))
|
| 58 |
+
|
| 59 |
+
ModelCatalog.register_custom_model(
|
| 60 |
+
"pa_model", ParametricActionsModelThatLearnsEmbeddings
|
| 61 |
+
)
|
| 62 |
+
|
| 63 |
+
if args.run == "DQN":
|
| 64 |
+
cfg = {
|
| 65 |
+
# TODO(ekl) we need to set these to prevent the masked values
|
| 66 |
+
# from being further processed in DistributionalQModel, which
|
| 67 |
+
# would mess up the masking. It is possible to support these if we
|
| 68 |
+
# defined a custom DistributionalQModel that is aware of masking.
|
| 69 |
+
"hiddens": [],
|
| 70 |
+
"dueling": False,
|
| 71 |
+
"enable_rl_module_and_learner": False,
|
| 72 |
+
"enable_env_runner_and_connector_v2": False,
|
| 73 |
+
}
|
| 74 |
+
else:
|
| 75 |
+
cfg = {}
|
| 76 |
+
|
| 77 |
+
config = dict(
|
| 78 |
+
{
|
| 79 |
+
"env": "pa_cartpole",
|
| 80 |
+
"model": {
|
| 81 |
+
"custom_model": "pa_model",
|
| 82 |
+
},
|
| 83 |
+
# Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
|
| 84 |
+
"num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
|
| 85 |
+
"num_env_runners": 0,
|
| 86 |
+
"framework": args.framework,
|
| 87 |
+
"action_mask_key": "valid_avail_actions_mask",
|
| 88 |
+
},
|
| 89 |
+
**cfg,
|
| 90 |
+
)
|
| 91 |
+
|
| 92 |
+
stop = {
|
| 93 |
+
TRAINING_ITERATION: args.stop_iters,
|
| 94 |
+
NUM_ENV_STEPS_SAMPLED_LIFETIME: args.stop_timesteps,
|
| 95 |
+
f"{ENV_RUNNER_RESULTS}/{EPISODE_RETURN_MEAN}": args.stop_reward,
|
| 96 |
+
}
|
| 97 |
+
|
| 98 |
+
results = tune.Tuner(
|
| 99 |
+
args.run,
|
| 100 |
+
run_config=air.RunConfig(stop=stop, verbose=2),
|
| 101 |
+
param_space=config,
|
| 102 |
+
).fit()
|
| 103 |
+
|
| 104 |
+
if args.as_test:
|
| 105 |
+
check_learning_achieved(results, args.stop_reward)
|
| 106 |
+
|
| 107 |
+
ray.shutdown()
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/__init__.cpython-311.pyc
ADDED
|
Binary file (203 Bytes). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/cartpole_dqn_export.cpython-311.pyc
ADDED
|
Binary file (4.55 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/change_config_during_training.cpython-311.pyc
ADDED
|
Binary file (11.8 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/checkpoint_by_custom_criteria.cpython-311.pyc
ADDED
|
Binary file (6.36 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/checkpoints/__pycache__/restore_1_of_n_agents_from_checkpoint.cpython-311.pyc
ADDED
|
Binary file (7.57 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__init__.py
ADDED
|
File without changes
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/__init__.cpython-311.pyc
ADDED
|
Binary file (203 Bytes). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/custom_heuristic_policy.cpython-311.pyc
ADDED
|
Binary file (4.37 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/different_spaces_for_agents.cpython-311.pyc
ADDED
|
Binary file (5.9 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/multi_agent_cartpole.cpython-311.pyc
ADDED
|
Binary file (2.98 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/multi_agent_pendulum.cpython-311.pyc
ADDED
|
Binary file (3.33 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/pettingzoo_independent_learning.cpython-311.pyc
ADDED
|
Binary file (5.32 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/pettingzoo_parameter_sharing.cpython-311.pyc
ADDED
|
Binary file (4.66 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/pettingzoo_shared_value_function.cpython-311.pyc
ADDED
|
Binary file (485 Bytes). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/rock_paper_scissors_heuristic_vs_learned.cpython-311.pyc
ADDED
|
Binary file (6.03 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/rock_paper_scissors_learned_vs_learned.cpython-311.pyc
ADDED
|
Binary file (4.22 kB). View file
|
|
|
.venv/lib/python3.11/site-packages/ray/rllib/examples/multi_agent/__pycache__/self_play_league_based_with_open_spiel.cpython-311.pyc
ADDED
|
Binary file (11.4 kB). View file
|
|
|