File size: 40,450 Bytes
50f38bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
---
title: Reinforcement Learning Graphical Representations
date: 2026-04-08
category: Reinforcement Learning
description: A comprehensive gallery of 230 standard RL components and their graphical presentations.
---

# Reinforcement Learning Graphical Representations

This repository contains a full set of 230 visualizations representing foundational concepts, algorithms, and advanced topics in Reinforcement Learning.

| Category | Component | Illustration | Details | Context |
|----------|-----------|--------------|---------|---------|
| **MDP & Environment** | **Agent-Environment Interaction Loop** | ![Illustration](graphs/agent_environment_interaction_loop.png) | Core cycle: observation of state → selection of action → environment transition → receipt of reward + next state | All RL algorithms |
| **MDP & Environment** | **Markov Decision Process (MDP) Tuple** | ![Illustration](graphs/markov_decision_process_mdp_tuple.png) | (S, A, P, R, γ) with transition dynamics and reward function | s,a) and R(s,a,s′)) |
| **MDP & Environment** | **State Transition Graph** | ![Illustration](graphs/state_transition_graph.png) | Full probabilistic transitions between discrete states | Gridworld, Taxi, Cliff Walking |
| **MDP & Environment** | **Trajectory / Episode Sequence** | ![Illustration](graphs/trajectory_episode_sequence.png) | Sequence of (s₀, a₀, r₁, s₁, …, s_T) | Monte Carlo, episodic tasks |
| **MDP & Environment** | **Continuous State/Action Space Visualization** | ![Illustration](graphs/continuous_state_action_space_visualization.png) | High-dimensional spaces (e.g., robot joints, pixel inputs) | Continuous-control tasks (MuJoCo, PyBullet) |
| **MDP & Environment** | **Reward Function / Landscape** | ![Illustration](graphs/reward_function_landscape.png) | Scalar reward as function of state/action | All algorithms; especially reward shaping |
| **MDP & Environment** | **Discount Factor (γ) Effect** | ![Illustration](graphs/discount_factor_effect.png) | How future rewards are weighted | All discounted MDPs |
| **Value & Policy** | **State-Value Function V(s)** | ![Illustration](graphs/state_value_function_v_s.png) | Expected return from state s under policy π | Value-based methods |
| **Value & Policy** | **Action-Value Function Q(s,a)** | ![Illustration](graphs/action_value_function_q_s_a.png) | Expected return from state-action pair | Q-learning family |
| **Value & Policy** | **Policy π(s) or π(a\** | ![Illustration](graphs/policy_s_or_a.png) | s) | Arrow overlays on grid (optimal policy), probability bar charts, or softmax heatmaps |
| **Value & Policy** | **Advantage Function A(s,a)** | ![Illustration](graphs/advantage_function_a_s_a.png) | Q(s,a) – V(s) | A2C, PPO, SAC, TD3 |
| **Value & Policy** | **Optimal Value Function V* / Q*** | ![Illustration](graphs/optimal_value_function_v_q.png) | Solution to Bellman optimality | Value iteration, Q-learning |
| **Dynamic Programming** | **Policy Evaluation Backup** | ![Illustration](graphs/policy_evaluation_backup.png) | Iterative update of V using Bellman expectation | Policy iteration |
| **Dynamic Programming** | **Policy Improvement** | ![Illustration](graphs/policy_improvement.png) | Greedy policy update over Q | Policy iteration |
| **Dynamic Programming** | **Value Iteration Backup** | ![Illustration](graphs/value_iteration_backup.png) | Update using Bellman optimality | Value iteration |
| **Dynamic Programming** | **Policy Iteration Full Cycle** | ![Illustration](graphs/policy_iteration_full_cycle.png) | Evaluation → Improvement loop | Classic DP methods |
| **Monte Carlo** | **Monte Carlo Backup** | ![Illustration](graphs/monte_carlo_backup.png) | Update using full episode return G_t | First-visit / every-visit MC |
| **Monte Carlo** | **Monte Carlo Tree (MCTS)** | ![Illustration](graphs/monte_carlo_tree_mcts.png) | Search tree with selection, expansion, simulation, backprop | AlphaGo, AlphaZero |
| **Monte Carlo** | **Importance Sampling Ratio** | ![Illustration](graphs/importance_sampling_ratio.png) | Off-policy correction ρ = π(a\ | s) |
| **Temporal Difference** | **TD(0) Backup** | ![Illustration](graphs/td_0_backup.png) | Bootstrapped update using R + γV(s′) | TD learning |
| **Temporal Difference** | **Bootstrapping (general)** | ![Illustration](graphs/bootstrapping_general.png) | Using estimated future value instead of full return | All TD methods |
| **Temporal Difference** | **n-step TD Backup** | ![Illustration](graphs/n_step_td_backup.png) | Multi-step return G_t^{(n)} | n-step TD, TD(λ) |
| **Temporal Difference** | **TD(λ) & Eligibility Traces** | ![Illustration](graphs/td_eligibility_traces.png) | Decaying trace z_t for credit assignment | TD(λ), SARSA(λ), Q(λ) |
| **Temporal Difference** | **SARSA Update** | ![Illustration](graphs/sarsa_update.png) | On-policy TD control | SARSA |
| **Temporal Difference** | **Q-Learning Update** | ![Illustration](graphs/q_learning_update.png) | Off-policy TD control | Q-learning, Deep Q-Network |
| **Temporal Difference** | **Expected SARSA** | ![Illustration](graphs/expected_sarsa.png) | Expectation over next action under policy | Expected SARSA |
| **Temporal Difference** | **Double Q-Learning / Double DQN** | ![Illustration](graphs/double_q_learning_double_dqn.png) | Two separate Q estimators to reduce overestimation | Double DQN, TD3 |
| **Temporal Difference** | **Dueling DQN Architecture** | ![Illustration](graphs/dueling_dqn_architecture.png) | Separate streams for state value V(s) and advantage A(s,a) | Dueling DQN |
| **Temporal Difference** | **Prioritized Experience Replay** | ![Illustration](graphs/prioritized_experience_replay.png) | Importance sampling of transitions by TD error | Prioritized DQN, Rainbow |
| **Temporal Difference** | **Rainbow DQN Components** | ![Illustration](graphs/rainbow_dqn_components.png) | All extensions combined (Double, Dueling, PER, etc.) | Rainbow DQN |
| **Function Approximation** | **Linear Function Approximation** | ![Illustration](graphs/linear_function_approximation.png) | Feature vector φ(s) → wᵀφ(s) | Tabular → linear FA |
| **Function Approximation** | **Neural Network Layers (MLP, CNN, RNN, Transformer)** | ![Illustration](graphs/neural_network_layers_mlp_cnn_rnn_transformer.png) | Full deep network for value/policy | DQN, A3C, PPO, Decision Transformer |
| **Function Approximation** | **Computation Graph / Backpropagation Flow** | ![Illustration](graphs/computation_graph_backpropagation_flow.png) | Gradient flow through network | All deep RL |
| **Function Approximation** | **Target Network** | ![Illustration](graphs/target_network.png) | Frozen copy of Q-network for stability | DQN, DDQN, SAC, TD3 |
| **Policy Gradients** | **Policy Gradient Theorem** | ![Illustration](graphs/policy_gradient_theorem.png) | ∇_θ J(θ) = E[∇_θ log π(a\ | Flow diagram from reward → log-prob → gradient |
| **Policy Gradients** | **REINFORCE Update** | ![Illustration](graphs/reinforce_update.png) | Monte-Carlo policy gradient | REINFORCE |
| **Policy Gradients** | **Baseline / Advantage Subtraction** | ![Illustration](graphs/baseline_advantage_subtraction.png) | Subtract b(s) to reduce variance | All modern PG |
| **Policy Gradients** | **Trust Region (TRPO)** | ![Illustration](graphs/trust_region_trpo.png) | KL-divergence constraint on policy update | TRPO |
| **Policy Gradients** | **Proximal Policy Optimization (PPO)** | ![Illustration](graphs/proximal_policy_optimization_ppo.png) | Clipped surrogate objective | PPO, PPO-Clip |
| **Actor-Critic** | **Actor-Critic Architecture** | ![Illustration](graphs/actor_critic_architecture.png) | Separate or shared actor (policy) + critic (value) networks | A2C, A3C, SAC, TD3 |
| **Actor-Critic** | **Advantage Actor-Critic (A2C/A3C)** | ![Illustration](graphs/advantage_actor_critic_a2c_a3c.png) | Synchronous/asynchronous multi-worker | A2C/A3C |
| **Actor-Critic** | **Soft Actor-Critic (SAC)** | ![Illustration](graphs/soft_actor_critic_sac.png) | Entropy-regularized policy + twin critics | SAC |
| **Actor-Critic** | **Twin Delayed DDPG (TD3)** | ![Illustration](graphs/twin_delayed_ddpg_td3.png) | Twin critics + delayed policy + target smoothing | TD3 |
| **Exploration** | **ε-Greedy Strategy** | ![Illustration](graphs/greedy_strategy.png) | Probability ε of random action | DQN family |
| **Exploration** | **Softmax / Boltzmann Exploration** | ![Illustration](graphs/softmax_boltzmann_exploration.png) | Temperature τ in softmax | Softmax policies |
| **Exploration** | **Upper Confidence Bound (UCB)** | ![Illustration](graphs/upper_confidence_bound_ucb.png) | Optimism in face of uncertainty | UCB1, bandits |
| **Exploration** | **Intrinsic Motivation / Curiosity** | ![Illustration](graphs/intrinsic_motivation_curiosity.png) | Prediction error as intrinsic reward | ICM, RND, Curiosity-driven RL |
| **Exploration** | **Entropy Regularization** | ![Illustration](graphs/entropy_regularization.png) | Bonus term αH(π) | SAC, maximum-entropy RL |
| **Hierarchical RL** | **Options Framework** | ![Illustration](graphs/options_framework.png) | High-level policy over options (temporally extended actions) | Option-Critic |
| **Hierarchical RL** | **Feudal Networks / Hierarchical Actor-Critic** | ![Illustration](graphs/feudal_networks_hierarchical_actor_critic.png) | Manager-worker hierarchy | Feudal RL |
| **Hierarchical RL** | **Skill Discovery** | ![Illustration](graphs/skill_discovery.png) | Unsupervised emergence of reusable skills | DIAYN, VALOR |
| **Model-Based RL** | **Learned Dynamics Model** | ![Illustration](graphs/learned_dynamics_model.png) | ˆP(s′\ | Separate model network diagram (often RNN or transformer) |
| **Model-Based RL** | **Model-Based Planning** | ![Illustration](graphs/model_based_planning.png) | Rollouts inside learned model | MuZero, DreamerV3 |
| **Model-Based RL** | **Imagination-Augmented Agents (I2A)** | ![Illustration](graphs/imagination_augmented_agents_i2a.png) | Imagination module + policy | I2A |
| **Offline RL** | **Offline Dataset** | ![Illustration](graphs/offline_dataset.png) | Fixed batch of trajectories | BC, CQL, IQL |
| **Offline RL** | **Conservative Q-Learning (CQL)** | ![Illustration](graphs/conservative_q_learning_cql.png) | Penalty on out-of-distribution actions | CQL |
| **Multi-Agent RL** | **Multi-Agent Interaction Graph** | ![Illustration](graphs/multi_agent_interaction_graph.png) | Agents communicating or competing | MARL, MADDPG |
| **Multi-Agent RL** | **Centralized Training Decentralized Execution (CTDE)** | ![Illustration](graphs/centralized_training_decentralized_execution_ctde.png) | Shared critic during training | QMIX, VDN, MADDPG |
| **Multi-Agent RL** | **Cooperative / Competitive Payoff Matrix** | ![Illustration](graphs/cooperative_competitive_payoff_matrix.png) | Joint reward for multiple agents | Prisoner's Dilemma, multi-agent gridworlds |
| **Inverse RL / IRL** | **Reward Inference** | ![Illustration](graphs/reward_inference.png) | Infer reward from expert demonstrations | IRL, GAIL |
| **Inverse RL / IRL** | **Generative Adversarial Imitation Learning (GAIL)** | ![Illustration](graphs/generative_adversarial_imitation_learning_gail.png) | Discriminator vs. policy generator | GAIL, AIRL |
| **Meta-RL** | **Meta-RL Architecture** | ![Illustration](graphs/meta_rl_architecture.png) | Outer loop (meta-policy) + inner loop (task adaptation) | MAML for RL, RL² |
| **Meta-RL** | **Task Distribution Visualization** | ![Illustration](graphs/task_distribution_visualization.png) | Multiple MDPs sampled from meta-distribution | Meta-RL benchmarks |
| **Advanced / Misc** | **Experience Replay Buffer** | ![Illustration](graphs/experience_replay_buffer.png) | Stored (s,a,r,s′,done) tuples | DQN and all off-policy deep RL |
| **Advanced / Misc** | **State Visitation / Occupancy Measure** | ![Illustration](graphs/state_visitation_occupancy_measure.png) | Frequency of visiting each state | All algorithms (analysis) |
| **Advanced / Misc** | **Learning Curve** | ![Illustration](graphs/learning_curve.png) | Average episodic return vs. episodes / steps | Standard performance reporting |
| **Advanced / Misc** | **Regret / Cumulative Regret** | ![Illustration](graphs/regret_cumulative_regret.png) | Sub-optimality accumulated | Bandits and online RL |
| **Advanced / Misc** | **Attention Mechanisms (Transformers in RL)** | ![Illustration](graphs/attention_mechanisms_transformers_in_rl.png) | Attention weights | Decision Transformer, Trajectory Transformer |
| **Advanced / Misc** | **Diffusion Policy** | ![Illustration](graphs/diffusion_policy.png) | Denoising diffusion process for action generation | Diffusion-RL policies |
| **Advanced / Misc** | **Graph Neural Networks for RL** | ![Illustration](graphs/graph_neural_networks_for_rl.png) | Node/edge message passing | Graph RL, relational RL |
| **Advanced / Misc** | **World Model / Latent Space** | ![Illustration](graphs/world_model_latent_space.png) | Encoder-decoder dynamics in latent space | Dreamer, PlaNet |
| **Advanced / Misc** | **Convergence Analysis Plots** | ![Illustration](graphs/convergence_analysis_plots.png) | Error / value change over iterations | DP, TD, value iteration |
| **Advanced / Misc** | **RL Algorithm Taxonomy** | ![Illustration](graphs/rl_algorithm_taxonomy.png) | Comprehensive classification of algorithms | All RL |
| **Advanced / Misc** | **Probabilistic Graphical Model (RL as Inference)** | ![Illustration](graphs/probabilistic_graphical_model_rl_as_inference.png) | Formalizing RL as probabilistic inference | Control as Inference, MaxEnt RL |
| **Value & Policy** | **Distributional RL (C51 / Categorical)** | ![Illustration](graphs/distributional_rl_c51_categorical.png) | Representing return as a probability distribution | C51, QR-DQN, IQN |
| **Exploration** | **Hindsight Experience Replay (HER)** | ![Illustration](graphs/hindsight_experience_replay_her.png) | Learning from failures by relabeling goals | Sparse reward robotics, HER |
| **Model-Based RL** | **Dyna-Q Architecture** | ![Illustration](graphs/dyna_q_architecture.png) | Integration of real experience and model-based planning | Dyna-Q, Dyna-2 |
| **Function Approximation** | **Noisy Networks (Parameter Noise)** | ![Illustration](graphs/noisy_networks_parameter_noise.png) | Stochastic weights for exploration | Noisy DQN, Rainbow |
| **Exploration** | **Intrinsic Curiosity Module (ICM)** | ![Illustration](graphs/intrinsic_curiosity_module_icm.png) | Reward based on prediction error | Curiosity-driven exploration, ICM |
| **Temporal Difference** | **V-trace (IMPALA)** | ![Illustration](graphs/v_trace_impala.png) | Asynchronous off-policy importance sampling | IMPALA, V-trace |
| **Multi-Agent RL** | **QMIX Mixing Network** | ![Illustration](graphs/qmix_mixing_network.png) | Monotonic value function factorization | QMIX, VDN |
| **Advanced / Misc** | **Saliency Maps / Attention on State** | ![Illustration](graphs/saliency_maps_attention_on_state.png) | Visualizing what the agent "sees" or prioritizes | Interpretability, Atari RL |
| **Exploration** | **Action Selection Noise (OU vs Gaussian)** | ![Illustration](graphs/action_selection_noise_ou_vs_gaussian.png) | Temporal correlation in exploration noise | DDPG, TD3 |
| **Advanced / Misc** | **t-SNE / UMAP State Embeddings** | ![Illustration](graphs/t_sne_umap_state_embeddings.png) | Dimension reduction of high-dim neural states | Interpretability, SRL |
| **Advanced / Misc** | **Loss Landscape Visualization** | ![Illustration](graphs/loss_landscape_visualization.png) | Optimization surface geometry | Training stability analysis |
| **Advanced / Misc** | **Success Rate vs Steps** | ![Illustration](graphs/success_rate_vs_steps.png) | Percentage of successful episodes | Goal-conditioned RL, Robotics |
| **Advanced / Misc** | **Hyperparameter Sensitivity Heatmap** | ![Illustration](graphs/hyperparameter_sensitivity_heatmap.png) | Performance across parameter grids | Hyperparameter tuning |
| **Dynamics** | **Action Persistence (Frame Skipping)** | ![Illustration](graphs/action_persistence_frame_skipping.png) | Temporal abstraction by repeating actions | Atari RL, Robotics |
| **Model-Based RL** | **MuZero Dynamics Search Tree** | ![Illustration](graphs/muzero_dynamics_search_tree.png) | Planning with learned transition and value functions | MuZero, Gumbel MuZero |
| **Deep RL** | **Policy Distillation** | ![Illustration](graphs/policy_distillation.png) | Compressing knowledge from teacher to student | Kickstarting, multitask learning |
| **Transformers** | **Decision Transformer Token Sequence** | ![Illustration](graphs/decision_transformer_token_sequence.png) | Sequential modeling of RL as a translation task | Decision Transformer, TT |
| **Advanced / Misc** | **Performance Profiles (rliable)** | ![Illustration](graphs/performance_profiles_rliable.png) | Robust aggregate performance metrics | Reliable RL evaluation |
| **Safety RL** | **Safety Shielding / Barrier Functions** | ![Illustration](graphs/safety_shielding_barrier_functions.png) | Hard constraints on the action space | Constrained MDPs, Safe RL |
| **Training** | **Automated Curriculum Learning** | ![Illustration](graphs/automated_curriculum_learning.png) | Progressively increasing task difficulty | Curriculum RL, ALP-GMM |
| **Sim-to-Real** | **Domain Randomization** | ![Illustration](graphs/domain_randomization.png) | Generalizing across environment variations | Robotics, Sim-to-Real |
| **Alignment** | **RL with Human Feedback (RLHF)** | ![Illustration](graphs/rl_with_human_feedback_rlhf.png) | Aligning agents with human preferences | ChatGPT, InstructGPT |
| **Neuro-inspired RL** | **Successor Representation (SR)** | ![Illustration](graphs/successor_representation_sr.png) | Predictive state representations | SR-Dyna, Neuro-RL |
| **Inverse RL / IRL** | **Maximum Entropy IRL** | ![Illustration](graphs/maximum_entropy_irl.png) | Probability distribution over trajectories | MaxEnt IRL, Ziebart |
| **Theory** | **Information Bottleneck** | ![Illustration](graphs/information_bottleneck.png) | Mutual information $I(S;Z)$ and $I(Z;A)$ balance | VIB-RL, Information Theory |
| **Evolutionary RL** | **Evolutionary Strategies Population** | ![Illustration](graphs/evolutionary_strategies_population.png) | Population-based parameter search | OpenAI-ES, Salimans |
| **Safety RL** | **Control Barrier Functions (CBF)** | ![Illustration](graphs/control_barrier_functions_cbf.png) | Set-theoretic safety guarantees | CBF-RL, Control Theory |
| **Exploration** | **Count-based Exploration Heatmap** | ![Illustration](graphs/count_based_exploration_heatmap.png) | Visitation frequency and intrinsic bonus | MBIE-EB, RND |
| **Exploration** | **Thompson Sampling Posteriors** | ![Illustration](graphs/thompson_sampling_posteriors.png) | Direct uncertainty-based action selection | Bandits, Bayesian RL |
| **Multi-Agent RL** | **Adversarial RL Interaction** | ![Illustration](graphs/adversarial_rl_interaction.png) | Competition between protaganist and antagonist | Robust RL, RARL |
| **Hierarchical RL** | **Hierarchical Subgoal Trajectory** | ![Illustration](graphs/hierarchical_subgoal_trajectory.png) | Decomposing long-horizon tasks | Subgoal RL, HIRO |
| **Offline RL** | **Offline Action Distribution Shift** | ![Illustration](graphs/offline_action_distribution_shift.png) | Mismatch between dataset and current policy | CQL, IQL, D4RL |
| **Exploration** | **Random Network Distillation (RND)** | ![Illustration](graphs/random_network_distillation_rnd.png) | Prediction error as intrinsic reward | RND, OpenAI |
| **Offline RL** | **Batch-Constrained Q-learning (BCQ)** | ![Illustration](graphs/batch_constrained_q_learning_bcq.png) | Constraining actions to behavior dataset | BCQ, Fujimoto |
| **Training** | **Population-Based Training (PBT)** | ![Illustration](graphs/population_based_training_pbt.png) | Evolutionary hyperparameter optimization | PBT, DeepMind |
| **Deep RL** | **Recurrent State Flow (DRQN/R2D2)** | ![Illustration](graphs/recurrent_state_flow_drqn_r2d2.png) | Temporal dependency in state-action value | DRQN, R2D2 |
| **Theory** | **Belief State in POMDPs** | ![Illustration](graphs/belief_state_in_pomdps.png) | Probability distribution over hidden states | POMDPs, Belief Space |
| **Multi-Objective RL** | **Multi-Objective Pareto Front** | ![Illustration](graphs/multi_objective_pareto_front.png) | Balancing conflicting reward signals | MORL, Pareto Optimal |
| **Theory** | **Differential Value (Average Reward RL)** | ![Illustration](graphs/differential_value_average_reward_rl.png) | Values relative to average gain | Average Reward RL, Mahadevan |
| **Infrastructure** | **Distributed RL Cluster (Ray/RLLib)** | ![Illustration](graphs/distributed_rl_cluster_ray_rllib.png) | Parallelizing experience collection | Ray, RLLib, Ape-X |
| **Evolutionary RL** | **Neuroevolution Topology Evolution** | ![Illustration](graphs/neuroevolution_topology_evolution.png) | Evolving neural network architectures | NEAT, HyperNEAT |
| **Continual RL** | **Elastic Weight Consolidation (EWC)** | ![Illustration](graphs/elastic_weight_consolidation_ewc.png) | Preventing catastrophic forgetting | EWC, Kirkpatric |
| **Theory** | **Successor Features (SF)** | ![Illustration](graphs/successor_features_sf.png) | Generalizing predictive representations | SF-Dyna, Barreto |
| **Safety** | **Adversarial State Noise (Perception)** | ![Illustration](graphs/adversarial_state_noise_perception.png) | Attacks on agent observation space | Adversarial RL, Huang |
| **Imitation Learning** | **Behavioral Cloning (Imitation)** | ![Illustration](graphs/behavioral_cloning_imitation.png) | Direct supervised learning from experts | BC, DAGGER |
| **Relational RL** | **Relational Graph State Representation** | ![Illustration](graphs/relational_graph_state_representation.png) | Modeling objects and their relations | Relational MDPs, BoxWorld |
| **Quantum RL** | **Quantum RL Circuit (PQC)** | ![Illustration](graphs/quantum_rl_circuit_pqc.png) | Gate-based quantum policy networks | Quantum RL, PQC |
| **Symbolic RL** | **Symbolic Policy Tree** | ![Illustration](graphs/symbolic_policy_tree.png) | Policies as mathematical expressions | Symbolic RL, GP |
| **Control** | **Differentiable Physics Gradient Flow** | ![Illustration](graphs/differentiable_physics_gradient_flow.png) | Gradient-based planning through simulators | Brax, Isaac Gym |
| **Multi-Agent RL** | **MARL Communication Channel** | ![Illustration](graphs/marl_communication_channel.png) | Information exchange between agents | CommNet, DIAL |
| **Safety** | **Lagrangian Constraint Landscape** | ![Illustration](graphs/lagrangian_constraint_landscape.png) | Constrained optimization boundaries | Constrained RL, CPO |
| **Hierarchical RL** | **MAXQ Task Hierarchy** | ![Illustration](graphs/maxq_task_hierarchy.png) | Recursive task decomposition | MAXQ, Dietterich |
| **Agentic AI** | **ReAct Agentic Cycle** | ![Illustration](graphs/react_agentic_cycle.png) | Reasoning-Action loops for LLMs | ReAct, Agentic LLM |
| **Bio-inspired RL** | **Synaptic Plasticity RL** | ![Illustration](graphs/synaptic_plasticity_rl.png) | Hebbian-style synaptic weight updates | Hebbian RL, STDP |
| **Control** | **Guided Policy Search (GPS)** | ![Illustration](graphs/guided_policy_search_gps.png) | Distilling trajectories into a policy | GPS, Levine |
| **Robotics** | **Sim-to-Real Jitter & Latency** | ![Illustration](graphs/sim_to_real_jitter_latency.png) | Temporal robustness in transfer | Sim-to-Real, Robustness |
| **Policy Gradients** | **Deterministic Policy Gradient (DDPG) Flow** | ![Illustration](graphs/deterministic_policy_gradient_ddpg_flow.png) | Gradient flow for deterministic policies | DDPG |
| **Model-Based RL** | **Dreamer Latent Imagination** | ![Illustration](graphs/dreamer_latent_imagination.png) | Learning and planning in latent space | Dreamer (V1-V3) |
| **Deep RL** | **UNREAL Auxiliary Tasks** | ![Illustration](graphs/unreal_auxiliary_tasks.png) | Learning from non-reward signals | UNREAL, A3C extension |
| **Offline RL** | **Implicit Q-Learning (IQL) Expectile** | ![Illustration](graphs/implicit_q_learning_iql_expectile.png) | In-sample learning via expectile regression | IQL |
| **Model-Based RL** | **Prioritized Sweeping** | ![Illustration](graphs/prioritized_sweeping.png) | Planning prioritized by TD error | Sutton & Barto classic MBRL |
| **Imitation Learning** | **DAgger Expert Loop** | ![Illustration](graphs/dagger_expert_loop.png) | Training on expert labels in agent-visited states | DAgger |
| **Representation** | **Self-Predictive Representations (SPR)** | ![Illustration](graphs/self_predictive_representations_spr.png) | Consistency between predicted and target latents | SPR, sample-efficient RL |
| **Multi-Agent RL** | **Joint Action Space** | ![Illustration](graphs/joint_action_space.png) | Cartesian product of individual actions | MARL theory, Game Theory |
| **Multi-Agent RL** | **Dec-POMDP Formal Model** | ![Illustration](graphs/dec_pomdp_formal_model.png) | Decentralized partially observable MDP | Multi-agent coordination |
| **Theory** | **Bisimulation Metric** | ![Illustration](graphs/bisimulation_metric.png) | State equivalence based on transitions/rewards | State abstraction, bisimulation theory |
| **Theory** | **Potential-Based Reward Shaping** | ![Illustration](graphs/potential_based_reward_shaping.png) | Reward transformation preserving optimal policy | Sutton & Barto, Ng et al. |
| **Training** | **Transfer RL: Source to Target** | ![Illustration](graphs/transfer_rl_source_to_target.png) | Reusing knowledge across different MDPs | Transfer Learning, Distillation |
| **Deep RL** | **Multi-Task Backbone Arch** | ![Illustration](graphs/multi_task_backbone_arch.png) | Single agent learning multiple tasks | Multi-task RL, IMPALA |
| **Bandits** | **Contextual Bandit Pipeline** | ![Illustration](graphs/contextual_bandit_pipeline.png) | Decision making given context but no transitions | Personalization, Ad-tech |
| **Theory** | **Theoretical Regret Bounds** | ![Illustration](graphs/theoretical_regret_bounds.png) | Analytical performance guarantees | Online Learning, Bandits |
| **Value-based** | **Soft Q Boltzmann Probabilities** | ![Illustration](graphs/soft_q_boltzmann_probabilities.png) | Probabilistic action selection from Q-values | s) \propto \exp(Q/\tau)$ |
| **Robotics** | **Autonomous Driving RL Pipeline** | ![Illustration](graphs/autonomous_driving_rl_pipeline.png) | End-to-end or modular driving stack | Wayve, Tesla, Comma.ai |
| **Policy** | **Policy action gradient comparison** | ![Illustration](graphs/policy_action_gradient_comparison.png) | Comparison of gradient derivation types | PG Theorem vs DPG Theorem |
| **Inverse RL / IRL** | **IRL: Feature Expectation Matching** | ![Illustration](graphs/irl_feature_expectation_matching.png) | Comparing expert vs learner feature visitor frequency | \mu(\pi^*) - \mu(\pi) |
| **Imitation Learning** | **Apprenticeship Learning Loop** | ![Illustration](graphs/apprenticeship_learning_loop.png) | Training to match expert performance via reward inference | Apprenticeship Learning |
| **Theory** | **Active Inference Loop** | ![Illustration](graphs/active_inference_loop.png) | Agents minimizing surprise (free energy) | Free Energy Principle, Friston |
| **Theory** | **Bellman Residual Landscape** | ![Illustration](graphs/bellman_residual_landscape.png) | Training surface of the Bellman error | TD learning, fitted Q-iteration |
| **Model-Based RL** | **Plan-to-Explore Uncertainty Map** | ![Illustration](graphs/plan_to_explore_uncertainty_map.png) | Systematic exploration in learned world models | Plan-to-Explore, Sekar et al. |
| **Safety RL** | **Robust RL Uncertainty Set** | ![Illustration](graphs/robust_rl_uncertainty_set.png) | Optimizing for the worst-case environment transition | Robust MDPs, minimax RL |
| **Training** | **HPO Bayesian Opt Cycle** | ![Illustration](graphs/hpo_bayesian_opt_cycle.png) | Automating hyperparameter selection with GP | Hyperparameter Optimization |
| **Applied RL** | **Slate RL Recommendation** | ![Illustration](graphs/slate_rl_recommendation.png) | Optimizing list/slate of items for users | Recommender Systems, Ie et al. |
| **Multi-Agent RL** | **Fictitious Play Interaction** | ![Illustration](graphs/fictitious_play_interaction.png) | Belief-based learning in games | Game Theory, Brown (1951) |
| **Conceptual** | **Universal RL Framework Diagram** | ![Illustration](graphs/universal_rl_framework_diagram.png) | High-level summary of RL components | All RL |
| **Offline RL** | **Offline Density Ratio Estimator** | ![Illustration](graphs/offline_density_ratio_estimator.png) | Estimating $w(s,a)$ for off-policy data | Importance Sampling, Offline RL |
| **Continual RL** | **Continual Task Interference Heatmap** | ![Illustration](graphs/continual_task_interference_heatmap.png) | Measuring negative transfer between tasks | Lifelong Learning, EWC |
| **Safety RL** | **Lyapunov Stability Safe Set** | ![Illustration](graphs/lyapunov_stability_safe_set.png) | Invariant sets for safe control | Lyapunov RL, Chow et al. |
| **Applied RL** | **Molecular RL (Atom Coordinates)** | ![Illustration](graphs/molecular_rl_atom_coordinates.png) | RL for molecular design/protein folding | Chemistry RL, AlphaFold-style |
| **Architecture** | **MoE Multi-task Architecture** | ![Illustration](graphs/moe_multi_task_architecture.png) | Scaling models with mixture of experts | MoE-RL, Sparsity |
| **Direct Policy Search** | **CMA-ES Policy Search** | ![Illustration](graphs/cma_es_policy_search.png) | Evolutionary strategy for policy weights | ES for RL, Salimans |
| **Alignment** | **Elo Rating Preference Plot** | ![Illustration](graphs/elo_rating_preference_plot.png) | Measuring agent strength over time | AlphaZero, League training |
| **Explainable RL** | **Explainable RL (SHAP Attribution)** | ![Illustration](graphs/explainable_rl_shap_attribution.png) | Local attribution of features to agent actions | Interpretability, SHAP/LIME |
| **Meta-RL** | **PEARL Context Encoder** | ![Illustration](graphs/pearl_context_encoder.png) | Learning latent task representations | PEARL, Rakelly et al. |
| **Applied RL** | **Medical RL Therapy Pipeline** | ![Illustration](graphs/medical_rl_therapy_pipeline.png) | Personalized medicine and dosing | Healthcare RL, ICU Sepsis |
| **Applied RL** | **Supply Chain RL Pipeline** | ![Illustration](graphs/supply_chain_rl_pipeline.png) | Optimizing stock levels and orders | Logistics, Inventory Management |
| **Robotics** | **Sim-to-Real SysID Loop** | ![Illustration](graphs/sim_to_real_sysid_loop.png) | Closing the reality gap via parameter estimation | System Identification, Robotics |
| **Architecture** | **Transformer World Model** | ![Illustration](graphs/transformer_world_model.png) | Sequence-to-sequence dynamics modeling | DreamerV3, Transframer |
| **Applied RL** | **Network Traffic RL** | ![Illustration](graphs/network_traffic_rl.png) | Optimizing data packet routing in graphs | Networking, Traffic Engineering |
| **Training** | **RLHF: PPO with Reference Policy** | ![Illustration](graphs/rlhf_ppo_with_reference_policy.png) | Ensuring RL fine-tuning doesn't drift too far | InstructGPT, Llama 2/3 |
| **Multi-Agent RL** | **PSRO Meta-Game Update** | ![Illustration](graphs/psro_meta_game_update.png) | Reaching Nash equilibrium in large games | PSRO, Lanctot et al. |
| **Multi-Agent RL** | **DIAL: Differentiable Comm** | ![Illustration](graphs/dial_differentiable_comm.png) | End-to-end learning of communication protocols | DIAL, Foerster et al. |
| **Batch RL** | **Fitted Q-Iteration Loop** | ![Illustration](graphs/fitted_q_iteration_loop.png) | Data-driven iteration with a supervised regressor | Ernst et al. (2005) |
| **Safety RL** | **CMDP Feasible Region** | ![Illustration](graphs/cmdp_feasible_region.png) | Constrained optimization within a safety budget | Constrained MDPs, Altman |
| **Control** | **MPC vs RL Planning** | ![Illustration](graphs/mpc_vs_rl_planning.png) | Comparison of control paradigms | Control Theory vs RL |
| **AutoML** | **Learning to Optimize (L2O)** | ![Illustration](graphs/learning_to_optimize_l2o.png) | Using RL to learn an optimization update rule | L2O, Li & Malik |
| **Applied RL** | **Smart Grid RL Management** | ![Illustration](graphs/smart_grid_rl_management.png) | Optimizing energy supply and demand | Energy RL, Smart Grids |
| **Applied RL** | **Quantum State Tomography RL** | ![Illustration](graphs/quantum_state_tomography_rl.png) | RL for quantum state estimation | Quantum RL, Neural Tomography |
| **Applied RL** | **RL for Chip Placement** | ![Illustration](graphs/rl_for_chip_placement.png) | Placing components on silicon grids | Google Chip Placement |
| **Applied RL** | **RL Compiler Optimization (MLGO)** | ![Illustration](graphs/rl_compiler_optimization_mlgo.png) | Inlining and sizing in compilers | MLGO, LLVM |
| **Applied RL** | **RL for Theorem Proving** | ![Illustration](graphs/rl_for_theorem_proving.png) | Automated reasoning and proof search | LeanRL, AlphaProof |
| **Modern RL** | **Diffusion-QL Offline RL** | ![Illustration](graphs/diffusion_ql_offline_rl.png) | Policy as reverse diffusion process | s,k)$ with noise injection |
| **Principles** | **Fairness-reward Pareto Frontier** | ![Illustration](graphs/fairness_reward_pareto_frontier.png) | Balancing equity and returns | Fair RL, Jabbari et al. |
| **Principles** | **Differentially Private RL** | ![Illustration](graphs/differentially_private_rl.png) | Privacy-preserving training | DP-RL, Agarwal et al. |
| **Applied RL** | **Smart Agriculture RL** | ![Illustration](graphs/smart_agriculture_rl.png) | Optimizing crop yield and resources | Precision Agriculture |
| **Applied RL** | **Climate Mitigation RL (Grid)** | ![Illustration](graphs/climate_mitigation_rl_grid.png) | Environmental control policies | ClimateRL, Carbon Control |
| **Applied RL** | **AI Education (Knowledge Tracing)** | ![Illustration](graphs/ai_education_knowledge_tracing.png) | Personalized learning paths | ITS, Bayesian Knowledge Tracing |
| **Modern RL** | **Decision SDE Flow** | ![Illustration](graphs/decision_sde_flow.png) | RL in continuous stochastic systems | Neural SDEs, Control |
| **Control** | **Differentiable physics (Brax)** | ![Illustration](graphs/differentiable_physics_brax.png) | Gradients through simulators | Brax, PhysX, MuJoCo |
| **Applied RL** | **Wireless Beamforming RL** | ![Illustration](graphs/wireless_beamforming_rl.png) | Optimizing antenna signal directions | 5G/6G Networking |
| **Applied RL** | **Quantum Error Correction RL** | ![Illustration](graphs/quantum_error_correction_rl.png) | Correcting noise in quantum circuits | Quantum Computing RL |
| **Multi-Agent RL** | **Mean Field RL Interaction** | ![Illustration](graphs/mean_field_rl_interaction.png) | Large population agent dynamics | MF-RL, Yang et al. |
| **HRL** | **Goal-GAN Curriculum** | ![Illustration](graphs/goal_gan_curriculum.png) | Automatic goal generation | Goal-GAN, Florensa et al. |
| **Modern RL** | **JEPA: Predictive Architecture** | ![Illustration](graphs/jepa_predictive_architecture.png) | LeCun's world model framework | JEPA, I-JEPA |
| **Offline RL** | **CQL Value Penalty Landscape** | ![Illustration](graphs/cql_value_penalty_landscape.png) | Conservatism in value functions | CQL, Kumar et al. |
| **Applied RL** | **Causal RL** | ![Illustration](graphs/causal_rl.png) | Causal Inverse RL Graph | DAG with $S, A, R$ and latent $U$ |
| **Quantum RL** | **VQE-RL Optimization** | ![Illustration](graphs/vqe_rl_optimization.png) | Quantum circuit param tuning | VQE, Quantum RL |
| **Applied RL** | **De-novo Drug Discovery RL** | ![Illustration](graphs/de_novo_drug_discovery_rl.png) | Generating optimized lead molecules | Drug Discovery, Molecule RL |
| **Applied RL** | **Traffic Signal Coordination RL** | ![Illustration](graphs/traffic_signal_coordination_rl.png) | Multi-intersection coordination | IntelliLight, PressLight |
| **Applied RL** | **Mars Rover Pathfinding RL** | ![Illustration](graphs/mars_rover_pathfinding_rl.png) | Navigation on rough terrain | Space RL, Mars Rover |
| **Applied RL** | **Sports Player Movement RL** | ![Illustration](graphs/sports_player_movement_rl.png) | Predicting/Optimizing player actions | Sports Analytics, Ghosting |
| **Applied RL** | **Cryptography Attack RL** | ![Illustration](graphs/cryptography_attack_rl.png) | Searching for keys/vulnerabilities | Crypto-RL, Learning to Attack |
| **Applied RL** | **Humanitarian Resource RL** | ![Illustration](graphs/humanitarian_resource_rl.png) | Disaster response allocation | AI for Good, Resource RL |
| **Applied RL** | **Video Compression RL (RD)** | ![Illustration](graphs/video_compression_rl_rd.png) | Optimizing bit-rate vs distortion | Learned Video Compression |
| **Applied RL** | **Kubernetes Auto-scaling RL** | ![Illustration](graphs/kubernetes_auto_scaling_rl.png) | Cloud resource management | Cloud RL, K8s Scaling |
| **Applied RL** | **Fluid Dynamics Flow Control RL** | ![Illustration](graphs/fluid_dynamics_flow_control_rl.png) | Airfoil/Turbulence control | Aero-RL, Flow Control |
| **Applied RL** | **Structural Optimization RL** | ![Illustration](graphs/structural_optimization_rl.png) | Topology/Material design | Structural RL, Topology Opt |
| **Applied RL** | **Human Decision Modeling** | ![Illustration](graphs/human_decision_modeling.png) | Prospect Theory in RL | Behavioral RL, Prospect Theory |
| **Applied RL** | **Semantic Parsing RL** | ![Illustration](graphs/semantic_parsing_rl.png) | Language to Logic transformation | Semantic Parsing, Seq2Seq-RL |
| **Applied RL** | **Music Melody RL** | ![Illustration](graphs/music_melody_rl.png) | Reward-based melody generation | Music-RL, Magenta |
| **Applied RL** | **Plasma Fusion Control RL** | ![Illustration](graphs/plasma_fusion_control_rl.png) | Magnetic control of Tokamaks | DeepMind Fusion, Tokamak RL |
| **Applied RL** | **Carbon Capture RL cycle** | ![Illustration](graphs/carbon_capture_rl_cycle.png) | Adsorption/Desorption optimization | Carbon Capture, Green RL |
| **Applied RL** | **Swarm Robotics RL** | ![Illustration](graphs/swarm_robotics_rl.png) | Decentralized swarm coordination | Swarm-RL, Multi-Robot |
| **Applied RL** | **Legal Compliance RL Game** | ![Illustration](graphs/legal_compliance_rl_game.png) | Regulatory games | Legal-RL, RegTech |
| **Physics RL** | **Physics-Informed RL (PINN)** | ![Illustration](graphs/physics_informed_rl_pinn.png) | Constraint-based RL loss | PINN-RL, SciML |
| **Modern RL** | **Neuro-Symbolic RL** | ![Illustration](graphs/neuro_symbolic_rl.png) | Combining logic and neural nets | Neuro-Symbolic, Logic RL |
| **Applied RL** | **DeFi Liquidity Pool RL** | ![Illustration](graphs/defi_liquidity_pool_rl.png) | Yield farming/Liquidity balancing | DeFi-RL, AMM Optimization |
| **Neuro RL** | **Dopamine Reward Prediction Error** | ![Illustration](graphs/dopamine_reward_prediction_error.png) | Biological RL signal curves | Neuroscience-RL, Wolfram |
| **Robotics** | **Proprioceptive Sensory-Motor RL** | ![Illustration](graphs/proprioceptive_sensory_motor_rl.png) | Low-level joint control | Proprioceptive RL, Unitree |
| **Applied RL** | **AR Object Placement RL** | ![Illustration](graphs/ar_object_placement_rl.png) | AR visual overlay optimization | AR-RL, Visual Overlay |
| **Reco RL** | **Sequential Bundle RL** | ![Illustration](graphs/sequential_bundle_rl.png) | Recommendation item grouping | Bundle-RL, E-commerce |
| **Theoretical** | **Online Gradient Descent vs RL** | ![Illustration](graphs/online_gradient_descent_vs_rl.png) | Gradient-based learning comparison | Online Learning, Regret |
| **Modern RL** | **Active Learning: Query RL** | ![Illustration](graphs/active_learning_query_rl.png) | Query-based sample selection | Active-RL, Query Opt |
| **Modern RL** | **Federated RL global Aggregator** | ![Illustration](graphs/federated_rl_global_aggregator.png) | Privacy-preserving distributed RL | Federated-RL, FedAvg-RL |
| **Conceptual** | **Ultimate Universal RL Mastery Diagram** | ![Illustration](graphs/ultimate_universal_rl_mastery_diagram.png) | Final summary of 230 items | Absolute Mastery Milestone |