--- title: Reinforcement Learning Graphical Representations date: 2026-04-08 category: Reinforcement Learning description: A comprehensive gallery of 230 standard RL components and their graphical presentations. --- # Reinforcement Learning Graphical Representations This repository contains a full set of 230 visualizations representing foundational concepts, algorithms, and advanced topics in Reinforcement Learning. | Category | Component | Illustration | Details | Context | |----------|-----------|--------------|---------|---------| | **MDP & Environment** | **Agent-Environment Interaction Loop** | ![Illustration](graphs/agent_environment_interaction_loop.png) | Core cycle: observation of state → selection of action → environment transition → receipt of reward + next state | All RL algorithms | | **MDP & Environment** | **Markov Decision Process (MDP) Tuple** | ![Illustration](graphs/markov_decision_process_mdp_tuple.png) | (S, A, P, R, γ) with transition dynamics and reward function | s,a) and R(s,a,s′)) | | **MDP & Environment** | **State Transition Graph** | ![Illustration](graphs/state_transition_graph.png) | Full probabilistic transitions between discrete states | Gridworld, Taxi, Cliff Walking | | **MDP & Environment** | **Trajectory / Episode Sequence** | ![Illustration](graphs/trajectory_episode_sequence.png) | Sequence of (s₀, a₀, r₁, s₁, …, s_T) | Monte Carlo, episodic tasks | | **MDP & Environment** | **Continuous State/Action Space Visualization** | ![Illustration](graphs/continuous_state_action_space_visualization.png) | High-dimensional spaces (e.g., robot joints, pixel inputs) | Continuous-control tasks (MuJoCo, PyBullet) | | **MDP & Environment** | **Reward Function / Landscape** | ![Illustration](graphs/reward_function_landscape.png) | Scalar reward as function of state/action | All algorithms; especially reward shaping | | **MDP & Environment** | **Discount Factor (γ) Effect** | ![Illustration](graphs/discount_factor_effect.png) | How future rewards are weighted | All discounted MDPs | | **Value & Policy** | **State-Value Function V(s)** | ![Illustration](graphs/state_value_function_v_s.png) | Expected return from state s under policy π | Value-based methods | | **Value & Policy** | **Action-Value Function Q(s,a)** | ![Illustration](graphs/action_value_function_q_s_a.png) | Expected return from state-action pair | Q-learning family | | **Value & Policy** | **Policy π(s) or π(a\** | ![Illustration](graphs/policy_s_or_a.png) | s) | Arrow overlays on grid (optimal policy), probability bar charts, or softmax heatmaps | | **Value & Policy** | **Advantage Function A(s,a)** | ![Illustration](graphs/advantage_function_a_s_a.png) | Q(s,a) – V(s) | A2C, PPO, SAC, TD3 | | **Value & Policy** | **Optimal Value Function V* / Q*** | ![Illustration](graphs/optimal_value_function_v_q.png) | Solution to Bellman optimality | Value iteration, Q-learning | | **Dynamic Programming** | **Policy Evaluation Backup** | ![Illustration](graphs/policy_evaluation_backup.png) | Iterative update of V using Bellman expectation | Policy iteration | | **Dynamic Programming** | **Policy Improvement** | ![Illustration](graphs/policy_improvement.png) | Greedy policy update over Q | Policy iteration | | **Dynamic Programming** | **Value Iteration Backup** | ![Illustration](graphs/value_iteration_backup.png) | Update using Bellman optimality | Value iteration | | **Dynamic Programming** | **Policy Iteration Full Cycle** | ![Illustration](graphs/policy_iteration_full_cycle.png) | Evaluation → Improvement loop | Classic DP methods | | **Monte Carlo** | **Monte Carlo Backup** | ![Illustration](graphs/monte_carlo_backup.png) | Update using full episode return G_t | First-visit / every-visit MC | | **Monte Carlo** | **Monte Carlo Tree (MCTS)** | ![Illustration](graphs/monte_carlo_tree_mcts.png) | Search tree with selection, expansion, simulation, backprop | AlphaGo, AlphaZero | | **Monte Carlo** | **Importance Sampling Ratio** | ![Illustration](graphs/importance_sampling_ratio.png) | Off-policy correction ρ = π(a\ | s) | | **Temporal Difference** | **TD(0) Backup** | ![Illustration](graphs/td_0_backup.png) | Bootstrapped update using R + γV(s′) | TD learning | | **Temporal Difference** | **Bootstrapping (general)** | ![Illustration](graphs/bootstrapping_general.png) | Using estimated future value instead of full return | All TD methods | | **Temporal Difference** | **n-step TD Backup** | ![Illustration](graphs/n_step_td_backup.png) | Multi-step return G_t^{(n)} | n-step TD, TD(λ) | | **Temporal Difference** | **TD(λ) & Eligibility Traces** | ![Illustration](graphs/td_eligibility_traces.png) | Decaying trace z_t for credit assignment | TD(λ), SARSA(λ), Q(λ) | | **Temporal Difference** | **SARSA Update** | ![Illustration](graphs/sarsa_update.png) | On-policy TD control | SARSA | | **Temporal Difference** | **Q-Learning Update** | ![Illustration](graphs/q_learning_update.png) | Off-policy TD control | Q-learning, Deep Q-Network | | **Temporal Difference** | **Expected SARSA** | ![Illustration](graphs/expected_sarsa.png) | Expectation over next action under policy | Expected SARSA | | **Temporal Difference** | **Double Q-Learning / Double DQN** | ![Illustration](graphs/double_q_learning_double_dqn.png) | Two separate Q estimators to reduce overestimation | Double DQN, TD3 | | **Temporal Difference** | **Dueling DQN Architecture** | ![Illustration](graphs/dueling_dqn_architecture.png) | Separate streams for state value V(s) and advantage A(s,a) | Dueling DQN | | **Temporal Difference** | **Prioritized Experience Replay** | ![Illustration](graphs/prioritized_experience_replay.png) | Importance sampling of transitions by TD error | Prioritized DQN, Rainbow | | **Temporal Difference** | **Rainbow DQN Components** | ![Illustration](graphs/rainbow_dqn_components.png) | All extensions combined (Double, Dueling, PER, etc.) | Rainbow DQN | | **Function Approximation** | **Linear Function Approximation** | ![Illustration](graphs/linear_function_approximation.png) | Feature vector φ(s) → wᵀφ(s) | Tabular → linear FA | | **Function Approximation** | **Neural Network Layers (MLP, CNN, RNN, Transformer)** | ![Illustration](graphs/neural_network_layers_mlp_cnn_rnn_transformer.png) | Full deep network for value/policy | DQN, A3C, PPO, Decision Transformer | | **Function Approximation** | **Computation Graph / Backpropagation Flow** | ![Illustration](graphs/computation_graph_backpropagation_flow.png) | Gradient flow through network | All deep RL | | **Function Approximation** | **Target Network** | ![Illustration](graphs/target_network.png) | Frozen copy of Q-network for stability | DQN, DDQN, SAC, TD3 | | **Policy Gradients** | **Policy Gradient Theorem** | ![Illustration](graphs/policy_gradient_theorem.png) | ∇_θ J(θ) = E[∇_θ log π(a\ | Flow diagram from reward → log-prob → gradient | | **Policy Gradients** | **REINFORCE Update** | ![Illustration](graphs/reinforce_update.png) | Monte-Carlo policy gradient | REINFORCE | | **Policy Gradients** | **Baseline / Advantage Subtraction** | ![Illustration](graphs/baseline_advantage_subtraction.png) | Subtract b(s) to reduce variance | All modern PG | | **Policy Gradients** | **Trust Region (TRPO)** | ![Illustration](graphs/trust_region_trpo.png) | KL-divergence constraint on policy update | TRPO | | **Policy Gradients** | **Proximal Policy Optimization (PPO)** | ![Illustration](graphs/proximal_policy_optimization_ppo.png) | Clipped surrogate objective | PPO, PPO-Clip | | **Actor-Critic** | **Actor-Critic Architecture** | ![Illustration](graphs/actor_critic_architecture.png) | Separate or shared actor (policy) + critic (value) networks | A2C, A3C, SAC, TD3 | | **Actor-Critic** | **Advantage Actor-Critic (A2C/A3C)** | ![Illustration](graphs/advantage_actor_critic_a2c_a3c.png) | Synchronous/asynchronous multi-worker | A2C/A3C | | **Actor-Critic** | **Soft Actor-Critic (SAC)** | ![Illustration](graphs/soft_actor_critic_sac.png) | Entropy-regularized policy + twin critics | SAC | | **Actor-Critic** | **Twin Delayed DDPG (TD3)** | ![Illustration](graphs/twin_delayed_ddpg_td3.png) | Twin critics + delayed policy + target smoothing | TD3 | | **Exploration** | **ε-Greedy Strategy** | ![Illustration](graphs/greedy_strategy.png) | Probability ε of random action | DQN family | | **Exploration** | **Softmax / Boltzmann Exploration** | ![Illustration](graphs/softmax_boltzmann_exploration.png) | Temperature τ in softmax | Softmax policies | | **Exploration** | **Upper Confidence Bound (UCB)** | ![Illustration](graphs/upper_confidence_bound_ucb.png) | Optimism in face of uncertainty | UCB1, bandits | | **Exploration** | **Intrinsic Motivation / Curiosity** | ![Illustration](graphs/intrinsic_motivation_curiosity.png) | Prediction error as intrinsic reward | ICM, RND, Curiosity-driven RL | | **Exploration** | **Entropy Regularization** | ![Illustration](graphs/entropy_regularization.png) | Bonus term αH(π) | SAC, maximum-entropy RL | | **Hierarchical RL** | **Options Framework** | ![Illustration](graphs/options_framework.png) | High-level policy over options (temporally extended actions) | Option-Critic | | **Hierarchical RL** | **Feudal Networks / Hierarchical Actor-Critic** | ![Illustration](graphs/feudal_networks_hierarchical_actor_critic.png) | Manager-worker hierarchy | Feudal RL | | **Hierarchical RL** | **Skill Discovery** | ![Illustration](graphs/skill_discovery.png) | Unsupervised emergence of reusable skills | DIAYN, VALOR | | **Model-Based RL** | **Learned Dynamics Model** | ![Illustration](graphs/learned_dynamics_model.png) | ˆP(s′\ | Separate model network diagram (often RNN or transformer) | | **Model-Based RL** | **Model-Based Planning** | ![Illustration](graphs/model_based_planning.png) | Rollouts inside learned model | MuZero, DreamerV3 | | **Model-Based RL** | **Imagination-Augmented Agents (I2A)** | ![Illustration](graphs/imagination_augmented_agents_i2a.png) | Imagination module + policy | I2A | | **Offline RL** | **Offline Dataset** | ![Illustration](graphs/offline_dataset.png) | Fixed batch of trajectories | BC, CQL, IQL | | **Offline RL** | **Conservative Q-Learning (CQL)** | ![Illustration](graphs/conservative_q_learning_cql.png) | Penalty on out-of-distribution actions | CQL | | **Multi-Agent RL** | **Multi-Agent Interaction Graph** | ![Illustration](graphs/multi_agent_interaction_graph.png) | Agents communicating or competing | MARL, MADDPG | | **Multi-Agent RL** | **Centralized Training Decentralized Execution (CTDE)** | ![Illustration](graphs/centralized_training_decentralized_execution_ctde.png) | Shared critic during training | QMIX, VDN, MADDPG | | **Multi-Agent RL** | **Cooperative / Competitive Payoff Matrix** | ![Illustration](graphs/cooperative_competitive_payoff_matrix.png) | Joint reward for multiple agents | Prisoner's Dilemma, multi-agent gridworlds | | **Inverse RL / IRL** | **Reward Inference** | ![Illustration](graphs/reward_inference.png) | Infer reward from expert demonstrations | IRL, GAIL | | **Inverse RL / IRL** | **Generative Adversarial Imitation Learning (GAIL)** | ![Illustration](graphs/generative_adversarial_imitation_learning_gail.png) | Discriminator vs. policy generator | GAIL, AIRL | | **Meta-RL** | **Meta-RL Architecture** | ![Illustration](graphs/meta_rl_architecture.png) | Outer loop (meta-policy) + inner loop (task adaptation) | MAML for RL, RL² | | **Meta-RL** | **Task Distribution Visualization** | ![Illustration](graphs/task_distribution_visualization.png) | Multiple MDPs sampled from meta-distribution | Meta-RL benchmarks | | **Advanced / Misc** | **Experience Replay Buffer** | ![Illustration](graphs/experience_replay_buffer.png) | Stored (s,a,r,s′,done) tuples | DQN and all off-policy deep RL | | **Advanced / Misc** | **State Visitation / Occupancy Measure** | ![Illustration](graphs/state_visitation_occupancy_measure.png) | Frequency of visiting each state | All algorithms (analysis) | | **Advanced / Misc** | **Learning Curve** | ![Illustration](graphs/learning_curve.png) | Average episodic return vs. episodes / steps | Standard performance reporting | | **Advanced / Misc** | **Regret / Cumulative Regret** | ![Illustration](graphs/regret_cumulative_regret.png) | Sub-optimality accumulated | Bandits and online RL | | **Advanced / Misc** | **Attention Mechanisms (Transformers in RL)** | ![Illustration](graphs/attention_mechanisms_transformers_in_rl.png) | Attention weights | Decision Transformer, Trajectory Transformer | | **Advanced / Misc** | **Diffusion Policy** | ![Illustration](graphs/diffusion_policy.png) | Denoising diffusion process for action generation | Diffusion-RL policies | | **Advanced / Misc** | **Graph Neural Networks for RL** | ![Illustration](graphs/graph_neural_networks_for_rl.png) | Node/edge message passing | Graph RL, relational RL | | **Advanced / Misc** | **World Model / Latent Space** | ![Illustration](graphs/world_model_latent_space.png) | Encoder-decoder dynamics in latent space | Dreamer, PlaNet | | **Advanced / Misc** | **Convergence Analysis Plots** | ![Illustration](graphs/convergence_analysis_plots.png) | Error / value change over iterations | DP, TD, value iteration | | **Advanced / Misc** | **RL Algorithm Taxonomy** | ![Illustration](graphs/rl_algorithm_taxonomy.png) | Comprehensive classification of algorithms | All RL | | **Advanced / Misc** | **Probabilistic Graphical Model (RL as Inference)** | ![Illustration](graphs/probabilistic_graphical_model_rl_as_inference.png) | Formalizing RL as probabilistic inference | Control as Inference, MaxEnt RL | | **Value & Policy** | **Distributional RL (C51 / Categorical)** | ![Illustration](graphs/distributional_rl_c51_categorical.png) | Representing return as a probability distribution | C51, QR-DQN, IQN | | **Exploration** | **Hindsight Experience Replay (HER)** | ![Illustration](graphs/hindsight_experience_replay_her.png) | Learning from failures by relabeling goals | Sparse reward robotics, HER | | **Model-Based RL** | **Dyna-Q Architecture** | ![Illustration](graphs/dyna_q_architecture.png) | Integration of real experience and model-based planning | Dyna-Q, Dyna-2 | | **Function Approximation** | **Noisy Networks (Parameter Noise)** | ![Illustration](graphs/noisy_networks_parameter_noise.png) | Stochastic weights for exploration | Noisy DQN, Rainbow | | **Exploration** | **Intrinsic Curiosity Module (ICM)** | ![Illustration](graphs/intrinsic_curiosity_module_icm.png) | Reward based on prediction error | Curiosity-driven exploration, ICM | | **Temporal Difference** | **V-trace (IMPALA)** | ![Illustration](graphs/v_trace_impala.png) | Asynchronous off-policy importance sampling | IMPALA, V-trace | | **Multi-Agent RL** | **QMIX Mixing Network** | ![Illustration](graphs/qmix_mixing_network.png) | Monotonic value function factorization | QMIX, VDN | | **Advanced / Misc** | **Saliency Maps / Attention on State** | ![Illustration](graphs/saliency_maps_attention_on_state.png) | Visualizing what the agent "sees" or prioritizes | Interpretability, Atari RL | | **Exploration** | **Action Selection Noise (OU vs Gaussian)** | ![Illustration](graphs/action_selection_noise_ou_vs_gaussian.png) | Temporal correlation in exploration noise | DDPG, TD3 | | **Advanced / Misc** | **t-SNE / UMAP State Embeddings** | ![Illustration](graphs/t_sne_umap_state_embeddings.png) | Dimension reduction of high-dim neural states | Interpretability, SRL | | **Advanced / Misc** | **Loss Landscape Visualization** | ![Illustration](graphs/loss_landscape_visualization.png) | Optimization surface geometry | Training stability analysis | | **Advanced / Misc** | **Success Rate vs Steps** | ![Illustration](graphs/success_rate_vs_steps.png) | Percentage of successful episodes | Goal-conditioned RL, Robotics | | **Advanced / Misc** | **Hyperparameter Sensitivity Heatmap** | ![Illustration](graphs/hyperparameter_sensitivity_heatmap.png) | Performance across parameter grids | Hyperparameter tuning | | **Dynamics** | **Action Persistence (Frame Skipping)** | ![Illustration](graphs/action_persistence_frame_skipping.png) | Temporal abstraction by repeating actions | Atari RL, Robotics | | **Model-Based RL** | **MuZero Dynamics Search Tree** | ![Illustration](graphs/muzero_dynamics_search_tree.png) | Planning with learned transition and value functions | MuZero, Gumbel MuZero | | **Deep RL** | **Policy Distillation** | ![Illustration](graphs/policy_distillation.png) | Compressing knowledge from teacher to student | Kickstarting, multitask learning | | **Transformers** | **Decision Transformer Token Sequence** | ![Illustration](graphs/decision_transformer_token_sequence.png) | Sequential modeling of RL as a translation task | Decision Transformer, TT | | **Advanced / Misc** | **Performance Profiles (rliable)** | ![Illustration](graphs/performance_profiles_rliable.png) | Robust aggregate performance metrics | Reliable RL evaluation | | **Safety RL** | **Safety Shielding / Barrier Functions** | ![Illustration](graphs/safety_shielding_barrier_functions.png) | Hard constraints on the action space | Constrained MDPs, Safe RL | | **Training** | **Automated Curriculum Learning** | ![Illustration](graphs/automated_curriculum_learning.png) | Progressively increasing task difficulty | Curriculum RL, ALP-GMM | | **Sim-to-Real** | **Domain Randomization** | ![Illustration](graphs/domain_randomization.png) | Generalizing across environment variations | Robotics, Sim-to-Real | | **Alignment** | **RL with Human Feedback (RLHF)** | ![Illustration](graphs/rl_with_human_feedback_rlhf.png) | Aligning agents with human preferences | ChatGPT, InstructGPT | | **Neuro-inspired RL** | **Successor Representation (SR)** | ![Illustration](graphs/successor_representation_sr.png) | Predictive state representations | SR-Dyna, Neuro-RL | | **Inverse RL / IRL** | **Maximum Entropy IRL** | ![Illustration](graphs/maximum_entropy_irl.png) | Probability distribution over trajectories | MaxEnt IRL, Ziebart | | **Theory** | **Information Bottleneck** | ![Illustration](graphs/information_bottleneck.png) | Mutual information $I(S;Z)$ and $I(Z;A)$ balance | VIB-RL, Information Theory | | **Evolutionary RL** | **Evolutionary Strategies Population** | ![Illustration](graphs/evolutionary_strategies_population.png) | Population-based parameter search | OpenAI-ES, Salimans | | **Safety RL** | **Control Barrier Functions (CBF)** | ![Illustration](graphs/control_barrier_functions_cbf.png) | Set-theoretic safety guarantees | CBF-RL, Control Theory | | **Exploration** | **Count-based Exploration Heatmap** | ![Illustration](graphs/count_based_exploration_heatmap.png) | Visitation frequency and intrinsic bonus | MBIE-EB, RND | | **Exploration** | **Thompson Sampling Posteriors** | ![Illustration](graphs/thompson_sampling_posteriors.png) | Direct uncertainty-based action selection | Bandits, Bayesian RL | | **Multi-Agent RL** | **Adversarial RL Interaction** | ![Illustration](graphs/adversarial_rl_interaction.png) | Competition between protaganist and antagonist | Robust RL, RARL | | **Hierarchical RL** | **Hierarchical Subgoal Trajectory** | ![Illustration](graphs/hierarchical_subgoal_trajectory.png) | Decomposing long-horizon tasks | Subgoal RL, HIRO | | **Offline RL** | **Offline Action Distribution Shift** | ![Illustration](graphs/offline_action_distribution_shift.png) | Mismatch between dataset and current policy | CQL, IQL, D4RL | | **Exploration** | **Random Network Distillation (RND)** | ![Illustration](graphs/random_network_distillation_rnd.png) | Prediction error as intrinsic reward | RND, OpenAI | | **Offline RL** | **Batch-Constrained Q-learning (BCQ)** | ![Illustration](graphs/batch_constrained_q_learning_bcq.png) | Constraining actions to behavior dataset | BCQ, Fujimoto | | **Training** | **Population-Based Training (PBT)** | ![Illustration](graphs/population_based_training_pbt.png) | Evolutionary hyperparameter optimization | PBT, DeepMind | | **Deep RL** | **Recurrent State Flow (DRQN/R2D2)** | ![Illustration](graphs/recurrent_state_flow_drqn_r2d2.png) | Temporal dependency in state-action value | DRQN, R2D2 | | **Theory** | **Belief State in POMDPs** | ![Illustration](graphs/belief_state_in_pomdps.png) | Probability distribution over hidden states | POMDPs, Belief Space | | **Multi-Objective RL** | **Multi-Objective Pareto Front** | ![Illustration](graphs/multi_objective_pareto_front.png) | Balancing conflicting reward signals | MORL, Pareto Optimal | | **Theory** | **Differential Value (Average Reward RL)** | ![Illustration](graphs/differential_value_average_reward_rl.png) | Values relative to average gain | Average Reward RL, Mahadevan | | **Infrastructure** | **Distributed RL Cluster (Ray/RLLib)** | ![Illustration](graphs/distributed_rl_cluster_ray_rllib.png) | Parallelizing experience collection | Ray, RLLib, Ape-X | | **Evolutionary RL** | **Neuroevolution Topology Evolution** | ![Illustration](graphs/neuroevolution_topology_evolution.png) | Evolving neural network architectures | NEAT, HyperNEAT | | **Continual RL** | **Elastic Weight Consolidation (EWC)** | ![Illustration](graphs/elastic_weight_consolidation_ewc.png) | Preventing catastrophic forgetting | EWC, Kirkpatric | | **Theory** | **Successor Features (SF)** | ![Illustration](graphs/successor_features_sf.png) | Generalizing predictive representations | SF-Dyna, Barreto | | **Safety** | **Adversarial State Noise (Perception)** | ![Illustration](graphs/adversarial_state_noise_perception.png) | Attacks on agent observation space | Adversarial RL, Huang | | **Imitation Learning** | **Behavioral Cloning (Imitation)** | ![Illustration](graphs/behavioral_cloning_imitation.png) | Direct supervised learning from experts | BC, DAGGER | | **Relational RL** | **Relational Graph State Representation** | ![Illustration](graphs/relational_graph_state_representation.png) | Modeling objects and their relations | Relational MDPs, BoxWorld | | **Quantum RL** | **Quantum RL Circuit (PQC)** | ![Illustration](graphs/quantum_rl_circuit_pqc.png) | Gate-based quantum policy networks | Quantum RL, PQC | | **Symbolic RL** | **Symbolic Policy Tree** | ![Illustration](graphs/symbolic_policy_tree.png) | Policies as mathematical expressions | Symbolic RL, GP | | **Control** | **Differentiable Physics Gradient Flow** | ![Illustration](graphs/differentiable_physics_gradient_flow.png) | Gradient-based planning through simulators | Brax, Isaac Gym | | **Multi-Agent RL** | **MARL Communication Channel** | ![Illustration](graphs/marl_communication_channel.png) | Information exchange between agents | CommNet, DIAL | | **Safety** | **Lagrangian Constraint Landscape** | ![Illustration](graphs/lagrangian_constraint_landscape.png) | Constrained optimization boundaries | Constrained RL, CPO | | **Hierarchical RL** | **MAXQ Task Hierarchy** | ![Illustration](graphs/maxq_task_hierarchy.png) | Recursive task decomposition | MAXQ, Dietterich | | **Agentic AI** | **ReAct Agentic Cycle** | ![Illustration](graphs/react_agentic_cycle.png) | Reasoning-Action loops for LLMs | ReAct, Agentic LLM | | **Bio-inspired RL** | **Synaptic Plasticity RL** | ![Illustration](graphs/synaptic_plasticity_rl.png) | Hebbian-style synaptic weight updates | Hebbian RL, STDP | | **Control** | **Guided Policy Search (GPS)** | ![Illustration](graphs/guided_policy_search_gps.png) | Distilling trajectories into a policy | GPS, Levine | | **Robotics** | **Sim-to-Real Jitter & Latency** | ![Illustration](graphs/sim_to_real_jitter_latency.png) | Temporal robustness in transfer | Sim-to-Real, Robustness | | **Policy Gradients** | **Deterministic Policy Gradient (DDPG) Flow** | ![Illustration](graphs/deterministic_policy_gradient_ddpg_flow.png) | Gradient flow for deterministic policies | DDPG | | **Model-Based RL** | **Dreamer Latent Imagination** | ![Illustration](graphs/dreamer_latent_imagination.png) | Learning and planning in latent space | Dreamer (V1-V3) | | **Deep RL** | **UNREAL Auxiliary Tasks** | ![Illustration](graphs/unreal_auxiliary_tasks.png) | Learning from non-reward signals | UNREAL, A3C extension | | **Offline RL** | **Implicit Q-Learning (IQL) Expectile** | ![Illustration](graphs/implicit_q_learning_iql_expectile.png) | In-sample learning via expectile regression | IQL | | **Model-Based RL** | **Prioritized Sweeping** | ![Illustration](graphs/prioritized_sweeping.png) | Planning prioritized by TD error | Sutton & Barto classic MBRL | | **Imitation Learning** | **DAgger Expert Loop** | ![Illustration](graphs/dagger_expert_loop.png) | Training on expert labels in agent-visited states | DAgger | | **Representation** | **Self-Predictive Representations (SPR)** | ![Illustration](graphs/self_predictive_representations_spr.png) | Consistency between predicted and target latents | SPR, sample-efficient RL | | **Multi-Agent RL** | **Joint Action Space** | ![Illustration](graphs/joint_action_space.png) | Cartesian product of individual actions | MARL theory, Game Theory | | **Multi-Agent RL** | **Dec-POMDP Formal Model** | ![Illustration](graphs/dec_pomdp_formal_model.png) | Decentralized partially observable MDP | Multi-agent coordination | | **Theory** | **Bisimulation Metric** | ![Illustration](graphs/bisimulation_metric.png) | State equivalence based on transitions/rewards | State abstraction, bisimulation theory | | **Theory** | **Potential-Based Reward Shaping** | ![Illustration](graphs/potential_based_reward_shaping.png) | Reward transformation preserving optimal policy | Sutton & Barto, Ng et al. | | **Training** | **Transfer RL: Source to Target** | ![Illustration](graphs/transfer_rl_source_to_target.png) | Reusing knowledge across different MDPs | Transfer Learning, Distillation | | **Deep RL** | **Multi-Task Backbone Arch** | ![Illustration](graphs/multi_task_backbone_arch.png) | Single agent learning multiple tasks | Multi-task RL, IMPALA | | **Bandits** | **Contextual Bandit Pipeline** | ![Illustration](graphs/contextual_bandit_pipeline.png) | Decision making given context but no transitions | Personalization, Ad-tech | | **Theory** | **Theoretical Regret Bounds** | ![Illustration](graphs/theoretical_regret_bounds.png) | Analytical performance guarantees | Online Learning, Bandits | | **Value-based** | **Soft Q Boltzmann Probabilities** | ![Illustration](graphs/soft_q_boltzmann_probabilities.png) | Probabilistic action selection from Q-values | s) \propto \exp(Q/\tau)$ | | **Robotics** | **Autonomous Driving RL Pipeline** | ![Illustration](graphs/autonomous_driving_rl_pipeline.png) | End-to-end or modular driving stack | Wayve, Tesla, Comma.ai | | **Policy** | **Policy action gradient comparison** | ![Illustration](graphs/policy_action_gradient_comparison.png) | Comparison of gradient derivation types | PG Theorem vs DPG Theorem | | **Inverse RL / IRL** | **IRL: Feature Expectation Matching** | ![Illustration](graphs/irl_feature_expectation_matching.png) | Comparing expert vs learner feature visitor frequency | \mu(\pi^*) - \mu(\pi) | | **Imitation Learning** | **Apprenticeship Learning Loop** | ![Illustration](graphs/apprenticeship_learning_loop.png) | Training to match expert performance via reward inference | Apprenticeship Learning | | **Theory** | **Active Inference Loop** | ![Illustration](graphs/active_inference_loop.png) | Agents minimizing surprise (free energy) | Free Energy Principle, Friston | | **Theory** | **Bellman Residual Landscape** | ![Illustration](graphs/bellman_residual_landscape.png) | Training surface of the Bellman error | TD learning, fitted Q-iteration | | **Model-Based RL** | **Plan-to-Explore Uncertainty Map** | ![Illustration](graphs/plan_to_explore_uncertainty_map.png) | Systematic exploration in learned world models | Plan-to-Explore, Sekar et al. | | **Safety RL** | **Robust RL Uncertainty Set** | ![Illustration](graphs/robust_rl_uncertainty_set.png) | Optimizing for the worst-case environment transition | Robust MDPs, minimax RL | | **Training** | **HPO Bayesian Opt Cycle** | ![Illustration](graphs/hpo_bayesian_opt_cycle.png) | Automating hyperparameter selection with GP | Hyperparameter Optimization | | **Applied RL** | **Slate RL Recommendation** | ![Illustration](graphs/slate_rl_recommendation.png) | Optimizing list/slate of items for users | Recommender Systems, Ie et al. | | **Multi-Agent RL** | **Fictitious Play Interaction** | ![Illustration](graphs/fictitious_play_interaction.png) | Belief-based learning in games | Game Theory, Brown (1951) | | **Conceptual** | **Universal RL Framework Diagram** | ![Illustration](graphs/universal_rl_framework_diagram.png) | High-level summary of RL components | All RL | | **Offline RL** | **Offline Density Ratio Estimator** | ![Illustration](graphs/offline_density_ratio_estimator.png) | Estimating $w(s,a)$ for off-policy data | Importance Sampling, Offline RL | | **Continual RL** | **Continual Task Interference Heatmap** | ![Illustration](graphs/continual_task_interference_heatmap.png) | Measuring negative transfer between tasks | Lifelong Learning, EWC | | **Safety RL** | **Lyapunov Stability Safe Set** | ![Illustration](graphs/lyapunov_stability_safe_set.png) | Invariant sets for safe control | Lyapunov RL, Chow et al. | | **Applied RL** | **Molecular RL (Atom Coordinates)** | ![Illustration](graphs/molecular_rl_atom_coordinates.png) | RL for molecular design/protein folding | Chemistry RL, AlphaFold-style | | **Architecture** | **MoE Multi-task Architecture** | ![Illustration](graphs/moe_multi_task_architecture.png) | Scaling models with mixture of experts | MoE-RL, Sparsity | | **Direct Policy Search** | **CMA-ES Policy Search** | ![Illustration](graphs/cma_es_policy_search.png) | Evolutionary strategy for policy weights | ES for RL, Salimans | | **Alignment** | **Elo Rating Preference Plot** | ![Illustration](graphs/elo_rating_preference_plot.png) | Measuring agent strength over time | AlphaZero, League training | | **Explainable RL** | **Explainable RL (SHAP Attribution)** | ![Illustration](graphs/explainable_rl_shap_attribution.png) | Local attribution of features to agent actions | Interpretability, SHAP/LIME | | **Meta-RL** | **PEARL Context Encoder** | ![Illustration](graphs/pearl_context_encoder.png) | Learning latent task representations | PEARL, Rakelly et al. | | **Applied RL** | **Medical RL Therapy Pipeline** | ![Illustration](graphs/medical_rl_therapy_pipeline.png) | Personalized medicine and dosing | Healthcare RL, ICU Sepsis | | **Applied RL** | **Supply Chain RL Pipeline** | ![Illustration](graphs/supply_chain_rl_pipeline.png) | Optimizing stock levels and orders | Logistics, Inventory Management | | **Robotics** | **Sim-to-Real SysID Loop** | ![Illustration](graphs/sim_to_real_sysid_loop.png) | Closing the reality gap via parameter estimation | System Identification, Robotics | | **Architecture** | **Transformer World Model** | ![Illustration](graphs/transformer_world_model.png) | Sequence-to-sequence dynamics modeling | DreamerV3, Transframer | | **Applied RL** | **Network Traffic RL** | ![Illustration](graphs/network_traffic_rl.png) | Optimizing data packet routing in graphs | Networking, Traffic Engineering | | **Training** | **RLHF: PPO with Reference Policy** | ![Illustration](graphs/rlhf_ppo_with_reference_policy.png) | Ensuring RL fine-tuning doesn't drift too far | InstructGPT, Llama 2/3 | | **Multi-Agent RL** | **PSRO Meta-Game Update** | ![Illustration](graphs/psro_meta_game_update.png) | Reaching Nash equilibrium in large games | PSRO, Lanctot et al. | | **Multi-Agent RL** | **DIAL: Differentiable Comm** | ![Illustration](graphs/dial_differentiable_comm.png) | End-to-end learning of communication protocols | DIAL, Foerster et al. | | **Batch RL** | **Fitted Q-Iteration Loop** | ![Illustration](graphs/fitted_q_iteration_loop.png) | Data-driven iteration with a supervised regressor | Ernst et al. (2005) | | **Safety RL** | **CMDP Feasible Region** | ![Illustration](graphs/cmdp_feasible_region.png) | Constrained optimization within a safety budget | Constrained MDPs, Altman | | **Control** | **MPC vs RL Planning** | ![Illustration](graphs/mpc_vs_rl_planning.png) | Comparison of control paradigms | Control Theory vs RL | | **AutoML** | **Learning to Optimize (L2O)** | ![Illustration](graphs/learning_to_optimize_l2o.png) | Using RL to learn an optimization update rule | L2O, Li & Malik | | **Applied RL** | **Smart Grid RL Management** | ![Illustration](graphs/smart_grid_rl_management.png) | Optimizing energy supply and demand | Energy RL, Smart Grids | | **Applied RL** | **Quantum State Tomography RL** | ![Illustration](graphs/quantum_state_tomography_rl.png) | RL for quantum state estimation | Quantum RL, Neural Tomography | | **Applied RL** | **RL for Chip Placement** | ![Illustration](graphs/rl_for_chip_placement.png) | Placing components on silicon grids | Google Chip Placement | | **Applied RL** | **RL Compiler Optimization (MLGO)** | ![Illustration](graphs/rl_compiler_optimization_mlgo.png) | Inlining and sizing in compilers | MLGO, LLVM | | **Applied RL** | **RL for Theorem Proving** | ![Illustration](graphs/rl_for_theorem_proving.png) | Automated reasoning and proof search | LeanRL, AlphaProof | | **Modern RL** | **Diffusion-QL Offline RL** | ![Illustration](graphs/diffusion_ql_offline_rl.png) | Policy as reverse diffusion process | s,k)$ with noise injection | | **Principles** | **Fairness-reward Pareto Frontier** | ![Illustration](graphs/fairness_reward_pareto_frontier.png) | Balancing equity and returns | Fair RL, Jabbari et al. | | **Principles** | **Differentially Private RL** | ![Illustration](graphs/differentially_private_rl.png) | Privacy-preserving training | DP-RL, Agarwal et al. | | **Applied RL** | **Smart Agriculture RL** | ![Illustration](graphs/smart_agriculture_rl.png) | Optimizing crop yield and resources | Precision Agriculture | | **Applied RL** | **Climate Mitigation RL (Grid)** | ![Illustration](graphs/climate_mitigation_rl_grid.png) | Environmental control policies | ClimateRL, Carbon Control | | **Applied RL** | **AI Education (Knowledge Tracing)** | ![Illustration](graphs/ai_education_knowledge_tracing.png) | Personalized learning paths | ITS, Bayesian Knowledge Tracing | | **Modern RL** | **Decision SDE Flow** | ![Illustration](graphs/decision_sde_flow.png) | RL in continuous stochastic systems | Neural SDEs, Control | | **Control** | **Differentiable physics (Brax)** | ![Illustration](graphs/differentiable_physics_brax.png) | Gradients through simulators | Brax, PhysX, MuJoCo | | **Applied RL** | **Wireless Beamforming RL** | ![Illustration](graphs/wireless_beamforming_rl.png) | Optimizing antenna signal directions | 5G/6G Networking | | **Applied RL** | **Quantum Error Correction RL** | ![Illustration](graphs/quantum_error_correction_rl.png) | Correcting noise in quantum circuits | Quantum Computing RL | | **Multi-Agent RL** | **Mean Field RL Interaction** | ![Illustration](graphs/mean_field_rl_interaction.png) | Large population agent dynamics | MF-RL, Yang et al. | | **HRL** | **Goal-GAN Curriculum** | ![Illustration](graphs/goal_gan_curriculum.png) | Automatic goal generation | Goal-GAN, Florensa et al. | | **Modern RL** | **JEPA: Predictive Architecture** | ![Illustration](graphs/jepa_predictive_architecture.png) | LeCun's world model framework | JEPA, I-JEPA | | **Offline RL** | **CQL Value Penalty Landscape** | ![Illustration](graphs/cql_value_penalty_landscape.png) | Conservatism in value functions | CQL, Kumar et al. | | **Applied RL** | **Causal RL** | ![Illustration](graphs/causal_rl.png) | Causal Inverse RL Graph | DAG with $S, A, R$ and latent $U$ | | **Quantum RL** | **VQE-RL Optimization** | ![Illustration](graphs/vqe_rl_optimization.png) | Quantum circuit param tuning | VQE, Quantum RL | | **Applied RL** | **De-novo Drug Discovery RL** | ![Illustration](graphs/de_novo_drug_discovery_rl.png) | Generating optimized lead molecules | Drug Discovery, Molecule RL | | **Applied RL** | **Traffic Signal Coordination RL** | ![Illustration](graphs/traffic_signal_coordination_rl.png) | Multi-intersection coordination | IntelliLight, PressLight | | **Applied RL** | **Mars Rover Pathfinding RL** | ![Illustration](graphs/mars_rover_pathfinding_rl.png) | Navigation on rough terrain | Space RL, Mars Rover | | **Applied RL** | **Sports Player Movement RL** | ![Illustration](graphs/sports_player_movement_rl.png) | Predicting/Optimizing player actions | Sports Analytics, Ghosting | | **Applied RL** | **Cryptography Attack RL** | ![Illustration](graphs/cryptography_attack_rl.png) | Searching for keys/vulnerabilities | Crypto-RL, Learning to Attack | | **Applied RL** | **Humanitarian Resource RL** | ![Illustration](graphs/humanitarian_resource_rl.png) | Disaster response allocation | AI for Good, Resource RL | | **Applied RL** | **Video Compression RL (RD)** | ![Illustration](graphs/video_compression_rl_rd.png) | Optimizing bit-rate vs distortion | Learned Video Compression | | **Applied RL** | **Kubernetes Auto-scaling RL** | ![Illustration](graphs/kubernetes_auto_scaling_rl.png) | Cloud resource management | Cloud RL, K8s Scaling | | **Applied RL** | **Fluid Dynamics Flow Control RL** | ![Illustration](graphs/fluid_dynamics_flow_control_rl.png) | Airfoil/Turbulence control | Aero-RL, Flow Control | | **Applied RL** | **Structural Optimization RL** | ![Illustration](graphs/structural_optimization_rl.png) | Topology/Material design | Structural RL, Topology Opt | | **Applied RL** | **Human Decision Modeling** | ![Illustration](graphs/human_decision_modeling.png) | Prospect Theory in RL | Behavioral RL, Prospect Theory | | **Applied RL** | **Semantic Parsing RL** | ![Illustration](graphs/semantic_parsing_rl.png) | Language to Logic transformation | Semantic Parsing, Seq2Seq-RL | | **Applied RL** | **Music Melody RL** | ![Illustration](graphs/music_melody_rl.png) | Reward-based melody generation | Music-RL, Magenta | | **Applied RL** | **Plasma Fusion Control RL** | ![Illustration](graphs/plasma_fusion_control_rl.png) | Magnetic control of Tokamaks | DeepMind Fusion, Tokamak RL | | **Applied RL** | **Carbon Capture RL cycle** | ![Illustration](graphs/carbon_capture_rl_cycle.png) | Adsorption/Desorption optimization | Carbon Capture, Green RL | | **Applied RL** | **Swarm Robotics RL** | ![Illustration](graphs/swarm_robotics_rl.png) | Decentralized swarm coordination | Swarm-RL, Multi-Robot | | **Applied RL** | **Legal Compliance RL Game** | ![Illustration](graphs/legal_compliance_rl_game.png) | Regulatory games | Legal-RL, RegTech | | **Physics RL** | **Physics-Informed RL (PINN)** | ![Illustration](graphs/physics_informed_rl_pinn.png) | Constraint-based RL loss | PINN-RL, SciML | | **Modern RL** | **Neuro-Symbolic RL** | ![Illustration](graphs/neuro_symbolic_rl.png) | Combining logic and neural nets | Neuro-Symbolic, Logic RL | | **Applied RL** | **DeFi Liquidity Pool RL** | ![Illustration](graphs/defi_liquidity_pool_rl.png) | Yield farming/Liquidity balancing | DeFi-RL, AMM Optimization | | **Neuro RL** | **Dopamine Reward Prediction Error** | ![Illustration](graphs/dopamine_reward_prediction_error.png) | Biological RL signal curves | Neuroscience-RL, Wolfram | | **Robotics** | **Proprioceptive Sensory-Motor RL** | ![Illustration](graphs/proprioceptive_sensory_motor_rl.png) | Low-level joint control | Proprioceptive RL, Unitree | | **Applied RL** | **AR Object Placement RL** | ![Illustration](graphs/ar_object_placement_rl.png) | AR visual overlay optimization | AR-RL, Visual Overlay | | **Reco RL** | **Sequential Bundle RL** | ![Illustration](graphs/sequential_bundle_rl.png) | Recommendation item grouping | Bundle-RL, E-commerce | | **Theoretical** | **Online Gradient Descent vs RL** | ![Illustration](graphs/online_gradient_descent_vs_rl.png) | Gradient-based learning comparison | Online Learning, Regret | | **Modern RL** | **Active Learning: Query RL** | ![Illustration](graphs/active_learning_query_rl.png) | Query-based sample selection | Active-RL, Query Opt | | **Modern RL** | **Federated RL global Aggregator** | ![Illustration](graphs/federated_rl_global_aggregator.png) | Privacy-preserving distributed RL | Federated-RL, FedAvg-RL | | **Conceptual** | **Ultimate Universal RL Mastery Diagram** | ![Illustration](graphs/ultimate_universal_rl_mastery_diagram.png) | Final summary of 230 items | Absolute Mastery Milestone |