| --- |
| title: Reinforcement Learning Graphical Representations |
| date: 2026-04-08 |
| category: Reinforcement Learning |
| description: A comprehensive gallery of 230 standard RL components and their graphical presentations. |
| --- |
| |
| # Reinforcement Learning Graphical Representations |
|
|
| This repository contains a full set of 230 visualizations representing foundational concepts, algorithms, and advanced topics in Reinforcement Learning. |
|
|
| | Category | Component | Illustration | Details | Context | |
| |----------|-----------|--------------|---------|---------| |
| | **MDP & Environment** | **Agent-Environment Interaction Loop** |  | Core cycle: observation of state → selection of action → environment transition → receipt of reward + next state | All RL algorithms | |
| | **MDP & Environment** | **Markov Decision Process (MDP) Tuple** |  | (S, A, P, R, γ) with transition dynamics and reward function | s,a) and R(s,a,s′)) | |
| | **MDP & Environment** | **State Transition Graph** |  | Full probabilistic transitions between discrete states | Gridworld, Taxi, Cliff Walking | |
| | **MDP & Environment** | **Trajectory / Episode Sequence** |  | Sequence of (s₀, a₀, r₁, s₁, …, s_T) | Monte Carlo, episodic tasks | |
| | **MDP & Environment** | **Continuous State/Action Space Visualization** |  | High-dimensional spaces (e.g., robot joints, pixel inputs) | Continuous-control tasks (MuJoCo, PyBullet) | |
| | **MDP & Environment** | **Reward Function / Landscape** |  | Scalar reward as function of state/action | All algorithms; especially reward shaping | |
| | **MDP & Environment** | **Discount Factor (γ) Effect** |  | How future rewards are weighted | All discounted MDPs | |
| | **Value & Policy** | **State-Value Function V(s)** |  | Expected return from state s under policy π | Value-based methods | |
| | **Value & Policy** | **Action-Value Function Q(s,a)** |  | Expected return from state-action pair | Q-learning family | |
| | **Value & Policy** | **Policy π(s) or π(a\** |  | s) | Arrow overlays on grid (optimal policy), probability bar charts, or softmax heatmaps | |
| | **Value & Policy** | **Advantage Function A(s,a)** |  | Q(s,a) – V(s) | A2C, PPO, SAC, TD3 | |
| | **Value & Policy** | **Optimal Value Function V* / Q*** |  | Solution to Bellman optimality | Value iteration, Q-learning | |
| | **Dynamic Programming** | **Policy Evaluation Backup** |  | Iterative update of V using Bellman expectation | Policy iteration | |
| | **Dynamic Programming** | **Policy Improvement** |  | Greedy policy update over Q | Policy iteration | |
| | **Dynamic Programming** | **Value Iteration Backup** |  | Update using Bellman optimality | Value iteration | |
| | **Dynamic Programming** | **Policy Iteration Full Cycle** |  | Evaluation → Improvement loop | Classic DP methods | |
| | **Monte Carlo** | **Monte Carlo Backup** |  | Update using full episode return G_t | First-visit / every-visit MC | |
| | **Monte Carlo** | **Monte Carlo Tree (MCTS)** |  | Search tree with selection, expansion, simulation, backprop | AlphaGo, AlphaZero | |
| | **Monte Carlo** | **Importance Sampling Ratio** |  | Off-policy correction ρ = π(a\ | s) | |
| | **Temporal Difference** | **TD(0) Backup** |  | Bootstrapped update using R + γV(s′) | TD learning | |
| | **Temporal Difference** | **Bootstrapping (general)** |  | Using estimated future value instead of full return | All TD methods | |
| | **Temporal Difference** | **n-step TD Backup** |  | Multi-step return G_t^{(n)} | n-step TD, TD(λ) | |
| | **Temporal Difference** | **TD(λ) & Eligibility Traces** |  | Decaying trace z_t for credit assignment | TD(λ), SARSA(λ), Q(λ) | |
| | **Temporal Difference** | **SARSA Update** |  | On-policy TD control | SARSA | |
| | **Temporal Difference** | **Q-Learning Update** |  | Off-policy TD control | Q-learning, Deep Q-Network | |
| | **Temporal Difference** | **Expected SARSA** |  | Expectation over next action under policy | Expected SARSA | |
| | **Temporal Difference** | **Double Q-Learning / Double DQN** |  | Two separate Q estimators to reduce overestimation | Double DQN, TD3 | |
| | **Temporal Difference** | **Dueling DQN Architecture** |  | Separate streams for state value V(s) and advantage A(s,a) | Dueling DQN | |
| | **Temporal Difference** | **Prioritized Experience Replay** |  | Importance sampling of transitions by TD error | Prioritized DQN, Rainbow | |
| | **Temporal Difference** | **Rainbow DQN Components** |  | All extensions combined (Double, Dueling, PER, etc.) | Rainbow DQN | |
| | **Function Approximation** | **Linear Function Approximation** |  | Feature vector φ(s) → wᵀφ(s) | Tabular → linear FA | |
| | **Function Approximation** | **Neural Network Layers (MLP, CNN, RNN, Transformer)** |  | Full deep network for value/policy | DQN, A3C, PPO, Decision Transformer | |
| | **Function Approximation** | **Computation Graph / Backpropagation Flow** |  | Gradient flow through network | All deep RL | |
| | **Function Approximation** | **Target Network** |  | Frozen copy of Q-network for stability | DQN, DDQN, SAC, TD3 | |
| | **Policy Gradients** | **Policy Gradient Theorem** |  | ∇_θ J(θ) = E[∇_θ log π(a\ | Flow diagram from reward → log-prob → gradient | |
| | **Policy Gradients** | **REINFORCE Update** |  | Monte-Carlo policy gradient | REINFORCE | |
| | **Policy Gradients** | **Baseline / Advantage Subtraction** |  | Subtract b(s) to reduce variance | All modern PG | |
| | **Policy Gradients** | **Trust Region (TRPO)** |  | KL-divergence constraint on policy update | TRPO | |
| | **Policy Gradients** | **Proximal Policy Optimization (PPO)** |  | Clipped surrogate objective | PPO, PPO-Clip | |
| | **Actor-Critic** | **Actor-Critic Architecture** |  | Separate or shared actor (policy) + critic (value) networks | A2C, A3C, SAC, TD3 | |
| | **Actor-Critic** | **Advantage Actor-Critic (A2C/A3C)** |  | Synchronous/asynchronous multi-worker | A2C/A3C | |
| | **Actor-Critic** | **Soft Actor-Critic (SAC)** |  | Entropy-regularized policy + twin critics | SAC | |
| | **Actor-Critic** | **Twin Delayed DDPG (TD3)** |  | Twin critics + delayed policy + target smoothing | TD3 | |
| | **Exploration** | **ε-Greedy Strategy** |  | Probability ε of random action | DQN family | |
| | **Exploration** | **Softmax / Boltzmann Exploration** |  | Temperature τ in softmax | Softmax policies | |
| | **Exploration** | **Upper Confidence Bound (UCB)** |  | Optimism in face of uncertainty | UCB1, bandits | |
| | **Exploration** | **Intrinsic Motivation / Curiosity** |  | Prediction error as intrinsic reward | ICM, RND, Curiosity-driven RL | |
| | **Exploration** | **Entropy Regularization** |  | Bonus term αH(π) | SAC, maximum-entropy RL | |
| | **Hierarchical RL** | **Options Framework** |  | High-level policy over options (temporally extended actions) | Option-Critic | |
| | **Hierarchical RL** | **Feudal Networks / Hierarchical Actor-Critic** |  | Manager-worker hierarchy | Feudal RL | |
| | **Hierarchical RL** | **Skill Discovery** |  | Unsupervised emergence of reusable skills | DIAYN, VALOR | |
| | **Model-Based RL** | **Learned Dynamics Model** |  | ˆP(s′\ | Separate model network diagram (often RNN or transformer) | |
| | **Model-Based RL** | **Model-Based Planning** |  | Rollouts inside learned model | MuZero, DreamerV3 | |
| | **Model-Based RL** | **Imagination-Augmented Agents (I2A)** |  | Imagination module + policy | I2A | |
| | **Offline RL** | **Offline Dataset** |  | Fixed batch of trajectories | BC, CQL, IQL | |
| | **Offline RL** | **Conservative Q-Learning (CQL)** |  | Penalty on out-of-distribution actions | CQL | |
| | **Multi-Agent RL** | **Multi-Agent Interaction Graph** |  | Agents communicating or competing | MARL, MADDPG | |
| | **Multi-Agent RL** | **Centralized Training Decentralized Execution (CTDE)** |  | Shared critic during training | QMIX, VDN, MADDPG | |
| | **Multi-Agent RL** | **Cooperative / Competitive Payoff Matrix** |  | Joint reward for multiple agents | Prisoner's Dilemma, multi-agent gridworlds | |
| | **Inverse RL / IRL** | **Reward Inference** |  | Infer reward from expert demonstrations | IRL, GAIL | |
| | **Inverse RL / IRL** | **Generative Adversarial Imitation Learning (GAIL)** |  | Discriminator vs. policy generator | GAIL, AIRL | |
| | **Meta-RL** | **Meta-RL Architecture** |  | Outer loop (meta-policy) + inner loop (task adaptation) | MAML for RL, RL² | |
| | **Meta-RL** | **Task Distribution Visualization** |  | Multiple MDPs sampled from meta-distribution | Meta-RL benchmarks | |
| | **Advanced / Misc** | **Experience Replay Buffer** |  | Stored (s,a,r,s′,done) tuples | DQN and all off-policy deep RL | |
| | **Advanced / Misc** | **State Visitation / Occupancy Measure** |  | Frequency of visiting each state | All algorithms (analysis) | |
| | **Advanced / Misc** | **Learning Curve** |  | Average episodic return vs. episodes / steps | Standard performance reporting | |
| | **Advanced / Misc** | **Regret / Cumulative Regret** |  | Sub-optimality accumulated | Bandits and online RL | |
| | **Advanced / Misc** | **Attention Mechanisms (Transformers in RL)** |  | Attention weights | Decision Transformer, Trajectory Transformer | |
| | **Advanced / Misc** | **Diffusion Policy** |  | Denoising diffusion process for action generation | Diffusion-RL policies | |
| | **Advanced / Misc** | **Graph Neural Networks for RL** |  | Node/edge message passing | Graph RL, relational RL | |
| | **Advanced / Misc** | **World Model / Latent Space** |  | Encoder-decoder dynamics in latent space | Dreamer, PlaNet | |
| | **Advanced / Misc** | **Convergence Analysis Plots** |  | Error / value change over iterations | DP, TD, value iteration | |
| | **Advanced / Misc** | **RL Algorithm Taxonomy** |  | Comprehensive classification of algorithms | All RL | |
| | **Advanced / Misc** | **Probabilistic Graphical Model (RL as Inference)** |  | Formalizing RL as probabilistic inference | Control as Inference, MaxEnt RL | |
| | **Value & Policy** | **Distributional RL (C51 / Categorical)** |  | Representing return as a probability distribution | C51, QR-DQN, IQN | |
| | **Exploration** | **Hindsight Experience Replay (HER)** |  | Learning from failures by relabeling goals | Sparse reward robotics, HER | |
| | **Model-Based RL** | **Dyna-Q Architecture** |  | Integration of real experience and model-based planning | Dyna-Q, Dyna-2 | |
| | **Function Approximation** | **Noisy Networks (Parameter Noise)** |  | Stochastic weights for exploration | Noisy DQN, Rainbow | |
| | **Exploration** | **Intrinsic Curiosity Module (ICM)** |  | Reward based on prediction error | Curiosity-driven exploration, ICM | |
| | **Temporal Difference** | **V-trace (IMPALA)** |  | Asynchronous off-policy importance sampling | IMPALA, V-trace | |
| | **Multi-Agent RL** | **QMIX Mixing Network** |  | Monotonic value function factorization | QMIX, VDN | |
| | **Advanced / Misc** | **Saliency Maps / Attention on State** |  | Visualizing what the agent "sees" or prioritizes | Interpretability, Atari RL | |
| | **Exploration** | **Action Selection Noise (OU vs Gaussian)** |  | Temporal correlation in exploration noise | DDPG, TD3 | |
| | **Advanced / Misc** | **t-SNE / UMAP State Embeddings** |  | Dimension reduction of high-dim neural states | Interpretability, SRL | |
| | **Advanced / Misc** | **Loss Landscape Visualization** |  | Optimization surface geometry | Training stability analysis | |
| | **Advanced / Misc** | **Success Rate vs Steps** |  | Percentage of successful episodes | Goal-conditioned RL, Robotics | |
| | **Advanced / Misc** | **Hyperparameter Sensitivity Heatmap** |  | Performance across parameter grids | Hyperparameter tuning | |
| | **Dynamics** | **Action Persistence (Frame Skipping)** |  | Temporal abstraction by repeating actions | Atari RL, Robotics | |
| | **Model-Based RL** | **MuZero Dynamics Search Tree** |  | Planning with learned transition and value functions | MuZero, Gumbel MuZero | |
| | **Deep RL** | **Policy Distillation** |  | Compressing knowledge from teacher to student | Kickstarting, multitask learning | |
| | **Transformers** | **Decision Transformer Token Sequence** |  | Sequential modeling of RL as a translation task | Decision Transformer, TT | |
| | **Advanced / Misc** | **Performance Profiles (rliable)** |  | Robust aggregate performance metrics | Reliable RL evaluation | |
| | **Safety RL** | **Safety Shielding / Barrier Functions** |  | Hard constraints on the action space | Constrained MDPs, Safe RL | |
| | **Training** | **Automated Curriculum Learning** |  | Progressively increasing task difficulty | Curriculum RL, ALP-GMM | |
| | **Sim-to-Real** | **Domain Randomization** |  | Generalizing across environment variations | Robotics, Sim-to-Real | |
| | **Alignment** | **RL with Human Feedback (RLHF)** |  | Aligning agents with human preferences | ChatGPT, InstructGPT | |
| | **Neuro-inspired RL** | **Successor Representation (SR)** |  | Predictive state representations | SR-Dyna, Neuro-RL | |
| | **Inverse RL / IRL** | **Maximum Entropy IRL** |  | Probability distribution over trajectories | MaxEnt IRL, Ziebart | |
| | **Theory** | **Information Bottleneck** |  | Mutual information $I(S;Z)$ and $I(Z;A)$ balance | VIB-RL, Information Theory | |
| | **Evolutionary RL** | **Evolutionary Strategies Population** |  | Population-based parameter search | OpenAI-ES, Salimans | |
| | **Safety RL** | **Control Barrier Functions (CBF)** |  | Set-theoretic safety guarantees | CBF-RL, Control Theory | |
| | **Exploration** | **Count-based Exploration Heatmap** |  | Visitation frequency and intrinsic bonus | MBIE-EB, RND | |
| | **Exploration** | **Thompson Sampling Posteriors** |  | Direct uncertainty-based action selection | Bandits, Bayesian RL | |
| | **Multi-Agent RL** | **Adversarial RL Interaction** |  | Competition between protaganist and antagonist | Robust RL, RARL | |
| | **Hierarchical RL** | **Hierarchical Subgoal Trajectory** |  | Decomposing long-horizon tasks | Subgoal RL, HIRO | |
| | **Offline RL** | **Offline Action Distribution Shift** |  | Mismatch between dataset and current policy | CQL, IQL, D4RL | |
| | **Exploration** | **Random Network Distillation (RND)** |  | Prediction error as intrinsic reward | RND, OpenAI | |
| | **Offline RL** | **Batch-Constrained Q-learning (BCQ)** |  | Constraining actions to behavior dataset | BCQ, Fujimoto | |
| | **Training** | **Population-Based Training (PBT)** |  | Evolutionary hyperparameter optimization | PBT, DeepMind | |
| | **Deep RL** | **Recurrent State Flow (DRQN/R2D2)** |  | Temporal dependency in state-action value | DRQN, R2D2 | |
| | **Theory** | **Belief State in POMDPs** |  | Probability distribution over hidden states | POMDPs, Belief Space | |
| | **Multi-Objective RL** | **Multi-Objective Pareto Front** |  | Balancing conflicting reward signals | MORL, Pareto Optimal | |
| | **Theory** | **Differential Value (Average Reward RL)** |  | Values relative to average gain | Average Reward RL, Mahadevan | |
| | **Infrastructure** | **Distributed RL Cluster (Ray/RLLib)** |  | Parallelizing experience collection | Ray, RLLib, Ape-X | |
| | **Evolutionary RL** | **Neuroevolution Topology Evolution** |  | Evolving neural network architectures | NEAT, HyperNEAT | |
| | **Continual RL** | **Elastic Weight Consolidation (EWC)** |  | Preventing catastrophic forgetting | EWC, Kirkpatric | |
| | **Theory** | **Successor Features (SF)** |  | Generalizing predictive representations | SF-Dyna, Barreto | |
| | **Safety** | **Adversarial State Noise (Perception)** |  | Attacks on agent observation space | Adversarial RL, Huang | |
| | **Imitation Learning** | **Behavioral Cloning (Imitation)** |  | Direct supervised learning from experts | BC, DAGGER | |
| | **Relational RL** | **Relational Graph State Representation** |  | Modeling objects and their relations | Relational MDPs, BoxWorld | |
| | **Quantum RL** | **Quantum RL Circuit (PQC)** |  | Gate-based quantum policy networks | Quantum RL, PQC | |
| | **Symbolic RL** | **Symbolic Policy Tree** |  | Policies as mathematical expressions | Symbolic RL, GP | |
| | **Control** | **Differentiable Physics Gradient Flow** |  | Gradient-based planning through simulators | Brax, Isaac Gym | |
| | **Multi-Agent RL** | **MARL Communication Channel** |  | Information exchange between agents | CommNet, DIAL | |
| | **Safety** | **Lagrangian Constraint Landscape** |  | Constrained optimization boundaries | Constrained RL, CPO | |
| | **Hierarchical RL** | **MAXQ Task Hierarchy** |  | Recursive task decomposition | MAXQ, Dietterich | |
| | **Agentic AI** | **ReAct Agentic Cycle** |  | Reasoning-Action loops for LLMs | ReAct, Agentic LLM | |
| | **Bio-inspired RL** | **Synaptic Plasticity RL** |  | Hebbian-style synaptic weight updates | Hebbian RL, STDP | |
| | **Control** | **Guided Policy Search (GPS)** |  | Distilling trajectories into a policy | GPS, Levine | |
| | **Robotics** | **Sim-to-Real Jitter & Latency** |  | Temporal robustness in transfer | Sim-to-Real, Robustness | |
| | **Policy Gradients** | **Deterministic Policy Gradient (DDPG) Flow** |  | Gradient flow for deterministic policies | DDPG | |
| | **Model-Based RL** | **Dreamer Latent Imagination** |  | Learning and planning in latent space | Dreamer (V1-V3) | |
| | **Deep RL** | **UNREAL Auxiliary Tasks** |  | Learning from non-reward signals | UNREAL, A3C extension | |
| | **Offline RL** | **Implicit Q-Learning (IQL) Expectile** |  | In-sample learning via expectile regression | IQL | |
| | **Model-Based RL** | **Prioritized Sweeping** |  | Planning prioritized by TD error | Sutton & Barto classic MBRL | |
| | **Imitation Learning** | **DAgger Expert Loop** |  | Training on expert labels in agent-visited states | DAgger | |
| | **Representation** | **Self-Predictive Representations (SPR)** |  | Consistency between predicted and target latents | SPR, sample-efficient RL | |
| | **Multi-Agent RL** | **Joint Action Space** |  | Cartesian product of individual actions | MARL theory, Game Theory | |
| | **Multi-Agent RL** | **Dec-POMDP Formal Model** |  | Decentralized partially observable MDP | Multi-agent coordination | |
| | **Theory** | **Bisimulation Metric** |  | State equivalence based on transitions/rewards | State abstraction, bisimulation theory | |
| | **Theory** | **Potential-Based Reward Shaping** |  | Reward transformation preserving optimal policy | Sutton & Barto, Ng et al. | |
| | **Training** | **Transfer RL: Source to Target** |  | Reusing knowledge across different MDPs | Transfer Learning, Distillation | |
| | **Deep RL** | **Multi-Task Backbone Arch** |  | Single agent learning multiple tasks | Multi-task RL, IMPALA | |
| | **Bandits** | **Contextual Bandit Pipeline** |  | Decision making given context but no transitions | Personalization, Ad-tech | |
| | **Theory** | **Theoretical Regret Bounds** |  | Analytical performance guarantees | Online Learning, Bandits | |
| | **Value-based** | **Soft Q Boltzmann Probabilities** |  | Probabilistic action selection from Q-values | s) \propto \exp(Q/\tau)$ | |
| | **Robotics** | **Autonomous Driving RL Pipeline** |  | End-to-end or modular driving stack | Wayve, Tesla, Comma.ai | |
| | **Policy** | **Policy action gradient comparison** |  | Comparison of gradient derivation types | PG Theorem vs DPG Theorem | |
| | **Inverse RL / IRL** | **IRL: Feature Expectation Matching** |  | Comparing expert vs learner feature visitor frequency | \mu(\pi^*) - \mu(\pi) | |
| | **Imitation Learning** | **Apprenticeship Learning Loop** |  | Training to match expert performance via reward inference | Apprenticeship Learning | |
| | **Theory** | **Active Inference Loop** |  | Agents minimizing surprise (free energy) | Free Energy Principle, Friston | |
| | **Theory** | **Bellman Residual Landscape** |  | Training surface of the Bellman error | TD learning, fitted Q-iteration | |
| | **Model-Based RL** | **Plan-to-Explore Uncertainty Map** |  | Systematic exploration in learned world models | Plan-to-Explore, Sekar et al. | |
| | **Safety RL** | **Robust RL Uncertainty Set** |  | Optimizing for the worst-case environment transition | Robust MDPs, minimax RL | |
| | **Training** | **HPO Bayesian Opt Cycle** |  | Automating hyperparameter selection with GP | Hyperparameter Optimization | |
| | **Applied RL** | **Slate RL Recommendation** |  | Optimizing list/slate of items for users | Recommender Systems, Ie et al. | |
| | **Multi-Agent RL** | **Fictitious Play Interaction** |  | Belief-based learning in games | Game Theory, Brown (1951) | |
| | **Conceptual** | **Universal RL Framework Diagram** |  | High-level summary of RL components | All RL | |
| | **Offline RL** | **Offline Density Ratio Estimator** |  | Estimating $w(s,a)$ for off-policy data | Importance Sampling, Offline RL | |
| | **Continual RL** | **Continual Task Interference Heatmap** |  | Measuring negative transfer between tasks | Lifelong Learning, EWC | |
| | **Safety RL** | **Lyapunov Stability Safe Set** |  | Invariant sets for safe control | Lyapunov RL, Chow et al. | |
| | **Applied RL** | **Molecular RL (Atom Coordinates)** |  | RL for molecular design/protein folding | Chemistry RL, AlphaFold-style | |
| | **Architecture** | **MoE Multi-task Architecture** |  | Scaling models with mixture of experts | MoE-RL, Sparsity | |
| | **Direct Policy Search** | **CMA-ES Policy Search** |  | Evolutionary strategy for policy weights | ES for RL, Salimans | |
| | **Alignment** | **Elo Rating Preference Plot** |  | Measuring agent strength over time | AlphaZero, League training | |
| | **Explainable RL** | **Explainable RL (SHAP Attribution)** |  | Local attribution of features to agent actions | Interpretability, SHAP/LIME | |
| | **Meta-RL** | **PEARL Context Encoder** |  | Learning latent task representations | PEARL, Rakelly et al. | |
| | **Applied RL** | **Medical RL Therapy Pipeline** |  | Personalized medicine and dosing | Healthcare RL, ICU Sepsis | |
| | **Applied RL** | **Supply Chain RL Pipeline** |  | Optimizing stock levels and orders | Logistics, Inventory Management | |
| | **Robotics** | **Sim-to-Real SysID Loop** |  | Closing the reality gap via parameter estimation | System Identification, Robotics | |
| | **Architecture** | **Transformer World Model** |  | Sequence-to-sequence dynamics modeling | DreamerV3, Transframer | |
| | **Applied RL** | **Network Traffic RL** |  | Optimizing data packet routing in graphs | Networking, Traffic Engineering | |
| | **Training** | **RLHF: PPO with Reference Policy** |  | Ensuring RL fine-tuning doesn't drift too far | InstructGPT, Llama 2/3 | |
| | **Multi-Agent RL** | **PSRO Meta-Game Update** |  | Reaching Nash equilibrium in large games | PSRO, Lanctot et al. | |
| | **Multi-Agent RL** | **DIAL: Differentiable Comm** |  | End-to-end learning of communication protocols | DIAL, Foerster et al. | |
| | **Batch RL** | **Fitted Q-Iteration Loop** |  | Data-driven iteration with a supervised regressor | Ernst et al. (2005) | |
| | **Safety RL** | **CMDP Feasible Region** |  | Constrained optimization within a safety budget | Constrained MDPs, Altman | |
| | **Control** | **MPC vs RL Planning** |  | Comparison of control paradigms | Control Theory vs RL | |
| | **AutoML** | **Learning to Optimize (L2O)** |  | Using RL to learn an optimization update rule | L2O, Li & Malik | |
| | **Applied RL** | **Smart Grid RL Management** |  | Optimizing energy supply and demand | Energy RL, Smart Grids | |
| | **Applied RL** | **Quantum State Tomography RL** |  | RL for quantum state estimation | Quantum RL, Neural Tomography | |
| | **Applied RL** | **RL for Chip Placement** |  | Placing components on silicon grids | Google Chip Placement | |
| | **Applied RL** | **RL Compiler Optimization (MLGO)** |  | Inlining and sizing in compilers | MLGO, LLVM | |
| | **Applied RL** | **RL for Theorem Proving** |  | Automated reasoning and proof search | LeanRL, AlphaProof | |
| | **Modern RL** | **Diffusion-QL Offline RL** |  | Policy as reverse diffusion process | s,k)$ with noise injection | |
| | **Principles** | **Fairness-reward Pareto Frontier** |  | Balancing equity and returns | Fair RL, Jabbari et al. | |
| | **Principles** | **Differentially Private RL** |  | Privacy-preserving training | DP-RL, Agarwal et al. | |
| | **Applied RL** | **Smart Agriculture RL** |  | Optimizing crop yield and resources | Precision Agriculture | |
| | **Applied RL** | **Climate Mitigation RL (Grid)** |  | Environmental control policies | ClimateRL, Carbon Control | |
| | **Applied RL** | **AI Education (Knowledge Tracing)** |  | Personalized learning paths | ITS, Bayesian Knowledge Tracing | |
| | **Modern RL** | **Decision SDE Flow** |  | RL in continuous stochastic systems | Neural SDEs, Control | |
| | **Control** | **Differentiable physics (Brax)** |  | Gradients through simulators | Brax, PhysX, MuJoCo | |
| | **Applied RL** | **Wireless Beamforming RL** |  | Optimizing antenna signal directions | 5G/6G Networking | |
| | **Applied RL** | **Quantum Error Correction RL** |  | Correcting noise in quantum circuits | Quantum Computing RL | |
| | **Multi-Agent RL** | **Mean Field RL Interaction** |  | Large population agent dynamics | MF-RL, Yang et al. | |
| | **HRL** | **Goal-GAN Curriculum** |  | Automatic goal generation | Goal-GAN, Florensa et al. | |
| | **Modern RL** | **JEPA: Predictive Architecture** |  | LeCun's world model framework | JEPA, I-JEPA | |
| | **Offline RL** | **CQL Value Penalty Landscape** |  | Conservatism in value functions | CQL, Kumar et al. | |
| | **Applied RL** | **Causal RL** |  | Causal Inverse RL Graph | DAG with $S, A, R$ and latent $U$ | |
| | **Quantum RL** | **VQE-RL Optimization** |  | Quantum circuit param tuning | VQE, Quantum RL | |
| | **Applied RL** | **De-novo Drug Discovery RL** |  | Generating optimized lead molecules | Drug Discovery, Molecule RL | |
| | **Applied RL** | **Traffic Signal Coordination RL** |  | Multi-intersection coordination | IntelliLight, PressLight | |
| | **Applied RL** | **Mars Rover Pathfinding RL** |  | Navigation on rough terrain | Space RL, Mars Rover | |
| | **Applied RL** | **Sports Player Movement RL** |  | Predicting/Optimizing player actions | Sports Analytics, Ghosting | |
| | **Applied RL** | **Cryptography Attack RL** |  | Searching for keys/vulnerabilities | Crypto-RL, Learning to Attack | |
| | **Applied RL** | **Humanitarian Resource RL** |  | Disaster response allocation | AI for Good, Resource RL | |
| | **Applied RL** | **Video Compression RL (RD)** |  | Optimizing bit-rate vs distortion | Learned Video Compression | |
| | **Applied RL** | **Kubernetes Auto-scaling RL** |  | Cloud resource management | Cloud RL, K8s Scaling | |
| | **Applied RL** | **Fluid Dynamics Flow Control RL** |  | Airfoil/Turbulence control | Aero-RL, Flow Control | |
| | **Applied RL** | **Structural Optimization RL** |  | Topology/Material design | Structural RL, Topology Opt | |
| | **Applied RL** | **Human Decision Modeling** |  | Prospect Theory in RL | Behavioral RL, Prospect Theory | |
| | **Applied RL** | **Semantic Parsing RL** |  | Language to Logic transformation | Semantic Parsing, Seq2Seq-RL | |
| | **Applied RL** | **Music Melody RL** |  | Reward-based melody generation | Music-RL, Magenta | |
| | **Applied RL** | **Plasma Fusion Control RL** |  | Magnetic control of Tokamaks | DeepMind Fusion, Tokamak RL | |
| | **Applied RL** | **Carbon Capture RL cycle** |  | Adsorption/Desorption optimization | Carbon Capture, Green RL | |
| | **Applied RL** | **Swarm Robotics RL** |  | Decentralized swarm coordination | Swarm-RL, Multi-Robot | |
| | **Applied RL** | **Legal Compliance RL Game** |  | Regulatory games | Legal-RL, RegTech | |
| | **Physics RL** | **Physics-Informed RL (PINN)** |  | Constraint-based RL loss | PINN-RL, SciML | |
| | **Modern RL** | **Neuro-Symbolic RL** |  | Combining logic and neural nets | Neuro-Symbolic, Logic RL | |
| | **Applied RL** | **DeFi Liquidity Pool RL** |  | Yield farming/Liquidity balancing | DeFi-RL, AMM Optimization | |
| | **Neuro RL** | **Dopamine Reward Prediction Error** |  | Biological RL signal curves | Neuroscience-RL, Wolfram | |
| | **Robotics** | **Proprioceptive Sensory-Motor RL** |  | Low-level joint control | Proprioceptive RL, Unitree | |
| | **Applied RL** | **AR Object Placement RL** |  | AR visual overlay optimization | AR-RL, Visual Overlay | |
| | **Reco RL** | **Sequential Bundle RL** |  | Recommendation item grouping | Bundle-RL, E-commerce | |
| | **Theoretical** | **Online Gradient Descent vs RL** |  | Gradient-based learning comparison | Online Learning, Regret | |
| | **Modern RL** | **Active Learning: Query RL** |  | Query-based sample selection | Active-RL, Query Opt | |
| | **Modern RL** | **Federated RL global Aggregator** |  | Privacy-preserving distributed RL | Federated-RL, FedAvg-RL | |
| | **Conceptual** | **Ultimate Universal RL Mastery Diagram** |  | Final summary of 230 items | Absolute Mastery Milestone | |
| |