Spaces:

lerobot
/

robot-learning-tutorial

Running

App Files Files Community

thibaud frere commited on Sep 18, 2025

Commit

52bc805

1 Parent(s): 59924a2

update article

Browse files

Files changed (4) hide show

app/scripts/latex-to-markdown/index.mjs +11 -0
app/scripts/latex-to-markdown/mdx-converter.mjs +42 -0
app/scripts/latex-to-markdown/output/main.mdx +11 -35
app/src/content/article.mdx +0 -0

app/scripts/latex-to-markdown/index.mjs CHANGED Viewed

@@ -2,6 +2,7 @@
 import { join, dirname } from 'path';
 import { fileURLToPath } from 'url';
 import { convertLatexToMarkdown } from './latex-converter.mjs';
 import { convertToMdx } from './mdx-converter.mjs';
 import { cleanBibliography } from './bib-cleaner.mjs';
@@ -12,6 +13,7 @@ const __dirname = dirname(__filename);
 // Default configuration
 const DEFAULT_INPUT = join(__dirname, 'input', 'main.tex');
 const DEFAULT_OUTPUT = join(__dirname, 'output');
 function parseArgs() {
     const args = process.argv.slice(2);
@@ -110,6 +112,15 @@ function main() {
             console.log('📝 Converting Markdown to MDX...');
             convertToMdx(markdownFile, mdxFile);
         }
     } catch (error) {

 import { join, dirname } from 'path';
 import { fileURLToPath } from 'url';
+import { copyFileSync } from 'fs';
 import { convertLatexToMarkdown } from './latex-converter.mjs';
 import { convertToMdx } from './mdx-converter.mjs';
 import { cleanBibliography } from './bib-cleaner.mjs';
 // Default configuration
 const DEFAULT_INPUT = join(__dirname, 'input', 'main.tex');
 const DEFAULT_OUTPUT = join(__dirname, 'output');
+const ASTRO_CONTENT_PATH = join(__dirname, '..', '..', 'src', 'content', 'article.mdx');
 function parseArgs() {
     const args = process.argv.slice(2);
             console.log('📝 Converting Markdown to MDX...');
             convertToMdx(markdownFile, mdxFile);
+            // Copy MDX to Astro content directory
+            console.log('📋 Copying MDX to Astro content directory...');
+            try {
+                copyFileSync(mdxFile, ASTRO_CONTENT_PATH);
+                console.log(`    ✅ Copied to ${ASTRO_CONTENT_PATH}`);
+            } catch (error) {
+                console.warn(`    ⚠️  Failed to copy MDX to Astro: ${error.message}`);
+            }
         }
     } catch (error) {

app/scripts/latex-to-markdown/mdx-converter.mjs CHANGED Viewed

@@ -356,6 +356,47 @@ date: "${new Date().toISOString().split('T')[0]}"
     return content;
 }
 /**
  * Clean up MDX-incompatible syntax
  * @param {string} content - MDX content
@@ -391,6 +432,7 @@ function processMdxContent(content) {
     // Apply each transformation step sequentially
     processedContent = ensureFrontmatter(processedContent);
     processedContent = cleanMdxSyntax(processedContent);
     processedContent = transformImages(processedContent);
     processedContent = transformStyledSpans(processedContent);

     return content;
 }
+/**
+ * Clean newlines from single-line math blocks that contain them
+ * @param {string} content - MDX content
+ * @returns {string} - Content with cleaned math blocks
+ */
+function cleanSingleLineMathNewlines(content) {
+    console.log('  🔢 Cleaning newlines in single-line math blocks...');
+    let cleanedCount = 0;
+    // Find single dollar math blocks that contain newlines BUT are short enough to be single-line math
+    // Use a more restrictive approach: max 200 chars and only simple newlines (not paragraph breaks)
+    const cleanedContent = content.replace(/\$([^$]{1,200}?)\$/g, (match, mathContent) => {
+        // Only process if:
+        // 1. It contains newlines
+        // 2. It's not too long (likely not a multi-paragraph match)
+        // 3. It doesn't contain double newlines (paragraph breaks)
+        if (mathContent.includes('\n') &&
+            !mathContent.includes('\n\n') &&
+            mathContent.length <= 200) {
+            cleanedCount++;
+            // Remove newlines and normalize whitespace, but preserve math structure
+            const cleanedMath = mathContent
+                .replace(/\n+/g, ' ')           // Replace newlines with spaces
+                .replace(/\s+/g, ' ')           // Normalize multiple spaces to single
+                .trim();                        // Remove leading/trailing spaces
+            return `$${cleanedMath}$`;
+        }
+        return match; // Keep original if doesn't meet criteria
+    });
+    if (cleanedCount > 0) {
+        console.log(`    ✅ Cleaned ${cleanedCount} single-line math block(s) with newlines`);
+    }
+    return cleanedContent;
+}
 /**
  * Clean up MDX-incompatible syntax
  * @param {string} content - MDX content
     // Apply each transformation step sequentially
     processedContent = ensureFrontmatter(processedContent);
+    processedContent = cleanSingleLineMathNewlines(processedContent);
     processedContent = cleanMdxSyntax(processedContent);
     processedContent = transformImages(processedContent);
     processedContent = transformStyledSpans(processedContent);

app/scripts/latex-to-markdown/output/main.mdx CHANGED Viewed

@@ -325,8 +325,7 @@ Deriving the end-effector’s *pose*--position *and* orientation--in some $m$-di
 In the simplified case here considered (for which $\boldsymbol{p} \equiv p$, as the orientation of the end-effector is disregarded for simplicity), one can solve the problem of controlling the end-effector’s location to reach a goal position $p^*$ by solving analytically for $q: p(q) = f_{\text{FK}}(q) = p^*$. However, in the general case, one might not be able to solve this problem analytically, and can typically resort to iterative optimization methods comparing candidate solutions using a loss function (in the simplest case, $\Vert p(q) - p^* \Vert_2^2$ is a natural candidate), yielding:
-$\min_{q \in \mathcal Q} \Vert p(q) - p^* \Vert_2^2 \, .
-$
 Exact analytical solutions to IK are even less appealing when one considers the presence of obstacles in the robot’s workspace, resulting in constraints on the possible values of $q \in \mathcal Q \subseteq [-\pi, +\pi]^n \subset \mathbb R^n$ in the general case of $n$-links robots.
@@ -334,8 +333,7 @@ For instance, the robot in Figure <a href="#fig:planar-manipulator-floor" data-
 However, IK--solving eq. <a href="#eq:ik_problem" data-reference-type="ref" data-reference="eq:ik_problem">[eq:ik_problem]</a> for a feasible $q$--only proves useful in determining information regarding the robot’s configuration in the goal pose, and crucially does not provide information on the *trajectory* to follow over time to reach a target pose. Expert-defined trajectories obviate to this problem providing a length-$K$ succession of goal poses $\tau_K = [p^*_0, p^*_1, \dots p^*_K]$ for tracking. In practice, trajectories can also be obtained automatically through *motion planning* algorithms, thus avoiding expensive trajectory definition from human experts. However, tracking $\tau_K$ via IK can prove prohibitively expensive, as tracking would require $K$ resolutions of eq. <a href="#eq:ik_problem" data-reference-type="ref" data-reference="eq:ik_problem">[eq:ik_problem]</a> (one for each target pose). *Differential* inverse kinematics (diff-IK) complements IK via closed-form solution of a variant of eq. <a href="#eq:ik_problem" data-reference-type="ref" data-reference="eq:ik_problem">[eq:ik_problem]</a>. Let $J(q)$ denote the Jacobian matrix of (partial) derivatives of the FK-function $f_\text{FK}: \mathcal Q \mapsto \mathcal P$, such that $J(q) = \frac{\partial f_{FK}(q)}{\partial q }$. Then, one can apply the chain rule to any $p(q) = f_{\text{FK}}(q)$, deriving $\dot p = J(q) \dot q$, and thus finally relating variations in the robot configurations to variations in pose, thereby providing a platform for control.
-Given a desired end-effector trajectory $\dot {p}^*(t)$ (1) indicating anchor regions in space and (2) how much time to spend in each region, diff-IK finds $\dot q(t)$ solving for joints’ *velocities* instead of *configurations*, $\dot q(t) = \arg\min_\nu \; \lVert J(q(t)) \nu - \dot {p}^*(t) \rVert_2^2
-$
 Unlike eq. <a href="#eq:ik_problem" data-reference-type="ref" data-reference="eq:ik_problem">[eq:ik_problem]</a>, solving for $\dot q$ is much less dependent on the environment (typically, variations in velocity are constrained by physical limits on the actuators). Conveniently, eq. <a href="#eq:reg_ik_velocity" data-reference-type="ref" data-reference="eq:reg_ik_velocity">[eq:reg_ik_velocity]</a> also often admits the closed-form solution $\dot q = J(q)^+ \dot {p}^*$, where $J^+(q)$ denotes the Moore-Penrose pseudo-inverse of $J(q)$. Finally, discrete-time joint configurations $q$ can be reconstructed from joint velocities $\dot q$ using forward-integration on the continuous-time joint velocity , $q_{t+1} = q_t + \Delta t\,\dot q_t$ for a given $\Delta t$, resulting in tracking via diff-IK.
@@ -480,11 +478,7 @@ A length-$T$ *trajectory* is the (random) sequence
 \end{equation}
 ```
 with per-step rewards defined as $r_t = r (s_t, a_t, s_{t+1})$ for ease of notation.Interestingly, assuming both the environment dynamics and conditional distribution over actions given states--the *policy*--to be *Markovian*:
-$$
-`\mathbb P(s_{t+1}\vert s_t, a_t, s_{t-1}, a_{t-1}, \dots s_0, a_0 ) = \mathbb P (s_{t+1}\vert s_t, a_t)\\
-    \mathbb P(a_t\vert s_t, a_{t-1}, s_{t-1}, s_0, a_0) = \mathbb P(a_t\vert s_t) `
-$$
- The probability of observing a given trajectory $\tau$ factorizes into
 ``` math
 \begin{equation}
@@ -492,11 +486,7 @@ $$
 \end{equation}
 ```
-Policies $\mathbb P(a_t\vert s_t)$ are typically indicated as $\pi(a_t\vert s_t)$, and often parametrized via $\theta$, yielding $\pi_\theta (a_t\vert s_t)$. Policies are trained optimizing the (discounted) *return* associated to a given $\tau$, i.e. the (random) sum of measured rewards over trajectory:
-``` math
-G(\tau) = \sum_{t=0}^{T-1} \gamma^{t} r_t.
-```
-In that, agents seek to learn control strategies (*policies*, $\pi_\theta$) maximizing the expected return $\mathbb E_{\tau \sim \pi_\theta} G(\tau)$. For a given dynamics $\mathcal D$--i.e., for a given problem--taking the expectation over the (possibly random) trajectories resulting from acting according to a certain policy provides a direct, goal-conditioned ordering in the space of all the possible policies $\Pi$, yielding the (maximization) target $J : \Pi \mapsto \mathbb R$
 $$
 `J(\pi_\theta) = \mathbb E_{\tau \sim \mathbb P_{\theta; \mathcal D}} [G(\tau)],\\
     \mathbb P_{\theta; \mathcal D} (\tau) = \rho \prod_{t=0}^{T-1} \mathcal D (s_t, a_t, s_{t+1})\ \pi_\theta (a_t\vert s_t).`
@@ -512,12 +502,7 @@ can be used to discriminate between desirable and undesirable state in terms of
 Q_\pi(s,a) = \mathbb E_{\tau \sim \pi} [G (\tau) \big \vert s_0 = s, a_0=a]
 ```
 Crucially, value functions are interrelated:
-$$
-`Q_\pi(s_t, a_t) = \mathbb{E}_{s_{t+1}\sim \mathbb P(\bullet \vert s_t, a_t)} [r_t + \gamma V_\pi(s_{t+1})]\\
-    V_\pi(s_t) = \mathbb E_{a_t\sim \pi(\bullet \vert s_t)} [Q_\pi (s_t, a_t)]
-`
-$$
- Inducing an ordering over states and state-action pairs under $\pi$, value functions are central to most RL algorithms. A variety of methods have been developed in RL as standalone attemps to find (approximate) solutions to the problem of maximizing cumulative reward (Figure <a href="#fig:rl-algos-atlas" data-reference-type="ref" data-reference="fig:rl-algos-atlas">15</a>).
 <ResponsiveImage
   src={ch3_rl_algorithms_atlas}
@@ -599,8 +584,7 @@ $$
         (\underbrace{y_i - Q_{\theta_i}(s_t, a_t)}_{\delta_i})^2
     \big],\\
     y_i = \mathbb E_{s_{t+1} \sim \mathbb P(\bullet \vert s_t, a_t)} \big[ r_t + \gamma \max_{a_t\in \mathcal A} Q_{\theta_{i-1}} (s_{t+1}, a_{t+1}) \big], `
-$$
- Where $\chi$ represents a behavior distribution over state-action pairs. Crucially, $\chi$ can in principle be different from the policy being followed, effectively allowing to reuse prior data stored in a *replay buffer* in the form of $(s_t, a_t, r_t, s_{t+1})$ transitions, used to form the TD-target $y_i$, TD-error $\delta_i$ and loss function <a href="#eq:dqn-loss" data-reference-type="ref" data-reference="eq:dqn-loss">[eq:dqn-loss]</a> via Monte-Carlo (MC) estimates.
 While effective in handling large, unstructured state spaces for discrete action-space problems, DQN application’s to continous control problems proved challenging. Indeed, in the case of high-capacity function approximators such as neural networks, solving $\max_{a_t \in \mathcal A} Q_\theta(s_t, a_t)$ at each timestep is simply unfeasible due to the (1) continous nature of the action space ($\mathcal A\subset \mathbb R^n$ for some $n$) and (2) impossibility to express the find a cheap (ideally, closed-form) solution to $Q_\theta$.  @silverDeterministicPolicyGradient2014 tackle this fundamental challenge by using a *deterministic* function of the state $s_t$ as policy, $\mu_\phi(s_t) = a_t$, parametrized by $\phi$. Thus, policies can be iteratively refined updating $\phi$ along the direction:
 ``` math
@@ -795,8 +779,7 @@ $$
             \mathbb{E}_{z \sim p_\theta(\cdot \vert (o,a)_i)} \big[ \log p_\theta((o,a)_i \vert z) \big]
         - \text{D}_{\text{KL}}\big[ q_\theta(z \vert (o,a)_i) \Vert p(z) \big]
         \right) `
-$$
- The true, generally intractable posterior $p_\theta (z \vert o,a)$ prevents computing both the expectation and KL divergence terms in <a href="#eq:ELBO-intractable" data-reference-type="ref" data-reference="eq:ELBO-intractable">[eq:ELBO-intractable]</a>, and therefore @kingmaAutoEncodingVariationalBayes2022 propose deriving the ELBO using an *approximate* posterior $q_\phi(z \vert o,a)$, resulting in the final, tractable ELBO objective, $\text{ELBO}_{\mathcal D}(\theta, \phi) = \sum_{i=0}^{N} \left(
             \mathbb{E}_{z \sim q_\phi(\cdot \vert (o,a)_i)} \big[ \log p_\theta((o,a)_i \vert z) \big]
         - \text{D}_{\text{KL}}\big[ q_\phi(z \vert (o,a)_i) \Vert p(z) \big]
         \right)
@@ -851,8 +834,7 @@ $$
     \mathbb{E}_{z_1 \sim q(\bullet \vert z_0)} \log p_\theta (z_0 \vert z_1) -\\
     \mathbb{E}_{z_{T-1} \sim q(\bullet \vert z_0)} \big[ \text{D}_{\text{KL}}(q(z_T \vert z_{T-1}) \Vert p(z_T) ) \big] - \notag\\
     \sum_{t=1}^{T-1} \mathbb{E}_{(z_{t-1}, z_{t+1}) \sim q(\bullet \vert z_0)} \big[ \text{D}_{\text{KL}}(q(z_t \vert z_{t-1}) \Vert p_\theta(z_t \vert z_{t-1}) ) \big], \notag`
-$$
- providing an optimization target in the form of $\max_\theta \log p_\theta (\mathcal D)$.
 In their seminal work on using DMs for variational inference, @hoDenoisingDiffusionProbabilistic2020 introduce major contributions regarding solving $\min_\theta -\log p_\theta(o,a)$. In particular, @hoDenoisingDiffusionProbabilistic2020 exclusively adopt a fixed *Gaussian* posterior in the form of $q(z_t \vert z_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}z_{t-1}, \beta_t \mathbf I)$. The choice of adopting Gaussians has profound implications on the generative process modeled. Indeed, under the (mild) assumption that the variance is sufficiently small $\beta_t \leq \eta, \eta \in \mathbb R^+$, @sohl-dicksteinDeepUnsupervisedLearning2015 proved that the likelihood $p(z_{t-1} \vert z_t)$ is Gaussian as well, which allows for the particularly convenient parametrization of the approximate likelihood $p_\theta (x_{t-1} \vert x_t) = \mathcal N(\mu_\theta(x_t, t), \Sigma_\theta(x_t,t)), \ t \in [1,T]$, as well as for closed-form tractability of the KL-divergence terms in <a href="#eq:diffusion-likelihood" data-reference-type="ref" data-reference="eq:diffusion-likelihood">[eq:diffusion-likelihood]</a>. Further, the posterior’s structure also enables an analytical description for the distribution of the $t$-th latent variable, $q(z_t \vert z_0) = \mathcal N (\sqrt{\bar{\alpha}_t}z_0, (1-\bar{\alpha}_t) \mathbf{I})$, with $\alpha_t = 1-\beta_t, \ \bar \alpha_t = \prod_{k=1}^t \alpha_k$, which conveniently prevents iterative posterior sampling.
@@ -891,10 +873,7 @@ By learning the total displacement from a generally, uninformative corrupted sam
 ### Flow Matching
 The posterior parametrization adopted by DMs proved traditionally effective, yet it raised concerns circa its efficiency at inference time, where a possibly large of compute-expensive denoising steps are needed in order to recover a sample from the target distribution. Flow Matching (FM) @lipmanFlowMatchingGenerative2023 extends DMs to the general case of arbitrary, parametrized likelihood and posteriors, and in this defines a superseding class of GMs providing a unified framework for learning *continuous transformations* between distributions, encompassing and generalizing DMs. Instead of a *stochastic, discrete, multi-step* denoising process, FM aims to learn a *deterministic, continuous, differentiable flow* $\psi [0,1] \times Z \mapsto Z$, formalized starting from possibly time-dependent vector field $v: [0,1] \times Z \mapsto Z$ transporting samples from a simple prior distribution $p_0$--e.g., a standard Gaussian--to a more complex, potentially unknown data distribution $p_1$ over time. Note how FM models time $t \in [0,1]$ to be varying continuously while moving away *from* an easy-to-sample distribution $p_0$ *towards* the unknown data-distribution, $p_1$. This results in a continuous and deterministic trajectory for each sample, which can be more efficient to generate compared to the stochastic paths of DMs. Formally, FM can be fully characterized by an ordinary differential equation (ODE) relating instantaneous variations of flows with the underlying vector field, and hence providing complete trajectories over the distributions’ support when integrating over time,
-$$
-`\frac{d}{dt} \psi(z, t) = v(t, \psi(t, z))\\
-    \psi(0, z) = z`
-$$
 FM proved very effective in a variety of applications, ranging from image @esserScalingRectifiedFlow2024 and video generation @polyakMovieGenCast2025 to robotics control @blackp0VisionLanguageActionFlow2024. Most notably, in their introductory work on FM for GM, @lipmanFlowMatchingGenerative2023 show how DMs can be seen as a specific instance of FM where the *conditional* target vector field $u$ approximated by the noise regressor corresponds to
@@ -928,9 +907,7 @@ While the noising schedule of DMs results in a stochastic process that resembles
   caption={'Compared to diffusion, flow matching distorts distribution along a less randomic pattern, resulting in a clearer interpolation between source and target distribution. The visualization shows an example comparison between these two methods on joint distribution of robot observations and actions over T = 50 steps.'}
 />
-In practice, FM can be applied to generative modeling by learning a vector field regressor $v_\theta(z, t)$ to approximate a given target vector field $u(t, z)$. In the particular case of DMs, $u(t, z)$ is defined as in <a href="#eq:fm-diffusion-vector-field" data-reference-type="ref" data-reference="eq:fm-diffusion-vector-field">[eq:fm-diffusion-vector-field]</a>, while in priciple the target vector field can be learned to induce a particular transportation, or fixed according to OT. Given a sample from the data distribution $z_1 \sim p_1$ and a sample from an easy-to-sample prior $z_0 \sim p_0$, CFM defines a simple path between them using *linear interpolation* between samples $z_t = (1-t)z_0 + t z_1$, resulting in the target vector field $u(t, z_t) = z_1 - z_0$. Then, a FM model can be trained with the simple regression objective defined as $
-    \mathcal L(\theta) = \mathbb{E}_{t, z_0, z_1} \big[
-        \Vert v_\theta((1-t)z_0 + t z_1, t) - (z_1 - z_0) \Vert^2 \big], \quad t \sim \mathcal{U}([0,1]),$ where $z_0 \sim p_0(\bullet)$ and $z_1 \sim p_1(\bullet)$. Note how in <a href="#eq:flow-matching-objective" data-reference-type="ref" data-reference="eq:flow-matching-objective">[eq:flow-matching-objective]</a>--differently from <a href="#eq:diffusion-simplified-loss" data-reference-type="ref" data-reference="eq:diffusion-simplified-loss">[eq:diffusion-simplified-loss]</a>--time is assumed to be varying continuously $t \sim \mathcal U([0,1])$ rather than discretely $t \sim \mathcal U(\{0,1\})$, a key property of flow-based models. The objective in <a href="#eq:flow-matching-objective" data-reference-type="ref" data-reference="eq:flow-matching-objective">[eq:flow-matching-objective]</a> directly regresses the learned vector field onto the simple, straight path connecting a point from the prior and a point from the data, providing a simulation-free training procedure that is both stable and efficient. At inference time, samples are generated by starting with $z_0 \sim p_0$ and iteratively refined according to $\frac{dz}{dt} = v_\theta(z_t, t)$ for $t \in [0,1]$--an operation that can be numerically carried out with standard ODE solvers.
 ## Action Chunking with Transformers
@@ -1189,8 +1166,7 @@ $$
     \tau \sim \mathrm{Beta}_{[0,s]}(1.5,1), \quad
     \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), \quad
     o_t, a_{t:t+H_a} \sim \mathcal D \notag`
-$$
- Where the experts parametrized by the separate weights $\phi, \theta$ interact with each other via self-attention layers only, so that the action expert $v_\theta$ internal computations also depend on the VLM backbone’s parameters $\phi$. Importantly, @blackp0VisionLanguageActionFlow2024 minimize <a href="#eq:pi0-loss" data-reference-type="ref" data-reference="eq:pi0-loss">[eq:pi0-loss]</a> over both the multimodal backbone and action expert parameters, thus updating the internal representations of the VLM using BC-specific gradients. In contrast, @driessKnowledgeInsulatingVisionLanguageAction2025 later show that failing to insulate the VLM knowledge from the flow matching gradients actually harms performance. Inference is performed iteratively refining action chunks while numerically forward-integrating the vector field predicted by the action expert,
 ``` math
 \begin{equation}
     a_{t:t+H_a}^{\tau + \delta} = a_{t:t+H_a}^{\tau } + \delta v_\theta(a_{t:t+H_a}^{\tau }, o_t)

 In the simplified case here considered (for which $\boldsymbol{p} \equiv p$, as the orientation of the end-effector is disregarded for simplicity), one can solve the problem of controlling the end-effector’s location to reach a goal position $p^*$ by solving analytically for $q: p(q) = f_{\text{FK}}(q) = p^*$. However, in the general case, one might not be able to solve this problem analytically, and can typically resort to iterative optimization methods comparing candidate solutions using a loss function (in the simplest case, $\Vert p(q) - p^* \Vert_2^2$ is a natural candidate), yielding:
+$\min_{q \in \mathcal Q} \Vert p(q) - p^* \Vert_2^2 \, . $
 Exact analytical solutions to IK are even less appealing when one considers the presence of obstacles in the robot’s workspace, resulting in constraints on the possible values of $q \in \mathcal Q \subseteq [-\pi, +\pi]^n \subset \mathbb R^n$ in the general case of $n$-links robots.
 However, IK--solving eq. <a href="#eq:ik_problem" data-reference-type="ref" data-reference="eq:ik_problem">[eq:ik_problem]</a> for a feasible $q$--only proves useful in determining information regarding the robot’s configuration in the goal pose, and crucially does not provide information on the *trajectory* to follow over time to reach a target pose. Expert-defined trajectories obviate to this problem providing a length-$K$ succession of goal poses $\tau_K = [p^*_0, p^*_1, \dots p^*_K]$ for tracking. In practice, trajectories can also be obtained automatically through *motion planning* algorithms, thus avoiding expensive trajectory definition from human experts. However, tracking $\tau_K$ via IK can prove prohibitively expensive, as tracking would require $K$ resolutions of eq. <a href="#eq:ik_problem" data-reference-type="ref" data-reference="eq:ik_problem">[eq:ik_problem]</a> (one for each target pose). *Differential* inverse kinematics (diff-IK) complements IK via closed-form solution of a variant of eq. <a href="#eq:ik_problem" data-reference-type="ref" data-reference="eq:ik_problem">[eq:ik_problem]</a>. Let $J(q)$ denote the Jacobian matrix of (partial) derivatives of the FK-function $f_\text{FK}: \mathcal Q \mapsto \mathcal P$, such that $J(q) = \frac{\partial f_{FK}(q)}{\partial q }$. Then, one can apply the chain rule to any $p(q) = f_{\text{FK}}(q)$, deriving $\dot p = J(q) \dot q$, and thus finally relating variations in the robot configurations to variations in pose, thereby providing a platform for control.
+Given a desired end-effector trajectory $\dot {p}^*(t)$ (1) indicating anchor regions in space and (2) how much time to spend in each region, diff-IK finds $\dot q(t)$ solving for joints’ *velocities* instead of *configurations*, $\dot q(t) = \arg\min_\nu \; \lVert J(q(t)) \nu - \dot {p}^*(t) \rVert_2^2 $
 Unlike eq. <a href="#eq:ik_problem" data-reference-type="ref" data-reference="eq:ik_problem">[eq:ik_problem]</a>, solving for $\dot q$ is much less dependent on the environment (typically, variations in velocity are constrained by physical limits on the actuators). Conveniently, eq. <a href="#eq:reg_ik_velocity" data-reference-type="ref" data-reference="eq:reg_ik_velocity">[eq:reg_ik_velocity]</a> also often admits the closed-form solution $\dot q = J(q)^+ \dot {p}^*$, where $J^+(q)$ denotes the Moore-Penrose pseudo-inverse of $J(q)$. Finally, discrete-time joint configurations $q$ can be reconstructed from joint velocities $\dot q$ using forward-integration on the continuous-time joint velocity , $q_{t+1} = q_t + \Delta t\,\dot q_t$ for a given $\Delta t$, resulting in tracking via diff-IK.
 \end{equation}
 ```
 with per-step rewards defined as $r_t = r (s_t, a_t, s_{t+1})$ for ease of notation.Interestingly, assuming both the environment dynamics and conditional distribution over actions given states--the *policy*--to be *Markovian*:
+$$\mathbb P(s_{t+1}\vert s_t, a_t, s_{t-1}, a_{t-1}, \dots s_0, a_0 ) = \mathbb P (s_{t+1}\vert s_t, a_t)\\ \mathbb P(a_t\vert s_t, a_{t-1}, s_{t-1}, s_0, a_0) = \mathbb P(a_t\vert s_t) $$The probability of observing a given trajectory$\tau$ factorizes into
 ``` math
 \begin{equation}
 \end{equation}
 ```
+Policies $\mathbb P(a_t\vert s_t)$ are typically indicated as $\pi(a_t\vert s_t)$, and often parametrized via $\theta$, yielding $\pi_\theta (a_t\vert s_t)$. Policies are trained optimizing the (discounted) *return* associated to a given $\tau$, i.e. the (random) sum of measured rewards over trajectory: ``` math G(\tau) = \sum_{t=0}^{T-1} \gamma^{t} r_t. ``` In that, agents seek to learn control strategies (*policies*,$\pi_\theta$) maximizing the expected return $\mathbb E_{\tau \sim \pi_\theta} G(\tau)$. For a given dynamics $\mathcal D$--i.e., for a given problem--taking the expectation over the (possibly random) trajectories resulting from acting according to a certain policy provides a direct, goal-conditioned ordering in the space of all the possible policies $\Pi$, yielding the (maximization) target $J : \Pi \mapsto \mathbb R$
 $$
 `J(\pi_\theta) = \mathbb E_{\tau \sim \mathbb P_{\theta; \mathcal D}} [G(\tau)],\\
     \mathbb P_{\theta; \mathcal D} (\tau) = \rho \prod_{t=0}^{T-1} \mathcal D (s_t, a_t, s_{t+1})\ \pi_\theta (a_t\vert s_t).`
 Q_\pi(s,a) = \mathbb E_{\tau \sim \pi} [G (\tau) \big \vert s_0 = s, a_0=a]
 ```
 Crucially, value functions are interrelated:
+$$Q_\pi(s_t, a_t) = \mathbb{E}_{s_{t+1}\sim \mathbb P(\bullet \vert s_t, a_t)} [r_t + \gamma V_\pi(s_{t+1})]\\ V_\pi(s_t) = \mathbb E_{a_t\sim \pi(\bullet \vert s_t)} [Q_\pi (s_t, a_t)] $$Inducing an ordering over states and state-action pairs under$\pi$, value functions are central to most RL algorithms. A variety of methods have been developed in RL as standalone attemps to find (approximate) solutions to the problem of maximizing cumulative reward (Figure <a href="#fig:rl-algos-atlas" data-reference-type="ref" data-reference="fig:rl-algos-atlas">15</a>).
 <ResponsiveImage
   src={ch3_rl_algorithms_atlas}
         (\underbrace{y_i - Q_{\theta_i}(s_t, a_t)}_{\delta_i})^2
     \big],\\
     y_i = \mathbb E_{s_{t+1} \sim \mathbb P(\bullet \vert s_t, a_t)} \big[ r_t + \gamma \max_{a_t\in \mathcal A} Q_{\theta_{i-1}} (s_{t+1}, a_{t+1}) \big], `
+$$Where$\chi$ represents a behavior distribution over state-action pairs. Crucially, $\chi$ can in principle be different from the policy being followed, effectively allowing to reuse prior data stored in a *replay buffer* in the form of $(s_t, a_t, r_t, s_{t+1})$ transitions, used to form the TD-target $y_i$, TD-error $\delta_i$ and loss function <a href="#eq:dqn-loss" data-reference-type="ref" data-reference="eq:dqn-loss">[eq:dqn-loss]</a> via Monte-Carlo (MC) estimates.
 While effective in handling large, unstructured state spaces for discrete action-space problems, DQN application’s to continous control problems proved challenging. Indeed, in the case of high-capacity function approximators such as neural networks, solving $\max_{a_t \in \mathcal A} Q_\theta(s_t, a_t)$ at each timestep is simply unfeasible due to the (1) continous nature of the action space ($\mathcal A\subset \mathbb R^n$ for some $n$) and (2) impossibility to express the find a cheap (ideally, closed-form) solution to $Q_\theta$.  @silverDeterministicPolicyGradient2014 tackle this fundamental challenge by using a *deterministic* function of the state $s_t$ as policy, $\mu_\phi(s_t) = a_t$, parametrized by $\phi$. Thus, policies can be iteratively refined updating $\phi$ along the direction:
 ``` math
             \mathbb{E}_{z \sim p_\theta(\cdot \vert (o,a)_i)} \big[ \log p_\theta((o,a)_i \vert z) \big]
         - \text{D}_{\text{KL}}\big[ q_\theta(z \vert (o,a)_i) \Vert p(z) \big]
         \right) `
+$$The true, generally intractable posterior$p_\theta (z \vert o,a)$ prevents computing both the expectation and KL divergence terms in <a href="#eq:ELBO-intractable" data-reference-type="ref" data-reference="eq:ELBO-intractable">[eq:ELBO-intractable]</a>, and therefore @kingmaAutoEncodingVariationalBayes2022 propose deriving the ELBO using an *approximate* posterior $q_\phi(z \vert o,a)$, resulting in the final, tractable ELBO objective, $\text{ELBO}_{\mathcal D}(\theta, \phi) = \sum_{i=0}^{N} \left(
             \mathbb{E}_{z \sim q_\phi(\cdot \vert (o,a)_i)} \big[ \log p_\theta((o,a)_i \vert z) \big]
         - \text{D}_{\text{KL}}\big[ q_\phi(z \vert (o,a)_i) \Vert p(z) \big]
         \right)
     \mathbb{E}_{z_1 \sim q(\bullet \vert z_0)} \log p_\theta (z_0 \vert z_1) -\\
     \mathbb{E}_{z_{T-1} \sim q(\bullet \vert z_0)} \big[ \text{D}_{\text{KL}}(q(z_T \vert z_{T-1}) \Vert p(z_T) ) \big] - \notag\\
     \sum_{t=1}^{T-1} \mathbb{E}_{(z_{t-1}, z_{t+1}) \sim q(\bullet \vert z_0)} \big[ \text{D}_{\text{KL}}(q(z_t \vert z_{t-1}) \Vert p_\theta(z_t \vert z_{t-1}) ) \big], \notag`
+$$providing an optimization target in the form of$\max_\theta \log p_\theta (\mathcal D)$.
 In their seminal work on using DMs for variational inference, @hoDenoisingDiffusionProbabilistic2020 introduce major contributions regarding solving $\min_\theta -\log p_\theta(o,a)$. In particular, @hoDenoisingDiffusionProbabilistic2020 exclusively adopt a fixed *Gaussian* posterior in the form of $q(z_t \vert z_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}z_{t-1}, \beta_t \mathbf I)$. The choice of adopting Gaussians has profound implications on the generative process modeled. Indeed, under the (mild) assumption that the variance is sufficiently small $\beta_t \leq \eta, \eta \in \mathbb R^+$, @sohl-dicksteinDeepUnsupervisedLearning2015 proved that the likelihood $p(z_{t-1} \vert z_t)$ is Gaussian as well, which allows for the particularly convenient parametrization of the approximate likelihood $p_\theta (x_{t-1} \vert x_t) = \mathcal N(\mu_\theta(x_t, t), \Sigma_\theta(x_t,t)), \ t \in [1,T]$, as well as for closed-form tractability of the KL-divergence terms in <a href="#eq:diffusion-likelihood" data-reference-type="ref" data-reference="eq:diffusion-likelihood">[eq:diffusion-likelihood]</a>. Further, the posterior’s structure also enables an analytical description for the distribution of the $t$-th latent variable, $q(z_t \vert z_0) = \mathcal N (\sqrt{\bar{\alpha}_t}z_0, (1-\bar{\alpha}_t) \mathbf{I})$, with $\alpha_t = 1-\beta_t, \ \bar \alpha_t = \prod_{k=1}^t \alpha_k$, which conveniently prevents iterative posterior sampling.
 ### Flow Matching
 The posterior parametrization adopted by DMs proved traditionally effective, yet it raised concerns circa its efficiency at inference time, where a possibly large of compute-expensive denoising steps are needed in order to recover a sample from the target distribution. Flow Matching (FM) @lipmanFlowMatchingGenerative2023 extends DMs to the general case of arbitrary, parametrized likelihood and posteriors, and in this defines a superseding class of GMs providing a unified framework for learning *continuous transformations* between distributions, encompassing and generalizing DMs. Instead of a *stochastic, discrete, multi-step* denoising process, FM aims to learn a *deterministic, continuous, differentiable flow* $\psi [0,1] \times Z \mapsto Z$, formalized starting from possibly time-dependent vector field $v: [0,1] \times Z \mapsto Z$ transporting samples from a simple prior distribution $p_0$--e.g., a standard Gaussian--to a more complex, potentially unknown data distribution $p_1$ over time. Note how FM models time $t \in [0,1]$ to be varying continuously while moving away *from* an easy-to-sample distribution $p_0$ *towards* the unknown data-distribution, $p_1$. This results in a continuous and deterministic trajectory for each sample, which can be more efficient to generate compared to the stochastic paths of DMs. Formally, FM can be fully characterized by an ordinary differential equation (ODE) relating instantaneous variations of flows with the underlying vector field, and hence providing complete trajectories over the distributions’ support when integrating over time,
+$$\frac{d}{dt} \psi(z, t) = v(t, \psi(t, z))\\ \psi(0, z) = z$$
 FM proved very effective in a variety of applications, ranging from image @esserScalingRectifiedFlow2024 and video generation @polyakMovieGenCast2025 to robotics control @blackp0VisionLanguageActionFlow2024. Most notably, in their introductory work on FM for GM, @lipmanFlowMatchingGenerative2023 show how DMs can be seen as a specific instance of FM where the *conditional* target vector field $u$ approximated by the noise regressor corresponds to
   caption={'Compared to diffusion, flow matching distorts distribution along a less randomic pattern, resulting in a clearer interpolation between source and target distribution. The visualization shows an example comparison between these two methods on joint distribution of robot observations and actions over T = 50 steps.'}
 />
+In practice, FM can be applied to generative modeling by learning a vector field regressor $v_\theta(z, t)$ to approximate a given target vector field $u(t, z)$. In the particular case of DMs, $u(t, z)$ is defined as in <a href="#eq:fm-diffusion-vector-field" data-reference-type="ref" data-reference="eq:fm-diffusion-vector-field">[eq:fm-diffusion-vector-field]</a>, while in priciple the target vector field can be learned to induce a particular transportation, or fixed according to OT. Given a sample from the data distribution $z_1 \sim p_1$ and a sample from an easy-to-sample prior $z_0 \sim p_0$, CFM defines a simple path between them using *linear interpolation* between samples $z_t = (1-t)z_0 + t z_1$, resulting in the target vector field $u(t, z_t) = z_1 - z_0$. Then, a FM model can be trained with the simple regression objective defined as $ \mathcal L(\theta) = \mathbb{E}_{t, z_0, z_1} \big[ \Vert v_\theta((1-t)z_0 + t z_1, t) - (z_1 - z_0) \Vert^2 \big], \quad t \sim \mathcal{U}([0,1]),$ where $z_0 \sim p_0(\bullet)$ and $z_1 \sim p_1(\bullet)$. Note how in <a href="#eq:flow-matching-objective" data-reference-type="ref" data-reference="eq:flow-matching-objective">[eq:flow-matching-objective]</a>--differently from <a href="#eq:diffusion-simplified-loss" data-reference-type="ref" data-reference="eq:diffusion-simplified-loss">[eq:diffusion-simplified-loss]</a>--time is assumed to be varying continuously $t \sim \mathcal U([0,1])$ rather than discretely $t \sim \mathcal U(\{0,1\})$, a key property of flow-based models. The objective in <a href="#eq:flow-matching-objective" data-reference-type="ref" data-reference="eq:flow-matching-objective">[eq:flow-matching-objective]</a> directly regresses the learned vector field onto the simple, straight path connecting a point from the prior and a point from the data, providing a simulation-free training procedure that is both stable and efficient. At inference time, samples are generated by starting with $z_0 \sim p_0$ and iteratively refined according to $\frac{dz}{dt} = v_\theta(z_t, t)$ for $t \in [0,1]$--an operation that can be numerically carried out with standard ODE solvers.
 ## Action Chunking with Transformers
     \tau \sim \mathrm{Beta}_{[0,s]}(1.5,1), \quad
     \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), \quad
     o_t, a_{t:t+H_a} \sim \mathcal D \notag`
+$$Where the experts parametrized by the separate weights$\phi, \theta$ interact with each other via self-attention layers only, so that the action expert $v_\theta$ internal computations also depend on the VLM backbone’s parameters $\phi$. Importantly, @blackp0VisionLanguageActionFlow2024 minimize <a href="#eq:pi0-loss" data-reference-type="ref" data-reference="eq:pi0-loss">[eq:pi0-loss]</a> over both the multimodal backbone and action expert parameters, thus updating the internal representations of the VLM using BC-specific gradients. In contrast, @driessKnowledgeInsulatingVisionLanguageAction2025 later show that failing to insulate the VLM knowledge from the flow matching gradients actually harms performance. Inference is performed iteratively refining action chunks while numerically forward-integrating the vector field predicted by the action expert,
 ``` math
 \begin{equation}
     a_{t:t+H_a}^{\tau + \delta} = a_{t:t+H_a}^{\tau } + \delta v_\theta(a_{t:t+H_a}^{\tau }, o_t)

app/src/content/article.mdx CHANGED Viewed

The diff for this file is too large to render. See raw diff