146 kB

Title: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs

URL Source: https://arxiv.org/html/2603.02083

Published Time: Tue, 10 Mar 2026 02:21:11 GMT

Markdown Content: 𝜋-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs

Report GitHub Issue

Title: Content selection saved. Describe the issue below:

Description:

Submit without GitHub Submit in GitHub

Back to arXiv

Why HTML?Report Issue Back to Abstract Download PDF

4.   [4.4 Comparison with Diffusion-NFT (Weighted-MSE)](https://arxiv.org/html/2603.02083#S4.SS4 "In 4 Method ‣ 𝜋-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs")

5.   [A.5 Proof of Theorem 4.4 (Gradient Form and Alignment)](https://arxiv.org/html/2603.02083#A1.SS5 "In Appendix A Theoretical Analysis and Proofs ‣ 𝜋-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs")
    1.   [Step 1: Constructed mean shift.](https://arxiv.org/html/2603.02083#A1.SS5.SSS0.Px1 "In A.5 Proof of Theorem 4.4 (Gradient Form and Alignment) ‣ Appendix A Theoretical Analysis and Proofs ‣ 𝜋-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs")
    2.   [Step 2: Error difference (Theorem 4.4.a).](https://arxiv.org/html/2603.02083#A1.SS5.SSS0.Px2 "In A.5 Proof of Theorem 4.4 (Gradient Form and Alignment) ‣ Appendix A Theoretical Analysis and Proofs ‣ 𝜋-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs")
    3.   [Step 3: Gradient form (Theorem 4.4.b).](https://arxiv.org/html/2603.02083#A1.SS5.SSS0.Px3 "In A.5 Proof of Theorem 4.4 (Gradient Form and Alignment) ‣ Appendix A Theoretical Analysis and Proofs ‣ 𝜋-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs")
    4.   [Step 4: Small-step alignment (general r∈[0,1]r\in[0,1] and binary case).](https://arxiv.org/html/2603.02083#A1.SS5.SSS0.Px4 "In A.5 Proof of Theorem 4.4 (Gradient Form and Alignment) ‣ Appendix A Theoretical Analysis and Proofs ‣ 𝜋-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs")

6.   [A.6 Proof of Theorem 4.5 (Comparison with wMSE)](https://arxiv.org/html/2603.02083#A1.SS6 "In Appendix A Theoretical Analysis and Proofs ‣ 𝜋-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs")
    1.   [A.6.1 Decomposition of wMSE](https://arxiv.org/html/2603.02083#A1.SS6.SSS1 "In A.6 Proof of Theorem 4.5 (Comparison with wMSE) ‣ Appendix A Theoretical Analysis and Proofs ‣ 𝜋-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs")
    2.   [A.6.2 Ranking Calibration](https://arxiv.org/html/2603.02083#A1.SS6.SSS2 "In A.6 Proof of Theorem 4.5 (Comparison with wMSE) ‣ Appendix A Theoretical Analysis and Proofs ‣ 𝜋-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs")
    3.   [A.6.3 Binary Case Analysis](https://arxiv.org/html/2603.02083#A1.SS6.SSS3 "In A.6 Proof of Theorem 4.5 (Comparison with wMSE) ‣ Appendix A Theoretical Analysis and Proofs ‣ 𝜋-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs")

License: arXiv.org perpetual non-exclusive license

arXiv:2603.02083v2 [cs.RO] 09 Mar 2026

π\pi-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs

Abstract

Flow-based vision-language-action (VLA) models excel in embodied control but suffer from intractable likelihoods during multi-step sampling, hindering online reinforcement learning. We propose 𝝅\boldsymbol{\pi}-StepNFT (Step-wise Negative-aware Fine-Tuning), a critic-and-likelihood-free framework that requires only a single forward pass per optimization step and eliminates auxiliary value networks. We identify that wider exploration spaces necessitate finer-grained, step-wise guidance for alignment. Empirically, π\pi-StepNFT unlocks latent potential on LIBERO with competitive few-shot robustness. Moreover, it achieves superior generalization on ManiSkill, outperforming value-based baselines in OOD scenarios by preventing overfitting to multimodal features. This property offers a scalable solution promising for complex real-world applications. Our implementation builds upon RLinf and is publicly available at https://wangst0181.github.io/pi-StepNFT/.

,@pa\theaffil\addtofullauthorlist Siting Wang Anonymous Authors@pa@anon \addtofullauthorlist Anonymous Authors ,@pa\theaffil\addtofullauthorlist Xiaofeng Wang ,@pa\theaffil\addtofullauthorlist Zheng Zhu ,@pa\theaffil\addtofullauthorlist Minnan Pei ,@pa\theaffil\addtofullauthorlist Xinyu Cui

,@pa\theaffil\addtofullauthorlist Cheng Deng ,@pa\theaffil\addtofullauthorlist Jian Zhao ,@pa\theaffil\addtofullauthorlist Guan Huang ,@pa\theaffil\addtofullauthorlist Haifeng Zhang ,@pa\theaffil\addtofullauthorlist Jun Wang

Figure 1: Comparison of training paradigms.Left (ODE): Terminal supervision is well-posed for deterministic ODEs but results in a narrow expert manifold. Middle (Naive SDE): Stochastic rollouts introduce a wider exploration space, but coarse terminal supervision fails to correct deviations, leading to misalignment. Right (π\pi-StepNFT): Our method leverages the wider space from SDE but applies finer, step-wise ranking guidance to ensure robust alignment with the expert manifold.

1 Introduction

Vision-language-action (VLA) models have recently demonstrated increasingly general capabilities, enabling robots to follow open-ended natural language instructions and perform complex tasks across diverse environments. While early approaches discretized actions into tokens (Zitkovich et al., 2023; Kim et al., 2024) or mapped observations to continuous regression features(Kim et al., 2025b), recent advances have converged on flow-matching-based policies(Black et al., 2026, 2025; Intelligence et al., 2025; Ye et al., 2025; Bjorck et al., 2025). These models, by integrating large-scale vision-language pretraining with generative action prediction, have established a new standard for complex manipulation tasks.

A recent systematic analysis(Pan et al., 2025) suggests that after supervised fine-tuning (SFT) converges, the output of flow-based VLAs often collapses to a single mode. Consequently, their efficacy stems from a mechanism of stochasticity injection combined with supervised iterative correction. Injecting noise during training and iterative correction during inference enable the policy to resist error accumulation and adhere closely to the expert manifold during closed-loop execution. However, we contend that relying heavily on the density of expert demonstrations, SFT merely establishes a foundational behavioral manifold that resembles a narrow line, where the model often lacks the ability to recover once it deviates due to micro-perturbations during testing. To overcome this fragility, reinforcement learning (RL) is required, aiming not to learn from scratch, but to explore an expanded manifold around the expert trajectory, endowing the model with the local error-correction capabilities needed to mitigate state deviations.

However, training flow-based VLAs with RL faces fundamental bottlenecks. The deterministic nature of ordinary differential equation (ODE) trajectories confines the action exploration space entirely to the initial noise distribution. To this end, the multi-step ODE integration renders exact action likelihoods computationally intractable, as calculating gradients requires expensive Jacobian trace estimation or backpropagation through the solver. Faced with these challenges, existing solutions either bypass likelihoods by employing latent space value distillation(Li et al., 2025b) or training separate value functions to introduce explicit conditioning on trajectory quality(Intelligence et al., 2025), or approximate likelihoods via Gaussian parameterization at each denoising step(Chen et al., 2025). In contrast, Diffusion-NFT(Zheng et al., 2026) offers a likelihood-free alternative from the image generation domain. It optimizes the flow field directly on the forward diffusion process and defines a contrastive improvement direction by splitting samples into positive and negative subsets.

Crucially, directly transferring Diffusion-NFT to embodied control reveals a fundamental domain gap. While the standard deterministic ODE sampling yields a safe but narrow manifold, it lacks the exploratory capacity for self-improvement. Attempting to broaden this scope via stochastic differential equation (SDE) theoretically enables manifold expansion during exploration, yet in practice, it results in a wider space but misaligned policy. This failure stems from the nature of supervision. Unlike image generation that targets static distribution matching, embodied control is sequential and thus sensitive to compounding errors. Moreover, embodied control typically uses a short denoising path to meet interaction latency, and prior empirical evidence indicates diminishing returns from moderately longer paths(Chen et al., 2025), making fine-grained, step-wise denoising supervision feasible in practice.

As illustrated in Figure1, under deterministic ODE rollouts, the intermediate state x t x_{t} stays on a narrow trajectory, making “point-level” terminal matching ❶ on x 0 x_{0} relatively well-posed. However, under SDE rollouts, injected noises accumulate along the denoising path, and naively regressing the final x 0 x_{0} forces unstable point-to-point matching with high-variance gradients. A noise-aware view ❷ reveals that each denoising update induces a Gaussian transition, suggesting “region-level” supervision with variance-normalization; yet using only the solely terminal ❸ still yields a coarse correction that lags behind on-policy drift. This motivates step-wise supervision ❹ that targets the immediate next solver state x t−x_{t^{-}}, providing fine-grained local guidance to stabilize stochastic exploration and accelerate convergence.

To address these issues, we propose 𝝅\boldsymbol{\pi}-StepNFT, a critic-and-likelihood-free online RL framework tailored for embodied VLAs. We achieve robust manifold expansion by systematically redesigning the interplay between exploration and supervision. First, regarding lack of exploration, effective policy improvement necessitates a wider behavioral space. We introduce an SDE-based sampling mechanism during training that augments deterministic rollouts with structured noise, forcing the model to traverse adjacent states to effectively inflate the behavioral manifold. Second, for supervision target mismatch, anchoring this expanded exploration demands finer-grained step-wise supervision. We shift the prediction target from the final denoised output x 0 x_{0} to the immediate next denoising step x t−x_{t^{-}}, using a noise-based regression to generate the precise local gradients needed for robust alignment. Third, regarding suppressed successful exploration, we identify that the previous objective in Diffusion-NFT suffers from an “implicit penalty”, which inadvertently suppresses policy updates to minimize the magnitude of branch separation. To overcome this, we introduce a logistic contrastive ranking loss, establishing a “push-pull dynamics”: maximizing the likelihood of successful trajectories while suppressing failed ones. This noise-based, step-wise, bidirectional and penalty-free signal enforces strict preference separation, enabling more aggressive and precise policy improvement.

In summary, our main contributions are as follows:

•We propose π\pi-StepNFT, a critic-and-likelihood-free online RL framework tailored for flow-based VLAs, which eliminates the need to train auxiliary value networks that are prone to multimodal overfitting, and requires only a single forward pass per optimization step.
•We identify that wider exploration spaces induced by SDE necessitate finer-grained guidance. By shifting the supervision target from the terminal to the immediate next denoising step and incorporating a logistic contrastive ranking objective, we resolve the supervision mismatch and reinforce successful exploration.
•We validate our approach through extensive experiments on LIBERO(Liu et al., 2023) and ManiSkill(Mu et al., 2021) benchmarks. On LIBERO, π\pi-StepNFT unlocks the policy’s potential in few-shot settings, achieving a 32.9% improvement over SFT. On ManiSkill, it demonstrates superior generalization in visually diverse OOD settings by preventing critic-induced multimodal overfitting, outperforming critic-based baselines by 11.1% in unseen scenarios, highlighting its potential for complex real-world deployment.

2 Related Works

2.1 Online RL for VLAs

Recent VLA models have evolved from discrete tokenization(Zitkovich et al., 2023; Kim et al., 2024; Liu et al., 2026) to continuous flow-based policies(Ghosh et al., 2024; Black et al., 2026, 2025; Intelligence et al., 2025), which establish strong priors for manipulation. However, fine-tuning these flow-based VLAs faces the challenge of intractable action likelihoods due to multi-step ODE sampling. Existing solutions generally adopt two strategies to circumvent this: bypassing likelihood calculation via value distillation or preference feedback (e.g., GR-RL(Li et al., 2025b), π 0.6∗\pi^{*}{0.6}), or approximating likelihoods by transforming ODEs into SDEs (e.g., π RL\pi{\texttt{RL}}(Chen et al., 2025)). Similar to test-time scaling strategies(Yang et al., 2025; Song et al., 2025), noise injection in SDEs facilitates exploration, yet existing methods often struggle to balance exploration width with supervision granularity.

2.2 Policy Optimization for Generative Models

To handle intractable likelihoods in generative policy optimization, prior works typically follow three paradigms. Explicit gradient and advantage methods(Black et al., 2023; Liu et al., 2025; Zhang et al., 2025b) treat denoising as a sequential process but often require expensive backpropagation through solvers. Reward-Weighted methods(Pfrommer et al., 2025; Fan et al., 2025; McAllister et al., 2025) avoid exact likelihoods by re-weighting regression targets, yet they can suffer from high variance in gradient estimation. Preference and contrastive methods(Wallace et al., 2024; Zhang et al., 2025a) offer a more stable alternative by aligning distributions via ranking. Most notably, Diffusion-NFT(Zheng et al., 2026) proposes a likelihood-free framework using implicit forward-process updates. Our work extends this efficient paradigm to embodied control, addressing the unique supervision gaps that arise when applying it to multi-step VLA policies. Please refer to AppendixB for detailed related works.

3 Preliminaries

We use two time scales throughout the paper: environment steps and denoising time. An episode trajectory is indexed by environment steps i=0,…,H−1 i=0,\dots,H-1, yielding τ={(s i,a i)}i=0 H−1\tau={(s_{i},a_{i})}{i=0}^{H-1}. Separately, the flow policy uses a continuous denoising time t∈[0,1]t\in[0,1] (or a discretization) to generate an action. When discretized, we use K K solver steps with schedule 1=t 0>t 1>⋯>t K=0 1=t{0}>t_{1}>\dots>t_{K}=0, and denote intermediate sampler states by {x t j}j=0 K{x_{t_{j}}}_{j=0}^{K}, where K K is typically small in embodied control due to real-time constraints. Unless stated otherwise, t t refers to denoising time in the sampler, while the environment index is i i, avoiding ambiguity when both the RL objective and the flow dynamics appear in the same derivations.

3.1 Flow Matching for VLA Models

We consider a VLA policy that generates continuous actions x 0∈ℝ d x_{0}\in\mathbb{R}^{d} conditioned on context c c. Flow-matching (Lipman et al., 2022) learns a time-dependent vector field v θ(x,t,c)v_{\theta}(x,t,c) to generate data by transforming a noise distribution x 1∼𝒩(0,I)x_{1}\sim\mathcal{N}(0,I) to the data distribution x 0 x_{0} over time t∈[0,1]t\in[0,1].

The standard flow-matching objective regresses the network prediction v θ v_{\theta} onto the target field u t=x 1−x 0 u_{t}=x_{1}-x_{0} by minimizing ℒ CFM(θ)=𝔼 t,x 0,x 1[‖v θ(x t,t,c)−u t‖2]\mathcal{L}{\text{CFM}}(\theta)=\mathbb{E}{t,x_{0},x_{1}}[|v_{\theta}(x_{t},t,c)-u_{t}|^{2}], where x t=tx 1+(1−t)x 0 x_{t}=tx_{1}+(1-t)x_{0}.

ODE Sampling. In the standard deterministic setting, inference is performed by numerically integrating the ODE dx=v θ(x,t,c)dt dx=v_{\theta}(x,t,c)dt from t=1 t=1 to t=0 t=0. Using a discrete step size δ t>0\delta_{t}>0, the Euler update rule for the next step x t−x_{t^{-}} (where t−=t−δ t t^{-}=t-\delta_{t}) is:

x t−=x t−v θ(x t,t,c)δ t.x_{t^{-}}=x_{t}-v_{\theta}(x_{t},t,c)\delta_{t}.(1)

While efficient for generation, this deterministic trajectory lacks the exploratory capability required for reinforcement learning.

SDE Sampling. To enable exploration, we adopt the reverse-time SDE formulation (Liu et al., 2025), which injects stochasticity while preserving the marginal distribution. The Euler-Maruyama discretized update is given by:

x t−\displaystyle x_{t^{-}}=x t+[v θ(x t,t)+σ t 2 2t(x t+(1−t)v θ(x t,t))](−δ t)\displaystyle=x_{t}+\leftv_{\theta}(x_{t},t)+\frac{\sigma_{t}^{2}}{2t}(x_{t}+(1-t)v_{\theta}(x_{t},t))\right +σ tδ tϵ,\displaystyle\quad+\sigma_{t}\sqrt{\delta_{t}}\epsilon,(2)

where ϵ∼𝒩(0,I)\epsilon\sim\mathcal{N}(0,I) provides exploration noise. This update step induces a Gaussian transition density q θ,t(x t−∣x t,c)=𝒩(μ θ,t(x t),Σ t)q_{\theta,t}(x_{t^{-}}\mid x_{t},c)=\mathcal{N}\big(\mu_{\theta,t}(x_{t}),\Sigma_{t}\big). Crucially, the mean of this transition is an affine transformation of the network output:

μ θ,t(x t,c)=U t(x t,t)+B t(t)v θ(x t,t,c),\mu_{\theta,t}(x_{t},c)=U_{t}(x_{t},t)+B_{t}(t)v_{\theta}(x_{t},t,c),(3)

where U t U_{t} and B t B_{t} are pre-determined coefficients derived from the noise schedule (detailed in AppendixA.1).

This linear relationship allows us to propagate gradients efficiently from the transition target to the policy parameters without backpropagating through the ODE solver.

3.2 RL Fine-tuning and the Likelihood Gap

We formulate the fine-tuning task as maximizing the expected return J(θ)J(\theta) over trajectories τ=(s 0,a 0,…)\tau=(s_{0},a_{0},\dots): J(θ)=𝔼 τ∼p θ(τ)[R(τ)],J(\theta)=\mathbb{E}{\tau\sim p{\theta}(\tau)}[R(\tau)], where the trajectory distribution is determined by the environment dynamics and the policy: p θ(τ)=p(s 0)∏i=0 H−1 π θ(a i|s i)p(s i+1|s i,a i)p_{\theta}(\tau)=p(s_{0})\prod_{i=0}^{H-1}\pi_{\theta}(a_{i}|s_{i})p(s_{i+1}|s_{i},a_{i}).

Standard policy gradient methods rely on the score function: ∇θ J(θ)=𝔼 τ[∑i∇θ log⁡π θ(a i|s i)⋅Ψ i]\nabla_{\theta}J(\theta)=\mathbb{E}{\tau}\left[\sum{i}\nabla_{\theta}\log\pi_{\theta}(a_{i}|s_{i})\cdot\Psi_{i}\right], where Ψ t\Psi_{t} is the advantage or return (e.g., REINFORCE(Williams, 1992), PPO(Schulman et al., 2017)).

However, for flow-based policies, calculating the explicit log-likelihood log⁡π θ(a i|s i)\log\pi_{\theta}(a_{i}|s_{i}) is computationally expensive and numerically unstable, as it requires integrating the instantaneous change of variables (Jacobian trace) along the entire generation trajectory. This intractability prevents the direct application of standard RL algorithms, motivating our likelihood-free approach.

4 Method

Algorithm 1 π\pi-StepNFT: Step-wise Negative-aware Fine-Tuning with Contrastive Ranking

Require: Flow policy π θ old\pi_{\theta^{\text{old}}}, simulator ℰ\mathcal{E}, env steps H H, solver steps K K, schedule {t j}j=0 K{t_{j}}{j=0}^{K}, hyperparams β,λ TR\beta,\lambda{\text{TR}}.

Initialize θ←θ old\theta\leftarrow\theta^{\text{old}}, buffer 𝒟←∅\mathcal{D}\leftarrow\emptyset.

for each iteration m m do

// Phase 1: Data Collection

for each task (initial states 0,language promptc)(\text{initial state }s_{0},\text{language prompt }c)do

Rollout H H env steps using π θ old\pi_{\theta^{\text{old}}}.

for i=0 i=0 to H−1 H-1 do

Run Flow-SDE sampler; get chain {x t j}j=0 K{x_{t_{j}}}_{j=0}^{K}.

Sample j∼𝒰{0,…,K−1}j\sim\mathcal{U}{0,\dots,K-1}; set t←t j t\leftarrow t_{j}.

Set x t←x t j x_{t}\leftarrow x_{t_{j}} and x t−←x t j+1 x_{t^{-}}\leftarrow x_{t_{j+1}}.

Set v t old←π θ old(c,s i,x t,t)v^{\text{old}}{t}\leftarrow\pi{\theta^{\text{old}}}(c,s_{i},x_{t},t).

Record d i=(x t,x t−,v t old,t,s i,c)d_{i}=(x_{t},x_{t^{-}},v^{\text{old}}{t},t,s{i},c).

Execute x t K x_{t_{K}} in ℰ\mathcal{E} and update s i s_{i}

end for

Observe terminal r∈{0,1}r\in{0,1}; 𝒟←{(d i,r)}i=0 H−1\mathcal{D}\leftarrow{(d_{i},r)}_{i=0}^{H-1}.

end for

// Phase 2: Optimization

for each batch (x t,x t−,v t old,t,s,c,r)∼𝒟(x_{t},x_{t^{-}},v^{\text{old}}_{t},t,s,c,r)\sim\mathcal{D}do

Pred v θ,t←π θ(c,s,x t,t)v_{\theta,t}\leftarrow\pi_{\theta}(c,s,x_{t},t); Drift Δv θ←v θ,t−v t old\Delta v_{\theta}\leftarrow v_{\theta,t}-v^{\text{old}}_{t}.

Construct mirrors: v θ±←v t old±βΔv θ v^{\pm}{\theta}\leftarrow v^{\text{old}}{t}\pm\beta\Delta v_{\theta}.

Calc means/var: μ θ,t±,Σ t←Mean_Var(x t,v θ±,t)\mu_{\theta,t}^{\pm},\Sigma_{t}\leftarrow\text{Mean_Var}(x_{t},v^{\pm}_{\theta},t).

Calc errors: E θ,t±←‖x t−−μ θ,t±‖Σ t−1 2 E^{\pm}{\theta,t}\leftarrow|x{t^{-}}-\mu_{\theta,t}^{\pm}|^{2}{\Sigma{t}^{-1}}.

Set y←2r−1 y\leftarrow 2r-1 and ΔE θ←E θ,t+−E θ,t−\Delta E_{\theta}\leftarrow E^{+}{\theta,t}-E^{-}{\theta,t}.

ℒ total←softplus(1 2yΔE θ)+λ TR‖Δv θ‖2\mathcal{L}{\text{total}}\leftarrow\text{softplus}\left(\frac{1}{2}y\Delta E{\theta}\right)+\lambda_{\text{TR}}|\Delta v_{\theta}|^{2}.

Update θ←θ−η∇θ ℒ total\theta\leftarrow\theta-\eta\nabla_{\theta}\mathcal{L}_{\text{total}}.

end for

θ old←α mθ old+(1−α m)θ\theta^{\text{old}}\leftarrow\alpha_{m}\theta^{\text{old}}+(1-\alpha_{m})\theta; clear 𝒟\mathcal{D}.

end for

Output: Optimized policy π θ\pi_{\theta}.

We introduce π\pi-StepNFT (Step-wise Negative-aware Fine-Tuning), an online RL framework designed for flow-based VLA models, as shown in Algorithm1. Our method is inspired by Diffusion-NFT (Zheng et al., 2026), which fine-tunes diffusion models using a weighted-MSE objective on the final denoised output x 0 x_{0} generated by ODE rollouts.

Wider Space Needs Finer Steps. While effective for image generation, we observe that directly transferring the ODE-based formulation to embodied control yields suboptimal convergence. We attribute this domain gap to two critical factors, requiring us to establish a Wider Space for exploration anchored by Finer Steps of supervision (made practical by the typically short denoising path in embodied control):

•Lack of Exploration: Deterministic ODE rollouts quickly collapse to a narrow manifold, failing to discover diverse solutions in high-dimensional action spaces. We instead adopt an SDE-based formulation to inject controlled noise. This active expansion creates the necessary wider space for the policy to traverse and learn from adjacent regions around the expert trajectory.
•Supervision Target Mismatch: Operating within this wider space renders standard terminal-x 0 x_{0} supervision unstable, as injected noises accumulate and amplify variance over the rollout horizon. To counteract this, we require finer steps of guidance: we supervise the immediate one-step transition x t→x t−x_{t}\to x_{t^{-}} with variance normalization, providing the precise, low-variance local gradients needed for robust alignment.

4.1 Step-wise Transitions and Mirror Errors

We conduct rollouts using the Flow-SDE solver described in Section3.1 with a rollout policy π θ old\pi_{\theta^{\text{old}}} which is updated with EMA across iterations. Each episode yields an environment trajectory τ={(s i,a i)}i=0 H−1\tau={(s_{i},a_{i})}{i=0}^{H-1}. At each environment step i i, the policy generates a i a{i} by running a K K-step solver with schedule 1=t 0>t 1>⋯>t K=0 1=t_{0}>t_{1}>\cdots>t_{K}=0, producing sampler states {x t j(i)}j=0 K{x_{t_{j}}^{(i)}}{j=0}^{K}; we execute the chunked terminal sample x(i)t K x^{(i)}{t{K}} in the simulator as a i a_{i}. For efficiency, we uniformly sample one solver index j∼𝒰{0,…,K−1}j\sim\mathcal{U}{0,\dots,K-1} and define a single solver transition (x t,x t−,t)=(x t j(i),x t j+1(i),t j)(x_{t},x_{t^{-}},t)=(x_{t_{j}}^{(i)},x_{t_{j+1}}^{(i)},t_{j}), where t−t^{-} denotes the next solver time point in the discretization. We additionally record the rollout velocity v old=π θ old(c,x t,t)v^{\text{old}}=\pi_{\theta}^{\text{old}}(c,x_{t},t). Each episode also provides a terminal optimality signal r(τ)∈[0,1]r(\tau)\in[0,1].

Following Diffusion-NFT, we construct two “mirrored” velocity candidates v+θ v^{+}{\theta} and v−θ v^{-}{\theta}, symmetric around v old v^{\text{old}} along the update direction Δv θ=v θ−v old\Delta v_{\theta}=v_{\theta}-v^{\text{old}}:

v θ+\displaystyle v_{\theta}^{+}=(1−β)v old+βv θ,\displaystyle=(1-\beta)v^{\text{old}}+\beta v_{\theta},(4) v θ−\displaystyle v_{\theta}^{-}=(1+β)v old−βv θ,\displaystyle=(1+\beta)v^{\text{old}}-\beta v_{\theta},(5)

where β>0\beta>0 is a trust-region hyperparameter controlling how far we deviate from the rollout policy to estimate a local improvement signal. This construction ensures symmetry: v θ+−v old=v old−v θ−=βΔv θ v_{\theta}^{+}-v^{\text{old}}=v^{\text{old}}-v_{\theta}^{-}=\beta\Delta v_{\theta}.

Importantly, under the Flow-SDE transition Eq.(3), the one-step mean is an affine function of the velocity, so these two velocity candidates induce two Gaussian transition means μ±θ,t=μ t(v±θ)\mu^{\pm}{\theta,t}=\mu_{t}(v^{\pm}{\theta}) with shared covariance Σ t\Sigma_{t}. We then compute the variance-normalized step errors against the sampled next state x t−x_{t^{-}}:

E θ,t+=‖x t−−μ θ,t+‖Σ t−1 2,E θ,t−=‖x t−−μ θ,t−‖Σ t−1 2.E_{\theta,t}^{+}=|x_{t^{-}}-\mu_{\theta,t}^{+}|^{2}{\Sigma{t}^{-1}},\quad E_{\theta,t}^{-}=|x_{t^{-}}-\mu_{\theta,t}^{-}|^{2}{\Sigma{t}^{-1}}.(6)

Intuitively, E θ,t+E^{+}{\theta,t} measures how well the positive mirrored branch explains the observed stochastic transition, while E θ,t−E^{-}{\theta,t} measures the negative branch. Normalizing by Σ t\Sigma_{t} (which reflects the injected noise level at solver time t t) stabilizes gradient scales across timesteps. We next show how this yields a step-wise contrastive objective over mirrored perturbations.

4.2 π\pi-StepNFT: Step-wise Contrastive Objective

Given a sampled solver transition (x t→x t−)(x_{t}\rightarrow x_{t^{-}}) with terminal signal r(τ)r(\tau), we define y=2r(τ)−1 y=2r(\tau)-1. We use oracle to denote the ideal, outcome-conditioned comparison between p(x t−∣x t,c,o=1)p(x_{t^{-}}\mid x_{t},c,o{=}1) and p(x t−∣x t,c,o=0)p(x_{t^{-}}\mid x_{t},c,o{=}0), which is not directly observable. Our method replaces this with a computable ranking surrogate: we construct two symmetric perturbations around the rollout policy along the update direction and rank the two branches by which one assigns higher likelihood to the observed transition.

Definition 4.1(π\pi-StepNFT Objective).

For a sampled solver transition tuple (x t,x t−,t,c)(x_{t},x_{t^{-}},t,c) with episode label y∈[−1,1]y\in[-1,1], let E θ,t+E_{\theta,t}^{+} and E θ,t−E_{\theta,t}^{-} denote the step-wise errors defined in Section4.1. The step-level objective is:

ℓ t(θ)=softplus(1 2y⋅(E θ,t+−E θ,t−)).\ell_{t}(\theta)=\text{softplus}\Big(\dfrac{1}{2}y\cdot(E_{\theta,t}^{+}-E_{\theta,t}^{-})\Big).(7)

Minimizing ℓ t\ell_{t} encourages E θ,t+<E θ,t−E_{\theta,t}^{+}<E_{\theta,t}^{-} when the episode is successful (y>0 y>0) and reverses the inequality for failures (y<0 y<0). Intuitively, this ranks two local transition hypotheses along the update direction Δv θ=v θ−v old\Delta v_{\theta}=v_{\theta}-v^{\text{old}}, using the episode label as a weak preference signal.

Lemma 4.2(Log-Likelihood Ratio).

Under the shared covariance Σ t\Sigma_{t}, the difference in squared errors is proportional to the log-likelihood ratio of the two mirrored branches:

log⁡q θ,t+(x t−∣x t,c)q θ,t−(x t−∣x t,c)=−1 2(E θ,t+−E θ,t−).\log\frac{q_{\theta,t}^{+}(x_{t^{-}}\mid x_{t},c)}{q_{\theta,t}^{-}(x_{t^{-}}\mid x_{t},c)}=-\frac{1}{2}(E_{\theta,t}^{+}-E_{\theta,t}^{-}).(8)

Proof. See AppendixA.2.

Lemma4.2 shows that minimizing ℓ t(θ)\ell_{t}(\theta) adjusts the constructed transition log-ratio log⁡(q θ+q θ−)\log!\big(\tfrac{q_{\theta}^{+}}{q_{\theta}^{-}}\big) according to the episode label y y. The two densities q θ±q_{\theta}^{\pm} arise from symmetric perturbations ±βΔv θ\pm\beta\Delta v_{\theta} of the rollout policy, so this log-ratio provides a directional signal on the same observed solver transition (x t→x t−)(x_{t}!\to x_{t^{-}}). In contrast, DPO(Rafailov et al., 2023) ranks outcome-conditioned distributions under a shared context, whereas we rank update perturbations using only episode-level feedback. For y=+1 y=+1, minimizing the loss increases the log-ratio, favoring updates that make the observed transition more likely under the positive perturbation; for y=−1 y=-1, it encourages the opposite preference. Thus, ℓ t\ell_{t} yields a low-variance step-wise surrogate, and in Section4.3 we show that under small-step assumptions its induced updates align with the oracle improvement direction from outcome-conditioned posterior splits.

4.3 Validity and Optimized Direction

This section closes the conceptual loop between our constructed step-wise objective and the oracle improvement signal.

Oracle direction from posterior splits (not directly computable).

Let o∈0,1 o\in{0,1} denote the latent episode outcome under context c c. Posterior splits (AppendixA.4) induce an outcome-conditioned decomposition of the rollout posterior at solver time t t, which defines the oracle mean gap Δμ t⋆(x t,c)\Delta\mu_{t}^{\star}(x_{t},c) (LemmaA.4). Intuitively, Δμ t⋆\Delta\mu_{t}^{\star} captures how the one-step transition mean would change if we could condition the rollout transition on success versus failure; we treat it as an ideal local improvement direction in mean space. Importantly, this oracle quantity is a reference defined under outcome conditioning, and is not directly observable from online rollouts.

Proposition 4.3(Bayes Monotonicity).

For fixed (x t,c)(x_{t},c), the posterior ℙ(o=1∣x t−,x t,c)\mathbb{P}(o=1\mid x_{t^{-}},x_{t},c) is strictly increasing in the oracle likelihood ratio p(x t−∣x t,c,o=1)p(x t−∣x t,c,o=0).\frac{p(x_{t^{-}}\mid x_{t},c,o=1)}{p(x_{t^{-}}\mid x_{t},c,o=0)}.

Proof. See AppendixA.3.

Proposition 4.3 provides the key monotonic link: increasing the oracle transition ratio on the observed transition (x t→x t−)(x_{t}\rightarrow x_{t^{-}}) strictly increases the posterior probability of success. This motivates seeking a step-wise objective that increases a success-vs-failure transition ratio, while acknowledging that the oracle-conditioned densities p(⋅∣o)p(\cdot\mid o) are inaccessible.

Computable surrogate via mirrored transitions (what we actually optimize).

Since the oracle success–failure ratio p(·|o=1)p(·|o=0)\frac{p(\textperiodcentered|o=1)}{p(\textperiodcentered|o=0)} is not observable online, we optimize a constructed step-wise surrogate based on the mirrored perturbations defined in Section4.1. By Lemma4.2, our ranking loss is equivalent to increasing (for y>0 y>0) or decreasing (for y<0 y<0) the constructed transition ratio q θ+q θ−\frac{q_{\theta}^{+}}{q_{\theta}^{-}} evaluated on the observed transition (x t→x t−)(x_{t}\rightarrow x_{t^{-}}). We next show that, under a small-step regime, the expected gradient induced by this constructed ratio aligns with the oracle mean-gap direction.

Theorem 4.4(Gradient Form and Small-Step Alignment).

Let e t=x t−−μ t old e_{t}=x_{t^{-}}-\mu_{t}^{\mathrm{old}} be the residual of the rollout mean, d t=μ θ,t+−μ t old d_{t}=\mu_{\theta,t}^{+}-\mu^{\text{old}}{t} be the displacement in mean space and B t B{t} be the affine coefficient from Eq.(3).

(a)The error difference satisfies

E θ,t+−E θ,t−=−4⟨Σ t−1e t,d t⟩.E_{\theta,t}^{+}-E_{\theta,t}^{-}=-4\langle\Sigma_{t}^{-1}e_{t},d_{t}\rangle.(9) 2. (b)Consequently, the gradient of the step loss ℓ t\ell_{t} is

−∇θ ℓ t(θ)∝σ(z t)y(∂v θ∂θ)⊤B tΣ t−1e t,-\nabla_{\theta}\ell_{t}(\theta)\propto\sigma(z_{t}),y,\Big(\frac{\partial v_{\theta}}{\partial\theta}\Big)^{\top}B_{t}\Sigma_{t}^{-1}e_{t},(10)

where z t z_{t} is the softplus logit and σ(⋅)\sigma(\cdot) is the sigmoid. 3. (c)In the binary-success setting and for small updates (v θ≈v old v_{\theta}\approx v^{\mathrm{old}} so σ(z t)≈const\sigma(z_{t})\approx\mathrm{const}), the conditional expected direction aligns with the oracle mean gap:

𝔼[−∇θ ℓ t(θ)∣x t,c]∥(∂v θ∂θ)⊤B tΣ t−1Δμ t⋆(x t,c),\mathbb{E}[-\nabla_{\theta}\ell_{t}(\theta)\mid x_{t},c]\parallel\Big(\frac{\partial v_{\theta}}{\partial\theta}\Big)^{\top}B_{t}\Sigma_{t}^{-1}\Delta\mu_{t}^{\star}(x_{t},c),(11)

where Δμ t⋆\Delta\mu_{t}^{\star} is defined by posterior splits in AppendixA.4.

Proof. See AppendixA.5.

Theorem4.4 provides the missing “closed loop”: posterior splits define an oracle local improvement signal Δμ t⋆\Delta\mu_{t}^{\star}, while our mirrored construction yields a computable surrogate ratio whose small-step expected gradient provably points in the same mean-space direction.

4.4 Comparison with Diffusion-NFT (Weighted-MSE)

Diffusion-NFT optimizes a reward-weighted regression objective. In our step-wise setting, the analogous form is ℓ t wMSE(θ)=rE θ,t++(1−r)E θ,t−\ell^{\text{wMSE}}{t}(\theta)=r,E^{+}{\theta,t}+(1-r),E^{-}{\theta,t}, where E θ,t±E^{\pm}{\theta,t} are the mirrored step errors from Section4.1. We next show that this weighted-MSE contains an implicit separation penalty that can suppress branch separation (and hence policy updates), whereas our logistic ranking objective isolates the directional alignment signal and induces a clearer push–pull behavior.

Theorem 4.5(Separation Penalty in wMSE).

The wMSE loss decomposes as:

ℓ t wMSE(θ)=const−2y⟨Σ t−1e t,d t⟩+‖d t‖Σ t−1 2.\ell_{t}^{\text{wMSE}}(\theta)=\text{const}-2y\langle\Sigma_{t}^{-1}e_{t},d_{t}\rangle+|d_{t}|^{2}{\Sigma{t}^{-1}}.(12)

Proof. See AppendixA.6.

Here, defined in Section4.3, e t=x t−−μ t old e_{t}=x_{t^{-}}-\mu^{\text{old}}{t} is the rollout residual and d t=μ θ,t+−μ t old d{t}=\mu^{+}{\theta,t}-\mu^{\text{old}}{t} is the mirrored mean displacement. The middle term is the directional alignment signal driven by y y, while the last term ‖d t‖Σ t−1 2|d_{t}|^{2}{\Sigma{t}^{-1}} is a separation penalty that discourages large branch displacement irrespective of y y.

In contrast, the core directional term in π\pi-StepNFT (derived in Theorem4.4) depends only on the error difference:

E θ,t+−E θ,t−=−4⟨Σ t−1e t,d t⟩.E_{\theta,t}^{+}-E_{\theta,t}^{-}=-4\langle\Sigma_{t}^{-1}e_{t},d_{t}\rangle.(13)

Implicit Penalty: The decomposition of the objective above shows that wMSE optimizes the same alignment term plus an additional quadratic penalty ‖d t‖Σ t−1 2|d_{t}|^{2}{\Sigma{t}^{-1}}. This penalty explicitly discourages branch separation (and thus suppresses the magnitude of the policy update), even when the data suggests a strong corrective move (i.e., large alignment between e t e_{t} and d t d_{t}). In contrast, by using a logistic ranking loss, π\pi-StepNFT removes this intrinsic suppression and preserves the alignment signal as the dominant driver for policy improvement.

Push-pull dynamics: In the binary case r∈{0,1}r\in{0,1}, the wMSE objective reduces to fitting only one branch: it pulls the selected branch toward the observed transition but does not explicitly push the other branch away. By contrast, our logistic ranking enforces a strict ordering between E θ,t+E^{+}{\theta,t} and E θ,t−E^{-}{\theta,t}: for successful episodes it simultaneously pulls the positive branch and pushes the negative branch away (and vice versa for failures). This bidirectional signal yields stronger separation and typically sharper gradients during fine-tuning, which translates into faster convergence and higher asymptotic performance in our experiments.

5 Experiments

5.1 Experimental Setup

Evaluation Benchmarks. We evaluate on 2 multitask benchmarks. For LIBERO(Liu et al., 2023), we follow the standard protocol across four suites (Spatial, Object, Goal, Long), reporting average success rates over 500 episodes (50 states ×\times 10 sub-tasks) per suite. For ManiSkill(Mu et al., 2021), we adopt the PutOnPlateInScene multitask setting from RL4VLA(Liu et al., 2026) tested for generalization, which defines 4,352 compositional tasks derived from 16 objects, 17 receptacles, and 16 tabletop scenes.

Model Architectures. We employ π 0\pi_{0} and π 0.5\pi_{0.5}, OpenPi’s flow-based VLAs combining a PaliGemma-3B(Beyer et al., 2024; Steiner et al., 2024) backbone with a ∼\sim 300M parameter flow-matching action expert. π 0.5\pi_{0.5} incorporates an improved training paradigm. Adhering to official configurations, π 0\pi_{0} uses vision, text, and proprioception, while π 0.5\pi_{0.5} omits proprioception on LIBERO; this modality setting remains consistent across SFT and RL phases.

SFT Initialization. We initialize our policy using π RL\pi_{\texttt{RL}} checkpoints. For LIBERO, to prevent performance saturation from masking RL gains, we train on pruned subsets of the total 1,692 trajectories: π 0\pi_{0} uses 58 trajectories for Spatial/Object/Goal and 208 for Long; π 0.5\pi_{0.5} uses a unified few-shot set of 40 trajectories (1 per sub-task). As for ManiSkill, we use the full 16,384 trajectories due to task complexity.

RL Training Protocol. We freeze the VLM backbone and fine-tune only the action expert. Training utilizes the RLinf(Yu et al., 2025) framework, which maximizes throughput by co-locating the environment, rollout policy, and actor on the same GPU. Main experiments were conducted on 8×8\times NVIDIA H100 (80GB) GPUs; ablations used 8×8\times NVIDIA RTX 4090 (48GB). Hyperparameters are detailed in the AppendixC.2.

5.2 Main Results

Table 1: Success rates (%) on LIBERO in the few-shot setting.

| Model | | Spatial | Object | Goal | Long | Avg. | Δ\Delta Avg. | | --- | --- | | # Full SFT | | Octo(Ghosh et al., 2024) | 78.9 | 85.7 | 84.6 | 51.1 | 75.1 | — | | OpenVLA(Kim et al., 2024) | 84.7 | 88.4 | 79.2 | 53.7 | 76.5 | — | | π fast\pi_{\text{fast}}(Pertsch et al., 2025) | 96.4 | 96.8 | 88.6 | 60.2 | 85.5 | — | | OpenVLA-OFT(Kim et al., 2025b) | 91.6 | 95.3 | 90.6 | 86.5 | 91.0 | — | | π 0\pi_{0} | 96.8 | 98.8 | 95.8 | 85.2 | 94.2 | — | | π 0.5\pi_{0.5} | 98.8 | 98.2 | 98.0 | 92.4 | 96.9 | — | | # Few-shot SFT + RL | | π 0\pi_{0} | SFT | 65.3 | 64.4 | 49.8 | 51.2 | 57.6 | — | | π RL\pi_{\texttt{RL}} (Flow-SDE + PPO) | 98.4 | 99.4 | 96.2 | 90.2 | 96.0 | +38.4 | | π RL\pi_{\texttt{RL}} (Flow-SDE + GRPO) | 97.8 | 97.8 | 83.2 | 81.4 | 90.0 | +32.4 | | π\pi-StepNFT | 93.5 | 98.0 | 83.7 | 86.7 | 90.5 | +32.9 | | # Few-shot SFT + RL | | π 0.5\pi_{0.5} | SFT | 84.6 | 95.4 | 84.6 | 43.9 | 77.1 | — | | π RL\pi_{\texttt{RL}} (Flow-SDE + PPO) | 99.6 | 100 | 98.8 | 93.0 | 97.9 | +20.8 | | π RL\pi_{\texttt{RL}} (Flow-SDE + GRPO) | 97.4 | 99.8 | 91.2 | 77.6 | 91.5 | +14.4 | | π\pi-StepNFT | 97.8 | 100 | 98.2 | 79.8 | 94.0 | +16.9 |

Table 2: Success rates (%) on ManiSkill across In-Distribution (IND) and Out-Of-Distribution (OOD) settings.

Model		IND	OOD
Vision	Semantic	Execution	Avg.
π 0\pi_{0}	Full SFT	38.4	32.6
π RL\pi_{\texttt{RL}} (Flow-SDE + PPO)	78.8	61.1	25.4
π\pi-StepNFT	79.2	69.1	49.1
π 0.5\pi_{0.5}	Full SFT	40.1	40.2
π RL\pi_{\texttt{RL}} (Flow-SDE + PPO)	90.9	68.0	34.5
π\pi-StepNFT	85.4	76.9	56.6

LIBERO: Unlocking potential from few-shot SFT.

Table1 reveals that SFT baselines are constrained by a narrow expert manifold, yielding initial success rates of only 57.6% (π 0\pi_{0}) and 77.1% (π 0.5\pi_{0.5}). π\pi-StepNFT unlocks the model’s latent capacity via “wider space” exploration, significantly boosting average performance to 90.5% and 94.0%, respectively. Notably, on short-horizon tasks (e.g., Object), our method achieves performance comparable to PPO. Regarding alignment, while critic-based methods (PPO) maintain an advantage in long-horizon tasks due to temporal credit assignment, our method notably outperforms the critic-free GRPO baseline (e.g., 86.7% vs. 81.4% on π 0\pi_{0} Long), demonstrating that step-wise supervision offers highly competitive guidance without the need for estimating advantages.

ManiSkill: Critic-free generalization.

Unlike LIBERO, ManiSkill features high visual diversity, requiring generalization to unseen textures, objects and positions (OOD). Value-based methods estimate values from vision-language embeddings, which often causes critics to overfit to nuisance visual features and specific language prompts rather than task semantics. π\pi-StepNFT bypasses this by relying on ground-truth outcomes. As shown in Table2, while PPO is competitive in IND settings, π\pi-StepNFT dominates in OOD scenarios. For π 0\pi_{0}, it achieves an OOD average of 50.4% (+11.1% over PPO), nearly doubling success rates on Semantic shifts (unseen objects/instructions) to 49.1%. This robust trend holds for π 0.5\pi_{0.5} (59.5% OOD average vs. 49.3%), confirming that critic-free supervision effectively mitigates visual overfitting.

5.3 Ablations Studies

(a)Stochastic rollouts. ODE vs. Flow-SDE variants.

(b)Regression target.x 0 x_{0} vs. step-wise x t−x_{t^{-}} supervision.

Figure 2: Flow-SDE sampling and step-wise supervision improve on-policy stability.

(a)Loss formulation. wMSE vs. contrastive ranking.

(b)Critic-free learning. Sparse labels vs. advantage signals.

Figure 3: Contrastive ranking enables stable critic-free learning.

(a)Noise level σ\sigma.

(b)Trust region size β\beta.

(c)Decay α\alpha.

Figure 4: Hyperparameter sensitivity analysis. Configuration selected for main experiments is highlighted by the bold pink curves.

To verify the efficacy of π\pi-StepNFT, we conduct a component-wise analysis aligned with the challenges identified in Section1. First, we decouple the effects of stochastic exploration, investigating whether SDE sampling with noise-aware correction is strictly necessary for manifold expansion compared to deterministic ODEs. Second, we analyze supervision granularity, contrasting our step-wise x t−x_{t^{-}} target against the standard terminal x 0 x_{0} regression to demonstrate its impact on training stability and convergence speed. Third, we evaluate the learning objective, comparing our contrastive ranking formulation against weighted-MSE to highlight the benefits of removing the implicit separation penalty. Finally, we assess the necessity of explicit critics, showing that our likelihood-free framework achieves competitive performance using only sparse binary rewards, and provide a sensitivity analysis for key hyperparameters.

Impact of Stochastic Exploration. To isolate the benefit of exploration, we compare rollout strategies while fixing the regression target to the final denoised output x 0 x_{0} and using a conservative EMA schedule (α\alpha: 0.9 →\to 0.995). As shown in Figure2(a), deterministic ODE rollouts ❶(Figure1) plateau early, confirming that restricted state coverage hinders policy improvement. While standard SDE ❷ widens the visited manifold, significant performance gains are only realized when the objective explicitly accounts for injected noise via mean correction ❸ (Eq.(3)). This indicates that effective exploration requires not just traversing a wider space, but utilizing a learning signal that mathematically aligns the noisy transition back to the policy’s velocity field.

Regression Target Granularity. We evaluate the efficacy of terminal x 0 x_{0} versus step-wise x t−x_{t^{-}} regression targets under stochastic rollouts in Figure2(b). Empirical results demonstrate that supervision via x 0 x_{0} induces significant instability, necessitating overly conservative synchronization to prevent policy collapse. Conversely, the step-wise target x t−x_{t^{-}} ❹ facilitates stable, near on-policy learning even under aggressive updates, thereby accelerating convergence. This suggests that precise, local supervision is essential to counteract the distribution shift introduced by active exploration, whereas terminal targets provide gradients that are too coarse for effective manifold adherence .

Objective Formulation: Ranking vs. wMSE. We benchmark the proposed contrastive ranking objective against a weighted-MSE (wMSE) baseline and single-branch ablations in Figure3(a). We observe that utilizing exclusively Positive or Negative branches yields partial improvement. This confirms that valid gradient signals exist in both directions, yet combining them yields superior performance. Crucially, the wMSE baseline underperforms because, in the binary reward setting, it degenerates to fitting a single branch . Consequently, it fails to leverage both positive and negative signals simultaneously. In contrast, our ranking objective establishes a ”push-pull” dynamic by utilizing both branches to enforce strict preference separation. This effectively removes the ”implicit separation penalty” that otherwise suppresses the policy update magnitude.

Necessity of Value Estimation. We investigate the trade-off between supervision density and training complexity in Figure3(b). Although dense step-level value estimates can in principle improve long-horizon credit assignment, sparse trajectory-level outcomes remain highly competitive for general manipulation tasks. Empirically, binary supervision yields smoother training because it relies on accurate environment feedback rather than approximate value estimates. Unlike image generation, where sample quality varies continuously and advantage-style soft weighting is more natural, embodied control typically has discrete success-or-failure outcomes, similar to mathematical reasoning(Chen et al., 2026). Correspondingly, treating r r as a bounded trajectory success probability r∈[0,1]r\in[0,1] avoids the instability of unbounded advantage scores and reduces the need for normalization and clipping. Our probability-based formulation is also compatible with denser supervision: the sparse signal can be replaced by an offline critic-learned that predicts step-wise success probabilities, enabling finer credit assignment without changing the architecture.

Hyperparameter Sensitivity and Robustness. Figure4 illustrates key trade-offs. For noise level σ\sigma, excessive noise impedes convergence by overly expanding the search space, while insufficient noise limits exploration. For trust region size β\beta, results indicate that values around [1.0,2.0][1.0,2.0] are optimal; larger β\beta violate local linearity, whereas smaller steps induce gradient instability. Regarding the decay α\alpha, a dynamic strategy proves most effective. High decay (slow updates) causes significant off-policy lag and lowers the performance ceiling, while constant or overly aggressive updates risk collapse. A dynamic schedule that progressively increases decay balances initial acceleration with final stability, which matches the small-step alignment in Theorem4.4.

6 Conclusion

In this paper, we introduced π\pi-StepNFT, a critic-and-likelihood-free framework for flow-based VLAs that structurally eliminates auxiliary value networks and requires only a single forward pass per step. We identified that wider exploration spaces necessitate finer-grained, step-wise guidance for effective alignment. Empirically, π\pi-StepNFT unlocks latent potential on LIBERO in few-shot SFT settings and achieves superior OOD generalization on ManiSkill by preventing multimodal overfitting. These results establish a scalable, robust paradigm for fine-tuning generalist robot policies in complex real-world scenarios.

Impact Statement

This paper introduces a framework for fine-tuning flow-based vision-language-action (VLA) policies to improve the efficiency and robustness of embodied agents. Beyond algorithmic advances, our work has implications for the accessibility and sustainability of robotic learning.

Democratization of embodied AI research: Training large-scale VLA models often requires substantial compute, in part due to the cost of differentiating through ODE trajectories. By proposing a likelihood-free, critic-free approach that uses a single forward pass per optimization step, π\pi-StepNFT lowers the hardware barrier. This reduced overhead can broaden participation by smaller labs and academic groups, supporting a more diverse research community.

Safety and robustness: By improving out-of-distribution (OOD) generalization, our method can yield agents that behave more reliably in unstructured real-world settings. While increased capability may introduce dual-use concerns, our fine-grained supervision encourages adherence to expert manifolds and may reduce unpredictable behaviors during deployment.

References

J. Bai, X. Yu, M. Xu, W. Lu, X. Pan, K. Maeng, D. Kifer, J. Wang, and Y. Wang (2025)Towards better optimization for listwise preference in diffusion models. arXiv preprint arXiv:2510.01540. Cited by: Appendix D.
L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, et al. (2024)Paligemma: a versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726. Cited by: §5.1.
J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y. Fang, D. Fox, F. Hu, S. Huang, et al. (2025)Gr00t n1: an open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734. Cited by: §B.1, §1.
K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y. Galliker, et al. (2025)π 0.5\pi_{0.5}: A vision-language-action model with open-world generalization. In 9th Annual Conference on Robot Learning, Cited by: §B.1, §1, §2.1.
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky (2026)π 0\pi_{0}: A vision-language-action flow model for general robot control. External Links: 2410.24164, LinkCited by: §B.1, §1, §2.1.
K. Black, M. Janner, Y. Du, I. Kostrikov, and S. Levine (2023)Training diffusion models with reinforcement learning. External Links: 2305.13301 Cited by: §B.2, §2.2.
H. Chen, K. Zheng, Q. Zhang, G. Cui, Y. Cui, H. Ye, T. Lin, M. Liu, J. Zhu, and H. Wang (2026)NFT: bridging supervised learning and reinforcement learning in math reasoning. In The Fourteenth International Conference on Learning Representations, External Links: LinkCited by: §5.3.
K. Chen, Z. Liu, T. Zhang, Z. Guo, S. Xu, H. Lin, H. Zang, X. Li, Q. Zhang, Z. Yu, G. Fan, T. Huang, Y. Wang, and C. Yu (2025)π RL\pi_{\texttt{RL}}: Online rl fine-tuning for flow-based vision-language-action models. External Links: 2510.25889, LinkCited by: §B.1, §1, §1, §2.1.
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei (2017)Deep reinforcement learning from human preferences. Advances in neural information processing systems 30. Cited by: Appendix D.
J. Fan, S. Shen, C. Cheng, Y. Chen, C. Liang, and G. Liu (2025)Online reward-weighted fine-tuning of flow matching with wasserstein regularization. In The Thirteenth International Conference on Learning Representations, Cited by: §B.2, §2.2.
Y. Fan, O. Watkins, Y. Du, H. Liu, M. Ryu, C. Boutilier, P. Abbeel, M. Ghavamzadeh, K. Lee, and K. Lee (2023)Reinforcement learning for fine-tuning text-to-image diffusion models. In Thirty-seventh Conference on Neural Information Processing Systems, External Links: LinkCited by: §B.2.
D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, Y. L. Tan, L. Y. Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine (2024)Octo: an open-source generalist robot policy. External Links: 2405.12213, LinkCited by: §B.1, §2.1, Table 1.
J. Han, A. Wang, M. Xu, W. Chu, M. Dang, Y. Yue, and S. Ermon (2025)Discrete diffusion trajectory alignment via stepwise decomposition. arXiv preprint arXiv:2507.04832. Cited by: Appendix D.
J. Ho, A. Jain, and P. Abbeel (2020)Denoising diffusion probabilistic models. Advances in neural information processing systems 33, pp.6840–6851. Cited by: Appendix D.
P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al. (2025)π 0.6∗\pi^{*}_{0.6}: A vla that learns from experience. arXiv preprint arXiv:2511.14759. Cited by: §B.1, §1, §1, §2.1.
D. Kim, S. Lyu, S. W. Kim, and P. H. Seo (2025a)Direct diffusion score preference optimization via stepwise contrastive policy-pair supervision. arXiv preprint arXiv:2512.23426. Cited by: Appendix D.
M. J. Kim, C. Finn, and P. Liang (2025b)Fine-tuning vision-language-action models: optimizing speed and success. arXiv preprint arXiv:2502.19645. Cited by: §B.1, §1, Table 1.
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. (2024)Openvla: an open-source vision-language-action model. arXiv preprint arXiv:2406.09246. Cited by: §B.1, §1, §2.1, Table 1.
H. Li, Y. Zuo, J. Yu, Y. Zhang, Z. Yang, K. Zhang, X. Zhu, Y. Zhang, T. Chen, G. Cui, et al. (2025a)Simplevla-rl: scaling vla training via reinforcement learning. arXiv preprint arXiv:2509.09674. Cited by: §B.1.
Y. Li, X. Ma, J. Xu, Y. Cui, Z. Cui, Z. Han, L. Huang, T. Kong, Y. Liu, H. Niu, et al. (2025b)Gr-rl: going dexterous and precise for long-horizon robotic manipulation. arXiv preprint arXiv:2512.01801. Cited by: §B.1, §1, §2.1.
X. Liao, W. Wei, X. Qu, and Y. Cheng (2025)Step-level reward for free in rl-based t2i diffusion model fine-tuning. arXiv preprint arXiv:2505.19196. Cited by: Appendix D.
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015)Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: Appendix D.
Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le (2022)Flow matching for generative modeling. arXiv preprint arXiv:2210.02747. Cited by: §3.1.
B. Liu, Y. Zhu, C. Gao, Y. Feng, Q. Liu, Y. Zhu, and P. Stone (2023)Libero: benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems 36, pp.44776–44791. Cited by: 1st item, 3rd item, §5.1.
J. Liu, G. Liu, J. Liang, Y. Li, J. Liu, X. Wang, P. Wan, D. Zhang, and W. Ouyang (2025)Flow-grpo: training flow matching models via online rl. arXiv preprint arXiv:2505.05470. Cited by: §B.2, §2.2, §3.1.
J. Liu, F. Gao, B. Wei, X. Chen, Q. Liao, Y. Wu, C. Yu, and Y. Wang (2026)What can rl bring to vla generalization? an empirical study. External Links: 2505.19789, LinkCited by: §B.1, 2nd item, §2.1, §5.1.
G. Lu, W. Guo, C. Zhang, Y. Zhou, H. Jiang, Z. Gao, Y. Tang, and Z. Wang (2025)Vla-rl: towards masterful and general robotic manipulation with scalable reinforcement learning. arXiv preprint arXiv:2505.18719. Cited by: §B.1.
D. McAllister, S. Ge, B. Yi, C. M. Kim, E. Weber, H. Choi, H. Feng, and A. Kanazawa (2025)Flow matching policy gradients. arXiv preprint arXiv:2507.21053. Cited by: §B.2, §2.2.
T. Mu, Z. Ling, F. Xiang, D. Yang, X. Li, S. Tao, Z. Huang, Z. Jia, and H. Su (2021)Maniskill: generalizable manipulation skill benchmark with large-scale demonstrations. arXiv preprint arXiv:2107.14483. Cited by: 2nd item, 3rd item, §5.1.
NVIDIA (2022)NVIDIA h100 tensor core gpu datasheet. Technical report NVIDIA. Note: Technical Report External Links: LinkCited by: Appendix D.
NVIDIA (2023)NVIDIA ampere gpu architecture whitepaper. Technical report NVIDIA. Note: Technical Report External Links: LinkCited by: Appendix D.
C. Pan, G. Anantharaman, N. Huang, C. Jin, D. Pfrommer, C. Yuan, F. Permenter, G. Qu, N. Boffi, G. Shi, et al. (2025)Much ado about noising: dispelling the myths of generative robotic control. arXiv preprint arXiv:2512.01809. Cited by: §1.
M. Pei, G. Li, J. Si, Z. Zhu, Z. Mo, P. Wang, Z. Song, X. Liang, and J. Cheng (2025)GCC: a 3dgs inference architecture with gaussian-wise and cross-stage conditional processing. In Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, pp.1824–1837. Cited by: Appendix D.
K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine (2025)Fast: efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747. Cited by: Table 1.
S. Pfrommer, Y. Huang, and S. Sojoudi (2025)Reinforcement learning for flow-matching policies. arXiv preprint arXiv:2507.15073. Cited by: §B.2, §2.2.
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn (2023)Direct preference optimization: your language model is secretly a reward model. Advances in neural information processing systems 36, pp.53728–53741. Cited by: §4.2.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017)Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §3.2.
M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. (2025)Smolvla: a vision-language-action model for affordable and efficient robotics. arXiv preprint arXiv:2506.01844. Cited by: Appendix D.
H. Song, D. Qu, Y. Yao, Q. Chen, Q. Lv, Y. Tang, M. Shi, G. Ren, M. Yao, B. Zhao, et al. (2025)Hume: introducing system-2 thinking in visual-language-action model. arXiv preprint arXiv:2505.21432. Cited by: §B.1, §2.1.
Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole (2020)Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456. Cited by: Appendix D.
A. Steiner, A. S. Pinto, M. Tschannen, D. Keysers, X. Wang, Y. Bitton, A. Gritsenko, M. Minderer, A. Sherbondy, S. Long, et al. (2024)Paligemma 2: a family of versatile vlms for transfer. arXiv preprint arXiv:2412.03555. Cited by: §5.1.
M. Sun, P. Ding, W. Zhang, and D. Wang (2025)Iterative refinement of flow policies in probability space for online reinforcement learning. arXiv preprint arXiv:2510.15388. Cited by: Appendix D.
R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour (1999)Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems 12. Cited by: Appendix D.
A. Wagenmaker, M. Nakamoto, Y. Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine (2025)Steering your diffusion policy with latent space reinforcement learning. arXiv preprint arXiv:2506.15799. Cited by: §B.1.
B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik (2024)Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.8228–8238. Cited by: §B.2, §2.2.
F. Wang and Z. Yu (2025)Coefficients-preserving sampling for reinforcement learning with flow matching. arXiv preprint arXiv:2509.05952. Cited by: Appendix D.
R. J. Williams (1992)Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3), pp.229–256. Cited by: §3.2.
Y. Wu, B. Ruan, C. Tseng, and H. Shuai (2025)Ranking-based preference optimization for diffusion models from implicit user feedback. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, Cited by: Appendix D.
S. Yang, Y. Zhang, H. He, L. Pan, X. Li, C. Bai, and X. Li (2025)Steering vision-language-action models as anti-exploration: a test-time scaling approach. arXiv preprint arXiv:2512.02834. Cited by: §B.1, §2.1.
A. Ye, B. Wang, C. Ni, G. Huang, G. Zhao, H. Li, J. Li, J. Zhu, L. Feng, et al. (2025)Gigabrain-0: a world model-powered vision-language-action model. arXiv preprint arXiv:2510.19430. Cited by: §1.
C. Yu, Y. Wang, Z. Guo, H. Lin, S. Xu, H. Zang, Q. Zhang, Y. Wu, C. Zhu, J. Hu, et al. (2025)Rlinf: flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation. arXiv preprint arXiv:2509.15965. Cited by: §5.1.
T. Zhang, C. Da, K. Ding, H. Yang, K. Jin, Y. Li, T. Gao, D. Zhang, S. Xiang, and C. Pan (2025a)Diffusion model as a noise-aware latent reward model for step-level preference optimization. arXiv preprint arXiv:2502.01051. Cited by: §B.2, §2.2.
T. Zhang, C. Yu, S. Su, and Y. Wang (2025b)ReinFlow: fine-tuning flow matching policy with online reinforcement learning. External Links: 2505.22094, LinkCited by: §B.2, §2.2.
K. Zheng, H. Chen, H. Ye, H. Wang, Q. Zhang, K. Jiang, H. Su, S. Ermon, J. Zhu, and M. Liu (2026)DiffusionNFT: online diffusion reinforcement with forward process. In The Fourteenth International Conference on Learning Representations, External Links: LinkCited by: §B.2, §1, §2.2, §4.
D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving (2019)Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593. Cited by: Appendix D.
B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. (2023)Rt-2: vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pp.2165–2183. Cited by: §B.1, §1, §2.1.

Appendix A Theoretical Analysis and Proofs

In this section, we provide detailed proofs for the lemmas, propositions, and theorems presented in the main text.

A.1 Proof of Equation 3 (Affine Mean Derivation)

We derive the explicit forms of U t U_{t} and B t B_{t} stated in Equation 3.

The flow-SDE solver computes the next mean μ t(v)\mu_{t}(v) by mixing the rectified flow endpoints x 0 pred=x t−v⋅t x_{0}^{\text{pred}}=x_{t}-v\cdot t and x 1 pred=x t+v⋅(1−t)x_{1}^{\text{pred}}=x_{t}+v\cdot(1-t) with weights derived from the Euler-Maruyama discretization. The weights are given by:

w 0=1−t+δ t,w 1=(t−δ t)−σ t 2δ t 2t.w_{0}=1-t+\delta_{t},\qquad w_{1}=(t-\delta_{t})-\frac{\sigma_{t}^{2}\delta_{t}}{2t}.(14)

The mean is given by the linear combination:

μ t(v)\displaystyle\mu_{t}(v)=w 0x 0 pred+w 1x 1 pred\displaystyle=w_{0}x_{0}^{\text{pred}}+w_{1}x_{1}^{\text{pred}} =w 0(x t−vt)+w 1(x t+v(1−t))\displaystyle=w_{0}(x_{t}-vt)+w_{1}(x_{t}+v(1-t)) =(w 0+w 1)x t+(−tw 0+(1−t)w 1)v.\displaystyle=(w_{0}+w_{1})x_{t}+\big(-tw_{0}+(1-t)w_{1}\big)v.(15)

Directly computing the coefficients for x t x_{t} and v v yields:

Coefficient for x t x_{t} (U t U_{t}):

U t=w 0+w 1=(1−t+δ t)+(t−δ t−σ t 2δ t 2t)=1−σ t 2δ t 2t.U_{t}=w_{0}+w_{1}=(1-t+\delta_{t})+\left(t-\delta_{t}-\frac{\sigma_{t}^{2}\delta_{t}}{2t}\right)=1-\frac{\sigma_{t}^{2}\delta_{t}}{2t}.(16)

Coefficient for v v (B t B_{t}):

B t\displaystyle B_{t}=−tw 0+(1−t)w 1\displaystyle=-tw_{0}+(1-t)w_{1} =−t(1−t+δ t)+(1−t)(t−δ t−σ t 2δ t 2t)\displaystyle=-t(1-t+\delta_{t})+(1-t)\left(t-\delta_{t}-\frac{\sigma_{t}^{2}\delta_{t}}{2t}\right) =−t+t 2−tδ t+(t−δ t−σ t 2δ t 2t−t 2+tδ t+tσ t 2δ t 2t)\displaystyle=-t+t^{2}-t\delta_{t}+(t-\delta_{t}-\frac{\sigma_{t}^{2}\delta_{t}}{2t}-t^{2}+t\delta_{t}+t\frac{\sigma_{t}^{2}\delta_{t}}{2t}) =−δ t−(1−t)σ t 2δ t 2t.\displaystyle=-\delta_{t}-(1-t)\frac{\sigma_{t}^{2}\delta_{t}}{2t}.(17)

Thus, μ t(v)=U t(x t,t)+B t(t)v\mu_{t}(v)=U_{t}(x_{t},t)+B_{t}(t)v, matching the affine form in Equation(3). ∎

A.2 Proof of Lemma4.2 (Log-Likelihood Ratio)

We prove that the difference in variance-normalized errors equals the log-likelihood ratio. Let q(x)=𝒩(μ,Σ)q(x)=\mathcal{N}(\mu,\Sigma). Its log-density is:

log⁡q(x)=−1 2(x−μ)⊤Σ−1(x−μ)−1 2logdet(2πΣ).\log q(x)=-\frac{1}{2}(x-\mu)^{\top}\Sigma^{-1}(x-\mu)-\frac{1}{2}\log\det(2\pi\Sigma).(18)

For the two branches q t+=𝒩(μ t+,Σ t)q^{+}{t}=\mathcal{N}(\mu{t}^{+},\Sigma_{t}) and q t−=𝒩(μ t−,Σ t)q^{-}{t}=\mathcal{N}(\mu{t}^{-},\Sigma_{t}), they share the same covariance Σ t\Sigma_{t}. Subtracting their log-densities cancels the normalization constant:

log⁡q t+(x t−)−log⁡q t−(x t−)\displaystyle\log q^{+}{t}(x{t^{-}})-\log q^{-}{t}(x{t^{-}})=−1 2((x t−−μ t+)⊤Σ t−1(x t−−μ t+)−(x t−−μ t−)⊤Σ t−1(x t−−μ t−))\displaystyle=-\frac{1}{2}\left((x_{t^{-}}-\mu_{t}^{+})^{\top}\Sigma_{t}^{-1}(x_{t^{-}}-\mu_{t}^{+})-(x_{t^{-}}-\mu_{t}^{-})^{\top}\Sigma_{t}^{-1}(x_{t^{-}}-\mu_{t}^{-})\right) =−1 2(‖x t−−μ t+‖Σ t−1 2−‖x t−−μ t−‖Σ t−1 2)\displaystyle=-\frac{1}{2}\left(|x_{t^{-}}-\mu_{t}^{+}|^{2}{\Sigma{t}^{-1}}-|x_{t^{-}}-\mu_{t}^{-}|^{2}{\Sigma{t}^{-1}}\right) =−1 2(E θ+−E θ−).\displaystyle=-\frac{1}{2}(E_{\theta}^{+}-E_{\theta}^{-}).(19)

This concludes the proof. ∎

A.3 Proof of Proposition4.3 (Bayes Monotonicity)

Fix (x t,c)(x_{t},c) and denote the prior success probability conditioned on (x t,c)(x_{t},c) by π(x t,c)≜ℙ(o=1∣x t,c)∈(0,1)\pi(x_{t},c)\triangleq\mathbb{P}(o=1\mid x_{t},c)\in(0,1).

By Bayes’ rule, the posterior success probability given the observed next state x t−x_{t^{-}} is

η(x t−;x t,c)≜ℙ(o=1∣x t−,x t,c)=p(x t−∣x t,c,o=1)π(x t,c)p(x t−∣x t,c,o=1)π(x t,c)+p(x t−∣x t,c,o=0)(1−π(x t,c)).\eta(x_{t^{-}};x_{t},c)\triangleq\mathbb{P}(o=1\mid x_{t^{-}},x_{t},c)=\frac{p(x_{t^{-}}\mid x_{t},c,o=1),\pi(x_{t},c)}{p(x_{t^{-}}\mid x_{t},c,o=1),\pi(x_{t},c)+p(x_{t^{-}}\mid x_{t},c,o=0),(1-\pi(x_{t},c))}.(20)

Define the oracle likelihood ratio

Λ(x t−;x t,c)≜p(x t−∣x t,c,o=1)p(x t−∣x t,c,o=0).\Lambda(x_{t^{-}};x_{t},c)\triangleq\frac{p(x_{t^{-}}\mid x_{t},c,o=1)}{p(x_{t^{-}}\mid x_{t},c,o=0)}.

Dividing the numerator and denominator by p(x t−∣x t,c,o=0)p(x_{t^{-}}\mid x_{t},c,o=0) yields

η(x t−;x t,c)=Λ(x t−;x t,c)π(x t,c)Λ(x t−;x t,c)π(x t,c)+(1−π(x t,c)).\eta(x_{t^{-}};x_{t},c)=\frac{\Lambda(x_{t^{-}};x_{t},c),\pi(x_{t},c)}{\Lambda(x_{t^{-}};x_{t},c),\pi(x_{t},c)+(1-\pi(x_{t},c))}.(21)

For constants a=π(x t,c)>0 a=\pi(x_{t},c)>0 and b=1−π(x t,c)>0 b=1-\pi(x_{t},c)>0, consider f(λ)=aλ aλ+b f(\lambda)=\frac{a\lambda}{a\lambda+b} for λ>0\lambda>0. Its derivative is

f′(λ)=ab(aλ+b)2>0.f^{\prime}(\lambda)=\frac{ab}{(a\lambda+b)^{2}}>0.

Therefore η(x t−;x t,c)\eta(x_{t^{-}};x_{t},c) is strictly increasing in the likelihood ratio Λ(x t−;x t,c)\Lambda(x_{t^{-}};x_{t},c), completing the proof. ∎

A.4 Oracle Splits from Diffusion-NFT

Notation and symbol disambiguation.

We use κ t(x t∣x 0)\kappa_{t}(x_{t}\mid x_{0}) to denote the forward/noising kernel that maps a terminal sample x 0 x_{0} to an intermediate noisy state x t x_{t} (as in diffusion models).

Setup.

Fix a context c c. Let x 0 x_{0} denote the terminal sample (e.g., the final solver output used to form an action), with rollout terminal distribution π 0 old(x 0∣c)\pi^{\mathrm{old}}{0}(x{0}\mid c) induced by π θ old\pi_{\theta^{\mathrm{old}}}. Let κ t(x t∣x 0)\kappa_{t}(x_{t}\mid x_{0}) be the forward kernel and define the induced marginal

π t old(x t∣c)=∫κ t(x t∣x 0)π 0 old(x 0∣c)𝑑 x 0,\pi^{\mathrm{old}}{t}(x{t}\mid c)=\int\kappa_{t}(x_{t}\mid x_{0}),\pi^{\mathrm{old}}{0}(x{0}\mid c),dx_{0},

and the diffusion posterior

π 0|t old(x 0∣x t,c)=κ t(x t∣x 0)π 0 old(x 0∣c)π t old(x t∣c).\pi^{\mathrm{old}}{0|t}(x{0}\mid x_{t},c)=\frac{\kappa_{t}(x_{t}\mid x_{0}),\pi^{\mathrm{old}}{0}(x{0}\mid c)}{\pi^{\mathrm{old}}{t}(x{t}\mid c)}.

Optimality variable.

Introduce a latent optimality variable o∈{0,1}o\in{0,1} and an instance-level score r(x 0,c)∈[0,1]r(x_{0},c)\in[0,1] satisfying r(x 0,c)=ℙ(o=1∣x 0,c)r(x_{0},c)=\mathbb{P}(o=1\mid x_{0},c). This is the same setup as Diffusion-NFT; see their Appendix for proofs.

Lemma A.1(Distribution Split (Diffusion-NFT)).

Let p(c)≜𝔼 x 0∼π 0 old(⋅∣c)[r(x 0,c)]p(c)\triangleq\mathbb{E}{x{0}\sim\pi^{\mathrm{old}}{0}(\cdot\mid c)}[r(x{0},c)]. Define the oracle terminal distributions

π 0+(x 0∣c)=r(x 0,c)π 0 old(x 0∣c)p(c),π 0−(x 0∣c)=(1−r(x 0,c))π 0 old(x 0∣c)1−p(c).\pi_{0}^{+}(x_{0}\mid c)=\frac{r(x_{0},c),\pi^{\mathrm{old}}{0}(x{0}\mid c)}{p(c)},\qquad\pi_{0}^{-}(x_{0}\mid c)=\frac{(1-r(x_{0},c)),\pi^{\mathrm{old}}{0}(x{0}\mid c)}{1-p(c)}.

Then

π 0 old(x 0∣c)=p(c)π 0+(x 0∣c)+(1−p(c))π 0−(x 0∣c).\pi^{\mathrm{old}}{0}(x{0}\mid c)=p(c),\pi_{0}^{+}(x_{0}\mid c)+(1-p(c)),\pi_{0}^{-}(x_{0}\mid c).

Reference. Diffusion-NFT Appendix (Distribution Split).

Lemma A.2(Posterior Split (Diffusion-NFT)).

Let π 0|t±(x 0∣x t,c)\pi_{0|t}^{\pm}(x_{0}\mid x_{t},c) be the posteriors induced by π 0±\pi_{0}^{\pm}:

π 0|t±(x 0∣x t,c)=κ t(x t∣x 0)π 0±(x 0∣c)π t±(x t∣c),π t±(x t∣c)=∫κ t(x t∣x 0)π 0±(x 0∣c)𝑑 x 0.\pi_{0|t}^{\pm}(x_{0}\mid x_{t},c)=\frac{\kappa_{t}(x_{t}\mid x_{0}),\pi_{0}^{\pm}(x_{0}\mid c)}{\pi_{t}^{\pm}(x_{t}\mid c)},\quad\pi_{t}^{\pm}(x_{t}\mid c)=\int\kappa_{t}(x_{t}\mid x_{0}),\pi_{0}^{\pm}(x_{0}\mid c),dx_{0}.

Then the rollout posterior satisfies

π 0|t old(x 0∣x t,c)=α(x t,c)π 0|t+(x 0∣x t,c)+(1−α(x t,c))π 0|t−(x 0∣x t,c),\pi_{0|t}^{\mathrm{old}}(x_{0}\mid x_{t},c)=\alpha(x_{t},c),\pi_{0|t}^{+}(x_{0}\mid x_{t},c)+\bigl(1-\alpha(x_{t},c)\bigr),\pi_{0|t}^{-}(x_{0}\mid x_{t},c),

where the mixing weight is

α(x t,c)=ℙ(o=1∣x t,c)=p(c)π t+(x t∣c)π t old(x t∣c).\alpha(x_{t},c)=\mathbb{P}(o=1\mid x_{t},c)=p(c),\frac{\pi_{t}^{+}(x_{t}\mid c)}{\pi_{t}^{\mathrm{old}}(x_{t}\mid c)}.

Reference. Diffusion-NFT Appendix (Posterior Split).

Corollary A.3(Posterior Expectation Split).

Under LemmaA.2, for any integrable function ϕ(x 0)\phi(x_{0}),

𝔼 π 0|t old(⋅∣x t,c)[ϕ(x 0)]=α(x t,c)𝔼 π 0|t+(⋅∣x t,c)[ϕ(x 0)]+(1−α(x t,c))𝔼 π 0|t−(⋅∣x t,c)[ϕ(x 0)].\mathbb{E}{\pi{0|t}^{\mathrm{old}}(\cdot\mid x_{t},c)}[\phi(x_{0})]=\alpha(x_{t},c),\mathbb{E}{\pi{0|t}^{+}(\cdot\mid x_{t},c)}[\phi(x_{0})]+(1-\alpha(x_{t},c)),\mathbb{E}{\pi{0|t}^{-}(\cdot\mid x_{t},c)}[\phi(x_{0})].

Oracle vs. constructed branches.

The oracle objects (π 0±,π 0|t±)(\pi_{0}^{\pm},\pi_{0|t}^{\pm}) are defined by conditioning on the latent outcome o o. In contrast, our method constructs mirrored branches v θ±=v old±β(v θ−v old)v_{\theta}^{\pm}=v^{\mathrm{old}}\pm\beta(v_{\theta}-v^{\mathrm{old}}) and the induced solver transitions q θ,t±(x t−∣x t,c)q_{\theta,t}^{\pm}(x_{t^{-}}\mid x_{t},c) (Section4.1).

Lemma A.4(Oracle Velocity/Mean Splits (for alignment)).

Assume the (oracle) velocity field at solver time t t can be expressed as a posterior expectation under π 0|t(⋅∣x t,c)\pi_{0|t}(\cdot\mid x_{t},c), i.e., v(x t,c,t)=𝔼 π 0|t(⋅∣x t,c)[ψ(x 0,x t,c,t)]v(x_{t},c,t)=\mathbb{E}{\pi{0|t}(\cdot\mid x_{t},c)}[\psi(x_{0},x_{t},c,t)] for some function ψ\psi. Define oracle velocities v±v^{\pm} by taking the same expectation under π 0|t±\pi_{0|t}^{\pm}. Then

v old(x t,c,t)=α(x t,c)v+(x t,c,t)+(1−α(x t,c))v−(x t,c,t).v^{\mathrm{old}}(x_{t},c,t)=\alpha(x_{t},c),v^{+}(x_{t},c,t)+(1-\alpha(x_{t},c)),v^{-}(x_{t},c,t).

If additionally the one-step solver mean admits the affine form μ t(v)=A tx t+B tv\mu_{t}(v)=A_{t}x_{t}+B_{t}v (Eq.3), then

μ t old(x t,c)=α(x t,c)μ t+(x t,c)+(1−α(x t,c))μ t−(x t,c),Δμ t⋆(x t,c)≜μ t+(x t,c)−μ t−(x t,c)=B t(v+−v−).\mu_{t}^{\mathrm{old}}(x_{t},c)=\alpha(x_{t},c),\mu_{t}^{+}(x_{t},c)+(1-\alpha(x_{t},c)),\mu_{t}^{-}(x_{t},c),\quad\Delta\mu_{t}^{\star}(x_{t},c)\triangleq\mu_{t}^{+}(x_{t},c)-\mu_{t}^{-}(x_{t},c)=B_{t}\bigl(v^{+}-v^{-}\bigr).

Remark. This lemma is a direct consequence of CorollaryA.3 and linearity of expectation.

A.5 Proof of Theorem4.4 (Gradient Form and Alignment)

Fix a sampled training tuple (x t,x t−,c)(x_{t},x_{t^{-}},c) collected under the rollout policy v old=π θ old(c,x t,t)v^{\mathrm{old}}=\pi_{\theta^{\mathrm{old}}}(c,x_{t},t). Let the rollout one-step mean be μ t old:=μ t(v old)\mu_{t}^{\mathrm{old}}:=\mu_{t}(v^{\mathrm{old}}) and define the residual e t≜x t−−μ t old e_{t};\triangleq;x_{t^{-}}-\mu_{t}^{\mathrm{old}}.

Recall that our constructed mirrored branches are v θ±=v old±β(v θ−v old)v_{\theta}^{\pm};=;v^{\mathrm{old}}\pm\beta(v_{\theta}-v^{\mathrm{old}}), Δv θ≜v θ−v old\Delta v_{\theta}\triangleq v_{\theta}-v^{\mathrm{old}}, and that the one-step mean admits the affine form (Eq.3) μ t(v)=A tx t+B tv\mu_{t}(v)=A_{t}x_{t}+B_{t}v, with shared covariance Σ t\Sigma_{t}.

Step 1: Constructed mean shift.

Define the constructed branch means μ θ,t±≜μ t(v θ±)\mu_{\theta,t}^{\pm};\triangleq;\mu_{t}(v_{\theta}^{\pm}).

Using the affine mean form,

μ θ,t+−μ t old\displaystyle\mu_{\theta,t}^{+}-\mu_{t}^{\mathrm{old}}=B t(v θ+−v old)=B t(β(v θ−v old))=βB tΔv θ≜d t,\displaystyle=B_{t},(v_{\theta}^{+}-v^{\mathrm{old}})=B_{t}\bigl(\beta(v_{\theta}-v^{\mathrm{old}})\bigr)=\beta B_{t}\Delta v_{\theta};\triangleq;d_{t},(22) μ t old−μ θ,t−\displaystyle\mu_{t}^{\mathrm{old}}-\mu_{\theta,t}^{-}=B t(v old−v θ−)=B t(β(v θ−v old))=d t.\displaystyle=B_{t},(v^{\mathrm{old}}-v_{\theta}^{-})=B_{t}\bigl(\beta(v_{\theta}-v^{\mathrm{old}})\bigr)=d_{t}.(23)

Hence,

μ θ,t±=μ t old±d t,d t=βB tΔv θ.\mu_{\theta,t}^{\pm}=\mu_{t}^{\mathrm{old}}\pm d_{t},\qquad d_{t}=\beta B_{t}\Delta v_{\theta}.(24)

Step 2: Error difference (Theorem4.4.a).

Recall the step errors E θ±=‖x t−−μ θ,t±‖Σ t−1 2 E_{\theta}^{\pm}=|x_{t^{-}}-\mu_{\theta,t}^{\pm}|^{2}{\Sigma{t}^{-1}}.

Substituting μ θ,t±=μ t old±d t\mu_{\theta,t}^{\pm}=\mu_{t}^{\mathrm{old}}\pm d_{t} and e t=x t−−μ t old e_{t}=x_{t^{-}}-\mu_{t}^{\mathrm{old}} gives

E θ+−E θ−\displaystyle E_{\theta}^{+}-E_{\theta}^{-}=‖e t−d t‖Σ t−1 2−‖e t+d t‖Σ t−1 2\displaystyle=|e_{t}-d_{t}|^{2}{\Sigma{t}^{-1}}-|e_{t}+d_{t}|^{2}{\Sigma{t}^{-1}} =(‖e t‖Σ t−1 2+‖d t‖Σ t−1 2−2⟨e t,d t⟩Σ t−1)−(‖e t‖Σ t−1 2+‖d t‖Σ t−1 2+2⟨e t,d t⟩Σ t−1)\displaystyle=\bigl(|e_{t}|^{2}{\Sigma{t}^{-1}}+|d_{t}|^{2}{\Sigma{t}^{-1}}-2\langle e_{t},d_{t}\rangle_{\Sigma_{t}^{-1}}\bigr)-\bigl(|e_{t}|^{2}{\Sigma{t}^{-1}}+|d_{t}|^{2}{\Sigma{t}^{-1}}+2\langle e_{t},d_{t}\rangle_{\Sigma_{t}^{-1}}\bigr) =−4⟨e t,d t⟩Σ t−1=−4⟨Σ t−1e t,d t⟩,\displaystyle=-4\langle e_{t},d_{t}\rangle_{\Sigma_{t}^{-1}}=-4\langle\Sigma_{t}^{-1}e_{t},d_{t}\rangle,(25)

which proves part (a).

Step 3: Gradient form (Theorem4.4.b).

Define the logit z t≜1 2y(E θ+−E θ−)z_{t};\triangleq;\tfrac{1}{2}y(E_{\theta}^{+}-E_{\theta}^{-}), ℓ t(θ)=softplus(z t)\ell_{t}(\theta)=\mathrm{softplus}(z_{t}).

Using ∇softplus(z)=σ(z)\nabla,\mathrm{softplus}(z)=\sigma(z),

∇θ ℓ t(θ)=σ(z t)∇θ z t.\nabla_{\theta}\ell_{t}(\theta)=\sigma(z_{t}),\nabla_{\theta}z_{t}.

From Step 2,

E θ+−E θ−=−4e t⊤Σ t−1d t,E_{\theta}^{+}-E_{\theta}^{-}=-4e_{t}^{\top}\Sigma_{t}^{-1}d_{t},

thus

∇θ z t=1 2y∇θ(E θ+−E θ−)=1 2y∇θ(−4e t⊤Σ t−1d t).\nabla_{\theta}z_{t}=\tfrac{1}{2}y,\nabla_{\theta}(E_{\theta}^{+}-E_{\theta}^{-})=\tfrac{1}{2}y,\nabla_{\theta}\bigl(-4e_{t}^{\top}\Sigma_{t}^{-1}d_{t}\bigr).

During optimization we treat the rollout branch v old v^{\mathrm{old}} (hence μ t old\mu_{t}^{\mathrm{old}} and e t e_{t}) as constant with respect to θ\theta, so only d t d_{t} depends on θ\theta. From Equation(24), d t=βB t(v θ−v old)d_{t}=\beta B_{t}(v_{\theta}-v^{\mathrm{old}}) and therefore

∇θ d t=βB t∇θ v θ,equivalently(∇θ d t)⊤=(∂v θ∂θ)⊤βB t⊤.\nabla_{\theta}d_{t}=\beta B_{t}\nabla_{\theta}v_{\theta},\qquad\text{equivalently }\left(\nabla_{\theta}d_{t}\right)^{\top}=\left(\frac{\partial v_{\theta}}{\partial\theta}\right)^{\top}\beta B_{t}^{\top}.

Absorbing constant scalar factors into ∝\propto, we obtain

−∇θ ℓ t(θ)∝σ(z t)y(∂v θ∂θ)⊤B tΣ t−1e t,-\nabla_{\theta}\ell_{t}(\theta)\propto\sigma(z_{t}),y\left(\frac{\partial v_{\theta}}{\partial\theta}\right)^{\top}B_{t}\Sigma_{t}^{-1}e_{t},(26)

which proves part (b).

Step 4: Small-step alignment (general r∈[0,1]r\in[0,1] and binary case).

We relate the conditional expected update direction to an oracle improvement signal. Recall the signed label is defined in the main text as y≜2r−1,r∈[0,1]y\triangleq 2r-1,,r\in[0,1], where r r is the observed terminal signal (e.g., success indicator or a normalized score).

Conditioned on (x t,c)(x_{t},c), the rollout residual e t≜x t−−μ t old(x t,c)e_{t}\triangleq x_{t^{-}}-\mu_{t}^{\mathrm{old}}(x_{t},c) is zero-mean since μ t old(x t,c)=𝔼[x t−∣x t,c]\mu_{t}^{\mathrm{old}}(x_{t},c)=\mathbb{E}[x_{t^{-}}\mid x_{t},c], hence

𝔼[e t∣x t,c]=0.\mathbb{E}[e_{t}\mid x_{t},c]=0.(27)

Therefore,

𝔼[ye t∣x t,c]\displaystyle\mathbb{E}[ye_{t}\mid x_{t},c]=𝔼[(2r−1)e t∣x t,c]\displaystyle=\mathbb{E}[(2r-1)e_{t}\mid x_{t},c] =2𝔼[re t∣x t,c]−𝔼[e t∣x t,c]\displaystyle=2,\mathbb{E}[re_{t}\mid x_{t},c]-\mathbb{E}[e_{t}\mid x_{t},c] =2𝔼[re t∣x t,c].\displaystyle=2,\mathbb{E}[re_{t}\mid x_{t},c].(28)

(i) General case (r∈[0,1]r\in[0,1]). Equation(28) shows that the conditional expected direction is governed by the correlation between the terminal signal r r and the local rollout residual e t e_{t}. In general, this produces a mixture of success- and failure-associated components (and does not reduce to a single oracle mean-gap direction without additional assumptions relating r r to the latent optimality variable).

(ii) Binary case (r∈{0,1}r\in{0,1} with r=o r=o). In sparse-success RL we often take r r to be the episode success indicator, i.e., r=o∈{0,1}r=o\in{0,1} where o o is the latent optimality variable. We have α(x t,c)≜ℙ(o=1∣x t,c)\alpha(x_{t},c)\triangleq\mathbb{P}(o=1\mid x_{t},c) in LemmaA.2.

Then using the indicator property of o o and Equation(27),

𝔼[ye t∣x t,c]\displaystyle\mathbb{E}[ye_{t}\mid x_{t},c]=𝔼[(2o−1)e t∣x t,c]\displaystyle=\mathbb{E}[(2o-1)e_{t}\mid x_{t},c] =2𝔼[oe t∣x t,c]−𝔼[e t∣x t,c]\displaystyle=2,\mathbb{E}[oe_{t}\mid x_{t},c]-\mathbb{E}[e_{t}\mid x_{t},c] =2𝔼[o(x t−−μ t old)∣x t,c]\displaystyle=2,\mathbb{E}[o(x_{t^{-}}-\mu_{t}^{\mathrm{old}})\mid x_{t},c] =2ℙ(o=1∣x t,c)𝔼[x t−−μ t old∣x t,c,o=1]\displaystyle=2,\mathbb{P}(o=1\mid x_{t},c),\mathbb{E}[x_{t^{-}}-\mu_{t}^{\mathrm{old}}\mid x_{t},c,o=1] =2α(x t,c)(μ t+(x t,c)−μ t old(x t,c)),\displaystyle=2\alpha(x_{t},c)\bigl(\mu_{t}^{+}(x_{t},c)-\mu_{t}^{\mathrm{old}}(x_{t},c)\bigr),(29)

where μ t+(x t,c)≜𝔼[x t−∣x t,c,o=1]\mu_{t}^{+}(x_{t},c)\triangleq\mathbb{E}[x_{t^{-}}\mid x_{t},c,o=1] is the oracle success-branch mean.

By LemmaA.4, the oracle mean mixture identity holds:

μ t old(x t,c)=α(x t,c)μ t+(x t,c)+(1−α(x t,c))μ t−(x t,c),\mu_{t}^{\mathrm{old}}(x_{t},c)=\alpha(x_{t},c)\mu_{t}^{+}(x_{t},c)+(1-\alpha(x_{t},c))\mu_{t}^{-}(x_{t},c),

hence

μ t+(x t,c)−μ t old(x t,c)=(1−α(x t,c))(μ t+(x t,c)−μ t−(x t,c))=(1−α(x t,c))Δμ t⋆(x t,c),\mu_{t}^{+}(x_{t},c)-\mu_{t}^{\mathrm{old}}(x_{t},c)=(1-\alpha(x_{t},c))\bigl(\mu_{t}^{+}(x_{t},c)-\mu_{t}^{-}(x_{t},c)\bigr)=(1-\alpha(x_{t},c))\Delta\mu_{t}^{\star}(x_{t},c),

where Δμ t⋆(x t,c)≜μ t+(x t,c)−μ t−(x t,c)\Delta\mu_{t}^{\star}(x_{t},c)\triangleq\mu_{t}^{+}(x_{t},c)-\mu_{t}^{-}(x_{t},c) is the oracle mean gap. Substituting into Equation(29) yields

𝔼[ye t∣x t,c]=2α(x t,c)(1−α(x t,c))Δμ t⋆(x t,c).\mathbb{E}[ye_{t}\mid x_{t},c]=2\alpha(x_{t},c)(1-\alpha(x_{t},c)),\Delta\mu_{t}^{\star}(x_{t},c).(30)

Finally, at the start of training (or for sufficiently small updates) where v θ≈v old v_{\theta}\approx v^{\mathrm{old}} so that σ(z t)≈const\sigma(z_{t})\approx\mathrm{const}, taking conditional expectation of the gradient form in Step 3 and using Equation(30) gives

𝔼[−∇θ ℓ t(θ)∣x t,c]∥(∂v θ∂θ)⊤B tΣ t−1Δμ t⋆(x t,c),\mathbb{E}[-\nabla_{\theta}\ell_{t}(\theta)\mid x_{t},c]\parallel\left(\frac{\partial v_{\theta}}{\partial\theta}\right)^{\top}B_{t}\Sigma_{t}^{-1}\Delta\mu_{t}^{\star}(x_{t},c),

where scalar factors such as 2α(1−α)2\alpha(1-\alpha) do not affect alignment. This proves part (c). □\square

A.6 Proof of Theorem4.5 (Comparison with wMSE)

A.6.1 Decomposition of wMSE

We substitute E θ±=‖e t∓d t‖Σ t−1 2 E_{\theta}^{\pm}=|e_{t}\mp d_{t}|^{2}{\Sigma{t}^{-1}} into the weighted-MSE objective L wMSE=rE θ++(1−r)E θ−L_{\text{wMSE}}=rE_{\theta}^{+}+(1-r)E_{\theta}^{-}. Expanding the terms immediately yields:

L wMSE\displaystyle L_{\text{wMSE}}=r(‖e t‖2+‖d t‖2−2⟨Σ t−1e t,d t⟩)+(1−r)(‖e t‖2+‖d t‖2+2⟨Σ t−1e t,d t⟩)\displaystyle=r(|e_{t}|^{2}+|d_{t}|^{2}-2\langle\Sigma_{t}^{-1}e_{t},d_{t}\rangle)+(1-r)(|e_{t}|^{2}+|d_{t}|^{2}+2\langle\Sigma_{t}^{-1}e_{t},d_{t}\rangle) =(r+1−r)(‖e t‖2+‖d t‖2)+2⟨Σ t−1e t,d t⟩(−r+(1−r))\displaystyle=(r+1-r)(|e_{t}|^{2}+|d_{t}|^{2})+2\langle\Sigma_{t}^{-1}e_{t},d_{t}\rangle(-r+(1-r)) =const+‖d t‖Σ t−1 2+2(1−2r)⟨Σ t−1e t,d t⟩.\displaystyle=\text{const}+|d_{t}|^{2}{\Sigma{t}^{-1}}+2(1-2r)\langle\Sigma_{t}^{-1}e_{t},d_{t}\rangle.(31)

Using y=2r−1 y=2r-1 (which implies 1−2r=−y 1-2r=-y), we obtain:

L wMSE=const−2y⟨Σ t−1e t,d t⟩+‖d t‖Σ t−1 2.L_{\text{wMSE}}=\text{const}-2y\langle\Sigma_{t}^{-1}e_{t},d_{t}\rangle+|d_{t}|^{2}{\Sigma{t}^{-1}}.(32)

A.6.2 Ranking Calibration

Define the ranking error event at a sampled step t t:

ℰ t:={y(x 0,c)⋅(E θ+−E θ−)>0}.\mathcal{E}{t}:={y(x{0},c)\cdot(E_{\theta}^{+}-E_{\theta}^{-})>0}.(33)

π\pi-StepNFT minimizes softplus(y(E+−E−))\text{softplus}(y(E^{+}-E^{-})), which is a convex upper bound on the indicator function 𝟏 ℰ t\mathbf{1}{\mathcal{E}{t}}, thus directly minimizing ranking errors. In contrast, wMSE minimizes a regression loss with the penalty ‖d t‖2|d_{t}|^{2} (from A.7.1), which restricts the branch separation required to satisfy the ranking condition ℰ t\mathcal{E}_{t} when the margin is small.

A.6.3 Binary Case Analysis

When r∈{0,1}r\in{0,1}:

•wMSE: If r=1 r=1, L wMSE=E+L_{\text{wMSE}}=E^{+}. It pulls μ+\mu^{+} to x t−x_{t^{-}} but provides no signal to μ−\mu^{-}.
•π\pi-StepNFT: Minimizes softplus(E+−E−)\text{softplus}(E^{+}-E^{-}). It simultaneously pulls μ+\mu^{+} to x t−x_{t^{-}} and pushes μ−\mu^{-} away. This “push-pull” dynamic generates stronger gradients for discrimination. ∎

Appendix B Detailed Related Works

B.1 Online RL for VLAs

VLA models map multimodal inputs to actions via diverse representations: discretizing actions into tokens (RT series(Zitkovich et al., 2023), OpenVLA(Kim et al., 2024)), mapping to continuous regression features (OpenVLA-OFT(Kim et al., 2025b)), or outputting actions via generative denoising processes (Octo(Ghosh et al., 2024), GR00T(Bjorck et al., 2025), OpenPi(Black et al., 2026, 2025; Intelligence et al., 2025)). While pre-training establishes broad capabilities, the post-training focus is shifting from SFT to online RL to bridge the domain gap. Adapting RL depends on these representations, where discrete approaches (VLA-RL(Lu et al., 2025), RL4VLA(Liu et al., 2026)) leverage accessible token probabilities, while continuous mappings (SimpleVLA-RL(Li et al., 2025a)) treat outputs as Gaussian means. However, flow-based VLAs face the challenge of intractable likelihoods due to multi-step ODE sampling. Some methods bypass likelihood calculation entirely: GR-RL(Li et al., 2025b) distills value functions in the latent space, while π 0.6∗\pi^{*}{0.6} utilizes preference-based feedback. Conversely, π RL\pi{\texttt{RL}}(Chen et al., 2025) addresses this by transforming the deterministic ODE into an SDE or adding auxiliary noise networks. Crucially, this noise injection serves a dual purpose: it not only facilitates mathematical likelihood approximation but also significantly enhances exploration. This importance of noise-induced exploration is further echoed by test-time scaling strategies like TACO(Yang et al., 2025) and Hume(Song et al., 2025), as well as DSRL(Wagenmaker et al., 2025), which operates RL directly in the diffusion noise space.

B.2 Policy Optimization for Generative Models

Integrating online RL into generative models typically follows three paradigms to handle intractable likelihoods. Explicit Gradient and Advantage Methods. Approaches like DDPO(Black et al., 2023) and DPOK(Fan et al., 2023) treat denoising as a sequential decision process. Flow-GRPO(Liu et al., 2025) and ReinFlow(Zhang et al., 2025b) further facilitate this by converting ODEs to SDEs or using Gaussian approximations to enable policy gradient updates. Reward-Weighted Likelihood-Free Methods. To avoid exact likelihood computation, methods such as RWFM(Pfrommer et al., 2025; Fan et al., 2025) and FPO(McAllister et al., 2025) construct proxy objectives or advantage-weighted ratios, effectively optimizing the flow model via regression targets derived from high-reward samples. However, these paradigms often suffer from high variance in gradient estimation or rely on complex reward proxies to stabilize training. Preference and Contrastive Methods. These approaches align distributions via ranking losses, bypassing explicit advantages. Diffusion-DPO(Wallace et al., 2024) aligns models based on trajectory outcomes, while LPO(Zhang et al., 2025a) ensures fine-grained consistency at the latent noise-step level. Uniquely, Diffusion-NFT(Zheng et al., 2026) proposes a solver-agnostic framework that constructs implicit positive and negative update directions directly within the forward process, offering a computationally efficient paradigm without requiring explicit likelihoods or value networks.

Appendix C Experiment Details

C.1 Detailed Introduction of Benchmarks

We evaluate on 2 multitask benchmarks.

•LIBERO(Liu et al., 2023): We follow the standard protocol across four suites (Spatial, Object, Goal, Long), reporting average success rates over 500 episodes (50 states ×\times 10 sub-tasks) per suite. The agent receives dual 224×\times 224 RGB inputs, language instructions, and 7-dimensional proprioceptive states (6-DoF joints + gripper). It outputs continuous end-effector actions. The environment provides a sparse binary reward (1 for success, 0 otherwise).
•ManiSkill(Mu et al., 2021): We adopt the PutOnPlateInScene multitask setting from RL4VLA(Liu et al., 2026). This benchmark defines 4,352 compositional tasks derived from 16 objects, 17 receptacles, and 16 tabletop scenes. Observations consist of a single 480×\times 640 third-person view, language instructions, and joint poses. Actions are continuous joint-space commands. The environment provides a composite reward to discourage degenerate throwing behaviors.

Table 3: OOD task mapping for ManiSkill PutOnPlateInScene25* across Vision, Semantics, and Execution categories.

Category Sub-category (OOD type)ManiSkill env IDs Vision Unseen Table (background)PutOnPlateInScene25VisionImage-v1 Dynamic Textures (foreground, weak)PutOnPlateInScene25VisionTexture03-v1 Dynamic Textures (foreground, strong)PutOnPlateInScene25VisionTexture05-v1 Dynamic Noise (image-level, weak)PutOnPlateInScene25VisionWhole03-v1 Dynamic Noise (image-level, strong)PutOnPlateInScene25VisionWhole05-v1 Semantics Unseen Objects PutOnPlateInScene25Carrot-v1 Unseen Receptacles PutOnPlateInScene25Plate-v1 Unseen Instruction Phrasings PutOnPlateInScene25Instruct-v1 Multi-Object PutOnPlateInScene25MultiCarrot-v1 Distractive Receptacle PutOnPlateInScene25MultiPlate-v1 Execution Unseen Position PutOnPlateInScene25Position-v1 Unseen Robot Init Pose PutOnPlateInScene25EEPose-v1 Mid-Episode Object Reposition PutOnPlateInScene25PositionChangeTo-v1

C.2 Hyperparameters for Training

Table 4: Hyperparameter settings for Libero and ManiSkill.

| Parameters | LIBERO | | ManiSkill | | --- | | π 0\pi_{0} | | π 0.5\pi_{0.5} | | π 0\pi_{0} | | π 0.5\pi_{0.5} | | Spatial | Object | Goal | Long | | Spatial | Object | Goal | Long | | Multitask | | Multitask | | Train epochs | 400 | 400 | 400 | 400 | | 400 | 400 | 400 | 400 | | 240 | | 240 | | Batch size | 2048 | 2048 | 2048 | 2048 | | 2048 | 2048 | 2048 | 2048 | | 5120 | | 5120 | | Update epochs | 2 | 2 | 4 | 4 | | 1 | 1 | 3 | 4 | | 5 | | 5 | | Actor lr | 1e-5 | 1e-5 | 1e-5 | 1e-5 | | 1e-5 | 1e-5 | 1e-5 | 1e-5 | | 8e-6 | | 8e-6 | | Interaction steps | 240 | 240 | 320 | 480 | | 240 | 240 | 320 | 480 | | 60 | | 60 | | Parallel environments | 64 | 64 | 64 | 64 | | 64 | 64 | 64 | 64 | | 64 | | 64 | | Rollout epochs | 8 | 8 | 8 | 8 | | 8 | 8 | 8 | 8 | | – | | – | | Action chunk H H | 5 | 5 | 5 | 10 | | 5 | 5 | 5 | 10 | | 5 | | 5 | | Denoise steps | 4 | 4 | 4 | 4 | | 4 | 4 | 4 | 4 | | 4 | | 4 | | Noise level σ\sigma | 0.2 | 0.2 | 0.2 | 0.2 | | 0.2 | 0.2 | 0.2 | 0.2 | | 0.2 | | 0.2 | | Trust Region Size β\beta | 1.0 | 1.0 | 1.0 | 1.0 | | 1.0 | 1.0 | 1.0 | 1.0 | | 1.0 | | 1.0 | | Initial Decay α 0\alpha_{0} | 0.1 | 0.1 | 0.1 | 0.1 | | 0.1 | 0.1 | 0.1 | 0.1 | | 0.1 | | 0.1 | | End Decay α−1\alpha_{-1} | 0.995 | 0.995 | 0.995 | 0.995 | | 0.995 | 0.995 | 0.995 | 0.995 | | 0.995 | | 0.995 |

Appendix D Additional Ablation: Step Selection Strategy

Motivation.

Our step-wise supervision is defined on a single solver transition (x t→x t−)(x_{t}\rightarrow x_{t^{-}}) sampled from a K K-step Flow-SDE rollout. In our default implementation, we uniformly sample the solver step index j∼𝒰{0,…,K−1}j\sim\mathcal{U}{0,\dots,K-1} at each training iteration and construct (x t,x t−,t)=(x t j,x t j+1,t j)(x_{t},x_{t^{-}},t)=(x_{t_{j}},x_{t_{j+1}},t_{j}). This stochastic step selection exposes the model to transitions at different noise levels and denoising stages, providing more balanced supervision across the entire solver trajectory.

Ablation setup.

We compare the default Random Step strategy against several Fixed Step variants, where the solver index j j is held constant throughout training. All other configurations (solver, objective, training budget, and environment settings) remain identical.

Results.

As shown in Figure5, uniformly random step selection achieves more stable optimization and improves the final success rate compared to fixed-step choices. We hypothesize that fixed-step supervision biases learning toward a narrow noise regime, while random step selection provides coverage over multiple denoising stages and thus yields more robust policy learning.

Figure 5: Step selection ablation. Performance comparison between uniform random solver-step sampling and fixed-step selection strategies.

Experimental support, please view the build logs for errors. Generated by [L A T E xml](https://math.nist.gov/~BMiller/LaTeXML/).

Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" () button, located in the page header.

Tip: You can select the relevant text first, to include it in your report.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

BETA

Xet Storage Details

Size:: 146 kB
Xet hash:: ebc790400fc66f788fd06fb5843b134200453193d3e0050bc925d6d43b8f74d2

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.