FlowDiT V2 — Video-to-Navigation (GENESIS)

Part of the GENESIS research framework: video-conditioned robot learning.

Paper: Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion (IROS 2026)

Code: github.com/jeffrinsam/GENESIS → part2_navigation/

Model Description

FlowDiT V2 is a flow-constrained Diffusion Transformer (DiT) that translates a reference goal video and a current observation image into continuous navigation commands [vx, vy, yaw_rate].

Architecture:

Visual encoder: DINOv2-ViT-B/14 (frozen) — extracts spatial features from current observation
Flow encoder: RAFT optical flow → temporal flow tokens from goal video
DiT backbone: Diffusion Transformer with cross-attention between flow tokens and obs features
Output: 3-DOF velocity command at 2 Hz control frequency

Training data: ~50k episodes across wheeled and legged embodiments (Unitree B1, G1, custom wheeled platforms) in Isaac Sim.

Performance

Evaluated on 41 tasks across 3 robot embodiments in Isaac Sim:

Metric	Value
Success Rate (SR @ 3.0 m)	100%
SPL	0.91
Avg Trajectory Error (ATE)	0.42 m
Direction Accuracy (cosine > 0.75)	96.3%

Usage

# Install dependencies
conda activate genesis-navigation
cd GENESIS

# Single inference
python part2_navigation/flow_constrained_v2/inference.py \
  --checkpoint path/to/best.pth \
  --goal_video reference.mp4 \
  --current_obs frame.jpg \
  --output actions.npy

Download via the GENESIS checkpoint script:

bash scripts/download_checkpoints.sh

Checkpoint Details

File	Size	Format
`best.pth`	1.2 GB	PyTorch state dict + config

The .pth file contains:

{
  "model_state_dict": ...,
  "config": {"use_raft": False, "hidden_dim": 512, ...},
  "epoch": ...,
  "val_loss": ...
}

Citation

@inproceedings{sam2026actionagent,
  title     = {Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion},
  author    = {Sam, Jeffrin and Khang, Nguyen and Mahmoud, Yara and
               Altamirano Cabrera, Miguel and Tsetserukou, Dzmitry},
  booktitle = {2026 IEEE/RSJ International Conference on Intelligent Robots
               and Systems (IROS)},
  year      = {2026},
  note      = {arXiv:2605.01477}
}

License

Apache 2.0. See LICENSE.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics

Paper for JeffrinSam/genesis-flowdit-v2

Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion

Paper • 2605.01477 • Published May 2