Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion
Paper โข 2605.01477 โข Published
Part of the GENESIS research framework: video-conditioned robot learning.
Paper: Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion (IROS 2026)
Code: github.com/jeffrinsam/GENESIS โ part2_navigation/
FlowDiT V2 is a flow-constrained Diffusion Transformer (DiT) that translates a reference goal video and a current observation image into continuous navigation commands [vx, vy, yaw_rate].
Architecture:
Training data: ~50k episodes across wheeled and legged embodiments (Unitree B1, G1, custom wheeled platforms) in Isaac Sim.
Evaluated on 41 tasks across 3 robot embodiments in Isaac Sim:
| Metric | Value |
|---|---|
| Success Rate (SR @ 3.0 m) | 100% |
| SPL | 0.91 |
| Avg Trajectory Error (ATE) | 0.42 m |
| Direction Accuracy (cosine > 0.75) | 96.3% |
# Install dependencies
conda activate genesis-navigation
cd GENESIS
# Single inference
python part2_navigation/flow_constrained_v2/inference.py \
--checkpoint path/to/best.pth \
--goal_video reference.mp4 \
--current_obs frame.jpg \
--output actions.npy
Download via the GENESIS checkpoint script:
bash scripts/download_checkpoints.sh
| File | Size | Format |
|---|---|---|
best.pth |
1.2 GB | PyTorch state dict + config |
The .pth file contains:
{
"model_state_dict": ...,
"config": {"use_raft": False, "hidden_dim": 512, ...},
"epoch": ...,
"val_loss": ...
}
@inproceedings{sam2026actionagent,
title = {Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion},
author = {Sam, Jeffrin and Khang, Nguyen and Mahmoud, Yara and
Altamirano Cabrera, Miguel and Tsetserukou, Dzmitry},
booktitle = {2026 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS)},
year = {2026},
note = {arXiv:2605.01477}
}
Apache 2.0. See LICENSE.