--- license: apache-2.0 language: - en tags: - robotics - navigation - video-to-navigation - diffusion-transformer - optical-flow - GENESIS library_name: pytorch pipeline_tag: robotics --- # FlowDiT V2 — Video-to-Navigation (GENESIS) Part of the **GENESIS** research framework: video-conditioned robot learning. **Paper**: [Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion](https://arxiv.org/abs/2605.01477) (IROS 2026) **Code**: [github.com/jeffrinsam/GENESIS](https://github.com/jeffrinsam/GENESIS) → `part2_navigation/` ## Model Description FlowDiT V2 is a flow-constrained Diffusion Transformer (DiT) that translates a reference goal video and a current observation image into continuous navigation commands `[vx, vy, yaw_rate]`. **Architecture:** - **Visual encoder**: DINOv2-ViT-B/14 (frozen) — extracts spatial features from current observation - **Flow encoder**: RAFT optical flow → temporal flow tokens from goal video - **DiT backbone**: Diffusion Transformer with cross-attention between flow tokens and obs features - **Output**: 3-DOF velocity command at 2 Hz control frequency **Training data**: ~50k episodes across wheeled and legged embodiments (Unitree B1, G1, custom wheeled platforms) in Isaac Sim. ## Performance Evaluated on 41 tasks across 3 robot embodiments in Isaac Sim: | Metric | Value | |--------|-------| | Success Rate (SR @ 3.0 m) | 100% | | SPL | 0.91 | | Avg Trajectory Error (ATE) | 0.42 m | | Direction Accuracy (cosine > 0.75) | 96.3% | ## Usage ```bash # Install dependencies conda activate genesis-navigation cd GENESIS # Single inference python part2_navigation/flow_constrained_v2/inference.py \ --checkpoint path/to/best.pth \ --goal_video reference.mp4 \ --current_obs frame.jpg \ --output actions.npy ``` Download via the GENESIS checkpoint script: ```bash bash scripts/download_checkpoints.sh ``` ## Checkpoint Details | File | Size | Format | |------|------|--------| | `best.pth` | 1.2 GB | PyTorch state dict + config | The `.pth` file contains: ```python { "model_state_dict": ..., "config": {"use_raft": False, "hidden_dim": 512, ...}, "epoch": ..., "val_loss": ... } ``` ## Citation ```bibtex @inproceedings{sam2026actionagent, title = {Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion}, author = {Sam, Jeffrin and Khang, Nguyen and Mahmoud, Yara and Altamirano Cabrera, Miguel and Tsetserukou, Dzmitry}, booktitle = {2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, year = {2026}, note = {arXiv:2605.01477} } ``` ## License Apache 2.0. See [LICENSE](https://github.com/jeffrinsam/GENESIS/blob/main/LICENSE).