| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - robotics |
| - navigation |
| - video-to-navigation |
| - diffusion-transformer |
| - optical-flow |
| - GENESIS |
| library_name: pytorch |
| pipeline_tag: robotics |
| --- |
| |
| # FlowDiT V2 — Video-to-Navigation (GENESIS) |
|
|
| Part of the **GENESIS** research framework: video-conditioned robot learning. |
|
|
| **Paper**: [Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion](https://arxiv.org/abs/2605.01477) (IROS 2026) |
|
|
| **Code**: [github.com/jeffrinsam/GENESIS](https://github.com/jeffrinsam/GENESIS) → `part2_navigation/` |
|
|
| ## Model Description |
|
|
| FlowDiT V2 is a flow-constrained Diffusion Transformer (DiT) that translates a reference goal video and a current observation image into continuous navigation commands `[vx, vy, yaw_rate]`. |
|
|
| **Architecture:** |
| - **Visual encoder**: DINOv2-ViT-B/14 (frozen) — extracts spatial features from current observation |
| - **Flow encoder**: RAFT optical flow → temporal flow tokens from goal video |
| - **DiT backbone**: Diffusion Transformer with cross-attention between flow tokens and obs features |
| - **Output**: 3-DOF velocity command at 2 Hz control frequency |
|
|
| **Training data**: ~50k episodes across wheeled and legged embodiments (Unitree B1, G1, custom wheeled platforms) in Isaac Sim. |
|
|
| ## Performance |
|
|
| Evaluated on 41 tasks across 3 robot embodiments in Isaac Sim: |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Success Rate (SR @ 3.0 m) | 100% | |
| | SPL | 0.91 | |
| | Avg Trajectory Error (ATE) | 0.42 m | |
| | Direction Accuracy (cosine > 0.75) | 96.3% | |
|
|
| ## Usage |
|
|
| ```bash |
| # Install dependencies |
| conda activate genesis-navigation |
| cd GENESIS |
| |
| # Single inference |
| python part2_navigation/flow_constrained_v2/inference.py \ |
| --checkpoint path/to/best.pth \ |
| --goal_video reference.mp4 \ |
| --current_obs frame.jpg \ |
| --output actions.npy |
| ``` |
|
|
| Download via the GENESIS checkpoint script: |
| ```bash |
| bash scripts/download_checkpoints.sh |
| ``` |
|
|
| ## Checkpoint Details |
|
|
| | File | Size | Format | |
| |------|------|--------| |
| | `best.pth` | 1.2 GB | PyTorch state dict + config | |
|
|
| The `.pth` file contains: |
| ```python |
| { |
| "model_state_dict": ..., |
| "config": {"use_raft": False, "hidden_dim": 512, ...}, |
| "epoch": ..., |
| "val_loss": ... |
| } |
| ``` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{sam2026actionagent, |
| title = {Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion}, |
| author = {Sam, Jeffrin and Khang, Nguyen and Mahmoud, Yara and |
| Altamirano Cabrera, Miguel and Tsetserukou, Dzmitry}, |
| booktitle = {2026 IEEE/RSJ International Conference on Intelligent Robots |
| and Systems (IROS)}, |
| year = {2026}, |
| note = {arXiv:2605.01477} |
| } |
| ``` |
|
|
| ## License |
|
|
| Apache 2.0. See [LICENSE](https://github.com/jeffrinsam/GENESIS/blob/main/LICENSE). |
|
|