---
license: apache-2.0
language:
- en
tags:
- robotics
- navigation
- video-to-navigation
- diffusion-transformer
- optical-flow
- GENESIS
library_name: pytorch
pipeline_tag: robotics
---

# FlowDiT V2 — Video-to-Navigation (GENESIS)

Part of the **GENESIS** research framework: video-conditioned robot learning.

**Paper**: [Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion](https://arxiv.org/abs/2605.01477) (IROS 2026)

**Code**: [github.com/jeffrinsam/GENESIS](https://github.com/jeffrinsam/GENESIS) → `part2_navigation/`

## Model Description

FlowDiT V2 is a flow-constrained Diffusion Transformer (DiT) that translates a reference goal video and a current observation image into continuous navigation commands `[vx, vy, yaw_rate]`.

**Architecture:**
- **Visual encoder**: DINOv2-ViT-B/14 (frozen) — extracts spatial features from current observation
- **Flow encoder**: RAFT optical flow → temporal flow tokens from goal video
- **DiT backbone**: Diffusion Transformer with cross-attention between flow tokens and obs features
- **Output**: 3-DOF velocity command at 2 Hz control frequency

**Training data**: ~50k episodes across wheeled and legged embodiments (Unitree B1, G1, custom wheeled platforms) in Isaac Sim.

## Performance

Evaluated on 41 tasks across 3 robot embodiments in Isaac Sim:

| Metric | Value |
|--------|-------|
| Success Rate (SR @ 3.0 m) | 100% |
| SPL | 0.91 |
| Avg Trajectory Error (ATE) | 0.42 m |
| Direction Accuracy (cosine > 0.75) | 96.3% |

## Usage

```bash
# Install dependencies
conda activate genesis-navigation
cd GENESIS

# Single inference
python part2_navigation/flow_constrained_v2/inference.py \
  --checkpoint path/to/best.pth \
  --goal_video reference.mp4 \
  --current_obs frame.jpg \
  --output actions.npy
```

Download via the GENESIS checkpoint script:
```bash
bash scripts/download_checkpoints.sh
```

## Checkpoint Details

| File | Size | Format |
|------|------|--------|
| `best.pth` | 1.2 GB | PyTorch state dict + config |

The `.pth` file contains:
```python
{
  "model_state_dict": ...,
  "config": {"use_raft": False, "hidden_dim": 512, ...},
  "epoch": ...,
  "val_loss": ...
}
```

## Citation

```bibtex
@inproceedings{sam2026actionagent,
  title     = {Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion},
  author    = {Sam, Jeffrin and Khang, Nguyen and Mahmoud, Yara and
               Altamirano Cabrera, Miguel and Tsetserukou, Dzmitry},
  booktitle = {2026 IEEE/RSJ International Conference on Intelligent Robots
               and Systems (IROS)},
  year      = {2026},
  note      = {arXiv:2605.01477}
}
```

## License

Apache 2.0. See [LICENSE](https://github.com/jeffrinsam/GENESIS/blob/main/LICENSE).