JeffrinSam
/

genesis-flowdit-v2

video-to-navigation

diffusion-transformer

Model card Files Files and versions

genesis-flowdit-v2 / README.md

JeffrinSam's picture

Upload README.md with huggingface_hub

6d0b8f4 verified 3 days ago

|

History Blame Contribute Delete

2.74 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- robotics
	- navigation
	- video-to-navigation
	- diffusion-transformer
	- optical-flow
	- GENESIS
	library_name: pytorch
	pipeline_tag: robotics
	---

	# FlowDiT V2 — Video-to-Navigation (GENESIS)

	Part of the GENESIS research framework: video-conditioned robot learning.

	Paper: [Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion](https://arxiv.org/abs/2605.01477) (IROS 2026)

	Code: [github.com/jeffrinsam/GENESIS](https://github.com/jeffrinsam/GENESIS) → `part2_navigation/`

	## Model Description

	FlowDiT V2 is a flow-constrained Diffusion Transformer (DiT) that translates a reference goal video and a current observation image into continuous navigation commands `[vx, vy, yaw_rate]`.

	Architecture:
	- Visual encoder: DINOv2-ViT-B/14 (frozen) — extracts spatial features from current observation
	- Flow encoder: RAFT optical flow → temporal flow tokens from goal video
	- DiT backbone: Diffusion Transformer with cross-attention between flow tokens and obs features
	- Output: 3-DOF velocity command at 2 Hz control frequency

	Training data: ~50k episodes across wheeled and legged embodiments (Unitree B1, G1, custom wheeled platforms) in Isaac Sim.

	## Performance

	Evaluated on 41 tasks across 3 robot embodiments in Isaac Sim:

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Success Rate (SR @ 3.0 m) \| 100% \|
	\| SPL \| 0.91 \|
	\| Avg Trajectory Error (ATE) \| 0.42 m \|
	\| Direction Accuracy (cosine > 0.75) \| 96.3% \|

	## Usage

	```bash
	# Install dependencies
	conda activate genesis-navigation
	cd GENESIS

	# Single inference
	python part2_navigation/flow_constrained_v2/inference.py \
	--checkpoint path/to/best.pth \
	--goal_video reference.mp4 \
	--current_obs frame.jpg \
	--output actions.npy
	```

	Download via the GENESIS checkpoint script:
	```bash
	bash scripts/download_checkpoints.sh
	```

	## Checkpoint Details

	\| File \| Size \| Format \|
	\|------\|------\|--------\|
	\| `best.pth` \| 1.2 GB \| PyTorch state dict + config \|

	The `.pth` file contains:
	```python
	{
	"model_state_dict": ...,
	"config": {"use_raft": False, "hidden_dim": 512, ...},
	"epoch": ...,
	"val_loss": ...
	}
	```

	## Citation

	```bibtex
	@inproceedings{sam2026actionagent,
	title = {Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion},
	author = {Sam, Jeffrin and Khang, Nguyen and Mahmoud, Yara and
	Altamirano Cabrera, Miguel and Tsetserukou, Dzmitry},
	booktitle = {2026 IEEE/RSJ International Conference on Intelligent Robots
	and Systems (IROS)},
	year = {2026},
	note = {arXiv:2605.01477}
	}
	```

	## License

	Apache 2.0. See [LICENSE](https://github.com/jeffrinsam/GENESIS/blob/main/LICENSE).