|
|
--- |
|
|
license: other |
|
|
license_name: nvidia-open-model-license |
|
|
license_link: https://developer.nvidia.com/open-model-license |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
tags: |
|
|
- robotics |
|
|
- vision-language-action |
|
|
- manipulation |
|
|
- gr00t |
|
|
- nvidia |
|
|
- physical-ai |
|
|
- humanoid |
|
|
- reachy2 |
|
|
- lerobot |
|
|
datasets: |
|
|
- ganatrask/NOVA |
|
|
base_model: |
|
|
- nvidia/GR00T-N1.6-3B |
|
|
pipeline_tag: robotics |
|
|
--- |
|
|
|
|
|
# NOVA Model - GR00T N1.6 Fine-tuned for Reachy 2 |
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://img.shields.io/badge/NVIDIA-GR00T%20N1.6-76B900?style=for-the-badge&logo=nvidia" alt="GR00T N1.6"/> |
|
|
<img src="https://img.shields.io/badge/Robot-Reachy%202-0066CC?style=for-the-badge" alt="Reachy 2"/> |
|
|
<img src="https://img.shields.io/badge/Task-Pick%20%26%20Place-green?style=for-the-badge" alt="Pick & Place"/> |
|
|
</p> |
|
|
|
|
|
**NOVA** (Neural Open Vision Actions) is a fine-tuned version of NVIDIA's GR00T N1.6 vision-language-action model, trained specifically for [Pollen Robotics' Reachy 2](https://www.pollen-robotics.com/reachy/) humanoid robot. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is part of an end-to-end Physical AI pipeline that combines: |
|
|
- **Voice Input**: Parakeet CTC 0.6B for speech-to-text |
|
|
- **Scene Reasoning**: Cosmos Reason 2 for object detection and spatial understanding |
|
|
- **Action Policy**: This fine-tuned GR00T N1.6 model for manipulation |
|
|
|
|
|
### Model Details |
|
|
|
|
|
| Property | Value | |
|
|
|----------|-------| |
|
|
| **Base Model** | [nvidia/GR00T-N1.6-3B](https://huggingface.co/nvidia/GR00T-N1.6-3B) | |
|
|
| **Parameters** | ~3B | |
|
|
| **Embodiment** | Reachy 2 (custom embodiment tag) | |
|
|
| **Action Space** | 8-DOF (7 arm joints + gripper) | |
|
|
| **Training Steps** | 30,000 | |
|
|
| **Final Loss** | ~0.008-0.01 | |
|
|
|
|
|
### Action Space |
|
|
|
|
|
```python |
|
|
action = [ |
|
|
shoulder_pitch, # -180° to 90° |
|
|
shoulder_roll, # -180° to 10° |
|
|
elbow_yaw, # -90° to 90° |
|
|
elbow_pitch, # -125° to 0° |
|
|
wrist_roll, # -100° to 100° |
|
|
wrist_pitch, # -45° to 45° |
|
|
wrist_yaw, # -30° to 30° |
|
|
gripper, # 0 (closed) to 1 (open) |
|
|
] |
|
|
``` |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is designed for: |
|
|
- **Pick-and-place manipulation** tasks on Reachy 2 robot |
|
|
- **Language-conditioned control** ("Pick up the red cube") |
|
|
- **Research** in vision-language-action models and robotic manipulation |
|
|
|
|
|
### Supported Tasks |
|
|
|
|
|
- Pick up objects (cube, cylinder, capsule, rectangular box) |
|
|
- Place objects in target locations |
|
|
- Handle 8 color variations (red, green, blue, yellow, cyan, magenta, orange, purple) |
|
|
|
|
|
## Training |
|
|
|
|
|
### Training Data |
|
|
|
|
|
Trained on the [ganatrask/NOVA dataset](https://huggingface.co/datasets/ganatrask/NOVA): |
|
|
- **100 episodes** of expert demonstrations |
|
|
- **32 task variations** (4 objects × 8 colors) |
|
|
- Domain randomization (position, lighting, camera jitter) |
|
|
- LeRobot v2.1 format |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| GPU | NVIDIA A100-SXM4-80GB | |
|
|
| GPUs | 2 | |
|
|
| Batch Size | 64 | |
|
|
| Max Steps | 30,000 | |
|
|
| Save Steps | 3,000 | |
|
|
| Video Backend | decord | |
|
|
|
|
|
### Training Command |
|
|
|
|
|
```bash |
|
|
python -m gr00t.train \ |
|
|
--dataset_repo_id ganatrask/NOVA \ |
|
|
--embodiment_tag reachy2 \ |
|
|
--video_backend decord \ |
|
|
--num_gpus 2 \ |
|
|
--batch_size 64 \ |
|
|
--max_steps 30000 \ |
|
|
--save_steps 3000 \ |
|
|
--output_dir ./checkpoints/groot-reachy2 |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Prerequisites |
|
|
|
|
|
You need to apply a patch to Isaac-GR00T to add the Reachy 2 embodiment tag: |
|
|
|
|
|
```bash |
|
|
cd Isaac-GR00T |
|
|
patch -p1 < ../patches/add_reachy2_embodiment.patch |
|
|
``` |
|
|
|
|
|
### Inference |
|
|
|
|
|
```python |
|
|
from gr00t.data.embodiment_tags import EmbodimentTag |
|
|
from gr00t.policy.gr00t_policy import Gr00tPolicy |
|
|
import importlib.util |
|
|
|
|
|
# Load modality config first |
|
|
spec = importlib.util.spec_from_file_location( |
|
|
"modality_config", |
|
|
"configs/reachy2_modality_config.py" |
|
|
) |
|
|
module = importlib.util.module_from_spec(spec) |
|
|
spec.loader.exec_module(module) |
|
|
|
|
|
# Load policy |
|
|
policy = Gr00tPolicy( |
|
|
embodiment_tag=EmbodimentTag.REACHY2, |
|
|
model_path="ganatrask/NOVA", # or local checkpoint path |
|
|
device="cuda", |
|
|
strict=True, |
|
|
) |
|
|
|
|
|
# Run inference |
|
|
obs = { |
|
|
"video": {"front_cam": image[None, None, :, :, :]}, # (1, 1, H, W, 3) |
|
|
"state": {"arm_joints": joints[None, None, :]}, # (1, 1, 7) |
|
|
"language": {"annotation.human.task_description": [["Pick up the red cube"]]}, |
|
|
} |
|
|
action, _ = policy.get_action(obs) |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Inference Speed | ~40ms/step (A100) | |
|
|
| VRAM Usage | ~44GB / 80GB | |
|
|
| Training Time | ~6 hours (30K steps) | |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Simulation-trained**: Primarily trained on MuJoCo simulation data |
|
|
- **Single-arm**: Currently supports right arm manipulation only |
|
|
- **Fixed camera setup**: Expects front camera input at 224×224 resolution |
|
|
- **Task scope**: Optimized for pick-and-place; may not generalize to other manipulation tasks |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
- This model should be used for research purposes |
|
|
- Human supervision recommended for real robot deployment |
|
|
- Not intended for safety-critical applications without extensive testing |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{nova2025, |
|
|
title={NOVA: Neural Open Vision Actions}, |
|
|
author={ganatrask}, |
|
|
year={2025}, |
|
|
publisher={HuggingFace}, |
|
|
url={https://huggingface.co/ganatrask/NOVA} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- **[NVIDIA](https://developer.nvidia.com/)** - GR00T N1.6 base model |
|
|
- **[Pollen Robotics](https://www.pollen-robotics.com/)** - Reachy 2 robot |
|
|
- **[HuggingFace](https://huggingface.co/)** - LeRobot framework |
|
|
- **[VESSL AI](https://vessl.ai/)** - GPU compute for training |
|
|
|
|
|
## License |
|
|
|
|
|
This model inherits the [NVIDIA Open Model License](https://developer.nvidia.com/open-model-license) from the base GR00T N1.6 model. |
|
|
|
|
|
## Links |
|
|
|
|
|
- **GitHub**: [ganatrask/NOVA](https://github.com/ganatrask/NOVA) |
|
|
- **Dataset**: [ganatrask/NOVA](https://huggingface.co/datasets/ganatrask/NOVA) |
|
|
- **Base Model**: [nvidia/GR00T-N1.6-3B](https://huggingface.co/nvidia/GR00T-N1.6-3B) |
|
|
|