NOVA

File size: 5,936 Bytes

50b1265

---
license: other
license_name: nvidia-open-model-license
license_link: https://developer.nvidia.com/open-model-license
language:
  - en
library_name: transformers
tags:
  - robotics
  - vision-language-action
  - manipulation
  - gr00t
  - nvidia
  - physical-ai
  - humanoid
  - reachy2
  - lerobot
datasets:
  - ganatrask/NOVA
base_model:
  - nvidia/GR00T-N1.6-3B
pipeline_tag: robotics
---

# NOVA Model - GR00T N1.6 Fine-tuned for Reachy 2

<p align="center">
  <img src="https://img.shields.io/badge/NVIDIA-GR00T%20N1.6-76B900?style=for-the-badge&logo=nvidia" alt="GR00T N1.6"/>
  <img src="https://img.shields.io/badge/Robot-Reachy%202-0066CC?style=for-the-badge" alt="Reachy 2"/>
  <img src="https://img.shields.io/badge/Task-Pick%20%26%20Place-green?style=for-the-badge" alt="Pick & Place"/>
</p>

**NOVA** (Neural Open Vision Actions) is a fine-tuned version of NVIDIA's GR00T N1.6 vision-language-action model, trained specifically for [Pollen Robotics' Reachy 2](https://www.pollen-robotics.com/reachy/) humanoid robot.

## Model Description

This model is part of an end-to-end Physical AI pipeline that combines:
- **Voice Input**: Parakeet CTC 0.6B for speech-to-text
- **Scene Reasoning**: Cosmos Reason 2 for object detection and spatial understanding
- **Action Policy**: This fine-tuned GR00T N1.6 model for manipulation

### Model Details

| Property | Value |
|----------|-------|
| **Base Model** | [nvidia/GR00T-N1.6-3B](https://huggingface.co/nvidia/GR00T-N1.6-3B) |
| **Parameters** | ~3B |
| **Embodiment** | Reachy 2 (custom embodiment tag) |
| **Action Space** | 8-DOF (7 arm joints + gripper) |
| **Training Steps** | 30,000 |
| **Final Loss** | ~0.008-0.01 |

### Action Space

```python
action = [
    shoulder_pitch,  # -180° to 90°
    shoulder_roll,   # -180° to 10°
    elbow_yaw,       # -90° to 90°
    elbow_pitch,     # -125° to 0°
    wrist_roll,      # -100° to 100°
    wrist_pitch,     # -45° to 45°
    wrist_yaw,       # -30° to 30°
    gripper,         # 0 (closed) to 1 (open)
]
```

## Intended Use

This model is designed for:
- **Pick-and-place manipulation** tasks on Reachy 2 robot
- **Language-conditioned control** ("Pick up the red cube")
- **Research** in vision-language-action models and robotic manipulation

### Supported Tasks

- Pick up objects (cube, cylinder, capsule, rectangular box)
- Place objects in target locations
- Handle 8 color variations (red, green, blue, yellow, cyan, magenta, orange, purple)

## Training

### Training Data

Trained on the [ganatrask/NOVA dataset](https://huggingface.co/datasets/ganatrask/NOVA):
- **100 episodes** of expert demonstrations
- **32 task variations** (4 objects × 8 colors)
- Domain randomization (position, lighting, camera jitter)
- LeRobot v2.1 format

### Training Configuration

| Parameter | Value |
|-----------|-------|
| GPU | NVIDIA A100-SXM4-80GB |
| GPUs | 2 |
| Batch Size | 64 |
| Max Steps | 30,000 |
| Save Steps | 3,000 |
| Video Backend | decord |

### Training Command

```bash
python -m gr00t.train \
    --dataset_repo_id ganatrask/NOVA \
    --embodiment_tag reachy2 \
    --video_backend decord \
    --num_gpus 2 \
    --batch_size 64 \
    --max_steps 30000 \
    --save_steps 3000 \
    --output_dir ./checkpoints/groot-reachy2
```

## Usage

### Prerequisites

You need to apply a patch to Isaac-GR00T to add the Reachy 2 embodiment tag:

```bash
cd Isaac-GR00T
patch -p1 < ../patches/add_reachy2_embodiment.patch
```

### Inference

```python
from gr00t.data.embodiment_tags import EmbodimentTag
from gr00t.policy.gr00t_policy import Gr00tPolicy
import importlib.util

# Load modality config first
spec = importlib.util.spec_from_file_location(
    "modality_config",
    "configs/reachy2_modality_config.py"
)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)

# Load policy
policy = Gr00tPolicy(
    embodiment_tag=EmbodimentTag.REACHY2,
    model_path="ganatrask/NOVA",  # or local checkpoint path
    device="cuda",
    strict=True,
)

# Run inference
obs = {
    "video": {"front_cam": image[None, None, :, :, :]},  # (1, 1, H, W, 3)
    "state": {"arm_joints": joints[None, None, :]},      # (1, 1, 7)
    "language": {"annotation.human.task_description": [["Pick up the red cube"]]},
}
action, _ = policy.get_action(obs)
```

## Performance

| Metric | Value |
|--------|-------|
| Inference Speed | ~40ms/step (A100) |
| VRAM Usage | ~44GB / 80GB |
| Training Time | ~6 hours (30K steps) |

## Limitations

- **Simulation-trained**: Primarily trained on MuJoCo simulation data
- **Single-arm**: Currently supports right arm manipulation only
- **Fixed camera setup**: Expects front camera input at 224×224 resolution
- **Task scope**: Optimized for pick-and-place; may not generalize to other manipulation tasks

## Ethical Considerations

- This model should be used for research purposes
- Human supervision recommended for real robot deployment
- Not intended for safety-critical applications without extensive testing

## Citation

If you use this model, please cite:

```bibtex
@misc{nova2025,
  title={NOVA: Neural Open Vision Actions},
  author={ganatrask},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/ganatrask/NOVA}
}
```

## Acknowledgments

- **[NVIDIA](https://developer.nvidia.com/)** - GR00T N1.6 base model
- **[Pollen Robotics](https://www.pollen-robotics.com/)** - Reachy 2 robot
- **[HuggingFace](https://huggingface.co/)** - LeRobot framework
- **[VESSL AI](https://vessl.ai/)** - GPU compute for training

## License

This model inherits the [NVIDIA Open Model License](https://developer.nvidia.com/open-model-license) from the base GR00T N1.6 model.

## Links

- **GitHub**: [ganatrask/NOVA](https://github.com/ganatrask/NOVA)
- **Dataset**: [ganatrask/NOVA](https://huggingface.co/datasets/ganatrask/NOVA)
- **Base Model**: [nvidia/GR00T-N1.6-3B](https://huggingface.co/nvidia/GR00T-N1.6-3B)