File size: 5,936 Bytes
50b1265
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
license: other
license_name: nvidia-open-model-license
license_link: https://developer.nvidia.com/open-model-license
language:
  - en
library_name: transformers
tags:
  - robotics
  - vision-language-action
  - manipulation
  - gr00t
  - nvidia
  - physical-ai
  - humanoid
  - reachy2
  - lerobot
datasets:
  - ganatrask/NOVA
base_model:
  - nvidia/GR00T-N1.6-3B
pipeline_tag: robotics
---

# NOVA Model - GR00T N1.6 Fine-tuned for Reachy 2

<p align="center">
  <img src="https://img.shields.io/badge/NVIDIA-GR00T%20N1.6-76B900?style=for-the-badge&logo=nvidia" alt="GR00T N1.6"/>
  <img src="https://img.shields.io/badge/Robot-Reachy%202-0066CC?style=for-the-badge" alt="Reachy 2"/>
  <img src="https://img.shields.io/badge/Task-Pick%20%26%20Place-green?style=for-the-badge" alt="Pick & Place"/>
</p>

**NOVA** (Neural Open Vision Actions) is a fine-tuned version of NVIDIA's GR00T N1.6 vision-language-action model, trained specifically for [Pollen Robotics' Reachy 2](https://www.pollen-robotics.com/reachy/) humanoid robot.

## Model Description

This model is part of an end-to-end Physical AI pipeline that combines:
- **Voice Input**: Parakeet CTC 0.6B for speech-to-text
- **Scene Reasoning**: Cosmos Reason 2 for object detection and spatial understanding
- **Action Policy**: This fine-tuned GR00T N1.6 model for manipulation

### Model Details

| Property | Value |
|----------|-------|
| **Base Model** | [nvidia/GR00T-N1.6-3B](https://huggingface.co/nvidia/GR00T-N1.6-3B) |
| **Parameters** | ~3B |
| **Embodiment** | Reachy 2 (custom embodiment tag) |
| **Action Space** | 8-DOF (7 arm joints + gripper) |
| **Training Steps** | 30,000 |
| **Final Loss** | ~0.008-0.01 |

### Action Space

```python
action = [
    shoulder_pitch,  # -180° to 90°
    shoulder_roll,   # -180° to 10°
    elbow_yaw,       # -90° to 90°
    elbow_pitch,     # -125° to 0°
    wrist_roll,      # -100° to 100°
    wrist_pitch,     # -45° to 45°
    wrist_yaw,       # -30° to 30°
    gripper,         # 0 (closed) to 1 (open)
]
```

## Intended Use

This model is designed for:
- **Pick-and-place manipulation** tasks on Reachy 2 robot
- **Language-conditioned control** ("Pick up the red cube")
- **Research** in vision-language-action models and robotic manipulation

### Supported Tasks

- Pick up objects (cube, cylinder, capsule, rectangular box)
- Place objects in target locations
- Handle 8 color variations (red, green, blue, yellow, cyan, magenta, orange, purple)

## Training

### Training Data

Trained on the [ganatrask/NOVA dataset](https://huggingface.co/datasets/ganatrask/NOVA):
- **100 episodes** of expert demonstrations
- **32 task variations** (4 objects × 8 colors)
- Domain randomization (position, lighting, camera jitter)
- LeRobot v2.1 format

### Training Configuration

| Parameter | Value |
|-----------|-------|
| GPU | NVIDIA A100-SXM4-80GB |
| GPUs | 2 |
| Batch Size | 64 |
| Max Steps | 30,000 |
| Save Steps | 3,000 |
| Video Backend | decord |

### Training Command

```bash
python -m gr00t.train \
    --dataset_repo_id ganatrask/NOVA \
    --embodiment_tag reachy2 \
    --video_backend decord \
    --num_gpus 2 \
    --batch_size 64 \
    --max_steps 30000 \
    --save_steps 3000 \
    --output_dir ./checkpoints/groot-reachy2
```

## Usage

### Prerequisites

You need to apply a patch to Isaac-GR00T to add the Reachy 2 embodiment tag:

```bash
cd Isaac-GR00T
patch -p1 < ../patches/add_reachy2_embodiment.patch
```

### Inference

```python
from gr00t.data.embodiment_tags import EmbodimentTag
from gr00t.policy.gr00t_policy import Gr00tPolicy
import importlib.util

# Load modality config first
spec = importlib.util.spec_from_file_location(
    "modality_config",
    "configs/reachy2_modality_config.py"
)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)

# Load policy
policy = Gr00tPolicy(
    embodiment_tag=EmbodimentTag.REACHY2,
    model_path="ganatrask/NOVA",  # or local checkpoint path
    device="cuda",
    strict=True,
)

# Run inference
obs = {
    "video": {"front_cam": image[None, None, :, :, :]},  # (1, 1, H, W, 3)
    "state": {"arm_joints": joints[None, None, :]},      # (1, 1, 7)
    "language": {"annotation.human.task_description": [["Pick up the red cube"]]},
}
action, _ = policy.get_action(obs)
```

## Performance

| Metric | Value |
|--------|-------|
| Inference Speed | ~40ms/step (A100) |
| VRAM Usage | ~44GB / 80GB |
| Training Time | ~6 hours (30K steps) |

## Limitations

- **Simulation-trained**: Primarily trained on MuJoCo simulation data
- **Single-arm**: Currently supports right arm manipulation only
- **Fixed camera setup**: Expects front camera input at 224×224 resolution
- **Task scope**: Optimized for pick-and-place; may not generalize to other manipulation tasks

## Ethical Considerations

- This model should be used for research purposes
- Human supervision recommended for real robot deployment
- Not intended for safety-critical applications without extensive testing

## Citation

If you use this model, please cite:

```bibtex
@misc{nova2025,
  title={NOVA: Neural Open Vision Actions},
  author={ganatrask},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/ganatrask/NOVA}
}
```

## Acknowledgments

- **[NVIDIA](https://developer.nvidia.com/)** - GR00T N1.6 base model
- **[Pollen Robotics](https://www.pollen-robotics.com/)** - Reachy 2 robot
- **[HuggingFace](https://huggingface.co/)** - LeRobot framework
- **[VESSL AI](https://vessl.ai/)** - GPU compute for training

## License

This model inherits the [NVIDIA Open Model License](https://developer.nvidia.com/open-model-license) from the base GR00T N1.6 model.

## Links

- **GitHub**: [ganatrask/NOVA](https://github.com/ganatrask/NOVA)
- **Dataset**: [ganatrask/NOVA](https://huggingface.co/datasets/ganatrask/NOVA)
- **Base Model**: [nvidia/GR00T-N1.6-3B](https://huggingface.co/nvidia/GR00T-N1.6-3B)