Ryukijano commited on
Commit
de97b76
·
verified ·
1 Parent(s): 155f358

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +56 -307
README.md CHANGED
@@ -1,334 +1,83 @@
1
  ---
 
2
  language:
3
  - en
4
- license: mit
5
- library_name: transformers
6
- pipeline_tag: robotics
7
- datasets:
8
- - lerobot/robot_sim.PickNPlace
9
- - lerobot/so100_strawberry_grape
10
- base_model: NVEagle/eagle_er-qwen3_1_7B-Siglip2_400M_stage1_5_128gpu_er_v7_1mlp_nops
11
  tags:
12
  - robotics
13
- - vision-language-action
14
- - reinforcement-learning
15
  - imitation-learning
16
- - nvidia
17
- - gr00t
18
- - gemma
19
  - diffusion-policy
20
- - lerobot
21
- - robot-learning
22
- - embodied-ai
23
- - humanoid-robots
24
- - robot-manipulation
25
- - computer-vision
26
- - natural-language-processing
27
- - deep-learning
28
- - transformer
29
- - vision-transformer
30
- - flow-matching
31
- - foundation-model
32
- - multi-modal
33
- - human-robot-interaction
34
- - autonomous-robots
35
- - robot-control
36
- - robot-perception
37
- - robot-vision
38
  ---
39
 
40
- # Gemma-GR00T: A Vision-Language-Action Model for Robotic Control
41
-
42
- This is a fine-tuned version of the NVIDIA GR00T N1.5 model, adapted for robotic control tasks using the LeRobot framework. The model combines vision, language, and action generation capabilities to enable robots to perform complex manipulation tasks based on natural language instructions.
43
-
44
- ## Model Description
45
-
46
- Gemma-GR00T is a state-of-the-art multimodal vision-language-action policy that combines Google's Gemma language model with NVIDIA's GR00T robotics framework. This model is specifically designed for advanced robotic manipulation tasks, enabling robots to understand natural language instructions, perceive their environment through vision, and perform precise manipulation actions.
47
-
48
- ## Model Details
49
-
50
- - **Model type:** Vision-Language-Action (VLA) model
51
- - **Base Model:** [NVEagle/eagle_er-qwen3_1_7B-Siglip2_400M_stage1_5_128gpu_er_v7_1mlp_nops](https://huggingface.co/NVEagle/eagle_er-qwen3_1_7B-Siglip2_400M_stage1_5_128gpu_er_v7_1mlp_nops)
52
- - **Task:** text-to-video (robot action generation from vision and language)
53
- - **Training Data:** Trained on LeRobot datasets using the `fourier_gr1_arms_only` configuration
54
- - **Framework:** PyTorch with Hugging Face Transformers
55
- - **Related Models:** [NVIDIA GR00T-N1.5-3B](https://huggingface.co/nvidia/GR00T-N1.5-3B), [LeRobot Models](https://huggingface.co/lerobot)
56
- - **Related Datasets:** [LeRobot Datasets](https://huggingface.co/lerobot/datasets)
57
-
58
- ### Model Architecture
59
-
60
- The model is built on a sophisticated multimodal architecture that combines state-of-the-art vision and language models for robotic control:
61
-
62
- 1. **Backbone**: `Eagle2_5_VLForConditionalGeneration`
63
- - A powerful vision-language model that processes both visual and textual inputs
64
- - Integrates vision and language representations for multimodal understanding
65
-
66
- 2. **Text Encoder**: `Qwen3-1.7B`
67
- - Base Model: [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B)
68
- - Type: Causal Language Model
69
- - Parameters: 1.7B
70
- - Layers: 28
71
- - Attention: 16 heads for Q, 8 heads for KV (GQA)
72
- - Context Length: 32,768 tokens
73
- - Features:
74
- - Strong reasoning and instruction-following capabilities
75
- - Optimized for long-context understanding
76
- - Supports complex language understanding and generation
77
-
78
- 3. **Vision Encoder**: `SigLIP` (Sigmoid Loss for Language-Image Pre-training)
79
- - Base Model: [google/siglip-base-patch16-224](https://huggingface.co/google/siglip-base-patch16-224)
80
- - Type: Vision Transformer (ViT)
81
- - Patch Size: 16x16
82
- - Image Size: 224x224
83
- - Hidden Size: 768
84
- - Layers: 12
85
- - Attention Heads: 12
86
- - Features:
87
- - Strong visual representation learning
88
- - Excellent zero-shot classification capabilities
89
- - Robust to various visual domains
90
-
91
- 4. **Action Head**: Diffusion-based Policy
92
- - Type: Flow-matching action head
93
- - Architecture: 4-layer transformer (ScaledDP)
94
- - Hidden Size: 512
95
- - Feed-Forward Size: 2,048
96
- - Attention Heads: 8
97
- - Features:
98
- - Generates smooth, continuous actions for robotic control
99
- - Uses diffusion process for action generation
100
-
101
- ## Training & Evaluation
102
-
103
- ### Training Performance
104
-
105
- - **Total Training Steps**: 30,000
106
- - **Final Epoch**: 114.5
107
- - **Initial Loss**: 1.27
108
- - **Final Loss**: 0.11
109
- - **Learning Rate**: Warmup to 1e-5 with gradual decay
110
- - **Gradient Norm**: Stabilized around 0.3-1.0 (initial: 11.1)
111
 
112
- ### Recommended Evaluation Metrics
 
113
 
114
- #### Task Performance
115
- - **Success Rate**: Percentage of successful task completions
116
- - **Path Length**: Efficiency of movement (shorter paths are better)
117
- - **Smoothness**: L2 norm of action derivatives (lower is smoother)
118
- - **Goal Distance**: Final distance to target position
119
- - **Success Rate at k (SR@k)**: Success rate within k attempts
120
 
121
- #### Model Accuracy
122
- - **Action MSE**: Mean squared error of predicted vs. ground truth actions
123
- - **Per-Joint Position Error**: Error for each degree of freedom
124
- - **Gripper Accuracy**: Binary classification of gripper state
125
- - **Trajectory Error**: Dynamic Time Warping (DTW) distance from reference
126
 
127
- #### System Efficiency
128
- - **Inference Time**: Per-step latency (ms)
129
- - **Memory Usage**: Peak GPU memory consumption (GB)
130
- - **FLOPS**: Computational requirements
131
- - **Throughput**: Steps/second during inference
 
 
132
 
133
- #### Robustness
134
- - **Success Rate under Noise**: Performance with added sensor noise
135
- - **Generalization**: Performance on unseen objects/scenes
136
- - **Failure Mode Analysis**: Categorization of common failures
137
- - **Recovery Rate**: Ability to recover from perturbations
138
 
139
- ### Evaluation Protocol
 
 
 
140
 
141
- 1. **Test Environments**
142
- - Fixed initial conditions
143
- - Multiple random seeds (recommended: 5+)
144
- - Human baseline comparison
145
- - Ablation studies
146
-
147
- 2. **Visualization**
148
- - Trajectory plots (ground truth vs predicted)
149
- - Attention heatmaps
150
- - Failure case analysis
151
- - Action distribution plots
152
-
153
- 3. **Reporting**
154
- - Mean and standard deviation across seeds
155
- - Statistical significance testing
156
- - Compute requirements (GPU hours, memory)
157
- - Hyperparameter sensitivity analysis
158
- - Processes both visual and language conditioning
159
-
160
- 5. **Training Configuration**:
161
- - Optimizer: AdamW (lr=1e-4, weight_decay=1e-6)
162
- - Diffusion Steps: 100
163
- - Chunk Size: 16
164
- - Action Steps: 8
165
- - Observation Steps: 1
166
-
167
- The model processes visual inputs through the SigLIP vision encoder and textual instructions through the Qwen3-1.7B language model, then fuses these representations in the Eagle2.5 backbone to generate precise control actions via the diffusion-based policy head. The architecture is specifically designed for real-time robotic control with low-latency inference.
168
-
169
- ## Uses
170
-
171
- ### Direct Use
172
-
173
- This model is part of the [Gemma-GR00T](https://github.com/Ryukijano/Gemma-Grook) project and is designed for research and development of robotic manipulation systems. It can be used for:
174
-
175
- - Robotic arm manipulation tasks (pick-and-place, assembly, etc.)
176
- - Sim-to-real transfer learning in robotics
177
- - Multimodal robotic control with natural language instructions
178
- - Research in reinforcement and imitation learning for robotics
179
- - Integration with the [LeRobot](https://github.com/huggingface/lerobot) ecosystem
180
-
181
- ### Related Projects
182
-
183
- - [LeRobot](https://github.com/huggingface/lerobot): The base framework used for training
184
- - [GR00T](https://developer.nvidia.com/gr00t): NVIDIA's foundation model for humanoid robots
185
- - [Gemma](https://huggingface.co/google/gemma-7b): The language model backbone
186
-
187
- ### Out-of-Scope Use
188
-
189
- This model is not intended for:
190
- - Critical systems where failure could lead to harm
191
- - Applications without proper safety measures
192
- - Real-time control without thorough testing
193
- - Non-robotic applications
194
-
195
- ## How to Use
196
-
197
- ### Installation
198
-
199
- ```bash
200
- pip install -r requirements.txt
201
- ```
202
-
203
- ### Loading the Model
204
 
 
205
  ```python
206
- from transformers import AutoModelForCausalLM, AutoConfig
 
 
207
 
208
- # Load the model
209
- model = AutoModelForCausalLM.from_pretrained("path/to/exported_weights")
 
210
  ```
211
 
212
- ### Inference Example
213
-
214
- ```python
215
- # Example code for running inference with the model
216
- import torch
217
-
218
- def run_inference(observation, language_instruction):
219
- # Preprocess observation and instruction
220
- inputs = preprocess(observation, language_instruction)
221
-
222
- # Run model inference
223
- with torch.no_grad():
224
- actions = model(**inputs)
225
-
226
- return actions
227
  ```
228
 
229
- ## Training Details
230
-
231
- ### Training Data
232
-
233
- This model was trained using the [LeRobot](https://github.com/huggingface/lerobot) framework, which provides standardized datasets and tools for robotic learning. The training utilized the following configuration:
234
-
235
- - **Primary Datasets:**
236
- - `lerobot/robot_sim.PickNPlace`: Simulated pick and place tasks
237
- - `lerobot/so100_strawberry_grape`: Real-world manipulation tasks
238
- - **Data Configuration:** `fourier_gr1_arms_only`
239
- - **Dataset Documentation:** [LeRobot Datasets](https://huggingface.co/lerobot/datasets)
240
- - **Data Processing:** Follows LeRobot's standardized data pipeline for consistency with other models in the ecosystem
241
- - **Environment:** [Isaac Sim](https://developer.nvidia.com/isaac-sim)
242
- - **Training Steps:** 30,000
243
- - **Batch Size:** 32
244
- - **Learning Rate:** 1e-4
245
- - **Optimizer:** AdamW
246
- - **Weight Decay:** 1e-5
247
- - **Warmup Ratio:** 0.05
248
- - **Hardware:** 3× NVIDIA L40S GPUs
249
- - **Framework:** PyTorch with Hugging Face Transformers
250
-
251
- ### Data Processing
252
-
253
- The model processes the following modalities from the LeRobot dataset:
254
- - **Visual Inputs:** Processed through a vision encoder
255
- - **Proprioception:** Arm joint states and gripper status
256
- - **Actions:** 32-dimensional continuous action space
257
- - **Language Instructions:** Natural language task descriptions
258
-
259
- ### Training Procedure
260
-
261
- The model was trained using a combination of:
262
- - Imitation learning from demonstration data
263
- - Reinforcement learning with PPO
264
- - Behavior cloning
265
-
266
- ## Evaluation
267
-
268
- ### Metrics
269
-
270
- - **Success Rate:** 85% on validation tasks
271
- - **Task Completion:** 90% of tasks completed successfully
272
- - **Generalization:** 75% success on unseen objects
273
 
274
- ### Results
 
 
275
 
276
- | Task | Success Rate |
277
- |------|-------------:|
278
- | Pick and Place | 88% |
279
- | Object Stacking | 83% |
280
- | Tool Use | 79% |
281
- | Multi-step Tasks | 72% |
282
-
283
- ## Limitations and Bias
284
-
285
- - The model's performance is highly dependent on the quality and diversity of the training data.
286
- - May not generalize well to completely novel objects or environments.
287
- - Performance may degrade in cluttered or highly dynamic environments.
288
- - Safety mechanisms should be implemented for real-world deployment.
289
-
290
- ## Environmental Impact
291
-
292
- - **Carbon Emissions:** Estimated 120 kg CO2eq
293
- - **Hardware Type:** NVIDIA L40S GPUs
294
- - **Hours used:** 240
295
- - **Cloud Provider:** Private cluster
296
- - **Compute Region:** UK
297
- - **Energy Mix:** 40% renewable
298
-
299
- ## Technical Specifications
300
-
301
- ### Model Architecture
302
-
303
- - **Parameters:** 1.7B
304
- - **Layers:** 16
305
- - **Attention Heads:** 32
306
- - **Hidden Size:** 2048
307
- - **Context Length:** 2048 tokens
308
-
309
- ### Hardware and Software
310
-
311
- - **Training Hardware:** 3× NVIDIA L40S GPUs
312
- - **Inference Hardware:** NVIDIA L4 or better
313
- - **Framework:** PyTorch 2.7.1+
314
- - **CUDA Version:** 12.4
315
 
316
  ## Citation
317
-
318
- ```bibtex
319
- @misc{gemmagroot2024,
320
- title={Gemma-GR00T: Multimodal Robotic Manipulation with Language Models},
321
- author={Your Name},
322
- year={2024},
323
- publisher={GitHub},
324
- howpublished={\url{https://github.com/Ryukijano/Gemma-Grook}},
325
- }
326
- ```
327
-
328
- ## Model Card Contact
329
-
330
- For questions or comments about this model, please open an issue in the [GitHub repository](https://github.com/Ryukijano/Gemma-Grook/issues).
331
-
332
- ## License
333
-
334
- This model is licensed under the MIT License. See the [LICENSE](LICENSE) file for more details.
 
1
  ---
2
+ license: apache-2.0
3
  language:
4
  - en
 
 
 
 
 
 
 
5
  tags:
6
  - robotics
7
+ - vla
8
+ - lerobot
9
  - imitation-learning
 
 
 
10
  - diffusion-policy
11
+ - gemma-3
12
+ - siglip
13
+ - scaledp
14
+ - multimodal
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ---
16
 
17
+ # Gemma-Le: SigLIP + Gemma 3 + ScaleDP (LeRobot VLA Policy)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
+ Gemma-Le is a compact Vision-Language-Action policy for robotic manipulation built on top of LeRobot.
20
+ It replaces the previous NV Eagle/EagleBackbone stack with:
21
 
22
+ - SigLIP `siglip-so400m-patch14-384` as the vision encoder
23
+ - Gemma 3 `gemma-3-4b-it` as the language/reasoning encoder (with LoRA PEFT)
24
+ - ScaleDP (Scalable Diffusion Transformer) as the action head for denoising-based action generation
 
 
 
25
 
26
+ This repo hosts the exported checkpoints trained on LeRobot-format datasets (e.g., `robot_sim.PickNPlace`).
 
 
 
 
27
 
28
+ ## Architecture at a glance
29
+ - Vision: SigLIP ViT encoder (384px, patch14), pooled embedding
30
+ - Text: Gemma 3 4B-IT, mean-pooled hidden states, LoRA on q/k/v/o proj (rank=16)
31
+ - Fusion: Linear/MLP fusion of vision + text to a conditioning vector (default 768)
32
+ - Action head: ScaleDP Transformer (layers=12, d_model=320, heads=8, ff=1280) producing diffusion noise over T steps (default 50)
33
+ - Temporal context: chunk_size=8 (actions conditioned on short history)
34
+ - Mixed precision: AMP (bf16/fp16) selected dynamically for stability
35
 
36
+ Compared to prior NV Eagle-based setups, Gemma-Le:
37
+ - Removes EagleBackbone and NV-specific multi-modal blocks
38
+ - Uses standard Hugging Face SigLIP and Gemma 3 components
39
+ - Trains an explicit diffusion policy head (ScaleDP) for smooth action generation
 
40
 
41
+ ## Files
42
+ - `model.safetensors`: weights of the Gemma-Le policy (vision + text adapters + action head)
43
+ - `config.json`: policy/configuration metadata
44
+ - `train_config.json`: training run metadata (steps, scheduler, etc.)
45
 
46
+ ## Usage (with this repo’s LeRobot fork)
47
+ Install deps and set `PYTHONPATH` to include `lerobot` in this repository.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
+ Example evaluation-style load (pseudo-code):
50
  ```python
51
+ from lerobot.common.policies.gemma_le.configuration_gemma_le import GemmaLeConfig
52
+ from lerobot.common.policies.gemma_le.modeling_gemma_le import GemmaLePolicy
53
+ from huggingface_hub import snapshot_download
54
 
55
+ ckpt_dir = snapshot_download(repo_id="Ryukijano/gemma-groot", revision="main")
56
+ policy = GemmaLePolicy.from_pretrained(ckpt_dir, torch_dtype="bfloat16")
57
+ policy.eval()
58
  ```
59
 
60
+ Training entrypoint (in this repo):
61
+ ```bash
62
+ python lerobot/lerobot/scripts/train.py --policy.type gemma_le --dataset.repo_id local/robot_sim.PickNPlace --dataset.root /path/to/robot_sim.PickNPlace --dataset.episodes "[0,1,2,3,4]" --batch_size 2 --steps 60000 --save_freq 20000 --policy.vision_model_id google/siglip-so400m-patch14-384 --policy.text_model_id google/gemma-3-4b-it --policy.use_amp true
 
 
 
 
 
 
 
 
 
 
 
 
63
  ```
64
 
65
+ ## Checkpoints
66
+ Recent example: step 020000 from `2025-08-12/13-06-07_gemma_le` (uploaded here).
67
+ Additional runs exist under `outputs/train/2025-08-12/.../checkpoints/<step>/pretrained_model`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
+ ## Data
70
+ - Format: LeRobotDataset (parquet + video + metadata)
71
+ - Example: `robot_sim.PickNPlace` subset with RGB ego camera `observation.images.ego_view` and action vector `action`.
72
 
73
+ ## Notes
74
+ - Access to base models: `google/gemma-3-4b-it` may be gated; accept TOS to reproduce.
75
+ - Performance varies by dataset/embodiment; this is a compact 4B+vision policy optimized for 3× L40.
76
+ - Intended for imitation learning; RL fine-tuning or ThinkAct-style extensions can be layered on top.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
 
78
  ## Citation
79
+ If you use this model, please cite LeRobot and the base models:
80
+ - LeRobot: https://github.com/huggingface/lerobot
81
+ - Gemma 3: https://ai.google.dev/gemma
82
+ - SigLIP: https://huggingface.co/timm/ViT-SigLIP
83
+ - Diffusion Policy: https://arxiv.org/abs/2303.04137