--- license: apache-2.0 language: - en tags: - robotics - vla - lerobot - imitation-learning - diffusion-policy - gemma-3 - siglip - scaledp - multimodal --- # Gemma-Le: SigLIP + Gemma 3 + ScaleDP (LeRobot VLA Policy) Gemma-Le is a compact Vision-Language-Action policy for robotic manipulation built on top of LeRobot. It replaces NV Eagle with standard Hugging Face components: - SigLIP `google/siglip-so400m-patch14-384` for vision - Gemma 3 `google/gemma-3-4b-it` for language/reasoning (with LoRA PEFT) - ScaleDP (Scalable Diffusion Transformer) as the action head This repo hosts exported checkpoints trained on LeRobot-format datasets (e.g., `robot_sim.PickNPlace`). ## Architecture - Vision: SigLIP ViT encoder (384px, patch14), pooled embedding - Text: Gemma 3 4B-IT, mean-pooled hidden states - LoRA: rank=16 on `[q_proj, k_proj, v_proj, o_proj]` - Fusion: MLP projects [vision || text] -> `conditioning_dim=768` - Action head: ScaleDP Transformer (layers=12, d_model=320, heads=8, ff=1280) predicts diffusion noise - Temporal context: `chunk_size=8`; diffusion steps `num_diffusion_steps=50` - Mixed precision: AMP auto-selects bf16/fp16; bf16 uses no GradScaler ## Default config (excerpt) ```yaml vision_model_id: google/siglip-so400m-patch14-384 text_model_id: google/gemma-3-4b-it image_features: ["observation.images.ego_view"] action_feature: "action" chunk_size: 8 num_diffusion_steps: 50 conditioning_dim: 768 plan_update_interval: 10 scaledp_num_layers: 12 scaledp_dim_model: 320 scaledp_num_heads: 8 scaledp_dim_feedforward: 1280 use_lora: true lora_rank: 16 lora_target_modules: ["q_proj","k_proj","v_proj","o_proj"] optimizer_lr: 1e-4 optimizer_weight_decay: 1e-6 ``` ## Usage (with this repo’s LeRobot fork) Install deps and set `PYTHONPATH` to include `lerobot` in this repository. Evaluation-style load: ```python import torch from lerobot.common.policies.gemma_le.modeling_gemma_le import GemmaLePolicy from huggingface_hub import snapshot_download ckpt_dir = snapshot_download(repo_id="Ryukijano/gemma-groot", revision="main") policy = GemmaLePolicy.from_pretrained(ckpt_dir, torch_dtype=torch.bfloat16) policy.eval() ``` Training entrypoint: ```bash python lerobot/lerobot/scripts/train.py \ --policy.type gemma_le \ --dataset.repo_id local/robot_sim.PickNPlace \ --dataset.root /path/to/robot_sim.PickNPlace \ --dataset.episodes "[0,1,2,3,4]" \ --batch_size 3 \ --steps 200000 \ --log_freq 100 \ --save_freq 5000 \ --policy.vision_model_id google/siglip-so400m-patch14-384 \ --policy.text_model_id google/gemma-3-4b-it \ --policy.use_amp true \ --progress_bar true \ --push_to_hub true \ --push_repo_id Ryukijano/gemma-groot \ --push_branch main \ --push_exist_ok true ``` ### Slurm (3× L40) See `submit_job.sh`. Ensure caches on scratch and set: - `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` - `HF_HOME`, `HUGGINGFACE_HUB_CACHE`, `TRANSFORMERS_CACHE` to scratch ## Checkpoints - Latest runs uploaded under `runs///` in this repo. - Example: `runs/2025-08-12/13-06-07_gemma_le/020000/`. ## Data - LeRobotDataset (parquet + mp4 + metadata). Single RGB view: `observation.images.ego_view`. Targets: `action`. - Timestamp tolerance is auto-relaxed to `max(tolerance_s, 1/fps + 1e-4)` during training for robust decoding. ## Notes - Base model access: `google/gemma-3-4b-it` may require TOS. - Intended for imitation learning; ThinkAct-style planning can be layered on top. ## Citations - LeRobot: https://github.com/huggingface/lerobot - Gemma 3: https://ai.google.dev/gemma - SigLIP: https://huggingface.co/timm/ViT-SigLIP - Diffusion Policy: https://arxiv.org/abs/2303.04137 ```