Instructions to use andreaskoepf/cosmos3-dk1-cartesian with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Cosmos
How to use andreaskoepf/cosmos3-dk1-cartesian with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- LeRobot
How to use andreaskoepf/cosmos3-dk1-cartesian with LeRobot:
- Notebooks
- Google Colab
- Kaggle
cosmos3-dk1-cartesian delta @ iter 10000 (gen-attn + action-IO full-FT + gen-MLP LoRA) + merge script + model card
685301c verified | license: other | |
| license_name: openmdw-1.1 | |
| license_link: LICENSE | |
| base_model: nvidia/Cosmos3-Nano | |
| tags: | |
| - robotics | |
| - cosmos | |
| - world-model | |
| - action | |
| - dk1 | |
| - lerobot | |
| # cosmos3-dk1-cartesian | |
| Fine-tune of **NVIDIA Cosmos3-Nano** on the **DK-1 bimanual robot** datamix — | |
| multi-mode world + action SFT with a **cartesian end-effector action space** | |
| (single-step SE(3) pose deltas, à la the Cosmos DROID layout). Checkpoint at | |
| **iter 10000**. | |
| > ⚠️ **This is a DELTA, not a standalone model.** It contains only the ~1.56 B | |
| > *trained* parameters and must be applied on top of the public | |
| > [`nvidia/Cosmos3-Nano`](https://huggingface.co/nvidia/Cosmos3-Nano) base | |
| > (the other ~13.6 B params are frozen and not included here). | |
| ## What's in here | |
| | file | what | | |
| |---|---| | |
| | `cosmos3-dk1-cartesian-delta.safetensors` | the 1.555 B trained params (bf16, 3.1 GB) | | |
| | `merge_delta.py` | fold this delta into a base Cosmos3-Nano → **stock-architecture** checkpoint | | |
| | `lora_config.json` | gen-MLP LoRA config (r16 / α32 / targets) | | |
| | `dk1_action_normalization_cartesian.json` | quantile q01/q99 stats for the 20-D cartesian action | | |
| **Trained parameters** (365 tensors, 1,555,175,424 params): | |
| - **Gen attention** `q/k/v/o_proj_moe_gen` — **full fine-tuned** (1.51 B). Existing base modules, trained in place. | |
| - **Action I/O** `action2llm` / `llm2action` / `action_modality_embed` — **full fine-tuned** (17 M). Existing base modules. | |
| - **Gen MLP** `mlp_moe_gen.{gate,up,down}_proj` — **LoRA r16** (28 M adapters, `lora_*` keys). The *only* structurally-new keys. | |
| ## Compatibility (loads in the stock Cosmos framework) | |
| The architecture is **identical to Cosmos3-Nano** — our additions are full-FT of | |
| existing modules or extra LoRA keys, never shape changes: | |
| - The full-FT parts (98% of the fine-tuning) load **directly** into a vanilla base | |
| (same keys/shapes; overwrite values). | |
| - The LoRA adapters are the only extra keys. Either inject LoRA (`lora_config.json`) | |
| and load them, **or fold them in** with `merge_delta.py` → after merging the model | |
| is **100% stock architecture, no framework patches required**. | |
| - The **`dk1_cartesian` embodiment is `domain_id = 26`**, an *existing* row of the base | |
| `num_embodiment_domains = 32` embedding — no new params. To use it, just pass | |
| `domain_id = 26` with the 20-D action layout. | |
| - The training **modes** (`policy` / `causal_policy` / `forward_dynamics` / | |
| `inverse_dynamics`) are **input-masking recipes, not model features** — the net is | |
| bidirectional gen-attention, so e.g. `causal_policy` just means "give past frames + | |
| an RTC action prefix." Any user can run them; they need the masking logic, not special weights. | |
| ## How to use | |
| ```bash | |
| # Stock, patch-free model (recommended): fold the delta into the base. | |
| python merge_delta.py \ | |
| --base /path/to/Cosmos3-Nano-dcp/model \ | |
| --delta cosmos3-dk1-cartesian-delta.safetensors \ | |
| --out /path/to/Cosmos3-dk1-cartesian-merged/model | |
| ``` | |
| For DK-1 cartesian closed-loop control, the action convention is: | |
| - **20-D** bimanual = per arm `[pos_delta(3), rot6d_delta(6), gripper(1)]`, left then right. | |
| - **Single-step** (`backward_framewise`) SE(3) deltas `ΔT_t = T_{t-1}⁻¹ T_t` of the | |
| end-effector (**flange** frame, OpenCV convention: z=approach/front, x=right), 6D rotation | |
| (Zhou et al. 2019); gripper is the **absolute** state (not a delta). | |
| - Quantile-normalized with `dk1_action_normalization_cartesian.json` (grippers forced (0,1)→(−1,1)). | |
| - EE poses via FK over the DK-1 dual-arm URDF; closed-loop needs IK back to joints. | |
| ## Training recipe (summary) | |
| - Base: Cosmos3-Nano (MoT: frozen Qwen3-VL-8B reasoner + diffusion gen expert). | |
| - Gen attention **full-FT** + gen MLP **LoRA r16** (path-qualified to the gen expert only) + action I/O **full-FT**. | |
| - Multi-mode SFT (policy / causal_policy / forward_dynamics / inverse_dynamics), RTC, JSON caption metadata + CFG dropout. | |
| - 21-source DK-1 datamix, 480p, chunk_length 32, `action_loss_weight = 2`, ~0.18 epoch at 25k steps. | |
| - Full project + training code: https://github.com/andreaskoepf/cosmos3-dk1 | |
| ## License | |
| OpenMDW-1.1 (inherits from Cosmos3-Nano). The DK-1 URDF used for FK is Apache-2.0. | |