cosmos3-dk1-cartesian delta @ iter 10000 (gen-attn + action-IO full-FT + gen-MLP LoRA) + merge script + model card

685301c verified 1 day ago

4.21 kB

	---
	license: other
	license_name: openmdw-1.1
	license_link: LICENSE
	base_model: nvidia/Cosmos3-Nano
	tags:
	- robotics
	- cosmos
	- world-model
	- action
	- dk1
	- lerobot
	---

	# cosmos3-dk1-cartesian

	Fine-tune of NVIDIA Cosmos3-Nano on the DK-1 bimanual robot datamix —
	multi-mode world + action SFT with a cartesian end-effector action space
	(single-step SE(3) pose deltas, à la the Cosmos DROID layout). Checkpoint at
	iter 10000.

	> ⚠️ This is a DELTA, not a standalone model. It contains only the ~1.56 B
	> trained parameters and must be applied on top of the public
	> [`nvidia/Cosmos3-Nano`](https://huggingface.co/nvidia/Cosmos3-Nano) base
	> (the other ~13.6 B params are frozen and not included here).

	## What's in here

	\| file \| what \|
	\|---\|---\|
	\| `cosmos3-dk1-cartesian-delta.safetensors` \| the 1.555 B trained params (bf16, 3.1 GB) \|
	\| `merge_delta.py` \| fold this delta into a base Cosmos3-Nano → stock-architecture checkpoint \|
	\| `lora_config.json` \| gen-MLP LoRA config (r16 / α32 / targets) \|
	\| `dk1_action_normalization_cartesian.json` \| quantile q01/q99 stats for the 20-D cartesian action \|

	Trained parameters (365 tensors, 1,555,175,424 params):
	- Gen attention `q/k/v/o_proj_moe_gen` — full fine-tuned (1.51 B). Existing base modules, trained in place.
	- Action I/O `action2llm` / `llm2action` / `action_modality_embed` — full fine-tuned (17 M). Existing base modules.
	- Gen MLP `mlp_moe_gen.{gate,up,down}_proj` — LoRA r16 (28 M adapters, `lora_` keys). The only* structurally-new keys.

	## Compatibility (loads in the stock Cosmos framework)

	The architecture is identical to Cosmos3-Nano — our additions are full-FT of
	existing modules or extra LoRA keys, never shape changes:

	- The full-FT parts (98% of the fine-tuning) load directly into a vanilla base
	(same keys/shapes; overwrite values).
	- The LoRA adapters are the only extra keys. Either inject LoRA (`lora_config.json`)
	and load them, or fold them in with `merge_delta.py` → after merging the model
	is 100% stock architecture, no framework patches required.
	- The `dk1_cartesian` embodiment is `domain_id = 26`, an existing row of the base
	`num_embodiment_domains = 32` embedding — no new params. To use it, just pass
	`domain_id = 26` with the 20-D action layout.
	- The training modes (`policy` / `causal_policy` / `forward_dynamics` /
	`inverse_dynamics`) are input-masking recipes, not model features — the net is
	bidirectional gen-attention, so e.g. `causal_policy` just means "give past frames +
	an RTC action prefix." Any user can run them; they need the masking logic, not special weights.

	## How to use

	```bash
	# Stock, patch-free model (recommended): fold the delta into the base.
	python merge_delta.py \
	--base /path/to/Cosmos3-Nano-dcp/model \
	--delta cosmos3-dk1-cartesian-delta.safetensors \
	--out /path/to/Cosmos3-dk1-cartesian-merged/model
	```

	For DK-1 cartesian closed-loop control, the action convention is:
	- 20-D bimanual = per arm `[pos_delta(3), rot6d_delta(6), gripper(1)]`, left then right.
	- Single-step (`backward_framewise`) SE(3) deltas `ΔT_t = T_{t-1}⁻¹ T_t` of the
	end-effector (flange frame, OpenCV convention: z=approach/front, x=right), 6D rotation
	(Zhou et al. 2019); gripper is the absolute state (not a delta).
	- Quantile-normalized with `dk1_action_normalization_cartesian.json` (grippers forced (0,1)→(−1,1)).
	- EE poses via FK over the DK-1 dual-arm URDF; closed-loop needs IK back to joints.

	## Training recipe (summary)

	- Base: Cosmos3-Nano (MoT: frozen Qwen3-VL-8B reasoner + diffusion gen expert).
	- Gen attention full-FT + gen MLP LoRA r16 (path-qualified to the gen expert only) + action I/O full-FT.
	- Multi-mode SFT (policy / causal_policy / forward_dynamics / inverse_dynamics), RTC, JSON caption metadata + CFG dropout.
	- 21-source DK-1 datamix, 480p, chunk_length 32, `action_loss_weight = 2`, ~0.18 epoch at 25k steps.
	- Full project + training code: https://github.com/andreaskoepf/cosmos3-dk1

	## License
	OpenMDW-1.1 (inherits from Cosmos3-Nano). The DK-1 URDF used for FK is Apache-2.0.