moojink-nv commited on
Commit
336b538
·
verified ·
1 Parent(s): 80d313c

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. Cosmos-Policy-LIBERO-Predict2-2B.pt +3 -0
  2. README.md +101 -0
Cosmos-Policy-LIBERO-Predict2-2B.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8818528d8c9150cda0ddf8c711b0f221b21dac8ac379bd26d5690235954d33e2
3
+ size 3913017345
README.md ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Cosmos-Policy-LIBERO-Predict2-2B
2
+
3
+ ## Model Description
4
+
5
+ Cosmos-Policy-LIBERO-Predict2-2B is a robot manipulation policy fine-tuned from the [NVIDIA Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) video foundation model (checkpoint: `model-480p-16fps.pt`). This model achieves state-of-the-art performance on the LIBERO simulation benchmark with a 98.5% average success rate across four task suites.
6
+
7
+ **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
8
+
9
+ ### Key Features
10
+
11
+ - **Single-stage fine-tuning**: Adapted from pretrained video model with no architectural modifications
12
+ - **Multimodal outputs**: Jointly predicts actions, future states, and values through unified video diffusion
13
+ - **High performance**: 98.5% average success rate on LIBERO (Spatial: 98.1%, Object: 100.0%, Goal: 98.2%, Long: 97.6%)
14
+
15
+ ### Model Architecture
16
+
17
+ This model uses the same architecture as the base Cosmos-Predict2-2B model (a diffusion transformer with latent video diffusion). Please refer to the [base model card](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for detailed architecture specifications.
18
+
19
+ **Key adaptation**: Actions, proprioceptive states, and values are encoded as latent frames and injected directly into the video model's latent diffusion sequence, enabling the model to generate these modalities alongside predicted future images.
20
+
21
+ ## Model Details
22
+
23
+ ### Inputs
24
+
25
+ - **Current state images**:
26
+ - Third-person camera (agentview): Resized to 224x224 RGB
27
+ - Wrist-mounted camera (eye-in-hand): Resized to 224x224 RGB
28
+ - **Robot proprioception**: 9-dimensional (2 gripper joints + 3 end-effector position + 4 end-effector quaternion)
29
+ - **Task description**: Natural language text (e.g., "put the black bowl on top of the cabinet")
30
+
31
+ ### Outputs
32
+
33
+ - **Action chunk**: 16-timestep sequence of 7-dimensional actions (6-DoF end-effector control + 1 gripper)
34
+ - **Future robot proprioception**: 9-dimensional state at timestep t+16
35
+ - **Future state images**:
36
+ - Third-person camera prediction at timestep t+16
37
+ - Wrist camera prediction at timestep t+16
38
+ - **Future state value**: Expected cumulative reward from future state
39
+
40
+ ### Training Details
41
+
42
+ **Training Data**: [LIBERO-Cosmos-Policy](https://huggingface.co/datasets/nvidia/LIBERO-Cosmos-Policy) dataset
43
+ - 4 task suites: LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, LIBERO-Long
44
+ - 500 demonstrations per suite (50 demos × 10 tasks)
45
+ - Successful demonstrations used for policy training
46
+ - All demonstrations (including failures) used for world model and value function training
47
+
48
+ **Training Configuration**:
49
+ - **Base model**: NVIDIA Cosmos-Predict2-2B-Video2World (`model-480p-16fps.pt`)
50
+ - **Training steps**: 40,000 gradient steps
51
+ - **Batch size**: 1,920 (global)
52
+ - **GPUs**: 64 H100 GPUs
53
+ - **Training time**: ~48 hours
54
+ - **Optimization**: Full model fine-tuning (all weights updated)
55
+ - **Action chunk size**: 16 timesteps
56
+ - **Image resolution**: 224x224 pixels
57
+
58
+ **Training Objective**: The model is trained with a hybrid log-normal-uniform noise distribution (modified from the base model's log-normal distribution; see paper for details) to improve action prediction accuracy. Training batches are split 50/25/25 for policy, world model, and value function objectives, respectively.
59
+
60
+ **Inference Settings**:
61
+ - Denoising steps: 5 (note: this can be changed without retraining)
62
+ - Noise level range: σ_min = 4.0, σ_max = 80.0
63
+ - Generation mode: Parallel (action, future state, and value generated simultaneously)
64
+
65
+ ## Performance
66
+
67
+ ### LIBERO Benchmark Results
68
+
69
+ | Task Suite | Success Rate |
70
+ |-----------|--------------|
71
+ | LIBERO-Spatial | 98.1% |
72
+ | LIBERO-Object | 100.0% |
73
+ | LIBERO-Goal | 98.2% |
74
+ | LIBERO-Long | 97.6% |
75
+ | **Average** | **98.5%** |
76
+
77
+ Success rates are averaged over 500 trials per suite (10 tasks × 50 episodes) across 3 random seeds (6,000 trials total).
78
+
79
+ ## Notes
80
+
81
+ - **Simulation only**: This checkpoint is trained and evaluated exclusively on LIBERO simulation environments
82
+ - **Single robot platform**: Trained only for the Franka Emika Panda robot arm
83
+ - **Fixed camera setup**: Requires specific camera configuration (third-person + wrist views)
84
+
85
+ ## Citation
86
+
87
+ If you use this dataset, please cite the Cosmos Policy paper by Kim et al.
88
+ <!-- ```bibtex
89
+ # TODO: Add Cosmos Policy BibTeX
90
+ ``` -->
91
+
92
+
93
+ ## License
94
+
95
+ Please refer to the [base model license](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for licensing information.
96
+
97
+ ## Related Resources
98
+
99
+ - **Base Model**: [Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
100
+ - **Training Dataset**: [LIBERO-Cosmos-Policy](https://huggingface.co/datasets/nvidia/LIBERO-Cosmos-Policy)
101
+ - **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*