Cosmos
Diffusers
Safetensors
cosmos3_omni
nvidia
cosmos3
vllm
vllm-omni
sglang
sglang-diffusion
text, image, video, audio, and action generation
omnimodel
Instructions to use nvidia/Cosmos3-Super with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Cosmos
How to use nvidia/Cosmos3-Super with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Diffusers
How to use nvidia/Cosmos3-Super with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("nvidia/Cosmos3-Super", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
Add SGLang serving instructions
#10
by MickJ - opened
- README.md +62 -67
- sound_tokenizer.ckpt +3 -0
- sound_tokenizer.json +42 -0
- sound_tokenizer/diffusion_pytorch_model.safetensors +2 -2
README.md
CHANGED
|
@@ -169,7 +169,6 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
|
|
| 169 |
- [PyTorch](https://github.com/nvidia/cosmos3)
|
| 170 |
- [vLLM-Omni](https://github.com/vllm-project/vllm-omni)
|
| 171 |
- [Hugging Face Diffusers](https://huggingface.co/docs/diffusers/en/index)
|
| 172 |
-
- [SGLang](https://github.com/sgl-project/sglang)
|
| 173 |
|
| 174 |
**Supported Hardware Microarchitecture Compatibility:**
|
| 175 |
|
|
@@ -857,6 +856,67 @@ Example output from the command above:
|
|
| 857 |
4. Place the flower into the red bottle.
|
| 858 |
```
|
| 859 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 860 |
### Diffusers
|
| 861 |
|
| 862 |
#### Container
|
|
@@ -924,71 +984,6 @@ Example output:
|
|
| 924 |
|
| 925 |
<video controls width="1280" height="720" src="https://huggingface.co/nvidia/Cosmos3-Super/resolve/main/assets/example_t2v_diffusers_output.mp4"></video>
|
| 926 |
|
| 927 |
-
### SGLang
|
| 928 |
-
|
| 929 |
-
[SGLang Diffusion](https://docs.sglang.io/docs/sglang-diffusion/index) can serve `nvidia/Cosmos3-Super` through OpenAI-compatible image and video generation endpoints. Install SGLang from the main branch with diffusion dependencies, then start the server:
|
| 930 |
-
|
| 931 |
-
```bash
|
| 932 |
-
git clone --branch main https://github.com/sgl-project/sglang.git
|
| 933 |
-
cd sglang
|
| 934 |
-
pip install -e "python[diffusion]"
|
| 935 |
-
pip install "cosmos-guardrail==0.3.1"
|
| 936 |
-
|
| 937 |
-
sglang serve \
|
| 938 |
-
--model-path nvidia/Cosmos3-Super \
|
| 939 |
-
--num-gpus 4
|
| 940 |
-
```
|
| 941 |
-
|
| 942 |
-
Cosmos 3 support in SGLang Diffusion currently requires the SGLang main branch. Switch to a stable SGLang release once Cosmos 3 support is included there.
|
| 943 |
-
|
| 944 |
-
For the video-specialized checkpoint:
|
| 945 |
-
|
| 946 |
-
```bash
|
| 947 |
-
sglang serve \
|
| 948 |
-
--model-path nvidia/Cosmos3-Super-Image2Video \
|
| 949 |
-
--num-gpus 4
|
| 950 |
-
```
|
| 951 |
-
|
| 952 |
-
Supported SGLang endpoints:
|
| 953 |
-
|
| 954 |
-
| Mode | Endpoint | Notes |
|
| 955 |
-
| --- | --- | --- |
|
| 956 |
-
| Text to image | `POST /v1/images/generations` | Returns base64 image data by default |
|
| 957 |
-
| Text to video | `POST /v1/videos` | Creates an async job; poll `GET /v1/videos/{id}` and download `/content` |
|
| 958 |
-
| Image to video | `POST /v1/videos` | Upload the conditioning image with `input_reference` |
|
| 959 |
-
|
| 960 |
-
Example text-to-video request:
|
| 961 |
-
|
| 962 |
-
```bash
|
| 963 |
-
job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \
|
| 964 |
-
--form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
|
| 965 |
-
--form-string "negative_prompt=blurry, distorted, low quality" \
|
| 966 |
-
--form-string "size=1280x720" \
|
| 967 |
-
--form-string "num_frames=81" \
|
| 968 |
-
--form-string "fps=24" \
|
| 969 |
-
--form-string "num_inference_steps=35" \
|
| 970 |
-
--form-string "guidance_scale=4.0" \
|
| 971 |
-
--form-string "flow_shift=10.0" \
|
| 972 |
-
--form-string "seed=42" \
|
| 973 |
-
--form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \
|
| 974 |
-
| python -c 'import json, sys; print(json.load(sys.stdin)["id"])')
|
| 975 |
-
|
| 976 |
-
while true; do
|
| 977 |
-
status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" \
|
| 978 |
-
| python -c 'import json, sys; print(json.load(sys.stdin)["status"])')
|
| 979 |
-
[ "$status" = "completed" ] && break
|
| 980 |
-
[ "$status" = "failed" ] && exit 1
|
| 981 |
-
sleep 1
|
| 982 |
-
done
|
| 983 |
-
|
| 984 |
-
curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \
|
| 985 |
-
-o cosmos3_super_t2v_output.mp4
|
| 986 |
-
```
|
| 987 |
-
|
| 988 |
-
Video-to-video, video-with-sound, and action generation are not supported by SGLang yet.
|
| 989 |
-
|
| 990 |
-
For complete serving instructions and request examples, see the [Cosmos3 SGLang cookbook](https://docs.sglang.io/cookbook/diffusion/Cosmos/Cosmos3).
|
| 991 |
-
|
| 992 |
## Limitations
|
| 993 |
|
| 994 |
Cosmos3 may produce imperfect outputs in challenging scenarios. Generation artifacts include temporal inconsistency, unstable camera or object motion, imprecise physical interactions, inaccurate audio-video synchronization, and action-state drift — especially in long-horizon or high-resolution outputs. Reasoning may also be incorrect: object states, causal relationships, spatial geometry, temporal ordering, agent intent, and future outcomes can be misinferred, and complex or long-context inputs may yield hallucinated entities, inconsistent interpretations, or implausible predictions. Because the model lacks an explicit physics simulator, 3D geometry, 4D space-time evolution, object permanence, contact dynamics, and physical laws are only approximated — producing artifacts such as disappearing or morphing objects, unrealistic collisions, and physically implausible motions. Quality further degrades in out-of-distribution environments, safety-critical edge cases, and domains underrepresented in training.
|
|
@@ -997,7 +992,7 @@ Cosmos3 outputs should not be treated as physically accurate simulation, reliabl
|
|
| 997 |
|
| 998 |
## Inference
|
| 999 |
|
| 1000 |
-
**Acceleration Engine:** [PyTorch](https://pytorch.org/), [vLLM](https://github.com/vllm-project/vllm), [vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Hugging Face Diffusers](https://github.com/huggingface/diffusers)
|
| 1001 |
|
| 1002 |
**Test Hardware:** GB200 and H100
|
| 1003 |
|
|
|
|
| 169 |
- [PyTorch](https://github.com/nvidia/cosmos3)
|
| 170 |
- [vLLM-Omni](https://github.com/vllm-project/vllm-omni)
|
| 171 |
- [Hugging Face Diffusers](https://huggingface.co/docs/diffusers/en/index)
|
|
|
|
| 172 |
|
| 173 |
**Supported Hardware Microarchitecture Compatibility:**
|
| 174 |
|
|
|
|
| 856 |
4. Place the flower into the red bottle.
|
| 857 |
```
|
| 858 |
|
| 859 |
+
### SGLang
|
| 860 |
+
|
| 861 |
+
SGLang-Diffusion can serve `nvidia/Cosmos3-Super` through OpenAI-compatible image and video endpoints. Install SGLang from source with diffusion dependencies, then start the server:
|
| 862 |
+
|
| 863 |
+
```bash
|
| 864 |
+
git clone https://github.com/sgl-project/sglang.git
|
| 865 |
+
cd sglang
|
| 866 |
+
pip install -e "python[diffusion]"
|
| 867 |
+
pip install "cosmos-guardrail==0.3.1"
|
| 868 |
+
|
| 869 |
+
sglang serve \
|
| 870 |
+
--model-path nvidia/Cosmos3-Super \
|
| 871 |
+
--num-gpus 4
|
| 872 |
+
```
|
| 873 |
+
|
| 874 |
+
For the video-specialized checkpoint:
|
| 875 |
+
|
| 876 |
+
```bash
|
| 877 |
+
sglang serve \
|
| 878 |
+
--model-path nvidia/Cosmos3-Super-Image2Video \
|
| 879 |
+
--num-gpus 4
|
| 880 |
+
```
|
| 881 |
+
|
| 882 |
+
Supported SGLang endpoints:
|
| 883 |
+
|
| 884 |
+
| Mode | Endpoint | Notes |
|
| 885 |
+
| --- | --- | --- |
|
| 886 |
+
| Text to image | `POST /v1/images/generations` | Returns base64 image data by default |
|
| 887 |
+
| Text to video | `POST /v1/videos` | Creates an async job; poll `GET /v1/videos/{id}` and download `/content` |
|
| 888 |
+
| Image to video | `POST /v1/videos` | Upload the conditioning image with `input_reference` |
|
| 889 |
+
|
| 890 |
+
Example text-to-video request:
|
| 891 |
+
|
| 892 |
+
```bash
|
| 893 |
+
job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \
|
| 894 |
+
--form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
|
| 895 |
+
--form-string "negative_prompt=blurry, distorted, low quality" \
|
| 896 |
+
--form-string "size=1280x720" \
|
| 897 |
+
--form-string "num_frames=81" \
|
| 898 |
+
--form-string "fps=24" \
|
| 899 |
+
--form-string "num_inference_steps=35" \
|
| 900 |
+
--form-string "guidance_scale=4.0" \
|
| 901 |
+
--form-string "flow_shift=10.0" \
|
| 902 |
+
--form-string "seed=42" \
|
| 903 |
+
--form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \
|
| 904 |
+
| python -c 'import json, sys; print(json.load(sys.stdin)["id"])')
|
| 905 |
+
|
| 906 |
+
while true; do
|
| 907 |
+
status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" \
|
| 908 |
+
| python -c 'import json, sys; print(json.load(sys.stdin)["status"])')
|
| 909 |
+
[ "$status" = "completed" ] && break
|
| 910 |
+
[ "$status" = "failed" ] && exit 1
|
| 911 |
+
sleep 1
|
| 912 |
+
done
|
| 913 |
+
|
| 914 |
+
curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \
|
| 915 |
+
-o cosmos3_super_t2v_output.mp4
|
| 916 |
+
```
|
| 917 |
+
|
| 918 |
+
Video-to-video, video-with-sound, and action generation are not supported by SGLang yet.
|
| 919 |
+
|
| 920 |
### Diffusers
|
| 921 |
|
| 922 |
#### Container
|
|
|
|
| 984 |
|
| 985 |
<video controls width="1280" height="720" src="https://huggingface.co/nvidia/Cosmos3-Super/resolve/main/assets/example_t2v_diffusers_output.mp4"></video>
|
| 986 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 987 |
## Limitations
|
| 988 |
|
| 989 |
Cosmos3 may produce imperfect outputs in challenging scenarios. Generation artifacts include temporal inconsistency, unstable camera or object motion, imprecise physical interactions, inaccurate audio-video synchronization, and action-state drift — especially in long-horizon or high-resolution outputs. Reasoning may also be incorrect: object states, causal relationships, spatial geometry, temporal ordering, agent intent, and future outcomes can be misinferred, and complex or long-context inputs may yield hallucinated entities, inconsistent interpretations, or implausible predictions. Because the model lacks an explicit physics simulator, 3D geometry, 4D space-time evolution, object permanence, contact dynamics, and physical laws are only approximated — producing artifacts such as disappearing or morphing objects, unrealistic collisions, and physically implausible motions. Quality further degrades in out-of-distribution environments, safety-critical edge cases, and domains underrepresented in training.
|
|
|
|
| 992 |
|
| 993 |
## Inference
|
| 994 |
|
| 995 |
+
**Acceleration Engine:** [PyTorch](https://pytorch.org/), [vLLM](https://github.com/vllm-project/vllm), [vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Hugging Face Diffusers](https://github.com/huggingface/diffusers)
|
| 996 |
|
| 997 |
**Test Hardware:** GB200 and H100
|
| 998 |
|
sound_tokenizer.ckpt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6daeb68a219f3e86c0918f616d78b9ebf073f3d700df63ff1c02d214c081d72d
|
| 3 |
+
size 1985246007
|
sound_tokenizer.json
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "autoencoder_v2",
|
| 3 |
+
"sampling_rate": 48000,
|
| 4 |
+
"stereo": true,
|
| 5 |
+
"use_wav_as_input": true,
|
| 6 |
+
"normalize_volume": true,
|
| 7 |
+
"hop_size": 1920,
|
| 8 |
+
"input_channels": 1,
|
| 9 |
+
"enc_type": "spec_convnext",
|
| 10 |
+
"enc_dim": 192,
|
| 11 |
+
"enc_intermediate_dim": 768,
|
| 12 |
+
"enc_num_layers": 12,
|
| 13 |
+
"enc_num_blocks": 2,
|
| 14 |
+
"enc_n_fft": 64,
|
| 15 |
+
"enc_hop_length": 16,
|
| 16 |
+
"enc_latent_dim": 128,
|
| 17 |
+
"enc_c_mults": [1, 2, 4],
|
| 18 |
+
"enc_strides": [4, 5, 6],
|
| 19 |
+
"enc_identity_init": false,
|
| 20 |
+
"enc_use_snake": true,
|
| 21 |
+
"dec_type": "oobleck",
|
| 22 |
+
"dec_dim": 320,
|
| 23 |
+
"dec_c_mults": [1, 2, 4, 8, 16],
|
| 24 |
+
"dec_strides": [2, 4, 5, 6, 8],
|
| 25 |
+
"dec_use_snake": true,
|
| 26 |
+
"dec_final_tanh": false,
|
| 27 |
+
"dec_out_channels": 2,
|
| 28 |
+
"dec_anti_aliasing": false,
|
| 29 |
+
"dec_use_nearest_upsample": false,
|
| 30 |
+
"dec_use_tanh_at_final": false,
|
| 31 |
+
"bottleneck_type": "vae",
|
| 32 |
+
"bottleneck": {"type": "vae"},
|
| 33 |
+
"activation": "snakebeta",
|
| 34 |
+
"snake_logscale": true,
|
| 35 |
+
"anti_aliasing": false,
|
| 36 |
+
"use_cuda_kernel": false,
|
| 37 |
+
"causal": false,
|
| 38 |
+
"padding_mode": "zeros",
|
| 39 |
+
"vocoder_input_dim": 64,
|
| 40 |
+
"latent_mean": null,
|
| 41 |
+
"latent_std": null
|
| 42 |
+
}
|
sound_tokenizer/diffusion_pytorch_model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9d4c61cde38acfb0cad9048a140c3533750277a8462b19dc08450d9fe1ad9879
|
| 3 |
+
size 1892409600
|