Add SGLang serving instructions

#10
by MickJ - opened
README.md CHANGED
@@ -169,7 +169,6 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
169
  - [PyTorch](https://github.com/nvidia/cosmos3)
170
  - [vLLM-Omni](https://github.com/vllm-project/vllm-omni)
171
  - [Hugging Face Diffusers](https://huggingface.co/docs/diffusers/en/index)
172
- - [SGLang](https://github.com/sgl-project/sglang)
173
 
174
  **Supported Hardware Microarchitecture Compatibility:**
175
 
@@ -857,6 +856,67 @@ Example output from the command above:
857
  4. Place the flower into the red bottle.
858
  ```
859
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
860
  ### Diffusers
861
 
862
  #### Container
@@ -924,71 +984,6 @@ Example output:
924
 
925
  <video controls width="1280" height="720" src="https://huggingface.co/nvidia/Cosmos3-Super/resolve/main/assets/example_t2v_diffusers_output.mp4"></video>
926
 
927
- ### SGLang
928
-
929
- [SGLang Diffusion](https://docs.sglang.io/docs/sglang-diffusion/index) can serve `nvidia/Cosmos3-Super` through OpenAI-compatible image and video generation endpoints. Install SGLang from the main branch with diffusion dependencies, then start the server:
930
-
931
- ```bash
932
- git clone --branch main https://github.com/sgl-project/sglang.git
933
- cd sglang
934
- pip install -e "python[diffusion]"
935
- pip install "cosmos-guardrail==0.3.1"
936
-
937
- sglang serve \
938
- --model-path nvidia/Cosmos3-Super \
939
- --num-gpus 4
940
- ```
941
-
942
- Cosmos 3 support in SGLang Diffusion currently requires the SGLang main branch. Switch to a stable SGLang release once Cosmos 3 support is included there.
943
-
944
- For the video-specialized checkpoint:
945
-
946
- ```bash
947
- sglang serve \
948
- --model-path nvidia/Cosmos3-Super-Image2Video \
949
- --num-gpus 4
950
- ```
951
-
952
- Supported SGLang endpoints:
953
-
954
- | Mode | Endpoint | Notes |
955
- | --- | --- | --- |
956
- | Text to image | `POST /v1/images/generations` | Returns base64 image data by default |
957
- | Text to video | `POST /v1/videos` | Creates an async job; poll `GET /v1/videos/{id}` and download `/content` |
958
- | Image to video | `POST /v1/videos` | Upload the conditioning image with `input_reference` |
959
-
960
- Example text-to-video request:
961
-
962
- ```bash
963
- job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \
964
- --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
965
- --form-string "negative_prompt=blurry, distorted, low quality" \
966
- --form-string "size=1280x720" \
967
- --form-string "num_frames=81" \
968
- --form-string "fps=24" \
969
- --form-string "num_inference_steps=35" \
970
- --form-string "guidance_scale=4.0" \
971
- --form-string "flow_shift=10.0" \
972
- --form-string "seed=42" \
973
- --form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \
974
- | python -c 'import json, sys; print(json.load(sys.stdin)["id"])')
975
-
976
- while true; do
977
- status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" \
978
- | python -c 'import json, sys; print(json.load(sys.stdin)["status"])')
979
- [ "$status" = "completed" ] && break
980
- [ "$status" = "failed" ] && exit 1
981
- sleep 1
982
- done
983
-
984
- curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \
985
- -o cosmos3_super_t2v_output.mp4
986
- ```
987
-
988
- Video-to-video, video-with-sound, and action generation are not supported by SGLang yet.
989
-
990
- For complete serving instructions and request examples, see the [Cosmos3 SGLang cookbook](https://docs.sglang.io/cookbook/diffusion/Cosmos/Cosmos3).
991
-
992
  ## Limitations
993
 
994
  Cosmos3 may produce imperfect outputs in challenging scenarios. Generation artifacts include temporal inconsistency, unstable camera or object motion, imprecise physical interactions, inaccurate audio-video synchronization, and action-state drift — especially in long-horizon or high-resolution outputs. Reasoning may also be incorrect: object states, causal relationships, spatial geometry, temporal ordering, agent intent, and future outcomes can be misinferred, and complex or long-context inputs may yield hallucinated entities, inconsistent interpretations, or implausible predictions. Because the model lacks an explicit physics simulator, 3D geometry, 4D space-time evolution, object permanence, contact dynamics, and physical laws are only approximated — producing artifacts such as disappearing or morphing objects, unrealistic collisions, and physically implausible motions. Quality further degrades in out-of-distribution environments, safety-critical edge cases, and domains underrepresented in training.
@@ -997,7 +992,7 @@ Cosmos3 outputs should not be treated as physically accurate simulation, reliabl
997
 
998
  ## Inference
999
 
1000
- **Acceleration Engine:** [PyTorch](https://pytorch.org/), [vLLM](https://github.com/vllm-project/vllm), [vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Hugging Face Diffusers](https://github.com/huggingface/diffusers), [SGLang](https://github.com/sgl-project/sglang), [SGLang Diffusion](https://docs.sglang.io/docs/sglang-diffusion/index)
1001
 
1002
  **Test Hardware:** GB200 and H100
1003
 
 
169
  - [PyTorch](https://github.com/nvidia/cosmos3)
170
  - [vLLM-Omni](https://github.com/vllm-project/vllm-omni)
171
  - [Hugging Face Diffusers](https://huggingface.co/docs/diffusers/en/index)
 
172
 
173
  **Supported Hardware Microarchitecture Compatibility:**
174
 
 
856
  4. Place the flower into the red bottle.
857
  ```
858
 
859
+ ### SGLang
860
+
861
+ SGLang-Diffusion can serve `nvidia/Cosmos3-Super` through OpenAI-compatible image and video endpoints. Install SGLang from source with diffusion dependencies, then start the server:
862
+
863
+ ```bash
864
+ git clone https://github.com/sgl-project/sglang.git
865
+ cd sglang
866
+ pip install -e "python[diffusion]"
867
+ pip install "cosmos-guardrail==0.3.1"
868
+
869
+ sglang serve \
870
+ --model-path nvidia/Cosmos3-Super \
871
+ --num-gpus 4
872
+ ```
873
+
874
+ For the video-specialized checkpoint:
875
+
876
+ ```bash
877
+ sglang serve \
878
+ --model-path nvidia/Cosmos3-Super-Image2Video \
879
+ --num-gpus 4
880
+ ```
881
+
882
+ Supported SGLang endpoints:
883
+
884
+ | Mode | Endpoint | Notes |
885
+ | --- | --- | --- |
886
+ | Text to image | `POST /v1/images/generations` | Returns base64 image data by default |
887
+ | Text to video | `POST /v1/videos` | Creates an async job; poll `GET /v1/videos/{id}` and download `/content` |
888
+ | Image to video | `POST /v1/videos` | Upload the conditioning image with `input_reference` |
889
+
890
+ Example text-to-video request:
891
+
892
+ ```bash
893
+ job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \
894
+ --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
895
+ --form-string "negative_prompt=blurry, distorted, low quality" \
896
+ --form-string "size=1280x720" \
897
+ --form-string "num_frames=81" \
898
+ --form-string "fps=24" \
899
+ --form-string "num_inference_steps=35" \
900
+ --form-string "guidance_scale=4.0" \
901
+ --form-string "flow_shift=10.0" \
902
+ --form-string "seed=42" \
903
+ --form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \
904
+ | python -c 'import json, sys; print(json.load(sys.stdin)["id"])')
905
+
906
+ while true; do
907
+ status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" \
908
+ | python -c 'import json, sys; print(json.load(sys.stdin)["status"])')
909
+ [ "$status" = "completed" ] && break
910
+ [ "$status" = "failed" ] && exit 1
911
+ sleep 1
912
+ done
913
+
914
+ curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \
915
+ -o cosmos3_super_t2v_output.mp4
916
+ ```
917
+
918
+ Video-to-video, video-with-sound, and action generation are not supported by SGLang yet.
919
+
920
  ### Diffusers
921
 
922
  #### Container
 
984
 
985
  <video controls width="1280" height="720" src="https://huggingface.co/nvidia/Cosmos3-Super/resolve/main/assets/example_t2v_diffusers_output.mp4"></video>
986
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
987
  ## Limitations
988
 
989
  Cosmos3 may produce imperfect outputs in challenging scenarios. Generation artifacts include temporal inconsistency, unstable camera or object motion, imprecise physical interactions, inaccurate audio-video synchronization, and action-state drift — especially in long-horizon or high-resolution outputs. Reasoning may also be incorrect: object states, causal relationships, spatial geometry, temporal ordering, agent intent, and future outcomes can be misinferred, and complex or long-context inputs may yield hallucinated entities, inconsistent interpretations, or implausible predictions. Because the model lacks an explicit physics simulator, 3D geometry, 4D space-time evolution, object permanence, contact dynamics, and physical laws are only approximated — producing artifacts such as disappearing or morphing objects, unrealistic collisions, and physically implausible motions. Quality further degrades in out-of-distribution environments, safety-critical edge cases, and domains underrepresented in training.
 
992
 
993
  ## Inference
994
 
995
+ **Acceleration Engine:** [PyTorch](https://pytorch.org/), [vLLM](https://github.com/vllm-project/vllm), [vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Hugging Face Diffusers](https://github.com/huggingface/diffusers)
996
 
997
  **Test Hardware:** GB200 and H100
998
 
sound_tokenizer.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6daeb68a219f3e86c0918f616d78b9ebf073f3d700df63ff1c02d214c081d72d
3
+ size 1985246007
sound_tokenizer.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "autoencoder_v2",
3
+ "sampling_rate": 48000,
4
+ "stereo": true,
5
+ "use_wav_as_input": true,
6
+ "normalize_volume": true,
7
+ "hop_size": 1920,
8
+ "input_channels": 1,
9
+ "enc_type": "spec_convnext",
10
+ "enc_dim": 192,
11
+ "enc_intermediate_dim": 768,
12
+ "enc_num_layers": 12,
13
+ "enc_num_blocks": 2,
14
+ "enc_n_fft": 64,
15
+ "enc_hop_length": 16,
16
+ "enc_latent_dim": 128,
17
+ "enc_c_mults": [1, 2, 4],
18
+ "enc_strides": [4, 5, 6],
19
+ "enc_identity_init": false,
20
+ "enc_use_snake": true,
21
+ "dec_type": "oobleck",
22
+ "dec_dim": 320,
23
+ "dec_c_mults": [1, 2, 4, 8, 16],
24
+ "dec_strides": [2, 4, 5, 6, 8],
25
+ "dec_use_snake": true,
26
+ "dec_final_tanh": false,
27
+ "dec_out_channels": 2,
28
+ "dec_anti_aliasing": false,
29
+ "dec_use_nearest_upsample": false,
30
+ "dec_use_tanh_at_final": false,
31
+ "bottleneck_type": "vae",
32
+ "bottleneck": {"type": "vae"},
33
+ "activation": "snakebeta",
34
+ "snake_logscale": true,
35
+ "anti_aliasing": false,
36
+ "use_cuda_kernel": false,
37
+ "causal": false,
38
+ "padding_mode": "zeros",
39
+ "vocoder_input_dim": 64,
40
+ "latent_mean": null,
41
+ "latent_std": null
42
+ }
sound_tokenizer/diffusion_pytorch_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a4b6da5975bf89f6853fe589ad4752d281ac79fbdfad52ea90537fa080b4b9c2
3
- size 1985176840
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9d4c61cde38acfb0cad9048a140c3533750277a8462b19dc08450d9fe1ad9879
3
+ size 1892409600