Add SGLang serving instructions

#10

by MickJ - opened 28 days ago

base: refs/heads/main

←

from: refs/pr/10

Discussion Files changed

+109

-69

Files changed (4) hide show

README.md +62 -67
sound_tokenizer.ckpt +3 -0
sound_tokenizer.json +42 -0
sound_tokenizer/diffusion_pytorch_model.safetensors +2 -2

README.md CHANGED Viewed

@@ -169,7 +169,6 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
 - [PyTorch](https://github.com/nvidia/cosmos3)
 - [vLLM-Omni](https://github.com/vllm-project/vllm-omni)
 - [Hugging Face Diffusers](https://huggingface.co/docs/diffusers/en/index)
-- [SGLang](https://github.com/sgl-project/sglang)
 **Supported Hardware Microarchitecture Compatibility:**
@@ -857,6 +856,67 @@ Example output from the command above:
 4. Place the flower into the red bottle.
 ```
 ### Diffusers
 #### Container
@@ -924,71 +984,6 @@ Example output:
 <video controls width="1280" height="720" src="https://huggingface.co/nvidia/Cosmos3-Super/resolve/main/assets/example_t2v_diffusers_output.mp4"></video>
-### SGLang
-[SGLang Diffusion](https://docs.sglang.io/docs/sglang-diffusion/index) can serve `nvidia/Cosmos3-Super` through OpenAI-compatible image and video generation endpoints. Install SGLang from the main branch with diffusion dependencies, then start the server:
-```bash
-git clone --branch main https://github.com/sgl-project/sglang.git
-cd sglang
-pip install -e "python[diffusion]"
-pip install "cosmos-guardrail==0.3.1"
-sglang serve \
-  --model-path nvidia/Cosmos3-Super \
-  --num-gpus 4
-```
-Cosmos 3 support in SGLang Diffusion currently requires the SGLang main branch. Switch to a stable SGLang release once Cosmos 3 support is included there.
-For the video-specialized checkpoint:
-```bash
-sglang serve \
-  --model-path nvidia/Cosmos3-Super-Image2Video \
-  --num-gpus 4
-```
-Supported SGLang endpoints:
-| Mode | Endpoint | Notes |
-| --- | --- | --- |
-| Text to image | `POST /v1/images/generations` | Returns base64 image data by default |
-| Text to video | `POST /v1/videos` | Creates an async job; poll `GET /v1/videos/{id}` and download `/content` |
-| Image to video | `POST /v1/videos` | Upload the conditioning image with `input_reference` |
-Example text-to-video request:
-```bash
-job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \
-  --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
-  --form-string "negative_prompt=blurry, distorted, low quality" \
-  --form-string "size=1280x720" \
-  --form-string "num_frames=81" \
-  --form-string "fps=24" \
-  --form-string "num_inference_steps=35" \
-  --form-string "guidance_scale=4.0" \
-  --form-string "flow_shift=10.0" \
-  --form-string "seed=42" \
-  --form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \
-  | python -c 'import json, sys; print(json.load(sys.stdin)["id"])')
-while true; do
-  status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" \
-    | python -c 'import json, sys; print(json.load(sys.stdin)["status"])')
-  [ "$status" = "completed" ] && break
-  [ "$status" = "failed" ] && exit 1
-  sleep 1
-done
-curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \
-  -o cosmos3_super_t2v_output.mp4
-```
-Video-to-video, video-with-sound, and action generation are not supported by SGLang yet.
-For complete serving instructions and request examples, see the [Cosmos3 SGLang cookbook](https://docs.sglang.io/cookbook/diffusion/Cosmos/Cosmos3).
 ## Limitations
 Cosmos3 may produce imperfect outputs in challenging scenarios. Generation artifacts include temporal inconsistency, unstable camera or object motion, imprecise physical interactions, inaccurate audio-video synchronization, and action-state drift — especially in long-horizon or high-resolution outputs. Reasoning may also be incorrect: object states, causal relationships, spatial geometry, temporal ordering, agent intent, and future outcomes can be misinferred, and complex or long-context inputs may yield hallucinated entities, inconsistent interpretations, or implausible predictions. Because the model lacks an explicit physics simulator, 3D geometry, 4D space-time evolution, object permanence, contact dynamics, and physical laws are only approximated — producing artifacts such as disappearing or morphing objects, unrealistic collisions, and physically implausible motions. Quality further degrades in out-of-distribution environments, safety-critical edge cases, and domains underrepresented in training.
@@ -997,7 +992,7 @@ Cosmos3 outputs should not be treated as physically accurate simulation, reliabl
 ## Inference
-**Acceleration Engine:** [PyTorch](https://pytorch.org/), [vLLM](https://github.com/vllm-project/vllm), [vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Hugging Face Diffusers](https://github.com/huggingface/diffusers), [SGLang](https://github.com/sgl-project/sglang), [SGLang Diffusion](https://docs.sglang.io/docs/sglang-diffusion/index)
 **Test Hardware:** GB200 and H100

 - [PyTorch](https://github.com/nvidia/cosmos3)
 - [vLLM-Omni](https://github.com/vllm-project/vllm-omni)
 - [Hugging Face Diffusers](https://huggingface.co/docs/diffusers/en/index)
 **Supported Hardware Microarchitecture Compatibility:**
 4. Place the flower into the red bottle.
 ```
+### SGLang
+SGLang-Diffusion can serve `nvidia/Cosmos3-Super` through OpenAI-compatible image and video endpoints. Install SGLang from source with diffusion dependencies, then start the server:
+```bash
+git clone https://github.com/sgl-project/sglang.git
+cd sglang
+pip install -e "python[diffusion]"
+pip install "cosmos-guardrail==0.3.1"
+sglang serve \
+  --model-path nvidia/Cosmos3-Super \
+  --num-gpus 4
+```
+For the video-specialized checkpoint:
+```bash
+sglang serve \
+  --model-path nvidia/Cosmos3-Super-Image2Video \
+  --num-gpus 4
+```
+Supported SGLang endpoints:
+| Mode | Endpoint | Notes |
+| --- | --- | --- |
+| Text to image | `POST /v1/images/generations` | Returns base64 image data by default |
+| Text to video | `POST /v1/videos` | Creates an async job; poll `GET /v1/videos/{id}` and download `/content` |
+| Image to video | `POST /v1/videos` | Upload the conditioning image with `input_reference` |
+Example text-to-video request:
+```bash
+job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \
+  --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
+  --form-string "negative_prompt=blurry, distorted, low quality" \
+  --form-string "size=1280x720" \
+  --form-string "num_frames=81" \
+  --form-string "fps=24" \
+  --form-string "num_inference_steps=35" \
+  --form-string "guidance_scale=4.0" \
+  --form-string "flow_shift=10.0" \
+  --form-string "seed=42" \
+  --form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \
+  | python -c 'import json, sys; print(json.load(sys.stdin)["id"])')
+while true; do
+  status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" \
+    | python -c 'import json, sys; print(json.load(sys.stdin)["status"])')
+  [ "$status" = "completed" ] && break
+  [ "$status" = "failed" ] && exit 1
+  sleep 1
+done
+curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \
+  -o cosmos3_super_t2v_output.mp4
+```
+Video-to-video, video-with-sound, and action generation are not supported by SGLang yet.
 ### Diffusers
 #### Container
 <video controls width="1280" height="720" src="https://huggingface.co/nvidia/Cosmos3-Super/resolve/main/assets/example_t2v_diffusers_output.mp4"></video>
 ## Limitations
 Cosmos3 may produce imperfect outputs in challenging scenarios. Generation artifacts include temporal inconsistency, unstable camera or object motion, imprecise physical interactions, inaccurate audio-video synchronization, and action-state drift — especially in long-horizon or high-resolution outputs. Reasoning may also be incorrect: object states, causal relationships, spatial geometry, temporal ordering, agent intent, and future outcomes can be misinferred, and complex or long-context inputs may yield hallucinated entities, inconsistent interpretations, or implausible predictions. Because the model lacks an explicit physics simulator, 3D geometry, 4D space-time evolution, object permanence, contact dynamics, and physical laws are only approximated — producing artifacts such as disappearing or morphing objects, unrealistic collisions, and physically implausible motions. Quality further degrades in out-of-distribution environments, safety-critical edge cases, and domains underrepresented in training.
 ## Inference
+**Acceleration Engine:** [PyTorch](https://pytorch.org/), [vLLM](https://github.com/vllm-project/vllm), [vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Hugging Face Diffusers](https://github.com/huggingface/diffusers)
 **Test Hardware:** GB200 and H100

sound_tokenizer.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6daeb68a219f3e86c0918f616d78b9ebf073f3d700df63ff1c02d214c081d72d
+size 1985246007

sound_tokenizer.json ADDED Viewed

	@@ -0,0 +1,42 @@

+{
+    "model_type": "autoencoder_v2",
+    "sampling_rate": 48000,
+    "stereo": true,
+    "use_wav_as_input": true,
+    "normalize_volume": true,
+    "hop_size": 1920,
+    "input_channels": 1,
+    "enc_type": "spec_convnext",
+    "enc_dim": 192,
+    "enc_intermediate_dim": 768,
+    "enc_num_layers": 12,
+    "enc_num_blocks": 2,
+    "enc_n_fft": 64,
+    "enc_hop_length": 16,
+    "enc_latent_dim": 128,
+    "enc_c_mults": [1, 2, 4],
+    "enc_strides": [4, 5, 6],
+    "enc_identity_init": false,
+    "enc_use_snake": true,
+    "dec_type": "oobleck",
+    "dec_dim": 320,
+    "dec_c_mults": [1, 2, 4, 8, 16],
+    "dec_strides": [2, 4, 5, 6, 8],
+    "dec_use_snake": true,
+    "dec_final_tanh": false,
+    "dec_out_channels": 2,
+    "dec_anti_aliasing": false,
+    "dec_use_nearest_upsample": false,
+    "dec_use_tanh_at_final": false,
+    "bottleneck_type": "vae",
+    "bottleneck": {"type": "vae"},
+    "activation": "snakebeta",
+    "snake_logscale": true,
+    "anti_aliasing": false,
+    "use_cuda_kernel": false,
+    "causal": false,
+    "padding_mode": "zeros",
+    "vocoder_input_dim": 64,
+    "latent_mean": null,
+    "latent_std": null
+}

sound_tokenizer/diffusion_pytorch_model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a4b6da5975bf89f6853fe589ad4752d281ac79fbdfad52ea90537fa080b4b9c2
-size 1985176840

 version https://git-lfs.github.com/spec/v1
+oid sha256:9d4c61cde38acfb0cad9048a140c3533750277a8462b19dc08450d9fe1ad9879
+size 1892409600