File size: 12,425 Bytes
6268841 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 | # SGLang diffusion CLI Inference
The SGLang-diffusion CLI provides a quick way to access the inference pipeline for image and video generation.
## Prerequisites
- A working SGLang diffusion installation and the `sglang` CLI available in `$PATH`.
## Supported Arguments
### Server Arguments
- `--model-path {MODEL_PATH}`: Path to the model or model ID
- `--lora-path {LORA_PATH}`: Path to a LoRA adapter (local path or HuggingFace model ID). If not specified, LoRA will not be applied.
- `--lora-nickname {NAME}`: Nickname for the LoRA adapter. (default: `default`).
- `--num-gpus {NUM_GPUS}`: Number of GPUs to use
- `--tp-size {TP_SIZE}`: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster)
- `--sp-degree {SP_SIZE}`: Sequence parallelism size (typically should match the number of GPUs)
- `--ulysses-degree {ULYSSES_DEGREE}`: The degree of DeepSpeed-Ulysses-style SP in USP
- `--ring-degree {RING_DEGREE}`: The degree of ring attention-style SP in USP
- `--attention-backend {BACKEND}`: Attention backend to use. For SGLang-native pipelines use `fa`, `torch_sdpa`, `sage_attn`, etc. For diffusers pipelines use diffusers backend names like `flash`, `_flash_3_hub`, `sage`, `xformers`.
- `--attention-backend-config {CONFIG}`: Configuration for the attention backend. Can be a JSON string (e.g., '{"k": "v"}'), a path to a JSON/YAML file, or key=value pairs (e.g., "k=v,k2=v2").
- `--cache-dit-config {PATH}`: Path to a Cache-DiT YAML/JSON config (diffusers backend only)
- `--dit-precision {DTYPE}`: Precision for the DiT model (currently supports fp32, fp16, and bf16).
### Sampling Parameters
- `--prompt {PROMPT}`: Text description for the video you want to generate
- `--num-inference-steps {STEPS}`: Number of denoising steps
- `--negative-prompt {PROMPT}`: Negative prompt to guide generation away from certain concepts
- `--seed {SEED}`: Random seed for reproducible generation
**Image/Video Configuration**
- `--height {HEIGHT}`: Height of the generated output
- `--width {WIDTH}`: Width of the generated output
- `--num-frames {NUM_FRAMES}`: Number of frames to generate
- `--fps {FPS}`: Frames per second for the saved output, if this is a video-generation task
**Frame Interpolation** (video only)
Frame interpolation is a post-processing step that synthesizes new frames
between each pair of consecutive generated frames, producing smoother
motion without re-running the diffusion model. The `--frame-interpolation-exp`
flag controls how many rounds of interpolation to apply: each round inserts one
new frame into every gap between adjacent frames, so the output frame count
follows the formula **(N − 1) × 2^exp + 1** (e.g. 5 original frames with
`exp=1` → 4 gaps × 1 new frame + 5 originals = **9** frames; with `exp=2` →
**17** frames).
- `--enable-frame-interpolation`: Enable frame interpolation. Model weights are downloaded automatically on first use.
- `--frame-interpolation-exp {EXP}`: Interpolation exponent — `1` = 2× temporal resolution, `2` = 4×, etc. (default: `1`)
- `--frame-interpolation-scale {SCALE}`: RIFE inference scale; use `0.5` for high-resolution inputs to save memory (default: `1.0`)
- `--frame-interpolation-model-path {PATH}`: Local directory or HuggingFace repo ID containing RIFE `flownet.pkl` weights (default: `elfgum/RIFE-4.22.lite`, downloaded automatically)
Example — generate a 5-frame video and interpolate to 9 frames ((5 − 1) × 2¹ + 1 = 9):
```bash
sglang generate \
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
--prompt "A dog running through a park" \
--num-frames 5 \
--enable-frame-interpolation \
--frame-interpolation-exp 1 \
--save-output
```
**Output Options**
- `--output-path {PATH}`: Directory to save the generated video
- `--save-output`: Whether to save the image/video to disk
- `--return-frames`: Whether to return the raw frames
### Using Configuration Files
Instead of specifying all parameters on the command line, you can use a configuration file:
```bash
sglang generate --config {CONFIG_FILE_PATH}
```
The configuration file should be in JSON or YAML format with the same parameter names as the CLI options. Command-line arguments take precedence over settings in the configuration file, allowing you to override specific values while keeping the rest from the configuration file.
Example configuration file (config.json):
```json
{
"model_path": "FastVideo/FastHunyuan-diffusers",
"prompt": "A beautiful woman in a red dress walking down a street",
"output_path": "outputs/",
"num_gpus": 2,
"sp_size": 2,
"tp_size": 1,
"num_frames": 45,
"height": 720,
"width": 1280,
"num_inference_steps": 6,
"seed": 1024,
"fps": 24,
"precision": "bf16",
"vae_precision": "fp16",
"vae_tiling": true,
"vae_sp": true,
"vae_config": {
"load_encoder": false,
"load_decoder": true,
"tile_sample_min_height": 256,
"tile_sample_min_width": 256
},
"text_encoder_precisions": [
"fp16",
"fp16"
],
"mask_strategy_file_path": null,
"enable_torch_compile": false
}
```
Or using YAML format (config.yaml):
```yaml
model_path: "FastVideo/FastHunyuan-diffusers"
prompt: "A beautiful woman in a red dress walking down a street"
output_path: "outputs/"
num_gpus: 2
sp_size: 2
tp_size: 1
num_frames: 45
height: 720
width: 1280
num_inference_steps: 6
seed: 1024
fps: 24
precision: "bf16"
vae_precision: "fp16"
vae_tiling: true
vae_sp: true
vae_config:
load_encoder: false
load_decoder: true
tile_sample_min_height: 256
tile_sample_min_width: 256
text_encoder_precisions:
- "fp16"
- "fp16"
mask_strategy_file_path: null
enable_torch_compile: false
```
To see all the options, you can use the `--help` flag:
```bash
sglang generate --help
```
## Serve
Launch the SGLang diffusion HTTP server and interact with it using the OpenAI SDK and curl.
### Start the server
Use the following command to launch the server:
```bash
SERVER_ARGS=(
--model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
--text-encoder-cpu-offload
--pin-cpu-memory
--num-gpus 4
--ulysses-degree=2
--ring-degree=2
)
sglang serve "${SERVER_ARGS[@]}"
```
- **--model-path**: Which model to load. The example uses `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`.
- **--port**: HTTP port to listen on (the default here is `30010`).
For detailed API usage, including Image, Video Generation and LoRA management, please refer to the [OpenAI API Documentation](openai_api.md).
### Cloud Storage Support
SGLang diffusion supports automatically uploading generated images and videos to S3-compatible cloud storage (e.g., AWS S3, MinIO, Alibaba Cloud OSS, Tencent Cloud COS).
When enabled, the server follows a **Generate -> Upload -> Delete** workflow:
1. The artifact is generated to a temporary local file.
2. The file is immediately uploaded to the configured S3 bucket in a background thread.
3. Upon successful upload, the local file is deleted.
4. The API response returns the public URL of the uploaded object.
**Configuration**
Cloud storage is enabled via environment variables. Note that `boto3` must be installed separately (`pip install boto3`) to use this feature.
```bash
# Enable S3 storage
export SGLANG_CLOUD_STORAGE_TYPE=s3
export SGLANG_S3_BUCKET_NAME=my-bucket
export SGLANG_S3_ACCESS_KEY_ID=your-access-key
export SGLANG_S3_SECRET_ACCESS_KEY=your-secret-key
# Optional: Custom endpoint for MinIO/OSS/COS
export SGLANG_S3_ENDPOINT_URL=https://minio.example.com
```
See [Environment Variables Documentation](../environment_variables.md) for more details.
## Generate
Run a one-off generation task without launching a persistent server.
To use it, pass both server arguments and sampling parameters in one command, after the `generate` subcommand, for example:
```bash
SERVER_ARGS=(
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers
--text-encoder-cpu-offload
--pin-cpu-memory
--num-gpus 4
--ulysses-degree=2
--ring-degree=2
)
SAMPLING_ARGS=(
--prompt "A curious raccoon"
--save-output
--output-path outputs
--output-file-name "A curious raccoon.mp4"
)
sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"
# Or, users can set `SGLANG_CACHE_DIT_ENABLED` env as `true` to enable cache acceleration
SGLANG_CACHE_DIT_ENABLED=true sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"
```
Once the generation task has finished, the server will shut down automatically.
> [!NOTE]
> The HTTP server-related arguments are ignored in this subcommand.
## Component Path Overrides
SGLang diffusion allows you to override any pipeline component (e.g., `vae`, `transformer`, `text_encoder`) by specifying a custom checkpoint path. This is useful for:
### Example: FLUX.2-dev with Tiny AutoEncoder
You can override **any** component by using `--<component>-path`, where `<component>` matches the key in the model's `model_index.json`:
For example, replace the default VAE with a distilled tiny autoencoder for ~3x faster decoding:
```bash
sglang serve \
--model-path=black-forest-labs/FLUX.2-dev \
# with a Huggingface Repo ID
--vae-path=fal/FLUX.2-Tiny-AutoEncoder
# or use a local path
--vae-path=~/.cache/huggingface/hub/models--fal--FLUX.2-Tiny-AutoEncoder/snapshots/.../vae
```
**Important:**
- The component key must match the one in your model's `model_index.json` (e.g., `vae`).
- The path must:
- either be a Huggingface Repo ID (e.g., fal/FLUX.2-Tiny-AutoEncoder)
- or point to a **complete component folder**, containing `config.json` and safetensors files
## Diffusers Backend
SGLang diffusion supports a **diffusers backend** that allows you to run any diffusers-compatible model through SGLang's infrastructure using vanilla diffusers pipelines. This is useful for running models without native SGLang implementations or models with custom pipeline classes.
### Arguments
| Argument | Values | Description |
|----------|--------|-------------|
| `--backend` | `auto` (default), `sglang`, `diffusers` | `auto`: prefer native SGLang, fallback to diffusers. `sglang`: force native (fails if unavailable). `diffusers`: force vanilla diffusers pipeline. |
| `--diffusers-attention-backend` | `flash`, `_flash_3_hub`, `sage`, `xformers`, `native` | Attention backend for diffusers pipelines. See [diffusers attention backends](https://huggingface.co/docs/diffusers/main/en/optimization/attention_backends). |
| `--trust-remote-code` | flag | Required for models with custom pipeline classes (e.g., Ovis). |
| `--vae-tiling` | flag | Enable VAE tiling for large image support (decodes tile-by-tile). |
| `--vae-slicing` | flag | Enable VAE slicing for lower memory usage (decodes slice-by-slice). |
| `--dit-precision` | `fp16`, `bf16`, `fp32` | Precision for the diffusion transformer. |
| `--vae-precision` | `fp16`, `bf16`, `fp32` | Precision for the VAE. |
| `--enable-torch-compile` | flag | Enable `torch.compile` for diffusers pipelines. |
| `--cache-dit-config` | `{PATH}` | Path to a Cache-DiT YAML/JSON config file for accelerating diffusers pipelines with Cache-DiT. |
### Example: Running Ovis-Image-7B
[Ovis-Image-7B](https://huggingface.co/AIDC-AI/Ovis-Image-7B) is a 7B text-to-image model optimized for high-quality text rendering.
```bash
sglang generate \
--model-path AIDC-AI/Ovis-Image-7B \
--backend diffusers \
--trust-remote-code \
--diffusers-attention-backend flash \
--prompt "A serene Japanese garden with cherry blossoms" \
--height 1024 \
--width 1024 \
--num-inference-steps 30 \
--save-output \
--output-path outputs \
--output-file-name ovis_garden.png
```
### Extra Diffusers Arguments
For pipeline-specific parameters not exposed via CLI, use `diffusers_kwargs` in a config file:
```json
{
"model_path": "AIDC-AI/Ovis-Image-7B",
"backend": "diffusers",
"prompt": "A beautiful landscape",
"diffusers_kwargs": {
"cross_attention_kwargs": {"scale": 0.5}
}
}
```
```bash
sglang generate --config config.json
```
### Cache-DiT Acceleration
Users who use the diffusers backend can also leverage Cache-DiT acceleration and load custom cache configs from a YAML file to boost performance of diffusers pipelines. See the [Cache-DiT Acceleration](https://docs.sglang.io/diffusion/performance/cache/cache_dit.html) documentation for details.
|