Lekr0's picture
Add files using upload-large-folder tool
6268841 verified
# SGLang diffusion CLI Inference
The SGLang-diffusion CLI provides a quick way to access the inference pipeline for image and video generation.
## Prerequisites
- A working SGLang diffusion installation and the `sglang` CLI available in `$PATH`.
## Supported Arguments
### Server Arguments
- `--model-path {MODEL_PATH}`: Path to the model or model ID
- `--lora-path {LORA_PATH}`: Path to a LoRA adapter (local path or HuggingFace model ID). If not specified, LoRA will not be applied.
- `--lora-nickname {NAME}`: Nickname for the LoRA adapter. (default: `default`).
- `--num-gpus {NUM_GPUS}`: Number of GPUs to use
- `--tp-size {TP_SIZE}`: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster)
- `--sp-degree {SP_SIZE}`: Sequence parallelism size (typically should match the number of GPUs)
- `--ulysses-degree {ULYSSES_DEGREE}`: The degree of DeepSpeed-Ulysses-style SP in USP
- `--ring-degree {RING_DEGREE}`: The degree of ring attention-style SP in USP
- `--attention-backend {BACKEND}`: Attention backend to use. For SGLang-native pipelines use `fa`, `torch_sdpa`, `sage_attn`, etc. For diffusers pipelines use diffusers backend names like `flash`, `_flash_3_hub`, `sage`, `xformers`.
- `--attention-backend-config {CONFIG}`: Configuration for the attention backend. Can be a JSON string (e.g., '{"k": "v"}'), a path to a JSON/YAML file, or key=value pairs (e.g., "k=v,k2=v2").
- `--cache-dit-config {PATH}`: Path to a Cache-DiT YAML/JSON config (diffusers backend only)
- `--dit-precision {DTYPE}`: Precision for the DiT model (currently supports fp32, fp16, and bf16).
### Sampling Parameters
- `--prompt {PROMPT}`: Text description for the video you want to generate
- `--num-inference-steps {STEPS}`: Number of denoising steps
- `--negative-prompt {PROMPT}`: Negative prompt to guide generation away from certain concepts
- `--seed {SEED}`: Random seed for reproducible generation
**Image/Video Configuration**
- `--height {HEIGHT}`: Height of the generated output
- `--width {WIDTH}`: Width of the generated output
- `--num-frames {NUM_FRAMES}`: Number of frames to generate
- `--fps {FPS}`: Frames per second for the saved output, if this is a video-generation task
**Frame Interpolation** (video only)
Frame interpolation is a post-processing step that synthesizes new frames
between each pair of consecutive generated frames, producing smoother
motion without re-running the diffusion model. The `--frame-interpolation-exp`
flag controls how many rounds of interpolation to apply: each round inserts one
new frame into every gap between adjacent frames, so the output frame count
follows the formula **(N − 1) × 2^exp + 1** (e.g. 5 original frames with
`exp=1` → 4 gaps × 1 new frame + 5 originals = **9** frames; with `exp=2`
**17** frames).
- `--enable-frame-interpolation`: Enable frame interpolation. Model weights are downloaded automatically on first use.
- `--frame-interpolation-exp {EXP}`: Interpolation exponent — `1` = 2× temporal resolution, `2` = 4×, etc. (default: `1`)
- `--frame-interpolation-scale {SCALE}`: RIFE inference scale; use `0.5` for high-resolution inputs to save memory (default: `1.0`)
- `--frame-interpolation-model-path {PATH}`: Local directory or HuggingFace repo ID containing RIFE `flownet.pkl` weights (default: `elfgum/RIFE-4.22.lite`, downloaded automatically)
Example — generate a 5-frame video and interpolate to 9 frames ((5 − 1) × 2¹ + 1 = 9):
```bash
sglang generate \
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
--prompt "A dog running through a park" \
--num-frames 5 \
--enable-frame-interpolation \
--frame-interpolation-exp 1 \
--save-output
```
**Output Options**
- `--output-path {PATH}`: Directory to save the generated video
- `--save-output`: Whether to save the image/video to disk
- `--return-frames`: Whether to return the raw frames
### Using Configuration Files
Instead of specifying all parameters on the command line, you can use a configuration file:
```bash
sglang generate --config {CONFIG_FILE_PATH}
```
The configuration file should be in JSON or YAML format with the same parameter names as the CLI options. Command-line arguments take precedence over settings in the configuration file, allowing you to override specific values while keeping the rest from the configuration file.
Example configuration file (config.json):
```json
{
"model_path": "FastVideo/FastHunyuan-diffusers",
"prompt": "A beautiful woman in a red dress walking down a street",
"output_path": "outputs/",
"num_gpus": 2,
"sp_size": 2,
"tp_size": 1,
"num_frames": 45,
"height": 720,
"width": 1280,
"num_inference_steps": 6,
"seed": 1024,
"fps": 24,
"precision": "bf16",
"vae_precision": "fp16",
"vae_tiling": true,
"vae_sp": true,
"vae_config": {
"load_encoder": false,
"load_decoder": true,
"tile_sample_min_height": 256,
"tile_sample_min_width": 256
},
"text_encoder_precisions": [
"fp16",
"fp16"
],
"mask_strategy_file_path": null,
"enable_torch_compile": false
}
```
Or using YAML format (config.yaml):
```yaml
model_path: "FastVideo/FastHunyuan-diffusers"
prompt: "A beautiful woman in a red dress walking down a street"
output_path: "outputs/"
num_gpus: 2
sp_size: 2
tp_size: 1
num_frames: 45
height: 720
width: 1280
num_inference_steps: 6
seed: 1024
fps: 24
precision: "bf16"
vae_precision: "fp16"
vae_tiling: true
vae_sp: true
vae_config:
load_encoder: false
load_decoder: true
tile_sample_min_height: 256
tile_sample_min_width: 256
text_encoder_precisions:
- "fp16"
- "fp16"
mask_strategy_file_path: null
enable_torch_compile: false
```
To see all the options, you can use the `--help` flag:
```bash
sglang generate --help
```
## Serve
Launch the SGLang diffusion HTTP server and interact with it using the OpenAI SDK and curl.
### Start the server
Use the following command to launch the server:
```bash
SERVER_ARGS=(
--model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
--text-encoder-cpu-offload
--pin-cpu-memory
--num-gpus 4
--ulysses-degree=2
--ring-degree=2
)
sglang serve "${SERVER_ARGS[@]}"
```
- **--model-path**: Which model to load. The example uses `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`.
- **--port**: HTTP port to listen on (the default here is `30010`).
For detailed API usage, including Image, Video Generation and LoRA management, please refer to the [OpenAI API Documentation](openai_api.md).
### Cloud Storage Support
SGLang diffusion supports automatically uploading generated images and videos to S3-compatible cloud storage (e.g., AWS S3, MinIO, Alibaba Cloud OSS, Tencent Cloud COS).
When enabled, the server follows a **Generate -> Upload -> Delete** workflow:
1. The artifact is generated to a temporary local file.
2. The file is immediately uploaded to the configured S3 bucket in a background thread.
3. Upon successful upload, the local file is deleted.
4. The API response returns the public URL of the uploaded object.
**Configuration**
Cloud storage is enabled via environment variables. Note that `boto3` must be installed separately (`pip install boto3`) to use this feature.
```bash
# Enable S3 storage
export SGLANG_CLOUD_STORAGE_TYPE=s3
export SGLANG_S3_BUCKET_NAME=my-bucket
export SGLANG_S3_ACCESS_KEY_ID=your-access-key
export SGLANG_S3_SECRET_ACCESS_KEY=your-secret-key
# Optional: Custom endpoint for MinIO/OSS/COS
export SGLANG_S3_ENDPOINT_URL=https://minio.example.com
```
See [Environment Variables Documentation](../environment_variables.md) for more details.
## Generate
Run a one-off generation task without launching a persistent server.
To use it, pass both server arguments and sampling parameters in one command, after the `generate` subcommand, for example:
```bash
SERVER_ARGS=(
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers
--text-encoder-cpu-offload
--pin-cpu-memory
--num-gpus 4
--ulysses-degree=2
--ring-degree=2
)
SAMPLING_ARGS=(
--prompt "A curious raccoon"
--save-output
--output-path outputs
--output-file-name "A curious raccoon.mp4"
)
sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"
# Or, users can set `SGLANG_CACHE_DIT_ENABLED` env as `true` to enable cache acceleration
SGLANG_CACHE_DIT_ENABLED=true sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"
```
Once the generation task has finished, the server will shut down automatically.
> [!NOTE]
> The HTTP server-related arguments are ignored in this subcommand.
## Component Path Overrides
SGLang diffusion allows you to override any pipeline component (e.g., `vae`, `transformer`, `text_encoder`) by specifying a custom checkpoint path. This is useful for:
### Example: FLUX.2-dev with Tiny AutoEncoder
You can override **any** component by using `--<component>-path`, where `<component>` matches the key in the model's `model_index.json`:
For example, replace the default VAE with a distilled tiny autoencoder for ~3x faster decoding:
```bash
sglang serve \
--model-path=black-forest-labs/FLUX.2-dev \
# with a Huggingface Repo ID
--vae-path=fal/FLUX.2-Tiny-AutoEncoder
# or use a local path
--vae-path=~/.cache/huggingface/hub/models--fal--FLUX.2-Tiny-AutoEncoder/snapshots/.../vae
```
**Important:**
- The component key must match the one in your model's `model_index.json` (e.g., `vae`).
- The path must:
- either be a Huggingface Repo ID (e.g., fal/FLUX.2-Tiny-AutoEncoder)
- or point to a **complete component folder**, containing `config.json` and safetensors files
## Diffusers Backend
SGLang diffusion supports a **diffusers backend** that allows you to run any diffusers-compatible model through SGLang's infrastructure using vanilla diffusers pipelines. This is useful for running models without native SGLang implementations or models with custom pipeline classes.
### Arguments
| Argument | Values | Description |
|----------|--------|-------------|
| `--backend` | `auto` (default), `sglang`, `diffusers` | `auto`: prefer native SGLang, fallback to diffusers. `sglang`: force native (fails if unavailable). `diffusers`: force vanilla diffusers pipeline. |
| `--diffusers-attention-backend` | `flash`, `_flash_3_hub`, `sage`, `xformers`, `native` | Attention backend for diffusers pipelines. See [diffusers attention backends](https://huggingface.co/docs/diffusers/main/en/optimization/attention_backends). |
| `--trust-remote-code` | flag | Required for models with custom pipeline classes (e.g., Ovis). |
| `--vae-tiling` | flag | Enable VAE tiling for large image support (decodes tile-by-tile). |
| `--vae-slicing` | flag | Enable VAE slicing for lower memory usage (decodes slice-by-slice). |
| `--dit-precision` | `fp16`, `bf16`, `fp32` | Precision for the diffusion transformer. |
| `--vae-precision` | `fp16`, `bf16`, `fp32` | Precision for the VAE. |
| `--enable-torch-compile` | flag | Enable `torch.compile` for diffusers pipelines. |
| `--cache-dit-config` | `{PATH}` | Path to a Cache-DiT YAML/JSON config file for accelerating diffusers pipelines with Cache-DiT. |
### Example: Running Ovis-Image-7B
[Ovis-Image-7B](https://huggingface.co/AIDC-AI/Ovis-Image-7B) is a 7B text-to-image model optimized for high-quality text rendering.
```bash
sglang generate \
--model-path AIDC-AI/Ovis-Image-7B \
--backend diffusers \
--trust-remote-code \
--diffusers-attention-backend flash \
--prompt "A serene Japanese garden with cherry blossoms" \
--height 1024 \
--width 1024 \
--num-inference-steps 30 \
--save-output \
--output-path outputs \
--output-file-name ovis_garden.png
```
### Extra Diffusers Arguments
For pipeline-specific parameters not exposed via CLI, use `diffusers_kwargs` in a config file:
```json
{
"model_path": "AIDC-AI/Ovis-Image-7B",
"backend": "diffusers",
"prompt": "A beautiful landscape",
"diffusers_kwargs": {
"cross_attention_kwargs": {"scale": 0.5}
}
}
```
```bash
sglang generate --config config.json
```
### Cache-DiT Acceleration
Users who use the diffusers backend can also leverage Cache-DiT acceleration and load custom cache configs from a YAML file to boost performance of diffusers pipelines. See the [Cache-DiT Acceleration](https://docs.sglang.io/diffusion/performance/cache/cache_dit.html) documentation for details.