# SGLang diffusion CLI Inference The SGLang-diffusion CLI provides a quick way to access the inference pipeline for image and video generation. ## Prerequisites - A working SGLang diffusion installation and the `sglang` CLI available in `$PATH`. ## Supported Arguments ### Server Arguments - `--model-path {MODEL_PATH}`: Path to the model or model ID - `--lora-path {LORA_PATH}`: Path to a LoRA adapter (local path or HuggingFace model ID). If not specified, LoRA will not be applied. - `--lora-nickname {NAME}`: Nickname for the LoRA adapter. (default: `default`). - `--num-gpus {NUM_GPUS}`: Number of GPUs to use - `--tp-size {TP_SIZE}`: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster) - `--sp-degree {SP_SIZE}`: Sequence parallelism size (typically should match the number of GPUs) - `--ulysses-degree {ULYSSES_DEGREE}`: The degree of DeepSpeed-Ulysses-style SP in USP - `--ring-degree {RING_DEGREE}`: The degree of ring attention-style SP in USP - `--attention-backend {BACKEND}`: Attention backend to use. For SGLang-native pipelines use `fa`, `torch_sdpa`, `sage_attn`, etc. For diffusers pipelines use diffusers backend names like `flash`, `_flash_3_hub`, `sage`, `xformers`. - `--attention-backend-config {CONFIG}`: Configuration for the attention backend. Can be a JSON string (e.g., '{"k": "v"}'), a path to a JSON/YAML file, or key=value pairs (e.g., "k=v,k2=v2"). - `--cache-dit-config {PATH}`: Path to a Cache-DiT YAML/JSON config (diffusers backend only) - `--dit-precision {DTYPE}`: Precision for the DiT model (currently supports fp32, fp16, and bf16). ### Sampling Parameters - `--prompt {PROMPT}`: Text description for the video you want to generate - `--num-inference-steps {STEPS}`: Number of denoising steps - `--negative-prompt {PROMPT}`: Negative prompt to guide generation away from certain concepts - `--seed {SEED}`: Random seed for reproducible generation **Image/Video Configuration** - `--height {HEIGHT}`: Height of the generated output - `--width {WIDTH}`: Width of the generated output - `--num-frames {NUM_FRAMES}`: Number of frames to generate - `--fps {FPS}`: Frames per second for the saved output, if this is a video-generation task **Frame Interpolation** (video only) Frame interpolation is a post-processing step that synthesizes new frames between each pair of consecutive generated frames, producing smoother motion without re-running the diffusion model. The `--frame-interpolation-exp` flag controls how many rounds of interpolation to apply: each round inserts one new frame into every gap between adjacent frames, so the output frame count follows the formula **(N − 1) × 2^exp + 1** (e.g. 5 original frames with `exp=1` → 4 gaps × 1 new frame + 5 originals = **9** frames; with `exp=2` → **17** frames). - `--enable-frame-interpolation`: Enable frame interpolation. Model weights are downloaded automatically on first use. - `--frame-interpolation-exp {EXP}`: Interpolation exponent — `1` = 2× temporal resolution, `2` = 4×, etc. (default: `1`) - `--frame-interpolation-scale {SCALE}`: RIFE inference scale; use `0.5` for high-resolution inputs to save memory (default: `1.0`) - `--frame-interpolation-model-path {PATH}`: Local directory or HuggingFace repo ID containing RIFE `flownet.pkl` weights (default: `elfgum/RIFE-4.22.lite`, downloaded automatically) Example — generate a 5-frame video and interpolate to 9 frames ((5 − 1) × 2¹ + 1 = 9): ```bash sglang generate \ --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ --prompt "A dog running through a park" \ --num-frames 5 \ --enable-frame-interpolation \ --frame-interpolation-exp 1 \ --save-output ``` **Output Options** - `--output-path {PATH}`: Directory to save the generated video - `--save-output`: Whether to save the image/video to disk - `--return-frames`: Whether to return the raw frames ### Using Configuration Files Instead of specifying all parameters on the command line, you can use a configuration file: ```bash sglang generate --config {CONFIG_FILE_PATH} ``` The configuration file should be in JSON or YAML format with the same parameter names as the CLI options. Command-line arguments take precedence over settings in the configuration file, allowing you to override specific values while keeping the rest from the configuration file. Example configuration file (config.json): ```json { "model_path": "FastVideo/FastHunyuan-diffusers", "prompt": "A beautiful woman in a red dress walking down a street", "output_path": "outputs/", "num_gpus": 2, "sp_size": 2, "tp_size": 1, "num_frames": 45, "height": 720, "width": 1280, "num_inference_steps": 6, "seed": 1024, "fps": 24, "precision": "bf16", "vae_precision": "fp16", "vae_tiling": true, "vae_sp": true, "vae_config": { "load_encoder": false, "load_decoder": true, "tile_sample_min_height": 256, "tile_sample_min_width": 256 }, "text_encoder_precisions": [ "fp16", "fp16" ], "mask_strategy_file_path": null, "enable_torch_compile": false } ``` Or using YAML format (config.yaml): ```yaml model_path: "FastVideo/FastHunyuan-diffusers" prompt: "A beautiful woman in a red dress walking down a street" output_path: "outputs/" num_gpus: 2 sp_size: 2 tp_size: 1 num_frames: 45 height: 720 width: 1280 num_inference_steps: 6 seed: 1024 fps: 24 precision: "bf16" vae_precision: "fp16" vae_tiling: true vae_sp: true vae_config: load_encoder: false load_decoder: true tile_sample_min_height: 256 tile_sample_min_width: 256 text_encoder_precisions: - "fp16" - "fp16" mask_strategy_file_path: null enable_torch_compile: false ``` To see all the options, you can use the `--help` flag: ```bash sglang generate --help ``` ## Serve Launch the SGLang diffusion HTTP server and interact with it using the OpenAI SDK and curl. ### Start the server Use the following command to launch the server: ```bash SERVER_ARGS=( --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers --text-encoder-cpu-offload --pin-cpu-memory --num-gpus 4 --ulysses-degree=2 --ring-degree=2 ) sglang serve "${SERVER_ARGS[@]}" ``` - **--model-path**: Which model to load. The example uses `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`. - **--port**: HTTP port to listen on (the default here is `30010`). For detailed API usage, including Image, Video Generation and LoRA management, please refer to the [OpenAI API Documentation](openai_api.md). ### Cloud Storage Support SGLang diffusion supports automatically uploading generated images and videos to S3-compatible cloud storage (e.g., AWS S3, MinIO, Alibaba Cloud OSS, Tencent Cloud COS). When enabled, the server follows a **Generate -> Upload -> Delete** workflow: 1. The artifact is generated to a temporary local file. 2. The file is immediately uploaded to the configured S3 bucket in a background thread. 3. Upon successful upload, the local file is deleted. 4. The API response returns the public URL of the uploaded object. **Configuration** Cloud storage is enabled via environment variables. Note that `boto3` must be installed separately (`pip install boto3`) to use this feature. ```bash # Enable S3 storage export SGLANG_CLOUD_STORAGE_TYPE=s3 export SGLANG_S3_BUCKET_NAME=my-bucket export SGLANG_S3_ACCESS_KEY_ID=your-access-key export SGLANG_S3_SECRET_ACCESS_KEY=your-secret-key # Optional: Custom endpoint for MinIO/OSS/COS export SGLANG_S3_ENDPOINT_URL=https://minio.example.com ``` See [Environment Variables Documentation](../environment_variables.md) for more details. ## Generate Run a one-off generation task without launching a persistent server. To use it, pass both server arguments and sampling parameters in one command, after the `generate` subcommand, for example: ```bash SERVER_ARGS=( --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers --text-encoder-cpu-offload --pin-cpu-memory --num-gpus 4 --ulysses-degree=2 --ring-degree=2 ) SAMPLING_ARGS=( --prompt "A curious raccoon" --save-output --output-path outputs --output-file-name "A curious raccoon.mp4" ) sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}" # Or, users can set `SGLANG_CACHE_DIT_ENABLED` env as `true` to enable cache acceleration SGLANG_CACHE_DIT_ENABLED=true sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}" ``` Once the generation task has finished, the server will shut down automatically. > [!NOTE] > The HTTP server-related arguments are ignored in this subcommand. ## Component Path Overrides SGLang diffusion allows you to override any pipeline component (e.g., `vae`, `transformer`, `text_encoder`) by specifying a custom checkpoint path. This is useful for: ### Example: FLUX.2-dev with Tiny AutoEncoder You can override **any** component by using `---path`, where `` matches the key in the model's `model_index.json`: For example, replace the default VAE with a distilled tiny autoencoder for ~3x faster decoding: ```bash sglang serve \ --model-path=black-forest-labs/FLUX.2-dev \ # with a Huggingface Repo ID --vae-path=fal/FLUX.2-Tiny-AutoEncoder # or use a local path --vae-path=~/.cache/huggingface/hub/models--fal--FLUX.2-Tiny-AutoEncoder/snapshots/.../vae ``` **Important:** - The component key must match the one in your model's `model_index.json` (e.g., `vae`). - The path must: - either be a Huggingface Repo ID (e.g., fal/FLUX.2-Tiny-AutoEncoder) - or point to a **complete component folder**, containing `config.json` and safetensors files ## Diffusers Backend SGLang diffusion supports a **diffusers backend** that allows you to run any diffusers-compatible model through SGLang's infrastructure using vanilla diffusers pipelines. This is useful for running models without native SGLang implementations or models with custom pipeline classes. ### Arguments | Argument | Values | Description | |----------|--------|-------------| | `--backend` | `auto` (default), `sglang`, `diffusers` | `auto`: prefer native SGLang, fallback to diffusers. `sglang`: force native (fails if unavailable). `diffusers`: force vanilla diffusers pipeline. | | `--diffusers-attention-backend` | `flash`, `_flash_3_hub`, `sage`, `xformers`, `native` | Attention backend for diffusers pipelines. See [diffusers attention backends](https://huggingface.co/docs/diffusers/main/en/optimization/attention_backends). | | `--trust-remote-code` | flag | Required for models with custom pipeline classes (e.g., Ovis). | | `--vae-tiling` | flag | Enable VAE tiling for large image support (decodes tile-by-tile). | | `--vae-slicing` | flag | Enable VAE slicing for lower memory usage (decodes slice-by-slice). | | `--dit-precision` | `fp16`, `bf16`, `fp32` | Precision for the diffusion transformer. | | `--vae-precision` | `fp16`, `bf16`, `fp32` | Precision for the VAE. | | `--enable-torch-compile` | flag | Enable `torch.compile` for diffusers pipelines. | | `--cache-dit-config` | `{PATH}` | Path to a Cache-DiT YAML/JSON config file for accelerating diffusers pipelines with Cache-DiT. | ### Example: Running Ovis-Image-7B [Ovis-Image-7B](https://huggingface.co/AIDC-AI/Ovis-Image-7B) is a 7B text-to-image model optimized for high-quality text rendering. ```bash sglang generate \ --model-path AIDC-AI/Ovis-Image-7B \ --backend diffusers \ --trust-remote-code \ --diffusers-attention-backend flash \ --prompt "A serene Japanese garden with cherry blossoms" \ --height 1024 \ --width 1024 \ --num-inference-steps 30 \ --save-output \ --output-path outputs \ --output-file-name ovis_garden.png ``` ### Extra Diffusers Arguments For pipeline-specific parameters not exposed via CLI, use `diffusers_kwargs` in a config file: ```json { "model_path": "AIDC-AI/Ovis-Image-7B", "backend": "diffusers", "prompt": "A beautiful landscape", "diffusers_kwargs": { "cross_attention_kwargs": {"scale": 0.5} } } ``` ```bash sglang generate --config config.json ``` ### Cache-DiT Acceleration Users who use the diffusers backend can also leverage Cache-DiT acceleration and load custom cache configs from a YAML file to boost performance of diffusers pipelines. See the [Cache-DiT Acceleration](https://docs.sglang.io/diffusion/performance/cache/cache_dit.html) documentation for details.