Add files using upload-large-folder tool

6268841 verified 24 days ago

12.4 kB

	# SGLang diffusion CLI Inference

	The SGLang-diffusion CLI provides a quick way to access the inference pipeline for image and video generation.

	## Prerequisites

	- A working SGLang diffusion installation and the `sglang` CLI available in `$PATH`.


	## Supported Arguments

	### Server Arguments

	- `--model-path {MODEL_PATH}`: Path to the model or model ID
	- `--lora-path {LORA_PATH}`: Path to a LoRA adapter (local path or HuggingFace model ID). If not specified, LoRA will not be applied.
	- `--lora-nickname {NAME}`: Nickname for the LoRA adapter. (default: `default`).
	- `--num-gpus {NUM_GPUS}`: Number of GPUs to use
	- `--tp-size {TP_SIZE}`: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster)
	- `--sp-degree {SP_SIZE}`: Sequence parallelism size (typically should match the number of GPUs)
	- `--ulysses-degree {ULYSSES_DEGREE}`: The degree of DeepSpeed-Ulysses-style SP in USP
	- `--ring-degree {RING_DEGREE}`: The degree of ring attention-style SP in USP
	- `--attention-backend {BACKEND}`: Attention backend to use. For SGLang-native pipelines use `fa`, `torch_sdpa`, `sage_attn`, etc. For diffusers pipelines use diffusers backend names like `flash`, `_flash_3_hub`, `sage`, `xformers`.
	- `--attention-backend-config {CONFIG}`: Configuration for the attention backend. Can be a JSON string (e.g., '{"k": "v"}'), a path to a JSON/YAML file, or key=value pairs (e.g., "k=v,k2=v2").
	- `--cache-dit-config {PATH}`: Path to a Cache-DiT YAML/JSON config (diffusers backend only)
	- `--dit-precision {DTYPE}`: Precision for the DiT model (currently supports fp32, fp16, and bf16).


	### Sampling Parameters

	- `--prompt {PROMPT}`: Text description for the video you want to generate
	- `--num-inference-steps {STEPS}`: Number of denoising steps
	- `--negative-prompt {PROMPT}`: Negative prompt to guide generation away from certain concepts
	- `--seed {SEED}`: Random seed for reproducible generation


	Image/Video Configuration

	- `--height {HEIGHT}`: Height of the generated output
	- `--width {WIDTH}`: Width of the generated output
	- `--num-frames {NUM_FRAMES}`: Number of frames to generate
	- `--fps {FPS}`: Frames per second for the saved output, if this is a video-generation task


	Frame Interpolation (video only)

	Frame interpolation is a post-processing step that synthesizes new frames
	between each pair of consecutive generated frames, producing smoother
	motion without re-running the diffusion model. The `--frame-interpolation-exp`
	flag controls how many rounds of interpolation to apply: each round inserts one
	new frame into every gap between adjacent frames, so the output frame count
	follows the formula (N − 1) × 2^exp + 1 (e.g. 5 original frames with
	`exp=1` → 4 gaps × 1 new frame + 5 originals = 9 frames; with `exp=2` →
	17 frames).

	- `--enable-frame-interpolation`: Enable frame interpolation. Model weights are downloaded automatically on first use.
	- `--frame-interpolation-exp {EXP}`: Interpolation exponent — `1` = 2× temporal resolution, `2` = 4×, etc. (default: `1`)
	- `--frame-interpolation-scale {SCALE}`: RIFE inference scale; use `0.5` for high-resolution inputs to save memory (default: `1.0`)
	- `--frame-interpolation-model-path {PATH}`: Local directory or HuggingFace repo ID containing RIFE `flownet.pkl` weights (default: `elfgum/RIFE-4.22.lite`, downloaded automatically)

	Example — generate a 5-frame video and interpolate to 9 frames ((5 − 1) × 2¹ + 1 = 9):

	```bash
	sglang generate \
	--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
	--prompt "A dog running through a park" \
	--num-frames 5 \
	--enable-frame-interpolation \
	--frame-interpolation-exp 1 \
	--save-output
	```

	Output Options

	- `--output-path {PATH}`: Directory to save the generated video
	- `--save-output`: Whether to save the image/video to disk
	- `--return-frames`: Whether to return the raw frames

	### Using Configuration Files

	Instead of specifying all parameters on the command line, you can use a configuration file:

	```bash
	sglang generate --config {CONFIG_FILE_PATH}
	```

	The configuration file should be in JSON or YAML format with the same parameter names as the CLI options. Command-line arguments take precedence over settings in the configuration file, allowing you to override specific values while keeping the rest from the configuration file.

	Example configuration file (config.json):

	```json
	{
	"model_path": "FastVideo/FastHunyuan-diffusers",
	"prompt": "A beautiful woman in a red dress walking down a street",
	"output_path": "outputs/",
	"num_gpus": 2,
	"sp_size": 2,
	"tp_size": 1,
	"num_frames": 45,
	"height": 720,
	"width": 1280,
	"num_inference_steps": 6,
	"seed": 1024,
	"fps": 24,
	"precision": "bf16",
	"vae_precision": "fp16",
	"vae_tiling": true,
	"vae_sp": true,
	"vae_config": {
	"load_encoder": false,
	"load_decoder": true,
	"tile_sample_min_height": 256,
	"tile_sample_min_width": 256
	},
	"text_encoder_precisions": [
	"fp16",
	"fp16"
	],
	"mask_strategy_file_path": null,
	"enable_torch_compile": false
	}
	```

	Or using YAML format (config.yaml):

	```yaml
	model_path: "FastVideo/FastHunyuan-diffusers"
	prompt: "A beautiful woman in a red dress walking down a street"
	output_path: "outputs/"
	num_gpus: 2
	sp_size: 2
	tp_size: 1
	num_frames: 45
	height: 720
	width: 1280
	num_inference_steps: 6
	seed: 1024
	fps: 24
	precision: "bf16"
	vae_precision: "fp16"
	vae_tiling: true
	vae_sp: true
	vae_config:
	load_encoder: false
	load_decoder: true
	tile_sample_min_height: 256
	tile_sample_min_width: 256
	text_encoder_precisions:
	- "fp16"
	- "fp16"
	mask_strategy_file_path: null
	enable_torch_compile: false
	```


	To see all the options, you can use the `--help` flag:

	```bash
	sglang generate --help
	```

	## Serve

	Launch the SGLang diffusion HTTP server and interact with it using the OpenAI SDK and curl.

	### Start the server

	Use the following command to launch the server:

	```bash
	SERVER_ARGS=(
	--model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
	--text-encoder-cpu-offload
	--pin-cpu-memory
	--num-gpus 4
	--ulysses-degree=2
	--ring-degree=2
	)

	sglang serve "${SERVER_ARGS[@]}"
	```

	- --model-path: Which model to load. The example uses `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`.
	- --port: HTTP port to listen on (the default here is `30010`).

	For detailed API usage, including Image, Video Generation and LoRA management, please refer to the [OpenAI API Documentation](openai_api.md).

	### Cloud Storage Support

	SGLang diffusion supports automatically uploading generated images and videos to S3-compatible cloud storage (e.g., AWS S3, MinIO, Alibaba Cloud OSS, Tencent Cloud COS).

	When enabled, the server follows a Generate -> Upload -> Delete workflow:
	1. The artifact is generated to a temporary local file.
	2. The file is immediately uploaded to the configured S3 bucket in a background thread.
	3. Upon successful upload, the local file is deleted.
	4. The API response returns the public URL of the uploaded object.

	Configuration

	Cloud storage is enabled via environment variables. Note that `boto3` must be installed separately (`pip install boto3`) to use this feature.

	```bash
	# Enable S3 storage
	export SGLANG_CLOUD_STORAGE_TYPE=s3
	export SGLANG_S3_BUCKET_NAME=my-bucket
	export SGLANG_S3_ACCESS_KEY_ID=your-access-key
	export SGLANG_S3_SECRET_ACCESS_KEY=your-secret-key

	# Optional: Custom endpoint for MinIO/OSS/COS
	export SGLANG_S3_ENDPOINT_URL=https://minio.example.com
	```

	See [Environment Variables Documentation](../environment_variables.md) for more details.

	## Generate

	Run a one-off generation task without launching a persistent server.

	To use it, pass both server arguments and sampling parameters in one command, after the `generate` subcommand, for example:

	```bash
	SERVER_ARGS=(
	--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers
	--text-encoder-cpu-offload
	--pin-cpu-memory
	--num-gpus 4
	--ulysses-degree=2
	--ring-degree=2
	)

	SAMPLING_ARGS=(
	--prompt "A curious raccoon"
	--save-output
	--output-path outputs
	--output-file-name "A curious raccoon.mp4"
	)

	sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"

	# Or, users can set `SGLANG_CACHE_DIT_ENABLED` env as `true` to enable cache acceleration
	SGLANG_CACHE_DIT_ENABLED=true sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"
	```

	Once the generation task has finished, the server will shut down automatically.

	> [!NOTE]
	> The HTTP server-related arguments are ignored in this subcommand.

	## Component Path Overrides

	SGLang diffusion allows you to override any pipeline component (e.g., `vae`, `transformer`, `text_encoder`) by specifying a custom checkpoint path. This is useful for:

	### Example: FLUX.2-dev with Tiny AutoEncoder

	You can override any component by using `--<component>-path`, where `<component>` matches the key in the model's `model_index.json`:

	For example, replace the default VAE with a distilled tiny autoencoder for ~3x faster decoding:

	```bash
	sglang serve \
	--model-path=black-forest-labs/FLUX.2-dev \
	# with a Huggingface Repo ID
	--vae-path=fal/FLUX.2-Tiny-AutoEncoder
	# or use a local path
	--vae-path=~/.cache/huggingface/hub/models--fal--FLUX.2-Tiny-AutoEncoder/snapshots/.../vae
	```

	Important:
	- The component key must match the one in your model's `model_index.json` (e.g., `vae`).
	- The path must:
	- either be a Huggingface Repo ID (e.g., fal/FLUX.2-Tiny-AutoEncoder)
	- or point to a complete component folder, containing `config.json` and safetensors files


	## Diffusers Backend

	SGLang diffusion supports a diffusers backend that allows you to run any diffusers-compatible model through SGLang's infrastructure using vanilla diffusers pipelines. This is useful for running models without native SGLang implementations or models with custom pipeline classes.

	### Arguments

	\| Argument \| Values \| Description \|
	\|----------\|--------\|-------------\|
	\| `--backend` \| `auto` (default), `sglang`, `diffusers` \| `auto`: prefer native SGLang, fallback to diffusers. `sglang`: force native (fails if unavailable). `diffusers`: force vanilla diffusers pipeline. \|
	\| `--diffusers-attention-backend` \| `flash`, `_flash_3_hub`, `sage`, `xformers`, `native` \| Attention backend for diffusers pipelines. See [diffusers attention backends](https://huggingface.co/docs/diffusers/main/en/optimization/attention_backends). \|
	\| `--trust-remote-code` \| flag \| Required for models with custom pipeline classes (e.g., Ovis). \|
	\| `--vae-tiling` \| flag \| Enable VAE tiling for large image support (decodes tile-by-tile). \|
	\| `--vae-slicing` \| flag \| Enable VAE slicing for lower memory usage (decodes slice-by-slice). \|
	\| `--dit-precision` \| `fp16`, `bf16`, `fp32` \| Precision for the diffusion transformer. \|
	\| `--vae-precision` \| `fp16`, `bf16`, `fp32` \| Precision for the VAE. \|
	\| `--enable-torch-compile` \| flag \| Enable `torch.compile` for diffusers pipelines. \|
	\| `--cache-dit-config` \| `{PATH}` \| Path to a Cache-DiT YAML/JSON config file for accelerating diffusers pipelines with Cache-DiT. \|

	### Example: Running Ovis-Image-7B

	[Ovis-Image-7B](https://huggingface.co/AIDC-AI/Ovis-Image-7B) is a 7B text-to-image model optimized for high-quality text rendering.

	```bash
	sglang generate \
	--model-path AIDC-AI/Ovis-Image-7B \
	--backend diffusers \
	--trust-remote-code \
	--diffusers-attention-backend flash \
	--prompt "A serene Japanese garden with cherry blossoms" \
	--height 1024 \
	--width 1024 \
	--num-inference-steps 30 \
	--save-output \
	--output-path outputs \
	--output-file-name ovis_garden.png
	```

	### Extra Diffusers Arguments

	For pipeline-specific parameters not exposed via CLI, use `diffusers_kwargs` in a config file:

	```json
	{
	"model_path": "AIDC-AI/Ovis-Image-7B",
	"backend": "diffusers",
	"prompt": "A beautiful landscape",
	"diffusers_kwargs": {
	"cross_attention_kwargs": {"scale": 0.5}
	}
	}
	```

	```bash
	sglang generate --config config.json
	```

	### Cache-DiT Acceleration

	Users who use the diffusers backend can also leverage Cache-DiT acceleration and load custom cache configs from a YAML file to boost performance of diffusers pipelines. See the [Cache-DiT Acceleration](https://docs.sglang.io/diffusion/performance/cache/cache_dit.html) documentation for details.