# Qwen3-VL Usage [Qwen3-VL](https://huggingface.co/collections/Qwen/qwen3-vl) is Alibaba’s latest multimodal large language model with strong text, vision, and reasoning capabilities. SGLang supports Qwen3-VL Family of models with Image and Video input support. ## Launch commands for SGLang Below are suggested launch commands tailored for different hardware / precision modes ### FP8 (quantised) mode For high memory-efficiency and latency optimized deployments (e.g., on H100, H200) where FP8 checkpoint is supported: ```bash python3 -m sglang.launch_server \ --model-path Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \ --tp 8 \ --ep 8 \ --host 0.0.0.0 \ --port 30000 \ --keep-mm-feature-on-device ``` ### Non-FP8 (BF16 / full precision) mode For deployments on A100/H100 where BF16 is used (or FP8 snapshot not used): ```bash python3 -m sglang.launch_server \ --model-path Qwen/Qwen3-VL-235B-A22B-Instruct \ --tp 8 \ --ep 8 \ --host 0.0.0.0 \ --port 30000 \ ``` ## Hardware-specific notes / recommendations - On H100 with FP8: Use the FP8 checkpoint for best memory efficiency. - On A100 / H100 with BF16 (non-FP8): It’s recommended to use `--mm-max-concurrent-calls` to control parallel throughput and GPU memory usage during image/video inference. - On H200 & B200: The model can be run “out of the box”, supporting full context length plus concurrent image + video processing. ## Sending Image/Video Requests ### Image input: ```python import requests url = f"http://localhost:30000/v1/chat/completions" data = { "model": "Qwen/Qwen3-VL-30B-A3B-Instruct", "messages": [ { "role": "user", "content": [ {"type": "text", "text": "What’s in this image?"}, { "type": "image_url", "image_url": { "url": "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true" }, }, ], } ], "max_tokens": 300, } response = requests.post(url, json=data) print(response.text) ``` ### Video Input: ```python import requests url = f"http://localhost:30000/v1/chat/completions" data = { "model": "Qwen/Qwen3-VL-30B-A3B-Instruct", "messages": [ { "role": "user", "content": [ {"type": "text", "text": "What’s happening in this video?"}, { "type": "video_url", "video_url": { "url": "https://github.com/sgl-project/sgl-test-files/raw/refs/heads/main/videos/jobs_presenting_ipod.mp4" }, }, ], } ], "max_tokens": 300, } response = requests.post(url, json=data) print(response.text) ``` ## Important Server Parameters and Flags When launching the model server for **multimodal support**, you can use the following command-line arguments to fine-tune performance and behavior: - `--mm-attention-backend`: Specify multimodal attention backend. Eg. `fa3`(Flash Attention 3) - `--mm-max-concurrent-calls `: Specifies the **maximum number of concurrent asynchronous multimodal data processing calls** allowed on the server. Use this to control parallel throughput and GPU memory usage during image/video inference. - `--mm-per-request-timeout `: Defines the **timeout duration (in seconds)** for each multimodal request. If a request exceeds this time limit (e.g., for very large video inputs), it will be automatically terminated. - `--keep-mm-feature-on-device`: Instructs the server to **retain multimodal feature tensors on the GPU** after processing. This avoids device-to-host (D2H) memory copies and improves performance for repeated or high-frequency inference workloads. - `SGLANG_USE_CUDA_IPC_TRANSPORT=1`: Shared memory pool based CUDA IPC for multi-modal data transport. For significantly improving e2e latency. ### Example usage with the above optimizations: ```bash SGLANG_USE_CUDA_IPC_TRANSPORT=1 \ SGLANG_VLM_CACHE_SIZE_MB=0 \ python -m sglang.launch_server \ --model-path Qwen/Qwen3-VL-235B-A22B-Instruct \ --host 0.0.0.0 \ --port 30000 \ --trust-remote-code \ --tp-size 8 \ --enable-cache-report \ --log-level info \ --max-running-requests 64 \ --mem-fraction-static 0.65 \ --chunked-prefill-size 8192 \ --attention-backend fa3 \ --mm-attention-backend fa3 \ --enable-metrics ```