# Support Features on Ascend NPU
This section describes the basic functions and features supported by the Ascend NPU.If you encounter issues or have any
questions, please [open an issue](https://github.com/sgl-project/sglang/issues).
If you want to know the meaning and usage of each parameter,
click [Server Arguments](https://docs.sglang.io/advanced_features/server_arguments.html).
## Model and tokenizer
| Argument | Defaults | Options | Server supported |
|----------------------------------------|----------|---------------------------------------|:----------------:|
| `--model-path`
`--model` | `None` | Type: str | A2, A3 |
| `--tokenizer-path` | `None` | Type: str | A2, A3 |
| `--tokenizer-mode` | `auto` | `auto`, `slow` | A2, A3 |
| `--tokenizer-worker-num` | `1` | Type: int | A2, A3 |
| `--skip-tokenizer-init` | `False` | bool flag (set to enable) | A2, A3 |
| `--load-format` | `auto` | `auto`, `safetensors` | A2, A3 |
| `--model-loader-`
`extra-config` | {} | Type: str | A2, A3 |
| `--trust-remote-code` | `False` | bool flag (set to enable) | A2, A3 |
| `--context-length` | `None` | Type: int | A2, A3 |
| `--is-embedding` | `False` | bool flag (set to enable) | A2, A3 |
| `--enable-multimodal` | `None` | bool flag (set to enable) | A2, A3 |
| `--revision` | `None` | Type: str | A2, A3 |
| `--model-impl` | `auto` | `auto`, `sglang`,
`transformers` | A2, A3 |
## HTTP server
| Argument | Defaults | Options | Server supported |
|------------------------|-------------|---------------------------|:----------------:|
| `--host` | `127.0.0.1` | Type: str | A2, A3 |
| `--port` | `30000` | Type: int | A2, A3 |
| `--skip-server-warmup` | `False` | bool flag (set to enable) | A2, A3 |
| `--warmups` | `None` | Type: str | A2, A3 |
| `--nccl-port` | `None` | Type: int | A2, A3 |
| `--fastapi-root-path` | `None` | Type: str | A2, A3 |
| `--grpc-mode` | `False` | bool flag (set to enable) | A2, A3 |
## Quantization and data type
| Argument | Defaults | Options | Server supported |
|---------------------------------------------|----------|-----------------------------------------|:----------------:|
| `--dtype` | `auto` | `auto`,
`float16`,
`bfloat16` | A2, A3 |
| `--quantization` | `None` | `modelslim` | A2, A3 |
| `--quantization-param-path` | `None` | Type: str | Special For GPU |
| `--kv-cache-dtype` | `auto` | `auto` | A2, A3 |
| `--enable-fp32-lm-head` | `False` | bool flag
(set to enable) | A2, A3 |
| `--modelopt-quant` | `None` | Type: str | Special For GPU |
| `--modelopt-checkpoint-`
`restore-path` | `None` | Type: str | Special For GPU |
| `--modelopt-checkpoint-`
`save-path` | `None` | Type: str | Special For GPU |
| `--modelopt-export-path` | `None` | Type: str | Special For GPU |
| `--quantize-and-serve` | `False` | bool flag
(set to enable) | Special For GPU |
| `--rl-quant-profile` | `None` | Type: str | Special For GPU |
## Memory and scheduling
| Argument | Defaults | Options | Server supported |
|-----------------------------------------------------|----------|--------------------------------|:----------------:|
| `--mem-fraction-static` | `None` | Type: float | A2, A3 |
| `--max-running-requests` | `None` | Type: int | A2, A3 |
| `--prefill-max-requests` | `None` | Type: int | A2, A3 |
| `--max-queued-requests` | `None` | Type: int | A2, A3 |
| `--max-total-tokens` | `None` | Type: int | A2, A3 |
| `--chunked-prefill-size` | `None` | Type: int | A2, A3 |
| `--max-prefill-tokens` | `16384` | Type: int | A2, A3 |
| `--schedule-policy` | `fcfs` | `lpm`, `fcfs` | A2, A3 |
| `--enable-priority-`
`scheduling` | `False` | bool flag
(set to enable) | A2, A3 |
| `--schedule-low-priority-`
`values-first` | `False` | bool flag
(set to enable) | A2, A3 |
| `--priority-scheduling-`
`preemption-threshold` | `10` | Type: int | A2, A3 |
| `--schedule-conservativeness` | `1.0` | Type: float | A2, A3 |
| `--page-size` | `128` | Type: int | A2, A3 |
| `--swa-full-tokens-ratio` | `0.8` | Type: float | A2, A3 |
| `--disable-hybrid-swa-memory` | `False` | bool flag
(set to enable) | A2, A3 |
| `--abort-on-priority-`
`when-disabled` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-dynamic-chunking` | `False` | bool flag
(set to enable) | A2, A3 |
## Runtime options
| Argument | Defaults | Options | Server supported |
|----------------------------------------------------|----------|---------------------------|:----------------:|
| `--device` | `None` | Type: str | A2, A3 |
| `--tensor-parallel-size`
`--tp-size` | `1` | Type: int | A2, A3 |
| `--pipeline-parallel-size`
`--pp-size` | `1` | Type: int | A2, A3 |
| `--pp-max-micro-batch-size` | `None` | Type: int | A2, A3 |
| `--pp-async-batch-depth` | `None` | Type: int | A2, A3 |
| `--stream-interval` | `1` | Type: int | A2, A3 |
| `--stream-output` | `False` | bool flag (set to enable) | A2, A3 |
| `--random-seed` | `None` | Type: int | A2, A3 |
| `--constrained-json-`
`whitespace-pattern` | `None` | Type: str | A2, A3 |
| `--constrained-json-`
`disable-any-whitespace` | `False` | bool flag (set to enable) | A2, A3 |
| `--watchdog-timeout` | `300` | Type: float | A2, A3 |
| `--soft-watchdog-timeout` | `300` | Type: float | A2, A3 |
| `--dist-timeout` | `None` | Type: int | A2, A3 |
| `--base-gpu-id` | `0` | Type: int | A2, A3 |
| `--gpu-id-step` | `1` | Type: int | A2, A3 |
| `--sleep-on-idle` | `False` | bool flag (set to enable) | A2, A3 |
| `--custom-sigquit-handler` | `None` | Optional[Callable] | A2, A3 |
## Logging
| Argument | Defaults | Options | Server supported |
|----------------------------------------------------|-------------------|--------------------------------|:----------------:|
| `--log-level` | `info` | Type: str | A2, A3 |
| `--log-level-http` | `None` | Type: str | A2, A3 |
| `--log-requests` | `False` | bool flag
(set to enable) | A2, A3 |
| `--log-requests-level` | `2` | `0`, `1`, `2`, `3` | A2, A3 |
| `--log-requests-format` | text | text, json | A2, A3 |
| `--crash-dump-folder` | `None` | Type: str | A2, A3 |
| `--enable-metrics` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-metrics-for-`
`all-schedulers` | `False` | bool flag
(set to enable) | A2, A3 |
| `--tokenizer-metrics-`
`custom-labels-header` | `x-custom-labels` | Type: str | A2, A3 |
| `--tokenizer-metrics-`
`allowed-custom-labels` | `None` | List[str] | A2, A3 |
| `--bucket-time-to-`
`first-token` | `None` | List[float] | A2, A3 |
| `--bucket-inter-token-`
`latency` | `None` | List[float] | A2, A3 |
| `--bucket-e2e-request-`
`latency` | `None` | List[float] | A2, A3 |
| `--collect-tokens-`
`histogram` | `False` | bool flag
(set to enable) | A2, A3 |
| `--prompt-tokens-buckets` | `None` | List[str] | A2, A3 |
| `--generation-tokens-buckets` | `None` | List[str] | A2, A3 |
| `--gc-warning-threshold-secs` | `0.0` | Type: float | A2, A3 |
| `--decode-log-interval` | `40` | Type: int | A2, A3 |
| `--enable-request-time-`
`stats-logging` | `False` | bool flag
(set to enable) | A2, A3 |
| `--kv-events-config` | `None` | Type: str | Special for GPU |
| `--enable-trace` | `False` | bool flag
(set to enable) | A2, A3 |
| `--oltp-traces-endpoint` | `localhost:4317` | Type: str | A2, A3 |
## RequestMetricsExporter configuration
| Argument | Defaults | Options | Server supported |
|---------------------------------------|----------|--------------------------------|:----------------:|
| `--export-metrics-to-`
`file` | `False` | bool flag
(set to enable) | A2, A3 |
| `--export-metrics-to-`
`file-dir` | `None` | Type: str | A2, A3 |
## API related
| Argument | Defaults | Options | Server supported |
|-------------------------|-----------|--------------------------------|:----------------:|
| `--api-key` | `None` | Type: str | A2, A3 |
| `--admin-api-key` | `None` | Type: str | A2, A3 |
| `--served-model-name` | `None` | Type: str | A2, A3 |
| `--weight-version` | `default` | Type: str | A2, A3 |
| `--chat-template` | `None` | Type: str | A2, A3 |
| `--completion-template` | `None` | Type: str | A2, A3 |
| `--enable-cache-report` | `False` | bool flag
(set to enable) | A2, A3 |
| `--reasoning-parser` | `None` | `deepseek-r1` | A2, A3 |
| `--tool-call-parser` | `None` | `llama`,`pythonic` | A2, A3 |
| `--sampling-defaults` | `model` | `openai`, `model` | A2, A3 |
## Data parallelism
| Argument | Defaults | Options | Server supported |
|----------------------------------------|---------------|-----------------------------------------------------------|:----------------:|
| `--data-parallel-size`
`--dp-size` | `1` | Type: int | A2, A3 |
| `--load-balance-method` | `round_robin` | `round_robin`,
`total_requests`,
`total_tokens` | A2, A3 |
| `--prefill-round-robin-balance` | `False` | bool flag
(set to enable) | A2, A3 |
## Multi-node distributed serving
| Argument | Defaults | Options | Server supported |
|-------------------------------------------|----------|-----------|:----------------:|
| `--dist-init-addr`
`--nccl-init-addr` | `None` | Type: str | A2, A3 |
| `--nnodes` | `1` | Type: int | A2, A3 |
| `--node-rank` | `0` | Type: int | A2, A3 |
## Model override args
| Argument | Defaults | Options | Server supported |
|--------------------------------------|----------|-----------|:----------------:|
| `--json-model-override-`
`args` | `{}` | Type: str | A2, A3 |
| `--preferred-sampling-`
`params` | `None` | Type: str | A2, A3 |
## LoRA
| Argument | Defaults | Options | Server supported |
|--------------------------|----------|-------------------------------------|:----------------:|
| `--enable-lora` | `False` | Bool flag
(set to enable) | A2, A3 |
| `--max-lora-rank` | `None` | Type: int | A2, A3 |
| `--lora-target-modules` | `None` | `all` | A2, A3 |
| `--lora-paths` | `None` | Type: List[str] /
JSON objects | A2, A3 |
| `--max-loras-per-batch` | `8` | Type: int | A2, A3 |
| `--max-loaded-loras` | `None` | Type: int | A2, A3 |
| `--lora-eviction-policy` | `lru` | `lru`,
`fifo` | A2, A3 |
| `--lora-backend` | `triton` | `triton` | A2, A3 |
| `--max-lora-chunk-size` | `16` | `16`, `32`,
`64`, `128` | Special for GPU |
## Kernel Backends (Attention, Sampling, Grammar, GEMM)
| Argument | Defaults | Options | Server supported |
|----------------------------------------|-------------------|------------------------------------------------------------------------------------------------|:----------------:|
| `--attention-backend` | `None` | `ascend` | A2, A3 |
| `--prefill-attention-backend` | `None` | `ascend` | A2, A3 |
| `--decode-attention-backend` | `None` | `ascend` | A2, A3 |
| `--sampling-backend` | `None` | `pytorch`,
`ascend` | A2, A3 |
| `--grammar-backend` | `None` | `xgrammar` | A2, A3 |
| `--mm-attention-backend` | `None` | `ascend_attn` | A2, A3 |
| `--nsa-prefill-backend` | `flashmla_sparse` | `flashmla_sparse`,
`flashmla_decode`,
`fa3`,
`tilelang`,
`aiter` | Special for GPU |
| `--nsa-decode-backend` | `fa3` | `flashmla_prefill`,
`flashmla_kv`,
`fa3`,
`tilelang`,
`aiter` | Special for GPU |
| `--fp8-gemm-backend` | `auto` | `auto`,
`deep_gemm`,
`flashinfer_trtllm`,
`flashinfer_cutlass`,
`flashinfer_deepgemm`,
`cutlass`,
`triton`,
`aiter` | Special for GPU |
| `--disable-flashinfer-`
`autotune` | `False` | bool flag
(set to enable) | Special for GPU |
## Speculative decoding
| Argument | Defaults | Options | Server supported |
|------------------------------------------------------------------|-----------|--------------------------|:----------------:|
| `--speculative-algorithm` | `None` | `EAGLE3`,
`NEXTN` | A2, A3 |
| `--speculative-draft-model-path`
`--speculative-draft-model` | `None` | Type: str | A2, A3 |
| `--speculative-draft-model-`
`revision` | `None` | Type: str | A2, A3 |
| `--speculative-draft-load-format` | `None` | `auto` | A2, A3 |
| `--speculative-num-steps` | `None` | Type: int | A2, A3 |
| `--speculative-eagle-topk` | `None` | Type: int | A2, A3 |
| `--speculative-num-draft-tokens` | `None` | Type: int | A2, A3 |
| `--speculative-accept-`
`threshold-single` | `1.0` | Type: float | Special for GPU |
| `--speculative-accept-`
`threshold-acc` | `1.0` | Type: float | Special for GPU |
| `--speculative-token-map` | `None` | Type: str | A2, A3 |
| `--speculative-attention-`
`mode` | `prefill` | `prefill`,
`decode` | A2, A3 |
| `--speculative-moe-runner-`
`backend` | `None` | `auto` | A2, A3 |
| `--speculative-moe-a2a-`
`backend` | `None` | `ascend_fuseep` | A2, A3 |
| `--speculative-draft-attention-backend` | `None` | `ascend` | A2, A3 |
| `--speculative-draft-model-quantization` | `None` | `unquant` | A2, A3 |
## Ngram speculative decoding
| Argument | Defaults | Options | Server supported |
|----------------------------------------------------|------------|--------------------|:----------------:|
| `--speculative-ngram-`
`min-match-window-size` | `1` | Type: int | Experimental |
| `--speculative-ngram-`
`max-match-window-size` | `12` | Type: int | Experimental |
| `--speculative-ngram-`
`min-bfs-breadth` | `1` | Type: int | Experimental |
| `--speculative-ngram-`
`max-bfs-breadth` | `10` | Type: int | Experimental |
| `--speculative-ngram-`
`match-type` | `BFS` | `BFS`,
`PROB` | Experimental |
| `--speculative-ngram-`
`branch-length` | `18` | Type: int | Experimental |
| `--speculative-ngram-`
`capacity` | `10000000` | Type: int | Experimental |
## Expert parallelism
| Argument | Defaults | Options | Server supported |
|-------------------------------------------------------|-----------|---------------------------------------------|:----------------:|
| `--expert-parallel-size`
`--ep-size`
`--ep` | `1` | Type: int | A2, A3 |
| `--moe-a2a-backend` | `none` | `none`,
`deepep`,
`ascend_fuseep` | A2, A3 |
| `--moe-runner-backend` | `auto` | `auto`, `triton` | A2, A3 |
| `--flashinfer-mxfp4-`
`moe-precision` | `default` | `default`,
`bf16` | Special for GPU |
| `--enable-flashinfer-`
`allreduce-fusion` | `False` | bool flag
(set to enable) | Special for GPU |
| `--deepep-mode` | `auto` | `normal`,
`low_latency`,
`auto` | A2, A3 |
| `--deepep-config` | `None` | Type: str | Special for GPU |
| `--ep-num-redundant-experts` | `0` | Type: int | A2, A3 |
| `--ep-dispatch-algorithm` | `None` | Type: str | A2, A3 |
| `--init-expert-location` | `trivial` | Type: str | A2, A3 |
| `--enable-eplb` | `False` | bool flag
(set to enable) | A2, A3 |
| `--eplb-algorithm` | `auto` | Type: str | A2, A3 |
| `--eplb-rebalance-layers-`
`per-chunk` | `None` | Type: int | A2, A3 |
| `--eplb-min-rebalancing-`
`utilization-threshold` | `1.0` | Type: float | A2, A3 |
| `--expert-distribution-`
`recorder-mode` | `None` | Type: str | A2, A3 |
| `--expert-distribution-`
`recorder-buffer-size` | `None` | Type: int | A2, A3 |
| `--enable-expert-distribution-`
`metrics` | `False` | bool flag (set to enable) | A2, A3 |
| `--moe-dense-tp-size` | `None` | Type: int | A2, A3 |
| `--elastic-ep-backend` | `None` | `none`, `mooncake` | Special for GPU |
| `--mooncake-ib-device` | `None` | Type: str | Special for GPU |
## Mamba Cache
| Argument | Defaults | Options | Server supported |
|------------------------------|-----------|-----------------------------------------------|:----------------:|
| `--max-mamba-cache-size` | `None` | Type: int | A2, A3 |
| `--mamba-ssm-dtype` | `float32` | `float32`,
`bfloat16` | A2, A3 |
| `--mamba-full-memory-ratio` | `0.2` | Type: float | A2, A3 |
| `--mamba-scheduler-strategy` | `auto` | `auto`,
`no_buffer`,
`extra_buffer` | A2, A3 |
| `--mamba-track-interval` | `256` | Type: int | A2, A3 |
## Hierarchical cache
| Argument | Defaults | Options | Server supported |
|-------------------------------------------------|-----------------|---------------------------------------------------------------------|:----------------:|
| `--enable-hierarchical-`
`cache` | `False` | bool flag
(set to enable) | A2, A3 |
| `--hicache-ratio` | `2.0` | Type: float | A2, A3 |
| `--hicache-size` | `0` | Type: int | A2, A3 |
| `--hicache-write-policy` | `write_through` | `write_back`,
`write_through`,
`write_through_selective` | A2, A3 |
| `--radix-eviction-policy` | `lru` | `lru`, `lfu` | A2, A3 |
| `--hicache-io-backend` | `kernel` | `kernel_ascend`,
`direct` | A2, A3 |
| `--hicache-mem-layout` | `layer_first` | `page_first_direct`,
`page_first_kv_split` | A2, A3 |
| `--hicache-storage-`
`backend` | `None` | `file` | A2, A3 |
| `--hicache-storage-`
`prefetch-policy` | `best_effort` | `best_effort`,
`wait_complete`,
`timeout` | Special for GPU |
| `--hicache-storage-`
`backend-extra-config` | `None` | Type: str | Special for GPU |
## LMCache
| Argument | Defaults | Options | Server supported |
|--------------------|----------|--------------------------------|:----------------:|
| `--enable-lmcache` | `False` | bool flag
(set to enable) | Special for GPU |
## Offloading
| Argument | Defaults | Options | Server supported |
|---------------------------|----------|-----------|:----------------:|
| `--cpu-offload-gb` | `0` | Type: int | A2, A3 |
| `--offload-group-size` | `-1` | Type: int | A2, A3 |
| `--offload-num-in-group` | `1` | Type: int | A2, A3 |
| `--offload-prefetch-step` | `1` | Type: int | A2, A3 |
| `--offload-mode` | `cpu` | Type: str | A2, A3 |
## Args for multi-item scoring
| Argument | Defaults | Options | Server supported |
|----------------------------------|----------|-----------|:----------------:|
| `--multi-item-scoring-delimiter` | `None` | Type: int | A2, A3 |
## Optimization/debug options
| Argument | Defaults | Options | Server supported |
|---------------------------------------------------------|----------|--------------------------------|:----------------:|
| `--disable-radix-cache` | `False` | bool flag
(set to enable) | A2, A3 |
| `--cuda-graph-max-bs` | `None` | Type: int | A2, A3 |
| `--cuda-graph-bs` | `None` | List[int] | A2, A3 |
| `--disable-cuda-graph` | `False` | bool flag
(set to enable) | A2, A3 |
| `--disable-cuda-graph-`
`padding` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-profile-`
`cuda-graph` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-cudagraph-gc` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-nccl-nvls` | `False` | bool flag
(set to enable) | Special for GPU |
| `--enable-symm-mem` | `False` | bool flag
(set to enable) | Special for GPU |
| `--disable-flashinfer-`
`cutlass-moe-fp4-allgather` | `False` | bool flag
(set to enable) | Special for GPU |
| `--enable-tokenizer-`
`batch-encode` | `False` | bool flag
(set to enable) | A2, A3 |
| `--disable-tokenizer-`
`batch-encode` | `False` | bool flag
(set to enable) | A2, A3 |
| `--disable-outlines-`
`disk-cache` | `False` | bool flag
(set to enable) | A2, A3 |
| `--disable-custom-`
`all-reduce` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-mscclpp` | `False` | bool flag
(set to enable) | Special for GPU |
| `--enable-torch-`
`symm-mem` | `False` | bool flag
(set to enable) | Special for GPU |
| `--disable-overlap`
`-schedule` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-mixed-`
`chunk` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-dp-attention` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-dp-lm-head` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-two-`
`batch-overlap` | `False` | bool flag
(set to enable) | Planned |
| `--enable-single-`
`batch-overlap` | `False` | bool flag
(set to enable) | A2, A3 |
| `--tbo-token-`
`distribution-threshold` | `0.48` | Type: float | Planned |
| `--enable-torch-`
`compile` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-torch-`
`compile-debug-mode` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-piecewise-`
`cuda-graph` | `False` | bool flag
(set to enable) | A2, A3 |
| `--piecewise-cuda-`
`graph-tokens` | `None` | Type: JSON
list | A2, A3 |
| `--piecewise-cuda-`
`graph-compiler` | `eager` | ["eager", "inductor"] | A2, A3 |
| `--torch-compile-max-bs` | `32` | Type: int | A2, A3 |
| `--piecewise-cuda-`
`graph-max-tokens` | `4096` | Type: int | A2, A3 |
| `--torchao-config` | `` | Type: str | Special for GPU |
| `--enable-nan-detection` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-p2p-check` | `False` | bool flag
(set to enable) | Special for GPU |
| `--triton-attention-`
`reduce-in-fp32` | `False` | bool flag
(set to enable) | Special for GPU |
| `--triton-attention-`
`num-kv-splits` | `8` | Type: int | Special for GPU |
| `--triton-attention-`
`split-tile-size` | `None` | Type: int | Special for GPU |
| `--delete-ckpt-`
`after-loading` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-memory-saver` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-weights-`
`cpu-backup` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-draft-weights-`
`cpu-backup` | `False` | bool flag
(set to enable) | A2, A3 |
| `--allow-auto-truncate` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-custom-`
`logit-processor` | `False` | bool flag
(set to enable) | A2, A3 |
| `--flashinfer-mla-`
`disable-ragged` | `False` | bool flag
(set to enable) | Special for GPU |
| `--disable-shared-`
`experts-fusion` | `False` | bool flag
(set to enable) | A2, A3 |
| `--disable-chunked-`
`prefix-cache` | `False` | bool flag
(set to enable) | A2, A3 |
| `--disable-fast-`
`image-processor` | `False` | bool flag
(set to enable) | A2, A3 |
| `--keep-mm-feature-`
`on-device` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-return-`
`hidden-states` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-return-`
`routed-experts` | `False` | bool flag
(set to enable) | A2, A3 |
| `--scheduler-recv-`
`interval` | `1` | Type: int | A2, A3 |
| `--numa-node` | `None` | List[int] | A2, A3 |
| `--rl-on-policy-target` | `None` | `fsdp` | Planned |
| `--enable-layerwise-`
`nvtx-marker` | `False` | bool flag
(set to enable) | Special for GPU |
| `--enable-attn-tp-`
`input-scattered` | `False` | bool flag
(set to enable) | Experimental |
| `--enable-nsa-prefill-`
`context-parallel` | `False` | bool flag
(set to enable) | A2, A3 |
| `--enable-fused-qk-`
`norm-rope` | `False` | bool flag
(set to enable) | Special for GPU |
## Dynamic batch tokenizer
| Argument | Defaults | Options | Server supported |
|--------------------------------------------------|----------|--------------------------------|:----------------:|
| `--enable-dynamic-`
`batch-tokenizer` | `False` | bool flag
(set to enable) | A2, A3 |
| `--dynamic-batch-`
`tokenizer-batch-size` | `32` | Type: int | A2, A3 |
| `--dynamic-batch-`
`tokenizer-batch-timeout` | `0.002` | Type: float | A2, A3 |
## Debug tensor dumps
| Argument | Defaults | Options | Server supported |
|--------------------------------------------|----------|-----------|:----------------:|
| `--debug-tensor-dump-`
`output-folder` | `None` | Type: str | A2, A3 |
| `--debug-tensor-dump-`
`layers` | `None` | List[int] | A2, A3 |
| `--debug-tensor-dump-`
`input-file` | `None` | Type: str | A2, A3 |
## PD disaggregation
| Argument | Defaults | Options | Server supported |
|---------------------------------------------------------|------------|---------------------------------------|:----------------:|
| `--disaggregation-mode` | `null` | `null`,
`prefill`,
`decode` | A2, A3 |
| `--disaggregation-transfer-backend` | `mooncake` | `ascend` | A2, A3 |
| `--disaggregation-bootstrap-port` | `8998` | Type: int | A2, A3 |
| `--disaggregation-decode-tp` | `None` | Type: int | A2, A3 |
| `--disaggregation-decode-dp` | `None` | Type: int | A2, A3 |
| `--disaggregation-ib-device` | `None` | Type: str | Special for GPU |
| `--disaggregation-decode-`
`enable-offload-kvcache` | `False` | bool flag
(set to enable) | A2, A3 |
| `--disaggregation-decode-`
`enable-fake-auto` | `False` | bool flag
(set to enable) | A2, A3 |
| `--num-reserved-decode-tokens` | `512` | Type: int | A2, A3 |
| `--disaggregation-decode-`
`polling-interval` | `1` | Type: int | A2, A3 |
## Encode prefill disaggregation
| Argument | Defaults | Options | Server supported |
|------------------------------|--------------------|----------------------------------------------------------------|:----------------:|
| `--encoder-only` | `False` | bool flag
(set to enable) | A2, A3 |
| `--language-only` | `False` | bool flag
(set to enable) | A2, A3 |
| `--encoder-transfer-backend` | `zmq_to_scheduler` | `zmq_to_scheduler`,
`zmq_to_tokenizer`,
`mooncake` | A2, A3 |
| `--encoder-urls` | `[]` | List[str] | A2, A3 |
## Custom weight loader
| Argument | Defaults | Options | Server supported |
|-------------------------------------------------------------------------|----------|---------------------------------|:----------------:|
| `--custom-weight-loader` | `None` | List[str] | A2, A3 |
| `--weight-loader-disable-`
`mmap` | `False` | bool flag
(set to enable) | A2, A3 |
| `--remote-instance-weight-`
`loader-seed-instance-ip` | `None` | Type: str | A2, A3 |
| `--remote-instance-weight-`
`loader-seed-instance-service-port` | `None` | Type: int | A2, A3 |
| `--remote-instance-weight-`
`loader-send-weights-group-ports` | `None` | Type: JSON
list | A2, A3 |
| `--remote-instance-weight-`
`loader-backend` | `nccl` | `transfer_engine`,
`nccl` | A2, A3 |
| `--remote-instance-weight-`
`loader-start-seed-via-transfer-engine` | `False` | bool flag
(set to enable) | Special for GPU |
## For PD-Multiplexing
| Argument | Defaults | Options | Server supported |
|-----------------------|----------|--------------------------------|:----------------:|
| `--enable-pdmux` | `False` | bool flag
(set to enable) | Special for GPU |
| `--pdmux-config-path` | `None` | Type: str | Special for GPU |
| `--sm-group-num` | `8` | Type: int | Special for GPU |
## For Multi-Modal
| Argument | Defaults | Options | Server supported |
|-----------------------------------------------|----------|--------------------------------|:----------------:|
| `--mm-max-concurrent-calls` | 32 | Type: int | A2, A3 |
| `--mm-per-request-timeout` | 10.0 | Type: float | A2, A3 |
| `--enable-broadcast-mm-`
`inputs-process` | `False` | bool flag
(set to enable) | A2, A3 |
| `--mm-process-config` | `None` | Type: JSON / Dict | A2, A3 |
| `--mm-enable-dp-encoder` | `False` | bool flag
(set to enable) | A2, A3 |
| `--limit-mm-data-per-request` | `None` | Type: JSON / Dict | A2, A3 |
## For checkpoint decryption
| Argument | Defaults | Options | Server supported |
|---------------------------------|----------|--------------------------------|:----------------:|
| `--decrypted-config-file` | `None` | Type: str | A2, A3 |
| `--decrypted-draft-config-file` | `None` | Type: str | A2, A3 |
| `--enable-prefix-mm-cache` | `False` | bool flag
(set to enable) | A2, A3 |
## For deterministic inference
| Argument | Defaults | Options | Server supported |
|-------------------------------------------|----------|--------------------------------|:----------------:|
| `--enable-deterministic-`
`inference` | `False` | bool flag
(set to enable) | Planned |
## For registering hooks
| Argument | Defaults | Options | Server supported |
|-------------------|----------|-----------------|:----------------:|
| `--forward-hooks` | `None` | Type: JSON list | A2, A3 |
## Configuration file support
| Argument | Defaults | Options | Server supported |
|------------|----------|-----------|:----------------:|
| `--config` | `None` | Type: str | A2, A3 |
## Other Params
The following parameters are not supported because the third-party components that depend on are not compatible with the
NPU, like Ktransformer, checkpoint-engine etc.
| Argument | Defaults | Options |
|-------------------------------------------------------------------|-----------|---------------------------|
| `--checkpoint-engine-`
`wait-weights-`
`before-ready` | `False` | bool flag (set to enable) |
| `--kt-weight-path` | `None` | Type: str |
| `--kt-method` | `AMXINT4` | Type: str |
| `--kt-cpuinfer` | `None` | Type: int |
| `--kt-threadpool-count` | 2 | Type: int |
| `--kt-num-gpu-experts` | `None` | Type: int |
| `--kt-max-deferred-`
`experts-per-token` | `None` | Type: int |
The following parameters have some functional deficiencies on community
| Argument | Defaults | Options |
|---------------------------------------|----------|--------------------------------|
| `--enable-double-sparsity` | `False` | bool flag
(set to enable) |
| `--ds-channel-config-path` | `None` | Type: str |
| `--ds-heavy-channel-num` | `32` | Type: int |
| `--ds-heavy-token-num` | `256` | Type: int |
| `--ds-heavy-channel-type` | `qk` | Type: str |
| `--ds-sparse-decode-`
`threshold` | `4096` | Type: int |
| `--tool-server` | `None` | Type: str |