Hanrui / sglang /docs /platforms /ascend_npu_support_features.md

Lekr0

Add files using upload-large-folder tool

a227c91 verified about 2 months ago

preview code

raw

history blame contribute delete

45.2 kB

Support Features on Ascend NPU

This section describes the basic functions and features supported by the Ascend NPU.If you encounter issues or have any questions, please open an issue.

If you want to know the meaning and usage of each parameter, click Server Arguments.

Model and tokenizer

Argument	Defaults	Options	Server supported
`--model-path` `--model`	`None`	Type: str	A2, A3
`--tokenizer-path`	`None`	Type: str	A2, A3
`--tokenizer-mode`	`auto`	`auto`, `slow`	A2, A3
`--tokenizer-worker-num`	`1`	Type: int	A2, A3
`--skip-tokenizer-init`	`False`	bool flag (set to enable)	A2, A3
`--load-format`	`auto`	`auto`, `safetensors`	A2, A3
`--model-loader-` `extra-config`	{}	Type: str	A2, A3
`--trust-remote-code`	`False`	bool flag (set to enable)	A2, A3
`--context-length`	`None`	Type: int	A2, A3
`--is-embedding`	`False`	bool flag (set to enable)	A2, A3
`--enable-multimodal`	`None`	bool flag (set to enable)	A2, A3
`--revision`	`None`	Type: str	A2, A3
`--model-impl`	`auto`	`auto`, `sglang`, `transformers`	A2, A3

HTTP server

Argument	Defaults	Options	Server supported
`--host`	`127.0.0.1`	Type: str	A2, A3
`--port`	`30000`	Type: int	A2, A3
`--skip-server-warmup`	`False`	bool flag (set to enable)	A2, A3
`--warmups`	`None`	Type: str	A2, A3
`--nccl-port`	`None`	Type: int	A2, A3
`--fastapi-root-path`	`None`	Type: str	A2, A3
`--grpc-mode`	`False`	bool flag (set to enable)	A2, A3

Quantization and data type

Argument	Defaults	Options	Server supported
`--dtype`	`auto`	`auto`, `float16`, `bfloat16`	A2, A3
`--quantization`	`None`	`modelslim`	A2, A3
`--quantization-param-path`	`None`	Type: str	Special For GPU
`--kv-cache-dtype`	`auto`	`auto`	A2, A3
`--enable-fp32-lm-head`	`False`	bool flag (set to enable)	A2, A3
`--modelopt-quant`	`None`	Type: str	Special For GPU
`--modelopt-checkpoint-` `restore-path`	`None`	Type: str	Special For GPU
`--modelopt-checkpoint-` `save-path`	`None`	Type: str	Special For GPU
`--modelopt-export-path`	`None`	Type: str	Special For GPU
`--quantize-and-serve`	`False`	bool flag (set to enable)	Special For GPU
`--rl-quant-profile`	`None`	Type: str	Special For GPU

Memory and scheduling

Argument	Defaults	Options	Server supported
`--mem-fraction-static`	`None`	Type: float	A2, A3
`--max-running-requests`	`None`	Type: int	A2, A3
`--prefill-max-requests`	`None`	Type: int	A2, A3
`--max-queued-requests`	`None`	Type: int	A2, A3
`--max-total-tokens`	`None`	Type: int	A2, A3
`--chunked-prefill-size`	`None`	Type: int	A2, A3
`--max-prefill-tokens`	`16384`	Type: int	A2, A3
`--schedule-policy`	`fcfs`	`lpm`, `fcfs`	A2, A3
`--enable-priority-` `scheduling`	`False`	bool flag (set to enable)	A2, A3
`--schedule-low-priority-` `values-first`	`False`	bool flag (set to enable)	A2, A3
`--priority-scheduling-` `preemption-threshold`	`10`	Type: int	A2, A3
`--schedule-conservativeness`	`1.0`	Type: float	A2, A3
`--page-size`	`128`	Type: int	A2, A3
`--swa-full-tokens-ratio`	`0.8`	Type: float	A2, A3
`--disable-hybrid-swa-memory`	`False`	bool flag (set to enable)	A2, A3
`--abort-on-priority-` `when-disabled`	`False`	bool flag (set to enable)	A2, A3
`--enable-dynamic-chunking`	`False`	bool flag (set to enable)	A2, A3

Runtime options

Argument	Defaults	Options	Server supported
`--device`	`None`	Type: str	A2, A3
`--tensor-parallel-size` `--tp-size`	`1`	Type: int	A2, A3
`--pipeline-parallel-size` `--pp-size`	`1`	Type: int	A2, A3
`--pp-max-micro-batch-size`	`None`	Type: int	A2, A3
`--pp-async-batch-depth`	`None`	Type: int	A2, A3
`--stream-interval`	`1`	Type: int	A2, A3
`--stream-output`	`False`	bool flag (set to enable)	A2, A3
`--random-seed`	`None`	Type: int	A2, A3
`--constrained-json-` `whitespace-pattern`	`None`	Type: str	A2, A3
`--constrained-json-` `disable-any-whitespace`	`False`	bool flag (set to enable)	A2, A3
`--watchdog-timeout`	`300`	Type: float	A2, A3
`--soft-watchdog-timeout`	`300`	Type: float	A2, A3
`--dist-timeout`	`None`	Type: int	A2, A3
`--base-gpu-id`	`0`	Type: int	A2, A3
`--gpu-id-step`	`1`	Type: int	A2, A3
`--sleep-on-idle`	`False`	bool flag (set to enable)	A2, A3
`--custom-sigquit-handler`	`None`	Optional[Callable]	A2, A3

Logging

Argument	Defaults	Options	Server supported
`--log-level`	`info`	Type: str	A2, A3
`--log-level-http`	`None`	Type: str	A2, A3
`--log-requests`	`False`	bool flag (set to enable)	A2, A3
`--log-requests-level`	`2`	`0`, `1`, `2`, `3`	A2, A3
`--log-requests-format`	text	text, json	A2, A3
`--crash-dump-folder`	`None`	Type: str	A2, A3
`--enable-metrics`	`False`	bool flag (set to enable)	A2, A3
`--enable-metrics-for-` `all-schedulers`	`False`	bool flag (set to enable)	A2, A3
`--tokenizer-metrics-` `custom-labels-header`	`x-custom-labels`	Type: str	A2, A3
`--tokenizer-metrics-` `allowed-custom-labels`	`None`	List[str]	A2, A3
`--bucket-time-to-` `first-token`	`None`	List[float]	A2, A3
`--bucket-inter-token-` `latency`	`None`	List[float]	A2, A3
`--bucket-e2e-request-` `latency`	`None`	List[float]	A2, A3
`--collect-tokens-` `histogram`	`False`	bool flag (set to enable)	A2, A3
`--prompt-tokens-buckets`	`None`	List[str]	A2, A3
`--generation-tokens-buckets`	`None`	List[str]	A2, A3
`--gc-warning-threshold-secs`	`0.0`	Type: float	A2, A3
`--decode-log-interval`	`40`	Type: int	A2, A3
`--enable-request-time-` `stats-logging`	`False`	bool flag (set to enable)	A2, A3
`--kv-events-config`	`None`	Type: str	Special for GPU
`--enable-trace`	`False`	bool flag (set to enable)	A2, A3
`--oltp-traces-endpoint`	`localhost:4317`	Type: str	A2, A3

RequestMetricsExporter configuration

Argument	Defaults	Options	Server supported
`--export-metrics-to-` `file`	`False`	bool flag (set to enable)	A2, A3
`--export-metrics-to-` `file-dir`	`None`	Type: str	A2, A3

API related

Argument	Defaults	Options	Server supported
`--api-key`	`None`	Type: str	A2, A3
`--admin-api-key`	`None`	Type: str	A2, A3
`--served-model-name`	`None`	Type: str	A2, A3
`--weight-version`	`default`	Type: str	A2, A3
`--chat-template`	`None`	Type: str	A2, A3
`--completion-template`	`None`	Type: str	A2, A3
`--enable-cache-report`	`False`	bool flag (set to enable)	A2, A3
`--reasoning-parser`	`None`	`deepseek-r1`	A2, A3
`--tool-call-parser`	`None`	`llama`,`pythonic`	A2, A3
`--sampling-defaults`	`model`	`openai`, `model`	A2, A3

Data parallelism

Argument	Defaults	Options	Server supported
`--data-parallel-size` `--dp-size`	`1`	Type: int	A2, A3
`--load-balance-method`	`round_robin`	`round_robin`, `total_requests`, `total_tokens`	A2, A3
`--prefill-round-robin-balance`	`False`	bool flag (set to enable)	A2, A3

Multi-node distributed serving

Argument	Defaults	Options	Server supported
`--dist-init-addr` `--nccl-init-addr`	`None`	Type: str	A2, A3
`--nnodes`	`1`	Type: int	A2, A3
`--node-rank`	`0`	Type: int	A2, A3

Model override args

Argument	Defaults	Options	Server supported
`--json-model-override-` `args`	`{}`	Type: str	A2, A3
`--preferred-sampling-` `params`	`None`	Type: str	A2, A3

LoRA

Argument	Defaults	Options	Server supported
`--enable-lora`	`False`	Bool flag (set to enable)	A2, A3
`--max-lora-rank`	`None`	Type: int	A2, A3
`--lora-target-modules`	`None`	`all`	A2, A3
`--lora-paths`	`None`	Type: List[str] / JSON objects	A2, A3
`--max-loras-per-batch`	`8`	Type: int	A2, A3
`--max-loaded-loras`	`None`	Type: int	A2, A3
`--lora-eviction-policy`	`lru`	`lru`, `fifo`	A2, A3
`--lora-backend`	`triton`	`triton`	A2, A3
`--max-lora-chunk-size`	`16`	`16`, `32`, `64`, `128`	Special for GPU

Kernel Backends (Attention, Sampling, Grammar, GEMM)

Argument	Defaults	Options	Server supported
`--attention-backend`	`None`	`ascend`	A2, A3
`--prefill-attention-backend`	`None`	`ascend`	A2, A3
`--decode-attention-backend`	`None`	`ascend`	A2, A3
`--sampling-backend`	`None`	`pytorch`, `ascend`	A2, A3
`--grammar-backend`	`None`	`xgrammar`	A2, A3
`--mm-attention-backend`	`None`	`ascend_attn`	A2, A3
`--nsa-prefill-backend`	`flashmla_sparse`	`flashmla_sparse`, `flashmla_decode`, `fa3`, `tilelang`, `aiter`	Special for GPU
`--nsa-decode-backend`	`fa3`	`flashmla_prefill`, `flashmla_kv`, `fa3`, `tilelang`, `aiter`	Special for GPU
`--fp8-gemm-backend`	`auto`	`auto`, `deep_gemm`, `flashinfer_trtllm`, `flashinfer_cutlass`, `flashinfer_deepgemm`, `cutlass`, `triton`, `aiter`	Special for GPU
`--disable-flashinfer-` `autotune`	`False`	bool flag (set to enable)	Special for GPU

Speculative decoding

Argument	Defaults	Options	Server supported
`--speculative-algorithm`	`None`	`EAGLE3`, `NEXTN`	A2, A3
`--speculative-draft-model-path` `--speculative-draft-model`	`None`	Type: str	A2, A3
`--speculative-draft-model-` `revision`	`None`	Type: str	A2, A3
`--speculative-draft-load-format`	`None`	`auto`	A2, A3
`--speculative-num-steps`	`None`	Type: int	A2, A3
`--speculative-eagle-topk`	`None`	Type: int	A2, A3
`--speculative-num-draft-tokens`	`None`	Type: int	A2, A3
`--speculative-accept-` `threshold-single`	`1.0`	Type: float	Special for GPU
`--speculative-accept-` `threshold-acc`	`1.0`	Type: float	Special for GPU
`--speculative-token-map`	`None`	Type: str	A2, A3
`--speculative-attention-` `mode`	`prefill`	`prefill`, `decode`	A2, A3
`--speculative-moe-runner-` `backend`	`None`	`auto`	A2, A3
`--speculative-moe-a2a-` `backend`	`None`	`ascend_fuseep`	A2, A3
`--speculative-draft-attention-backend`	`None`	`ascend`	A2, A3
`--speculative-draft-model-quantization`	`None`	`unquant`	A2, A3

Ngram speculative decoding

Argument	Defaults	Options	Server supported
`--speculative-ngram-` `min-match-window-size`	`1`	Type: int	Experimental
`--speculative-ngram-` `max-match-window-size`	`12`	Type: int	Experimental
`--speculative-ngram-` `min-bfs-breadth`	`1`	Type: int	Experimental
`--speculative-ngram-` `max-bfs-breadth`	`10`	Type: int	Experimental
`--speculative-ngram-` `match-type`	`BFS`	`BFS`, `PROB`	Experimental
`--speculative-ngram-` `branch-length`	`18`	Type: int	Experimental
`--speculative-ngram-` `capacity`	`10000000`	Type: int	Experimental

Expert parallelism

Argument	Defaults	Options	Server supported
`--expert-parallel-size` `--ep-size` `--ep`	`1`	Type: int	A2, A3
`--moe-a2a-backend`	`none`	`none`, `deepep`, `ascend_fuseep`	A2, A3
`--moe-runner-backend`	`auto`	`auto`, `triton`	A2, A3
`--flashinfer-mxfp4-` `moe-precision`	`default`	`default`, `bf16`	Special for GPU
`--enable-flashinfer-` `allreduce-fusion`	`False`	bool flag (set to enable)	Special for GPU
`--deepep-mode`	`auto`	`normal`, `low_latency`, `auto`	A2, A3
`--deepep-config`	`None`	Type: str	Special for GPU
`--ep-num-redundant-experts`	`0`	Type: int	A2, A3
`--ep-dispatch-algorithm`	`None`	Type: str	A2, A3
`--init-expert-location`	`trivial`	Type: str	A2, A3
`--enable-eplb`	`False`	bool flag (set to enable)	A2, A3
`--eplb-algorithm`	`auto`	Type: str	A2, A3
`--eplb-rebalance-layers-` `per-chunk`	`None`	Type: int	A2, A3
`--eplb-min-rebalancing-` `utilization-threshold`	`1.0`	Type: float	A2, A3
`--expert-distribution-` `recorder-mode`	`None`	Type: str	A2, A3
`--expert-distribution-` `recorder-buffer-size`	`None`	Type: int	A2, A3
`--enable-expert-distribution-` `metrics`	`False`	bool flag (set to enable)	A2, A3
`--moe-dense-tp-size`	`None`	Type: int	A2, A3
`--elastic-ep-backend`	`None`	`none`, `mooncake`	Special for GPU
`--mooncake-ib-device`	`None`	Type: str	Special for GPU

Mamba Cache

Argument	Defaults	Options	Server supported
`--max-mamba-cache-size`	`None`	Type: int	A2, A3
`--mamba-ssm-dtype`	`float32`	`float32`, `bfloat16`	A2, A3
`--mamba-full-memory-ratio`	`0.2`	Type: float	A2, A3
`--mamba-scheduler-strategy`	`auto`	`auto`, `no_buffer`, `extra_buffer`	A2, A3
`--mamba-track-interval`	`256`	Type: int	A2, A3

Hierarchical cache

Argument	Defaults	Options	Server supported
`--enable-hierarchical-` `cache`	`False`	bool flag (set to enable)	A2, A3
`--hicache-ratio`	`2.0`	Type: float	A2, A3
`--hicache-size`	`0`	Type: int	A2, A3
`--hicache-write-policy`	`write_through`	`write_back`, `write_through`, `write_through_selective`	A2, A3
`--radix-eviction-policy`	`lru`	`lru`, `lfu`	A2, A3
`--hicache-io-backend`	`kernel`	`kernel_ascend`, `direct`	A2, A3
`--hicache-mem-layout`	`layer_first`	`page_first_direct`, `page_first_kv_split`	A2, A3
`--hicache-storage-` `backend`	`None`	`file`	A2, A3
`--hicache-storage-` `prefetch-policy`	`best_effort`	`best_effort`, `wait_complete`, `timeout`	Special for GPU
`--hicache-storage-` `backend-extra-config`	`None`	Type: str	Special for GPU

LMCache

Argument	Defaults	Options	Server supported
`--enable-lmcache`	`False`	bool flag (set to enable)	Special for GPU

Offloading

Argument	Defaults	Options	Server supported
`--cpu-offload-gb`	`0`	Type: int	A2, A3
`--offload-group-size`	`-1`	Type: int	A2, A3
`--offload-num-in-group`	`1`	Type: int	A2, A3
`--offload-prefetch-step`	`1`	Type: int	A2, A3
`--offload-mode`	`cpu`	Type: str	A2, A3

Args for multi-item scoring

Argument	Defaults	Options	Server supported
`--multi-item-scoring-delimiter`	`None`	Type: int	A2, A3

Optimization/debug options

Argument	Defaults	Options	Server supported
`--disable-radix-cache`	`False`	bool flag (set to enable)	A2, A3
`--cuda-graph-max-bs`	`None`	Type: int	A2, A3
`--cuda-graph-bs`	`None`	List[int]	A2, A3
`--disable-cuda-graph`	`False`	bool flag (set to enable)	A2, A3
`--disable-cuda-graph-` `padding`	`False`	bool flag (set to enable)	A2, A3
`--enable-profile-` `cuda-graph`	`False`	bool flag (set to enable)	A2, A3
`--enable-cudagraph-gc`	`False`	bool flag (set to enable)	A2, A3
`--enable-nccl-nvls`	`False`	bool flag (set to enable)	Special for GPU
`--enable-symm-mem`	`False`	bool flag (set to enable)	Special for GPU
`--disable-flashinfer-` `cutlass-moe-fp4-allgather`	`False`	bool flag (set to enable)	Special for GPU
`--enable-tokenizer-` `batch-encode`	`False`	bool flag (set to enable)	A2, A3
`--disable-tokenizer-` `batch-encode`	`False`	bool flag (set to enable)	A2, A3
`--disable-outlines-` `disk-cache`	`False`	bool flag (set to enable)	A2, A3
`--disable-custom-` `all-reduce`	`False`	bool flag (set to enable)	A2, A3
`--enable-mscclpp`	`False`	bool flag (set to enable)	Special for GPU
`--enable-torch-` `symm-mem`	`False`	bool flag (set to enable)	Special for GPU
`--disable-overlap` `-schedule`	`False`	bool flag (set to enable)	A2, A3
`--enable-mixed-` `chunk`	`False`	bool flag (set to enable)	A2, A3
`--enable-dp-attention`	`False`	bool flag (set to enable)	A2, A3
`--enable-dp-lm-head`	`False`	bool flag (set to enable)	A2, A3
`--enable-two-` `batch-overlap`	`False`	bool flag (set to enable)	Planned
`--enable-single-` `batch-overlap`	`False`	bool flag (set to enable)	A2, A3
`--tbo-token-` `distribution-threshold`	`0.48`	Type: float	Planned
`--enable-torch-` `compile`	`False`	bool flag (set to enable)	A2, A3
`--enable-torch-` `compile-debug-mode`	`False`	bool flag (set to enable)	A2, A3
`--enable-piecewise-` `cuda-graph`	`False`	bool flag (set to enable)	A2, A3
`--piecewise-cuda-` `graph-tokens`	`None`	Type: JSON list	A2, A3
`--piecewise-cuda-` `graph-compiler`	`eager`	["eager", "inductor"]	A2, A3
`--torch-compile-max-bs`	`32`	Type: int	A2, A3
`--piecewise-cuda-` `graph-max-tokens`	`4096`	Type: int	A2, A3
`--torchao-config`	``	Type: str	Special for GPU
`--enable-nan-detection`	`False`	bool flag (set to enable)	A2, A3
`--enable-p2p-check`	`False`	bool flag (set to enable)	Special for GPU
`--triton-attention-` `reduce-in-fp32`	`False`	bool flag (set to enable)	Special for GPU
`--triton-attention-` `num-kv-splits`	`8`	Type: int	Special for GPU
`--triton-attention-` `split-tile-size`	`None`	Type: int	Special for GPU
`--delete-ckpt-` `after-loading`	`False`	bool flag (set to enable)	A2, A3
`--enable-memory-saver`	`False`	bool flag (set to enable)	A2, A3
`--enable-weights-` `cpu-backup`	`False`	bool flag (set to enable)	A2, A3
`--enable-draft-weights-` `cpu-backup`	`False`	bool flag (set to enable)	A2, A3
`--allow-auto-truncate`	`False`	bool flag (set to enable)	A2, A3
`--enable-custom-` `logit-processor`	`False`	bool flag (set to enable)	A2, A3
`--flashinfer-mla-` `disable-ragged`	`False`	bool flag (set to enable)	Special for GPU
`--disable-shared-` `experts-fusion`	`False`	bool flag (set to enable)	A2, A3
`--disable-chunked-` `prefix-cache`	`False`	bool flag (set to enable)	A2, A3
`--disable-fast-` `image-processor`	`False`	bool flag (set to enable)	A2, A3
`--keep-mm-feature-` `on-device`	`False`	bool flag (set to enable)	A2, A3
`--enable-return-` `hidden-states`	`False`	bool flag (set to enable)	A2, A3
`--enable-return-` `routed-experts`	`False`	bool flag (set to enable)	A2, A3
`--scheduler-recv-` `interval`	`1`	Type: int	A2, A3
`--numa-node`	`None`	List[int]	A2, A3
`--rl-on-policy-target`	`None`	`fsdp`	Planned
`--enable-layerwise-` `nvtx-marker`	`False`	bool flag (set to enable)	Special for GPU
`--enable-attn-tp-` `input-scattered`	`False`	bool flag (set to enable)	Experimental
`--enable-nsa-prefill-` `context-parallel`	`False`	bool flag (set to enable)	A2, A3
`--enable-fused-qk-` `norm-rope`	`False`	bool flag (set to enable)	Special for GPU

Dynamic batch tokenizer

Argument	Defaults	Options	Server supported
`--enable-dynamic-` `batch-tokenizer`	`False`	bool flag (set to enable)	A2, A3
`--dynamic-batch-` `tokenizer-batch-size`	`32`	Type: int	A2, A3
`--dynamic-batch-` `tokenizer-batch-timeout`	`0.002`	Type: float	A2, A3

Debug tensor dumps

Argument	Defaults	Options	Server supported
`--debug-tensor-dump-` `output-folder`	`None`	Type: str	A2, A3
`--debug-tensor-dump-` `layers`	`None`	List[int]	A2, A3
`--debug-tensor-dump-` `input-file`	`None`	Type: str	A2, A3

PD disaggregation

Argument	Defaults	Options	Server supported
`--disaggregation-mode`	`null`	`null`, `prefill`, `decode`	A2, A3
`--disaggregation-transfer-backend`	`mooncake`	`ascend`	A2, A3
`--disaggregation-bootstrap-port`	`8998`	Type: int	A2, A3
`--disaggregation-decode-tp`	`None`	Type: int	A2, A3
`--disaggregation-decode-dp`	`None`	Type: int	A2, A3
`--disaggregation-ib-device`	`None`	Type: str	Special for GPU
`--disaggregation-decode-` `enable-offload-kvcache`	`False`	bool flag (set to enable)	A2, A3
`--disaggregation-decode-` `enable-fake-auto`	`False`	bool flag (set to enable)	A2, A3
`--num-reserved-decode-tokens`	`512`	Type: int	A2, A3
`--disaggregation-decode-` `polling-interval`	`1`	Type: int	A2, A3

Encode prefill disaggregation

Argument	Defaults	Options	Server supported
`--encoder-only`	`False`	bool flag (set to enable)	A2, A3
`--language-only`	`False`	bool flag (set to enable)	A2, A3
`--encoder-transfer-backend`	`zmq_to_scheduler`	`zmq_to_scheduler`, `zmq_to_tokenizer`, `mooncake`	A2, A3
`--encoder-urls`	`[]`	List[str]	A2, A3

Custom weight loader

Argument	Defaults	Options	Server supported
`--custom-weight-loader`	`None`	List[str]	A2, A3
`--weight-loader-disable-` `mmap`	`False`	bool flag (set to enable)	A2, A3
`--remote-instance-weight-` `loader-seed-instance-ip`	`None`	Type: str	A2, A3
`--remote-instance-weight-` `loader-seed-instance-service-port`	`None`	Type: int	A2, A3
`--remote-instance-weight-` `loader-send-weights-group-ports`	`None`	Type: JSON list	A2, A3
`--remote-instance-weight-` `loader-backend`	`nccl`	`transfer_engine`, `nccl`	A2, A3
`--remote-instance-weight-` `loader-start-seed-via-transfer-engine`	`False`	bool flag (set to enable)	Special for GPU

For PD-Multiplexing

Argument	Defaults	Options	Server supported
`--enable-pdmux`	`False`	bool flag (set to enable)	Special for GPU
`--pdmux-config-path`	`None`	Type: str	Special for GPU
`--sm-group-num`	`8`	Type: int	Special for GPU

For Multi-Modal

Argument	Defaults	Options	Server supported
`--mm-max-concurrent-calls`	32	Type: int	A2, A3
`--mm-per-request-timeout`	10.0	Type: float	A2, A3
`--enable-broadcast-mm-` `inputs-process`	`False`	bool flag (set to enable)	A2, A3
`--mm-process-config`	`None`	Type: JSON / Dict	A2, A3
`--mm-enable-dp-encoder`	`False`	bool flag (set to enable)	A2, A3
`--limit-mm-data-per-request`	`None`	Type: JSON / Dict	A2, A3

For checkpoint decryption

Argument	Defaults	Options	Server supported
`--decrypted-config-file`	`None`	Type: str	A2, A3
`--decrypted-draft-config-file`	`None`	Type: str	A2, A3
`--enable-prefix-mm-cache`	`False`	bool flag (set to enable)	A2, A3

For deterministic inference

Argument	Defaults	Options	Server supported
`--enable-deterministic-` `inference`	`False`	bool flag (set to enable)	Planned

For registering hooks

Argument	Defaults	Options	Server supported
`--forward-hooks`	`None`	Type: JSON list	A2, A3

Configuration file support

Argument	Defaults	Options	Server supported
`--config`	`None`	Type: str	A2, A3

Other Params

The following parameters are not supported because the third-party components that depend on are not compatible with the NPU, like Ktransformer, checkpoint-engine etc.

Argument	Defaults	Options
`--checkpoint-engine-` `wait-weights-` `before-ready`	`False`	bool flag (set to enable)
`--kt-weight-path`	`None`	Type: str
`--kt-method`	`AMXINT4`	Type: str
`--kt-cpuinfer`	`None`	Type: int
`--kt-threadpool-count`	2	Type: int
`--kt-num-gpu-experts`	`None`	Type: int
`--kt-max-deferred-` `experts-per-token`	`None`	Type: int

The following parameters have some functional deficiencies on community

Argument	Defaults	Options
`--enable-double-sparsity`	`False`	bool flag (set to enable)
`--ds-channel-config-path`	`None`	Type: str
`--ds-heavy-channel-num`	`32`	Type: int
`--ds-heavy-token-num`	`256`	Type: int
`--ds-heavy-channel-type`	`qk`	Type: str
`--ds-sparse-decode-` `threshold`	`4096`	Type: int
`--tool-server`	`None`	Type: str