Support Features on Ascend NPU
This section describes the basic functions and features supported by the Ascend NPU.If you encounter issues or have any questions, please open an issue.
If you want to know the meaning and usage of each parameter, click Server Arguments.
Model and tokenizer
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--model-path--model |
None |
Type: str | A2, A3 |
--tokenizer-path |
None |
Type: str | A2, A3 |
--tokenizer-mode |
auto |
auto, slow |
A2, A3 |
--tokenizer-worker-num |
1 |
Type: int | A2, A3 |
--skip-tokenizer-init |
False |
bool flag (set to enable) | A2, A3 |
--load-format |
auto |
auto, safetensors |
A2, A3 |
--model-loader- extra-config |
{} | Type: str | A2, A3 |
--trust-remote-code |
False |
bool flag (set to enable) | A2, A3 |
--context-length |
None |
Type: int | A2, A3 |
--is-embedding |
False |
bool flag (set to enable) | A2, A3 |
--enable-multimodal |
None |
bool flag (set to enable) | A2, A3 |
--revision |
None |
Type: str | A2, A3 |
--model-impl |
auto |
auto, sglang,transformers |
A2, A3 |
HTTP server
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--host |
127.0.0.1 |
Type: str | A2, A3 |
--port |
30000 |
Type: int | A2, A3 |
--skip-server-warmup |
False |
bool flag (set to enable) | A2, A3 |
--warmups |
None |
Type: str | A2, A3 |
--nccl-port |
None |
Type: int | A2, A3 |
--fastapi-root-path |
None |
Type: str | A2, A3 |
--grpc-mode |
False |
bool flag (set to enable) | A2, A3 |
Quantization and data type
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--dtype |
auto |
auto,float16,bfloat16 |
A2, A3 |
--quantization |
None |
modelslim |
A2, A3 |
--quantization-param-path |
None |
Type: str | Special For GPU |
--kv-cache-dtype |
auto |
auto |
A2, A3 |
--enable-fp32-lm-head |
False |
bool flag (set to enable) |
A2, A3 |
--modelopt-quant |
None |
Type: str | Special For GPU |
--modelopt-checkpoint-restore-path |
None |
Type: str | Special For GPU |
--modelopt-checkpoint-save-path |
None |
Type: str | Special For GPU |
--modelopt-export-path |
None |
Type: str | Special For GPU |
--quantize-and-serve |
False |
bool flag (set to enable) |
Special For GPU |
--rl-quant-profile |
None |
Type: str | Special For GPU |
Memory and scheduling
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--mem-fraction-static |
None |
Type: float | A2, A3 |
--max-running-requests |
None |
Type: int | A2, A3 |
--prefill-max-requests |
None |
Type: int | A2, A3 |
--max-queued-requests |
None |
Type: int | A2, A3 |
--max-total-tokens |
None |
Type: int | A2, A3 |
--chunked-prefill-size |
None |
Type: int | A2, A3 |
--max-prefill-tokens |
16384 |
Type: int | A2, A3 |
--schedule-policy |
fcfs |
lpm, fcfs |
A2, A3 |
--enable-priority-scheduling |
False |
bool flag (set to enable) |
A2, A3 |
--schedule-low-priority-values-first |
False |
bool flag (set to enable) |
A2, A3 |
--priority-scheduling-preemption-threshold |
10 |
Type: int | A2, A3 |
--schedule-conservativeness |
1.0 |
Type: float | A2, A3 |
--page-size |
128 |
Type: int | A2, A3 |
--swa-full-tokens-ratio |
0.8 |
Type: float | A2, A3 |
--disable-hybrid-swa-memory |
False |
bool flag (set to enable) |
A2, A3 |
--abort-on-priority-when-disabled |
False |
bool flag (set to enable) |
A2, A3 |
--enable-dynamic-chunking |
False |
bool flag (set to enable) |
A2, A3 |
Runtime options
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--device |
None |
Type: str | A2, A3 |
--tensor-parallel-size--tp-size |
1 |
Type: int | A2, A3 |
--pipeline-parallel-size--pp-size |
1 |
Type: int | A2, A3 |
--pp-max-micro-batch-size |
None |
Type: int | A2, A3 |
--pp-async-batch-depth |
None |
Type: int | A2, A3 |
--stream-interval |
1 |
Type: int | A2, A3 |
--stream-output |
False |
bool flag (set to enable) | A2, A3 |
--random-seed |
None |
Type: int | A2, A3 |
--constrained-json-whitespace-pattern |
None |
Type: str | A2, A3 |
--constrained-json-disable-any-whitespace |
False |
bool flag (set to enable) | A2, A3 |
--watchdog-timeout |
300 |
Type: float | A2, A3 |
--soft-watchdog-timeout |
300 |
Type: float | A2, A3 |
--dist-timeout |
None |
Type: int | A2, A3 |
--base-gpu-id |
0 |
Type: int | A2, A3 |
--gpu-id-step |
1 |
Type: int | A2, A3 |
--sleep-on-idle |
False |
bool flag (set to enable) | A2, A3 |
--custom-sigquit-handler |
None |
Optional[Callable] | A2, A3 |
Logging
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--log-level |
info |
Type: str | A2, A3 |
--log-level-http |
None |
Type: str | A2, A3 |
--log-requests |
False |
bool flag (set to enable) |
A2, A3 |
--log-requests-level |
2 |
0, 1, 2, 3 |
A2, A3 |
--log-requests-format |
text | text, json | A2, A3 |
--crash-dump-folder |
None |
Type: str | A2, A3 |
--enable-metrics |
False |
bool flag (set to enable) |
A2, A3 |
--enable-metrics-for-all-schedulers |
False |
bool flag (set to enable) |
A2, A3 |
--tokenizer-metrics-custom-labels-header |
x-custom-labels |
Type: str | A2, A3 |
--tokenizer-metrics-allowed-custom-labels |
None |
List[str] | A2, A3 |
--bucket-time-to-first-token |
None |
List[float] | A2, A3 |
--bucket-inter-token-latency |
None |
List[float] | A2, A3 |
--bucket-e2e-request-latency |
None |
List[float] | A2, A3 |
--collect-tokens-histogram |
False |
bool flag (set to enable) |
A2, A3 |
--prompt-tokens-buckets |
None |
List[str] | A2, A3 |
--generation-tokens-buckets |
None |
List[str] | A2, A3 |
--gc-warning-threshold-secs |
0.0 |
Type: float | A2, A3 |
--decode-log-interval |
40 |
Type: int | A2, A3 |
--enable-request-time-stats-logging |
False |
bool flag (set to enable) |
A2, A3 |
--kv-events-config |
None |
Type: str | Special for GPU |
--enable-trace |
False |
bool flag (set to enable) |
A2, A3 |
--oltp-traces-endpoint |
localhost:4317 |
Type: str | A2, A3 |
RequestMetricsExporter configuration
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--export-metrics-to-file |
False |
bool flag (set to enable) |
A2, A3 |
--export-metrics-to-file-dir |
None |
Type: str | A2, A3 |
API related
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--api-key |
None |
Type: str | A2, A3 |
--admin-api-key |
None |
Type: str | A2, A3 |
--served-model-name |
None |
Type: str | A2, A3 |
--weight-version |
default |
Type: str | A2, A3 |
--chat-template |
None |
Type: str | A2, A3 |
--completion-template |
None |
Type: str | A2, A3 |
--enable-cache-report |
False |
bool flag (set to enable) |
A2, A3 |
--reasoning-parser |
None |
deepseek-r1 |
A2, A3 |
--tool-call-parser |
None |
llama,pythonic |
A2, A3 |
--sampling-defaults |
model |
openai, model |
A2, A3 |
Data parallelism
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--data-parallel-size--dp-size |
1 |
Type: int | A2, A3 |
--load-balance-method |
round_robin |
round_robin,total_requests,total_tokens |
A2, A3 |
--prefill-round-robin-balance |
False |
bool flag (set to enable) |
A2, A3 |
Multi-node distributed serving
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--dist-init-addr--nccl-init-addr |
None |
Type: str | A2, A3 |
--nnodes |
1 |
Type: int | A2, A3 |
--node-rank |
0 |
Type: int | A2, A3 |
Model override args
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--json-model-override-args |
{} |
Type: str | A2, A3 |
--preferred-sampling-params |
None |
Type: str | A2, A3 |
LoRA
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-lora |
False |
Bool flag (set to enable) |
A2, A3 |
--max-lora-rank |
None |
Type: int | A2, A3 |
--lora-target-modules |
None |
all |
A2, A3 |
--lora-paths |
None |
Type: List[str] / JSON objects |
A2, A3 |
--max-loras-per-batch |
8 |
Type: int | A2, A3 |
--max-loaded-loras |
None |
Type: int | A2, A3 |
--lora-eviction-policy |
lru |
lru,fifo |
A2, A3 |
--lora-backend |
triton |
triton |
A2, A3 |
--max-lora-chunk-size |
16 |
16, 32,64, 128 |
Special for GPU |
Kernel Backends (Attention, Sampling, Grammar, GEMM)
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--attention-backend |
None |
ascend |
A2, A3 |
--prefill-attention-backend |
None |
ascend |
A2, A3 |
--decode-attention-backend |
None |
ascend |
A2, A3 |
--sampling-backend |
None |
pytorch,ascend |
A2, A3 |
--grammar-backend |
None |
xgrammar |
A2, A3 |
--mm-attention-backend |
None |
ascend_attn |
A2, A3 |
--nsa-prefill-backend |
flashmla_sparse |
flashmla_sparse,flashmla_decode,fa3,tilelang,aiter |
Special for GPU |
--nsa-decode-backend |
fa3 |
flashmla_prefill,flashmla_kv,fa3,tilelang,aiter |
Special for GPU |
--fp8-gemm-backend |
auto |
auto,deep_gemm,flashinfer_trtllm,flashinfer_cutlass,flashinfer_deepgemm,cutlass,triton,aiter |
Special for GPU |
--disable-flashinfer-autotune |
False |
bool flag (set to enable) |
Special for GPU |
Speculative decoding
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--speculative-algorithm |
None |
EAGLE3,NEXTN |
A2, A3 |
--speculative-draft-model-path--speculative-draft-model |
None |
Type: str | A2, A3 |
--speculative-draft-model-revision |
None |
Type: str | A2, A3 |
--speculative-draft-load-format |
None |
auto |
A2, A3 |
--speculative-num-steps |
None |
Type: int | A2, A3 |
--speculative-eagle-topk |
None |
Type: int | A2, A3 |
--speculative-num-draft-tokens |
None |
Type: int | A2, A3 |
--speculative-accept-threshold-single |
1.0 |
Type: float | Special for GPU |
--speculative-accept-threshold-acc |
1.0 |
Type: float | Special for GPU |
--speculative-token-map |
None |
Type: str | A2, A3 |
--speculative-attention-mode |
prefill |
prefill,decode |
A2, A3 |
--speculative-moe-runner-backend |
None |
auto |
A2, A3 |
--speculative-moe-a2a-backend |
None |
ascend_fuseep |
A2, A3 |
--speculative-draft-attention-backend |
None |
ascend |
A2, A3 |
--speculative-draft-model-quantization |
None |
unquant |
A2, A3 |
Ngram speculative decoding
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--speculative-ngram-min-match-window-size |
1 |
Type: int | Experimental |
--speculative-ngram-max-match-window-size |
12 |
Type: int | Experimental |
--speculative-ngram-min-bfs-breadth |
1 |
Type: int | Experimental |
--speculative-ngram-max-bfs-breadth |
10 |
Type: int | Experimental |
--speculative-ngram-match-type |
BFS |
BFS,PROB |
Experimental |
--speculative-ngram-branch-length |
18 |
Type: int | Experimental |
--speculative-ngram-capacity |
10000000 |
Type: int | Experimental |
Expert parallelism
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--expert-parallel-size--ep-size--ep |
1 |
Type: int | A2, A3 |
--moe-a2a-backend |
none |
none,deepep,ascend_fuseep |
A2, A3 |
--moe-runner-backend |
auto |
auto, triton |
A2, A3 |
--flashinfer-mxfp4-moe-precision |
default |
default,bf16 |
Special for GPU |
--enable-flashinfer-allreduce-fusion |
False |
bool flag (set to enable) |
Special for GPU |
--deepep-mode |
auto |
normal, low_latency,auto |
A2, A3 |
--deepep-config |
None |
Type: str | Special for GPU |
--ep-num-redundant-experts |
0 |
Type: int | A2, A3 |
--ep-dispatch-algorithm |
None |
Type: str | A2, A3 |
--init-expert-location |
trivial |
Type: str | A2, A3 |
--enable-eplb |
False |
bool flag (set to enable) |
A2, A3 |
--eplb-algorithm |
auto |
Type: str | A2, A3 |
--eplb-rebalance-layers-per-chunk |
None |
Type: int | A2, A3 |
--eplb-min-rebalancing-utilization-threshold |
1.0 |
Type: float | A2, A3 |
--expert-distribution-recorder-mode |
None |
Type: str | A2, A3 |
--expert-distribution-recorder-buffer-size |
None |
Type: int | A2, A3 |
--enable-expert-distribution-metrics |
False |
bool flag (set to enable) | A2, A3 |
--moe-dense-tp-size |
None |
Type: int | A2, A3 |
--elastic-ep-backend |
None |
none, mooncake |
Special for GPU |
--mooncake-ib-device |
None |
Type: str | Special for GPU |
Mamba Cache
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--max-mamba-cache-size |
None |
Type: int | A2, A3 |
--mamba-ssm-dtype |
float32 |
float32,bfloat16 |
A2, A3 |
--mamba-full-memory-ratio |
0.2 |
Type: float | A2, A3 |
--mamba-scheduler-strategy |
auto |
auto, no_buffer,extra_buffer |
A2, A3 |
--mamba-track-interval |
256 |
Type: int | A2, A3 |
Hierarchical cache
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-hierarchical-cache |
False |
bool flag (set to enable) |
A2, A3 |
--hicache-ratio |
2.0 |
Type: float | A2, A3 |
--hicache-size |
0 |
Type: int | A2, A3 |
--hicache-write-policy |
write_through |
write_back,write_through,write_through_selective |
A2, A3 |
--radix-eviction-policy |
lru |
lru, lfu |
A2, A3 |
--hicache-io-backend |
kernel |
kernel_ascend,direct |
A2, A3 |
--hicache-mem-layout |
layer_first |
page_first_direct,page_first_kv_split |
A2, A3 |
--hicache-storage-backend |
None |
file |
A2, A3 |
--hicache-storage-prefetch-policy |
best_effort |
best_effort,wait_complete,timeout |
Special for GPU |
--hicache-storage-backend-extra-config |
None |
Type: str | Special for GPU |
LMCache
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-lmcache |
False |
bool flag (set to enable) |
Special for GPU |
Offloading
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--cpu-offload-gb |
0 |
Type: int | A2, A3 |
--offload-group-size |
-1 |
Type: int | A2, A3 |
--offload-num-in-group |
1 |
Type: int | A2, A3 |
--offload-prefetch-step |
1 |
Type: int | A2, A3 |
--offload-mode |
cpu |
Type: str | A2, A3 |
Args for multi-item scoring
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--multi-item-scoring-delimiter |
None |
Type: int | A2, A3 |
Optimization/debug options
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--disable-radix-cache |
False |
bool flag (set to enable) |
A2, A3 |
--cuda-graph-max-bs |
None |
Type: int | A2, A3 |
--cuda-graph-bs |
None |
List[int] | A2, A3 |
--disable-cuda-graph |
False |
bool flag (set to enable) |
A2, A3 |
--disable-cuda-graph-padding |
False |
bool flag (set to enable) |
A2, A3 |
--enable-profile-cuda-graph |
False |
bool flag (set to enable) |
A2, A3 |
--enable-cudagraph-gc |
False |
bool flag (set to enable) |
A2, A3 |
--enable-nccl-nvls |
False |
bool flag (set to enable) |
Special for GPU |
--enable-symm-mem |
False |
bool flag (set to enable) |
Special for GPU |
--disable-flashinfer-cutlass-moe-fp4-allgather |
False |
bool flag (set to enable) |
Special for GPU |
--enable-tokenizer-batch-encode |
False |
bool flag (set to enable) |
A2, A3 |
--disable-tokenizer-batch-encode |
False |
bool flag (set to enable) |
A2, A3 |
--disable-outlines-disk-cache |
False |
bool flag (set to enable) |
A2, A3 |
--disable-custom-all-reduce |
False |
bool flag (set to enable) |
A2, A3 |
--enable-mscclpp |
False |
bool flag (set to enable) |
Special for GPU |
--enable-torch-symm-mem |
False |
bool flag (set to enable) |
Special for GPU |
--disable-overlap-schedule |
False |
bool flag (set to enable) |
A2, A3 |
--enable-mixed-chunk |
False |
bool flag (set to enable) |
A2, A3 |
--enable-dp-attention |
False |
bool flag (set to enable) |
A2, A3 |
--enable-dp-lm-head |
False |
bool flag (set to enable) |
A2, A3 |
--enable-two-batch-overlap |
False |
bool flag (set to enable) |
Planned |
--enable-single-batch-overlap |
False |
bool flag (set to enable) |
A2, A3 |
--tbo-token-distribution-threshold |
0.48 |
Type: float | Planned |
--enable-torch-compile |
False |
bool flag (set to enable) |
A2, A3 |
--enable-torch-compile-debug-mode |
False |
bool flag (set to enable) |
A2, A3 |
--enable-piecewise-cuda-graph |
False |
bool flag (set to enable) |
A2, A3 |
--piecewise-cuda-graph-tokens |
None |
Type: JSON list |
A2, A3 |
--piecewise-cuda-graph-compiler |
eager |
["eager", "inductor"] | A2, A3 |
--torch-compile-max-bs |
32 |
Type: int | A2, A3 |
--piecewise-cuda-graph-max-tokens |
4096 |
Type: int | A2, A3 |
--torchao-config |
`` | Type: str | Special for GPU |
--enable-nan-detection |
False |
bool flag (set to enable) |
A2, A3 |
--enable-p2p-check |
False |
bool flag (set to enable) |
Special for GPU |
--triton-attention-reduce-in-fp32 |
False |
bool flag (set to enable) |
Special for GPU |
--triton-attention-num-kv-splits |
8 |
Type: int | Special for GPU |
--triton-attention-split-tile-size |
None |
Type: int | Special for GPU |
--delete-ckpt-after-loading |
False |
bool flag (set to enable) |
A2, A3 |
--enable-memory-saver |
False |
bool flag (set to enable) |
A2, A3 |
--enable-weights-cpu-backup |
False |
bool flag (set to enable) |
A2, A3 |
--enable-draft-weights-cpu-backup |
False |
bool flag (set to enable) |
A2, A3 |
--allow-auto-truncate |
False |
bool flag (set to enable) |
A2, A3 |
--enable-custom-logit-processor |
False |
bool flag (set to enable) |
A2, A3 |
--flashinfer-mla-disable-ragged |
False |
bool flag (set to enable) |
Special for GPU |
--disable-shared-experts-fusion |
False |
bool flag (set to enable) |
A2, A3 |
--disable-chunked-prefix-cache |
False |
bool flag (set to enable) |
A2, A3 |
--disable-fast-image-processor |
False |
bool flag (set to enable) |
A2, A3 |
--keep-mm-feature-on-device |
False |
bool flag (set to enable) |
A2, A3 |
--enable-return-hidden-states |
False |
bool flag (set to enable) |
A2, A3 |
--enable-return-routed-experts |
False |
bool flag (set to enable) |
A2, A3 |
--scheduler-recv-interval |
1 |
Type: int | A2, A3 |
--numa-node |
None |
List[int] | A2, A3 |
--rl-on-policy-target |
None |
fsdp |
Planned |
--enable-layerwise-nvtx-marker |
False |
bool flag (set to enable) |
Special for GPU |
--enable-attn-tp-input-scattered |
False |
bool flag (set to enable) |
Experimental |
--enable-nsa-prefill-context-parallel |
False |
bool flag (set to enable) |
A2, A3 |
--enable-fused-qk-norm-rope |
False |
bool flag (set to enable) |
Special for GPU |
Dynamic batch tokenizer
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-dynamic-batch-tokenizer |
False |
bool flag (set to enable) |
A2, A3 |
--dynamic-batch-tokenizer-batch-size |
32 |
Type: int | A2, A3 |
--dynamic-batch-tokenizer-batch-timeout |
0.002 |
Type: float | A2, A3 |
Debug tensor dumps
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--debug-tensor-dump-output-folder |
None |
Type: str | A2, A3 |
--debug-tensor-dump-layers |
None |
List[int] | A2, A3 |
--debug-tensor-dump-input-file |
None |
Type: str | A2, A3 |
PD disaggregation
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--disaggregation-mode |
null |
null,prefill,decode |
A2, A3 |
--disaggregation-transfer-backend |
mooncake |
ascend |
A2, A3 |
--disaggregation-bootstrap-port |
8998 |
Type: int | A2, A3 |
--disaggregation-decode-tp |
None |
Type: int | A2, A3 |
--disaggregation-decode-dp |
None |
Type: int | A2, A3 |
--disaggregation-ib-device |
None |
Type: str | Special for GPU |
--disaggregation-decode-enable-offload-kvcache |
False |
bool flag (set to enable) |
A2, A3 |
--disaggregation-decode-enable-fake-auto |
False |
bool flag (set to enable) |
A2, A3 |
--num-reserved-decode-tokens |
512 |
Type: int | A2, A3 |
--disaggregation-decode-polling-interval |
1 |
Type: int | A2, A3 |
Encode prefill disaggregation
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--encoder-only |
False |
bool flag (set to enable) |
A2, A3 |
--language-only |
False |
bool flag (set to enable) |
A2, A3 |
--encoder-transfer-backend |
zmq_to_scheduler |
zmq_to_scheduler, zmq_to_tokenizer,mooncake |
A2, A3 |
--encoder-urls |
[] |
List[str] | A2, A3 |
Custom weight loader
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--custom-weight-loader |
None |
List[str] | A2, A3 |
--weight-loader-disable-mmap |
False |
bool flag (set to enable) |
A2, A3 |
--remote-instance-weight-loader-seed-instance-ip |
None |
Type: str | A2, A3 |
--remote-instance-weight-loader-seed-instance-service-port |
None |
Type: int | A2, A3 |
--remote-instance-weight-loader-send-weights-group-ports |
None |
Type: JSON list |
A2, A3 |
--remote-instance-weight-loader-backend |
nccl |
transfer_engine, nccl |
A2, A3 |
--remote-instance-weight-loader-start-seed-via-transfer-engine |
False |
bool flag (set to enable) |
Special for GPU |
For PD-Multiplexing
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-pdmux |
False |
bool flag (set to enable) |
Special for GPU |
--pdmux-config-path |
None |
Type: str | Special for GPU |
--sm-group-num |
8 |
Type: int | Special for GPU |
For Multi-Modal
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--mm-max-concurrent-calls |
32 | Type: int | A2, A3 |
--mm-per-request-timeout |
10.0 | Type: float | A2, A3 |
--enable-broadcast-mm-inputs-process |
False |
bool flag (set to enable) |
A2, A3 |
--mm-process-config |
None |
Type: JSON / Dict | A2, A3 |
--mm-enable-dp-encoder |
False |
bool flag (set to enable) |
A2, A3 |
--limit-mm-data-per-request |
None |
Type: JSON / Dict | A2, A3 |
For checkpoint decryption
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--decrypted-config-file |
None |
Type: str | A2, A3 |
--decrypted-draft-config-file |
None |
Type: str | A2, A3 |
--enable-prefix-mm-cache |
False |
bool flag (set to enable) |
A2, A3 |
For deterministic inference
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-deterministic-inference |
False |
bool flag (set to enable) |
Planned |
For registering hooks
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--forward-hooks |
None |
Type: JSON list | A2, A3 |
Configuration file support
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--config |
None |
Type: str | A2, A3 |
Other Params
The following parameters are not supported because the third-party components that depend on are not compatible with the NPU, like Ktransformer, checkpoint-engine etc.
| Argument | Defaults | Options |
|---|---|---|
--checkpoint-engine- wait-weights- before-ready |
False |
bool flag (set to enable) |
--kt-weight-path |
None |
Type: str |
--kt-method |
AMXINT4 |
Type: str |
--kt-cpuinfer |
None |
Type: int |
--kt-threadpool-count |
2 | Type: int |
--kt-num-gpu-experts |
None |
Type: int |
--kt-max-deferred-experts-per-token |
None |
Type: int |
The following parameters have some functional deficiencies on community
| Argument | Defaults | Options |
|---|---|---|
--enable-double-sparsity |
False |
bool flag (set to enable) |
--ds-channel-config-path |
None |
Type: str |
--ds-heavy-channel-num |
32 |
Type: int |
--ds-heavy-token-num |
256 |
Type: int |
--ds-heavy-channel-type |
qk |
Type: str |
--ds-sparse-decode-threshold |
4096 |
Type: int |
--tool-server |
None |
Type: str |