Hanrui / sglang /docs /references /environment_variables.md
Lekr0's picture
Add files using upload-large-folder tool
6268841 verified

Environment Variables

SGLang supports various environment variables that can be used to configure its runtime behavior. This document provides a comprehensive list and aims to stay updated over time.

Note: SGLang uses two prefixes for environment variables: SGL_ and SGLANG_. This is likely due to historical reasons. While both are currently supported for different settings, future versions might consolidate them.

General Configuration

Environment Variable Description Default Value
SGLANG_USE_MODELSCOPE Enable using models from ModelScope false
SGLANG_HOST_IP Host IP address for the server 0.0.0.0
SGLANG_PORT Port for the server auto-detected
SGLANG_LOGGING_CONFIG_PATH Custom logging configuration path Not set
SGLANG_DISABLE_REQUEST_LOGGING Disable request logging false
SGLANG_LOG_REQUEST_HEADERS Comma-separated list of additional HTTP headers to log when --log-requests is enabled. Appends to the default x-smg-routing-key. Not set
SGLANG_HEALTH_CHECK_TIMEOUT Timeout for health check in seconds 20
SGLANG_EPLB_HEATMAP_COLLECTION_INTERVAL The interval of passes to collect the metric of selected count of physical experts on each layer and GPU rank. 0 means disabled. 0
SGLANG_FORWARD_UNKNOWN_TOOLS Forward unknown tool calls to clients instead of dropping them false (drop unknown tools)
SGLANG_REQ_WAITING_TIMEOUT Timeout (in seconds) for requests waiting in the queue before being scheduled -1
SGLANG_REQ_RUNNING_TIMEOUT Timeout (in seconds) for requests running in the decode batch -1

Performance Tuning

Environment Variable Description Default Value
SGLANG_ENABLE_TORCH_INFERENCE_MODE Control whether to use torch.inference_mode false
SGLANG_ENABLE_TORCH_COMPILE Enable torch.compile true
SGLANG_SET_CPU_AFFINITY Enable CPU affinity setting (often set to 1 in Docker builds) 0
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN Allows the scheduler to overwrite longer context length requests (often set to 1 in Docker builds) 0
SGLANG_IS_FLASHINFER_AVAILABLE Control FlashInfer availability check true
SGLANG_SKIP_P2P_CHECK Skip P2P (peer-to-peer) access check false
SGLANG_CHUNKED_PREFIX_CACHE_THRESHOLD Sets the threshold for enabling chunked prefix caching 8192
SGLANG_FUSED_MLA_ENABLE_ROPE_FUSION Enable RoPE fusion in Fused Multi-Layer Attention 1
SGLANG_DISABLE_CONSECUTIVE_PREFILL_OVERLAP Disable overlap schedule for consecutive prefill batches false
SGLANG_SCHEDULER_MAX_RECV_PER_POLL Set the maximum number of requests per poll, with a negative value indicating no limit -1
SGLANG_DISABLE_FA4_WARMUP Disable Flash Attention 4 warmup passes (set to 1, true, yes, or on to disable) false
SGLANG_DATA_PARALLEL_BUDGET_INTERVAL Interval for DPBudget updates 1
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DEFAULT Default weight value for scheduler recv skipper counter (used when forward mode doesn't match specific modes). Only active when --scheduler-recv-interval > 1. The counter accumulates weights and triggers request polling when reaching the interval threshold. 1000
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DECODE Weight increment for decode forward mode in scheduler recv skipper. Works with --scheduler-recv-interval to control polling frequency during decode phase. 1
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_VERIFY Weight increment for target verify forward mode in scheduler recv skipper. Works with --scheduler-recv-interval to control polling frequency during verification phase. 1
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_NONE Weight increment when forward mode is None in scheduler recv skipper. Works with --scheduler-recv-interval to control polling frequency when no specific forward mode is active. 1
SGLANG_MM_BUFFER_SIZE_MB Size of preallocated GPU buffer (in MB) for multi-modal feature hashing optimization. When set to a positive value, temporarily moves features to GPU for faster hash computation, then moves them back to CPU to save GPU memory. Larger features benefit more from GPU hashing. Set to 0 to disable. 0
SGLANG_MM_PRECOMPUTE_HASH Enable precomputing of hash values for MultimodalDataItem false
SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCH Enable NCCL for gathering when preparing mlp sync batch under overlap scheduler (without this flag gloo is used for gathering) false
SGLANG_SYMM_MEM_PREALLOC_GB_SIZE Size of preallocated GPU buffer (in GB) for NCCL symmetric memory pool to limit memory fragmentation. Only have an effect when server arg --enable-symm-mem is set. 4
SGLANG_CUSTOM_ALLREDUCE_ALGO The algorithm of custom all-reduce. Set to oneshot or 1stage to force use one-shot. Set to twoshot or 2stage to force use two-shot. ``

DeepGEMM Configuration (Advanced Optimization)

Environment Variable Description Default Value
SGLANG_ENABLE_JIT_DEEPGEMM Enable Just-In-Time compilation of DeepGEMM kernels (enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) GPUs when the DeepGEMM package is installed; set to "0" to disable) "true"
SGLANG_JIT_DEEPGEMM_PRECOMPILE Enable precompilation of DeepGEMM kernels "true"
SGLANG_JIT_DEEPGEMM_COMPILE_WORKERS Number of workers for parallel DeepGEMM kernel compilation 4
SGLANG_IN_DEEPGEMM_PRECOMPILE_STAGE Indicator flag used during the DeepGEMM precompile script "false"
SGLANG_DG_CACHE_DIR Directory for caching compiled DeepGEMM kernels ~/.cache/deep_gemm
SGLANG_DG_USE_NVRTC Use NVRTC (instead of Triton) for JIT compilation (Experimental) "0"
SGLANG_USE_DEEPGEMM_BMM Use DeepGEMM for Batched Matrix Multiplication (BMM) operations "false"
SGLANG_JIT_DEEPGEMM_FAST_WARMUP Precompile less kernels during warmup, which reduces the warmup time from 30min to less than 3min. Might cause performance degradation during runtime. "false"

DeepEP Configuration

Environment Variable Description Default Value
SGLANG_DEEPEP_BF16_DISPATCH Use Bfloat16 for dispatch "false"
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK The maximum number of dispatched tokens on each GPU "128"
SGLANG_FLASHINFER_NUM_MAX_DISPATCH_TOKENS_PER_RANK The maximum number of dispatched tokens on each GPU for --moe-a2a-backend=flashinfer "1024"
SGLANG_DEEPEP_LL_COMBINE_SEND_NUM_SMS Number of SMs used for DeepEP combine when single batch overlap is enabled "32"
SGLANG_BLACKWELL_OVERLAP_SHARED_EXPERTS_OUTSIDE_SBO Run shared experts on an alternate stream when single batch overlap is enabled on GB200. When not setting this flag, shared experts and down gemm will be overlapped with DeepEP combine together. "false"

MORI Configuration

Environment Variable Description Default Value
SGLANG_MORI_FP8_DISP Use FP8 for dispatch "false"
SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK Maximum number of dispatch tokens per rank for MORI-EP buffer allocation 4096
SGLANG_MORI_DISPATCH_INTER_KERNEL_SWITCH_THRESHOLD Threshold for switching between InterNodeV1 and InterNodeV1LL kernel types. InterNodeV1LL is used if SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK is less than or equal to this threshold; otherwise, InterNodeV1 is used. 256
SGLANG_MORI_QP_PER_TRANSFER Number of RDMA Queue Pairs (QPs) used per transfer operation 1
SGLANG_MORI_POST_BATCH_SIZE Number of RDMA work requests posted in a single batch to each QP -1
SGLANG_MORI_NUM_WORKERS Number of worker threads in the RDMA executor thread pool 1

NSA Backend Configuration (For DeepSeek V3.2)

Environment Variable Description Default Value
SGLANG_NSA_FUSE_TOPK Fuse the operation of picking topk logits and picking topk indices from page table true
SGLANG_NSA_ENABLE_MTP_PRECOMPUTE_METADATA Precompute metadata that can be shared among different draft steps when MTP is enabled true

Memory Management

Environment Variable Description Default Value
SGLANG_DEBUG_MEMORY_POOL Enable memory pool debugging false
SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATION Clip max new tokens estimation for memory planning 4096
SGLANG_DETOKENIZER_MAX_STATES Maximum states for detokenizer Default value based on system
SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK Enable checks for memory imbalance across Tensor Parallel ranks true
SGLANG_MOONCAKE_CUSTOM_MEM_POOL Configure the custom memory pool type for Mooncake. Supports NVLINK, BAREX, INTRA_NODE_NVLINK. If set to true, it defaults to NVLINK. None

Model-Specific Options

Environment Variable Description Default Value
SGLANG_USE_AITER Use AITER optimize implementation false
SGLANG_MOE_PADDING Enable MoE padding (sets padding size to 128 if value is 1, often set to 1 in Docker builds) 0
SGLANG_CUTLASS_MOE (deprecated) Use Cutlass FP8 MoE kernel on Blackwell GPUs (deprecated, use --moe-runner-backend=cutlass) false

Quantization

Environment Variable Description Default Value
SGLANG_INT4_WEIGHT Enable INT4 weight quantization false
SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2 Apply per token group quantization kernel with fused silu and mul and masked m false
SGLANG_FORCE_FP8_MARLIN Force using FP8 MARLIN kernels even if other FP8 kernels are available false
SGLANG_FLASHINFER_FP4_GEMM_BACKEND (deprecated) Select backend for mm_fp4 on Blackwell GPUs. DEPRECATED: Please use --fp4-gemm-backend instead. ``
SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN Quantize q_b_proj from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint false
SGLANG_MOE_NVFP4_DISPATCH Use nvfp4 for moe dispatch (on flashinfer_cutlass or flashinfer_cutedsl moe runner backend) "false"
SGLANG_NVFP4_CKPT_FP8_NEXTN_MOE Quantize moe of nextn layer from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint false
SGLANG_ENABLE_FLASHINFER_FP8_GEMM (deprecated) Use flashinfer kernels when running blockwise fp8 GEMM on Blackwell GPUs. DEPRECATED: Please use --fp8-gemm-backend=flashinfer_trtllm (SM100/SM103) or --fp8-gemm-backend=flashinfer_cutlass (SM120/SM121 and newer) instead. false
SGLANG_SUPPORT_CUTLASS_BLOCK_FP8 (deprecated) Use Cutlass kernels when running blockwise fp8 GEMM on Hopper or Blackwell GPUs. DEPRECATED: Please use --fp8-gemm-backend=cutlass instead. false
SGLANG_QUANT_ALLOW_DOWNCASTING Allow weights downcasting false

Distributed Computing

Environment Variable Description Default Value
SGLANG_BLOCK_NONZERO_RANK_CHILDREN Control blocking of non-zero rank children processes 1
SGLANG_IS_FIRST_RANK_ON_NODE Indicates if the current process is the first rank on its node "true"
SGLANG_PP_LAYER_PARTITION Pipeline parallel layer partition specification Not set
SGLANG_ONE_VISIBLE_DEVICE_PER_PROCESS Set one visible device per process for distributed computing false

Testing & Debugging (Internal/CI)

These variables are primarily used for internal testing, continuous integration, or debugging.

Environment Variable Description Default Value
SGLANG_IS_IN_CI Indicates if running in CI environment false
SGLANG_IS_IN_CI_AMD Indicates running in AMD CI environment 0
SGLANG_TEST_RETRACT Enable retract decode testing false
SGLANG_TEST_RETRACT_NO_PREFILL_BS When SGLANG_TEST_RETRACT is enabled, no prefill is performed if the batch size exceeds SGLANG_TEST_RETRACT_NO_PREFILL_BS. 2 ** 31
SGLANG_RECORD_STEP_TIME Record step time for profiling false
SGLANG_TEST_REQUEST_TIME_STATS Test request time statistics false

Profiling & Benchmarking

Environment Variable Description Default Value
SGLANG_TORCH_PROFILER_DIR Directory for PyTorch profiler output /tmp
SGLANG_PROFILE_WITH_STACK Set with_stack option (bool) for PyTorch profiler (capture stack trace) true
SGLANG_PROFILE_RECORD_SHAPES Set record_shapes option (bool) for PyTorch profiler (record shapes) true
SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLIS Config BatchSpanProcessor.schedule_delay_millis if tracing is enabled 500
SGLANG_OTLP_EXPORTER_MAX_EXPORT_BATCH_SIZE Config BatchSpanProcessor.max_export_batch_size if tracing is enabled 64

Storage & Caching

Environment Variable Description Default Value
SGLANG_WAIT_WEIGHTS_READY_TIMEOUT Timeout period for waiting on weights 120
SGLANG_DISABLE_OUTLINES_DISK_CACHE Disable Outlines disk cache true
SGLANG_USE_CUSTOM_TRITON_KERNEL_CACHE Use SGLang's custom Triton kernel cache implementation for lower overheads (automatically enabled on CUDA) false

Function Calling / Tool Use

Environment Variable Description Default Value
SGLANG_TOOL_STRICT_LEVEL Controls the strictness level of tool call parsing and validation.
Level 0: Off - No strict validation
Level 1: Function strict - Enables structural tag constraints for all tools (even if none have strict=True set)
Level 2: Parameter strict - Enforces strict parameter validation for all tools, treating them as if they all have strict=True set
0