Unable to use full 192k context in SGLang with MiniMax-M2.7-NVFP4 (runtime capped at ~80,964 tokens)
Hi,
I’m seeing a mismatch between the configured context length and the actual usable context length when serving this quantized model with SGLang.
What I expected
I launched the server with:
--context-length 196608- model:
lukealonso/MiniMax-M2.7-NVFP4 - quantization:
modelopt_fp4 - TP=2
- 2x Blackwell GPUs
I expected to be able to use roughly the advertised 192k context window.
What actually happens
SGLang starts successfully and reports:
context_length=196608
But at runtime it appears the effective max context is only around 80,964 / 80,970 tokens.
From the logs:
KV Cache is allocated. #tokens: 80970max_total_num_tokens=80970- later requests fail with:
Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate.
So even though --context-length 196608 is accepted, I cannot actually send prompts anywhere close to 192k.
Environment
- SGLang image:
lmsysorg/sglang:dev-cu13 - CUDA:
13.0.1 - GPUs: 2x Blackwell
- command:
docker run --rm -it \
--gpus '"device=0,2"' \
--shm-size 32g \
-p 10002:8000 \
-v /media/mukul/data/models:/models \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
lmsysorg/sglang:dev-cu13 \
python -m sglang.launch_server \
--model-path /models/lukealonso/MiniMax-M2.7-NVFP4 \
--served-model-name jarvis-thinker \
--tp-size 2 \
--quantization modelopt_fp4 \
--tool-call-parser minimax-m2 \
--reasoning-parser minimax \
--host 0.0.0.0 \
--port 8000 \
--trust-remote-code \
--dtype auto \
--mem-fraction-static 0.85 \
--context-length 196608 \
--max-running-requests 16 \
--chunked-prefill-size 16384 \
--sleep-on-idle
Relevant log snippets
server_args=... context_length=196608 ...
KV Cache is allocated. #tokens: 80970
max_total_num_tokens=80970, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=16, context_len=196608
Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate.
Question
Is this expected for this NVFP4 quant on this hardware, or is there something specific I need to change to actually get close to the full 192k context?
In particular, I’d like to understand:
- Whether this quantized checkpoint is intended to support the full 192k context in SGLang.
- Whether the limitation is from the checkpoint / KV-cache requirements rather than the model config.
- Whether there are recommended launch settings to get closer to the full context length.
Thanks!
[2026-04-19 01:35:00 TP0] Decode batch, #running-req: 1, #token: 75387, token usage: 0.93, cuda graph: True, gen throughput (token/s): 62.56, #queue-req: 0
[2026-04-19 01:35:00 TP0] Decode batch, #running-req: 1, #token: 75427, token usage: 0.93, cuda graph: True, gen throughput (token/s): 62.59, #queue-req: 0
[2026-04-19 01:35:01 TP0] Decode batch, #running-req: 1, #token: 75467, token usage: 0.93, cuda graph: True, gen throughput (token/s): 62.53, #queue-req: 0
[2026-04-19 01:35:02 TP0] Decode batch, #running-req: 1, #token: 75507, token usage: 0.93, cuda graph: True, gen throughput (token/s): 62.55, #queue-req: 0
[2026-04-19 01:46:46 TP0] Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate., self.rid='9191a9b6635b4d7286dadd9f10d2c76f'
[2026-04-19 01:46:46 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.00
[2026-04-19 01:46:46] INFO: 172.17.0.1:48556 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-19 01:46:49 TP0] Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate., self.rid='e4a97a0674924fdf8598024faca138d1'
[2026-04-19 01:46:49 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.43
[2026-04-19 01:46:49] INFO: 172.17.0.1:48556 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-19 01:46:53 TP0] Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate., self.rid='40cc016d21bb4d0b9aecddd992c5da55'
[2026-04-19 01:46:53 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.23
[2026-04-19 01:46:53] INFO: 172.17.0.1:48556 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-19 01:47:01 TP0] Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate., self.rid='beb7d2b333cb4d3c9394a3c7146d873e'
[2026-04-19 01:47:01 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.12
[2026-04-19 01:47:01] INFO: 172.17.0.1:42090 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-19 01:47:18 TP0] Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate., self.rid='495eb8ce1ea64e63b3532d1708deed0b'
[2026-04-19 01:47:18 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.06
[2026-04-19 01:47:18] INFO: 172.17.0.1:54824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-19 01:47:50 TP0] Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate., self.rid='1b2f88b8aca245d6b19b0b34a6ad2c04'
[2026-04-19 01:47:50 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.03
@mtcl
It is tight. You can try to increase --mem-fraction-static, possibly decrease --chunked-prefill-size, to get to around 130K @ bf16 context.
Or go the --kv-cache-dtype fp8_e4m3way if you need more. Also, any reason not to use b12x?
See here: https://huggingface.co/lukealonso/MiniMax-M2.7-NVFP4/discussions/4#69e10ba74010ebbc8be99f80
Same happens in b12x.
M2.5 used to go to 192k easily so it looks like something is odd here...