Unable to use full 192k context in SGLang with MiniMax-M2.7-NVFP4 (runtime capped at ~80,964 tokens)

by mtcl - opened Apr 19

Apr 19

Hi,

I’m seeing a mismatch between the configured context length and the actual usable context length when serving this quantized model with SGLang.

What I expected

I launched the server with:

--context-length 196608
model: lukealonso/MiniMax-M2.7-NVFP4
quantization: modelopt_fp4
TP=2
2x Blackwell GPUs

I expected to be able to use roughly the advertised 192k context window.

What actually happens

SGLang starts successfully and reports:

context_length=196608

But at runtime it appears the effective max context is only around 80,964 / 80,970 tokens.

From the logs:

KV Cache is allocated. #tokens: 80970
max_total_num_tokens=80970
later requests fail with:
- Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate.

So even though --context-length 196608 is accepted, I cannot actually send prompts anywhere close to 192k.

Environment

SGLang image: lmsysorg/sglang:dev-cu13
CUDA: 13.0.1
GPUs: 2x Blackwell
command:

docker run --rm -it \
  --gpus '"device=0,2"' \
  --shm-size 32g \
  -p 10002:8000 \
  -v /media/mukul/data/models:/models \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  lmsysorg/sglang:dev-cu13 \
  python -m sglang.launch_server \
    --model-path /models/lukealonso/MiniMax-M2.7-NVFP4 \
    --served-model-name jarvis-thinker \
    --tp-size 2 \
    --quantization modelopt_fp4 \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code \
    --dtype auto \
    --mem-fraction-static 0.85 \
    --context-length 196608 \
    --max-running-requests 16 \
    --chunked-prefill-size 16384 \
    --sleep-on-idle

Relevant log snippets

server_args=... context_length=196608 ...

KV Cache is allocated. #tokens: 80970

max_total_num_tokens=80970, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=16, context_len=196608

Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate.

Question

Is this expected for this NVFP4 quant on this hardware, or is there something specific I need to change to actually get close to the full 192k context?

In particular, I’d like to understand:

Whether this quantized checkpoint is intended to support the full 192k context in SGLang.
Whether the limitation is from the checkpoint / KV-cache requirements rather than the model config.
Whether there are recommended launch settings to get closer to the full context length.

Thanks!

[2026-04-19 01:35:00 TP0] Decode batch, #running-req: 1, #token: 75387, token usage: 0.93, cuda graph: True, gen throughput (token/s): 62.56, #queue-req: 0
[2026-04-19 01:35:00 TP0] Decode batch, #running-req: 1, #token: 75427, token usage: 0.93, cuda graph: True, gen throughput (token/s): 62.59, #queue-req: 0
[2026-04-19 01:35:01 TP0] Decode batch, #running-req: 1, #token: 75467, token usage: 0.93, cuda graph: True, gen throughput (token/s): 62.53, #queue-req: 0
[2026-04-19 01:35:02 TP0] Decode batch, #running-req: 1, #token: 75507, token usage: 0.93, cuda graph: True, gen throughput (token/s): 62.55, #queue-req: 0
[2026-04-19 01:46:46 TP0] Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate., self.rid='9191a9b6635b4d7286dadd9f10d2c76f'
[2026-04-19 01:46:46 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.00
[2026-04-19 01:46:46] INFO:     172.17.0.1:48556 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-19 01:46:49 TP0] Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate., self.rid='e4a97a0674924fdf8598024faca138d1'
[2026-04-19 01:46:49 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.43
[2026-04-19 01:46:49] INFO:     172.17.0.1:48556 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-19 01:46:53 TP0] Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate., self.rid='40cc016d21bb4d0b9aecddd992c5da55'
[2026-04-19 01:46:53 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.23
[2026-04-19 01:46:53] INFO:     172.17.0.1:48556 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-19 01:47:01 TP0] Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate., self.rid='beb7d2b333cb4d3c9394a3c7146d873e'
[2026-04-19 01:47:01 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.12
[2026-04-19 01:47:01] INFO:     172.17.0.1:42090 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-19 01:47:18 TP0] Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate., self.rid='495eb8ce1ea64e63b3532d1708deed0b'
[2026-04-19 01:47:18 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.06
[2026-04-19 01:47:18] INFO:     172.17.0.1:54824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-19 01:47:50 TP0] Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate., self.rid='1b2f88b8aca245d6b19b0b34a6ad2c04'
[2026-04-19 01:47:50 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.03

sousekd

Apr 19

•

edited Apr 19

@mtcl
It is tight. You can try to increase --mem-fraction-static, possibly decrease --chunked-prefill-size, to get to around 130K @ bf16 context.
Or go the --kv-cache-dtype fp8_e4m3way if you need more. Also, any reason not to use b12x?
See here: https://huggingface.co/lukealonso/MiniMax-M2.7-NVFP4/discussions/4#69e10ba74010ebbc8be99f80

milizhang

Apr 19

@sousekd Do we have the right k/v scale in this quant for fp8_e4m3 KV Cache? SGLang does not have on-the-fly calibration for these.

mtcl

Apr 19

Same happens in b12x.

M2.5 used to go to 192k easily so it looks like something is odd here...

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment