Unable to use full 192k context in SGLang with MiniMax-M2.7-NVFP4 (runtime capped at ~80,964 tokens)

#9
by mtcl - opened

Hi,

I’m seeing a mismatch between the configured context length and the actual usable context length when serving this quantized model with SGLang.

What I expected

I launched the server with:

  • --context-length 196608
  • model: lukealonso/MiniMax-M2.7-NVFP4
  • quantization: modelopt_fp4
  • TP=2
  • 2x Blackwell GPUs

I expected to be able to use roughly the advertised 192k context window.

What actually happens

SGLang starts successfully and reports:

  • context_length=196608

But at runtime it appears the effective max context is only around 80,964 / 80,970 tokens.

From the logs:

  • KV Cache is allocated. #tokens: 80970
  • max_total_num_tokens=80970
  • later requests fail with:
    • Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate.

So even though --context-length 196608 is accepted, I cannot actually send prompts anywhere close to 192k.

Environment

  • SGLang image: lmsysorg/sglang:dev-cu13
  • CUDA: 13.0.1
  • GPUs: 2x Blackwell
  • command:
docker run --rm -it \
  --gpus '"device=0,2"' \
  --shm-size 32g \
  -p 10002:8000 \
  -v /media/mukul/data/models:/models \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  lmsysorg/sglang:dev-cu13 \
  python -m sglang.launch_server \
    --model-path /models/lukealonso/MiniMax-M2.7-NVFP4 \
    --served-model-name jarvis-thinker \
    --tp-size 2 \
    --quantization modelopt_fp4 \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code \
    --dtype auto \
    --mem-fraction-static 0.85 \
    --context-length 196608 \
    --max-running-requests 16 \
    --chunked-prefill-size 16384 \
    --sleep-on-idle

Relevant log snippets

server_args=... context_length=196608 ...
KV Cache is allocated. #tokens: 80970
max_total_num_tokens=80970, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=16, context_len=196608
Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate.

Question

Is this expected for this NVFP4 quant on this hardware, or is there something specific I need to change to actually get close to the full 192k context?

In particular, I’d like to understand:

  1. Whether this quantized checkpoint is intended to support the full 192k context in SGLang.
  2. Whether the limitation is from the checkpoint / KV-cache requirements rather than the model config.
  3. Whether there are recommended launch settings to get closer to the full context length.

Thanks!

[2026-04-19 01:35:00 TP0] Decode batch, #running-req: 1, #token: 75387, token usage: 0.93, cuda graph: True, gen throughput (token/s): 62.56, #queue-req: 0
[2026-04-19 01:35:00 TP0] Decode batch, #running-req: 1, #token: 75427, token usage: 0.93, cuda graph: True, gen throughput (token/s): 62.59, #queue-req: 0
[2026-04-19 01:35:01 TP0] Decode batch, #running-req: 1, #token: 75467, token usage: 0.93, cuda graph: True, gen throughput (token/s): 62.53, #queue-req: 0
[2026-04-19 01:35:02 TP0] Decode batch, #running-req: 1, #token: 75507, token usage: 0.93, cuda graph: True, gen throughput (token/s): 62.55, #queue-req: 0
[2026-04-19 01:46:46 TP0] Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate., self.rid='9191a9b6635b4d7286dadd9f10d2c76f'
[2026-04-19 01:46:46 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.00
[2026-04-19 01:46:46] INFO:     172.17.0.1:48556 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-19 01:46:49 TP0] Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate., self.rid='e4a97a0674924fdf8598024faca138d1'
[2026-04-19 01:46:49 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.43
[2026-04-19 01:46:49] INFO:     172.17.0.1:48556 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-19 01:46:53 TP0] Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate., self.rid='40cc016d21bb4d0b9aecddd992c5da55'
[2026-04-19 01:46:53 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.23
[2026-04-19 01:46:53] INFO:     172.17.0.1:48556 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-19 01:47:01 TP0] Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate., self.rid='beb7d2b333cb4d3c9394a3c7146d873e'
[2026-04-19 01:47:01 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.12
[2026-04-19 01:47:01] INFO:     172.17.0.1:42090 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-19 01:47:18 TP0] Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate., self.rid='495eb8ce1ea64e63b3532d1708deed0b'
[2026-04-19 01:47:18 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.06
[2026-04-19 01:47:18] INFO:     172.17.0.1:54824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-19 01:47:50 TP0] Input length (104433 tokens) exceeds the maximum allowed length (80964 tokens). Use a shorter input or enable --allow-auto-truncate., self.rid='1b2f88b8aca245d6b19b0b34a6ad2c04'
[2026-04-19 01:47:50 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.03

@mtcl
It is tight. You can try to increase --mem-fraction-static, possibly decrease --chunked-prefill-size, to get to around 130K @ bf16 context.
Or go the --kv-cache-dtype fp8_e4m3way if you need more. Also, any reason not to use b12x?
See here: https://huggingface.co/lukealonso/MiniMax-M2.7-NVFP4/discussions/4#69e10ba74010ebbc8be99f80

@sousekd Do we have the right k/v scale in this quant for fp8_e4m3 KV Cache? SGLang does not have on-the-fly calibration for these.

Same happens in b12x.

M2.5 used to go to 192k easily so it looks like something is odd here...

Sign up or log in to comment