nvidia/MiniMax-M2.7-NVFP4 · Context Length for 2X6000 Pros (2x96 = 192GB VRAM)

Context Length for 2X6000 Pros (2x96 = 192GB VRAM)

by mtcl - opened Apr 25

Apr 25

On M2.5 i can easily get full context length however, on this version of M2.7 I cannot get more than ~88K of context length.

these are the commands:

for M2.5

docker run --rm -it \
    --gpus '"device=0,2"' \
    --shm-size 32g \
    -p 10002:8000 \
    -v /media/mukul/data/models:/models \
    -e PYTORCH_ALLOC_CONF=expandable_segments:True \
    lmsysorg/sglang:latest \
    python -m sglang.launch_server \
        --model-path /models/nvidia/MiniMax-M2.5-NVFP4 \
        --served-model-name jarvis-thinker \
        --tp-size 2 \
        --quantization modelopt_fp4 \
        --tool-call-parser minimax-m2 \
        --reasoning-parser minimax \
        --host 0.0.0.0 \
        --port 8000 \
        --trust-remote-code \
        --dtype auto \
        --mem-fraction-static 0.90 \
        --context-length 196608 \
        --max-running-requests 16 \
        --chunked-prefill-size 16384 \
        --sleep-on-idle

and for M2.7

docker run --rm -it \
  --gpus '"device=0,2"' \
  --shm-size 32g \
  -p 10002:8000 \
  -v /media/mukul/data/models:/models \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
    --model-path /models/nvidia/MiniMax-M2.7-NVFP4 \
    --served-model-name jarvis-thinker \
    --tp-size 2 \
    --quantization modelopt_fp4 \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code \
    --dtype auto \
    --mem-fraction-static 0.90 \
    --context-length 196608 \
    --max-running-requests 16 \
    --chunked-prefill-size 16384 \
    --sleep-on-idle

mratsim

Apr 27

M2.5 defaulted to fp8 KV-cache. This one to BF16

mtcl

Apr 27

M2.5 defaulted to fp8 KV-cache. This one to BF16

Thank you for your reply. I realized that nvidia's version of this quant actually holds full context length. However I was only facing issues with other nvfp4 quants (lukeanso and others). Somehow this works perfectly!

sousekd

May 13

@mtcl

Somehow this works perfectly!

Have you settled on this one at the end? And do you use BF16 or FP8 kv-cache?
Still looking for the best possible version for 2x96 GB :)

numberfour8

5 days ago

@mtcl

Somehow this works perfectly!

Have you settled on this one at the end? And do you use BF16 or FP8 kv-cache?
Still looking for the best possible version for 2x96 GB :)

Did you eventually find the best working version for M2.7 on 2x PRO 6000s ?

sousekd

4 days ago

@numberfour8 I settled on DSv4-Flash for day-to-day stuff.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment