Context Length for 2X6000 Pros (2x96 = 192GB VRAM)
#2
by mtcl - opened
On M2.5 i can easily get full context length however, on this version of M2.7 I cannot get more than ~88K of context length.
these are the commands:
for M2.5
docker run --rm -it \
--gpus '"device=0,2"' \
--shm-size 32g \
-p 10002:8000 \
-v /media/mukul/data/models:/models \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
lmsysorg/sglang:latest \
python -m sglang.launch_server \
--model-path /models/nvidia/MiniMax-M2.5-NVFP4 \
--served-model-name jarvis-thinker \
--tp-size 2 \
--quantization modelopt_fp4 \
--tool-call-parser minimax-m2 \
--reasoning-parser minimax \
--host 0.0.0.0 \
--port 8000 \
--trust-remote-code \
--dtype auto \
--mem-fraction-static 0.90 \
--context-length 196608 \
--max-running-requests 16 \
--chunked-prefill-size 16384 \
--sleep-on-idle
and for M2.7
docker run --rm -it \
--gpus '"device=0,2"' \
--shm-size 32g \
-p 10002:8000 \
-v /media/mukul/data/models:/models \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
lmsysorg/sglang:latest \
python -m sglang.launch_server \
--model-path /models/nvidia/MiniMax-M2.7-NVFP4 \
--served-model-name jarvis-thinker \
--tp-size 2 \
--quantization modelopt_fp4 \
--tool-call-parser minimax-m2 \
--reasoning-parser minimax \
--host 0.0.0.0 \
--port 8000 \
--trust-remote-code \
--dtype auto \
--mem-fraction-static 0.90 \
--context-length 196608 \
--max-running-requests 16 \
--chunked-prefill-size 16384 \
--sleep-on-idle
M2.5 defaulted to fp8 KV-cache. This one to BF16
M2.5 defaulted to fp8 KV-cache. This one to BF16
Thank you for your reply. I realized that nvidia's version of this quant actually holds full context length. However I was only facing issues with other nvfp4 quants (lukeanso and others). Somehow this works perfectly!
Somehow this works perfectly!
Have you settled on this one at the end? And do you use BF16 or FP8 kv-cache?
Still looking for the best possible version for 2x96 GB :)
Did you eventually find the best working version for M2.7 on 2x PRO 6000s ?