Q3_K_M - Results with: 2x RTX 2080Ti (22GB VRAM each) // Xeon E5-2682V4 16C 128G DDR4 2133 Quad Channel

#9
by yuzu127000d - opened

discussion on model performance in restricted hardware, details will be shown below

Hardware:
GPU: 2x NVIDIA RTX 2080Ti (22GB each,44GB total) with NVLink
CPU: E5-2682V4
RAM: 128GB (4x32GB) DDR4 @ 2133 MT/s

Performance:
7-8 token/s -- long context window(96K or higher)
10 token/s -- short window(4096)

ONLY WITH SHORT PROMPT. Real world testing is still in progress.

Setup:
llama.cpp & docker
docker run -d --name minimax
--runtime=nvidia
--gpus all
--ipc=host
-v /data/models/MiniMax-M2.5-GGUF-Q3_K_M/Q3_K_M:/models
-p 8080:8080
ghcr.io/ggml-org/llama.cpp:server-cuda
-m /models/MiniMax-M2.5-Q3_K_M-00001-of-00004.gguf
-c 65536
--cache-type-k q4_0
--cache-type-v q4_0
-ngl 62
-ncmoe 52
--split-mode row
--tensor-split 4,1
-b 512
--no-mmap
--flash-attn on
-t 16
-tb 16
-np 1
--numa isolate
--host 0.0.0.0
--port 8080

JUST A SAMPLE. Might not be the optimal params. The main trade-off is to balance the amount of weights GPU VRAM can handle and the context window.

Post just for a record, if or when someone need a reference. If your hardware fully support FP8 quantization and AVX512BF16&AMX, maybe ktransformers is also a good choice.
Overall, even with only old hardware, the model can run. With llama.cpp & relatively newer hardware, performance will be better.

Sign up or log in to comment