Ot parameters
im very impressed with the double speed between lamacpp and iklama, and the model here and the official gguf one they uploaded, my question since im a bit greedy and would like more than the 5.6t/s im getting can i get more by precise tensors allocations? thats my begining log, i have 3x 3090, 1x 4070ti 1 on 16x, rest on 4x pcie lanes and i have 96 RAM DDR4 and i7 13700k processor.
D:\iklama\ik_llama.cpp\build\bin\Release>llama-server.exe ^
More? --model "D:\models\step35\Step-3.5-Flash-IQ4_XS-00001-of-00004.gguf" ^
More? --alias ubergarm/GLM-4.7 ^
More? --ctx-size 8536 ^
More? -sm graph ^
More? -smgs ^
More? -mea 256 ^
More? -ngl 99 ^
More? --n-cpu-moe 60 ^
More? -ts 13,29,29,29 ^
More? -ub 512 -b 512 ^
More? --threads 24 ^
More? --parallel 1 ^
More? --host 127.0.0.1 ^
More? --port 8085 ^
More? --no-mmap ^
More? --jinja
INFO [ main] build info | tid="27984" timestamp=1770520791 build=4189 commit="e22b2d12"
INFO [ main] system info | tid="27984" timestamp=1770520791 n_threads=24 n_threads_batch=-1 total_threads=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes, VRAM: 12281 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
CUDA0: using device CUDA0 - 11036 MiB free
CUDA1: using device CUDA1 - 23304 MiB free
CUDA2: using device CUDA2 - 23304 MiB free
CUDA3: using device CUDA3 - 23304 MiB free
Cant you play with these perams? just decrease --n-cpu-moe until you have no more vram available.
-ngl 99
--n-cpu-moe 54
Yeah take some time to dial in your system, here are some things to read:
- gist covering how it works: https://gist.github.com/DocShotgun/a02a4c0c0a57e43ff4f038b46ca66ae0
-sm graphyou're already using, and if a model is supported for that on ik it is the fastest way to run GGUFs anywhere on multi GPU rigs
Since you're already using -sm graph you might be able to play around with the ordering of the GPUs passing env vars and using -mg 0 to set the main gpu to the fastest/most pcie lanes...
more advanced stuff includes changing the reduce operation quantization type and other things, but that is beyond me given i mainly use CPU rig and don't experiment that much with >2 GPUs.
Read up on the closed PRs on ik_llama.cpp or join some discussions over on the beaver ai discord: https://huggingface.co/BeaverAI
Also if you can use -ub 4096 -b 4096 (keep in mind the default values are -ub 512 -b 2048` you can probably get more PP