ik_llama.cpp version

#11

by geveent - opened Feb 14

Feb 14

MiniMax M2.5 IQ5_K ran fine with my AMD 9600X + RTX 5090 setup. I got about 9 t/s. Then it wasn't usable (i.e., repeating words and very slow) after I updated ik_llama.cpp. I had to roll back to f7923739 (build 4081), and everything was back to normal. I just wanted to share my experience.

ubergarm

Owner Feb 14

@geveent

Heya, good seeing you around!

Can you provide the full command you were using for testing? Also strangely, the smaller IQ4_NL version might be slightly better (it shows better perplexity, but I didn' test KLD stats). You could use the ik version which is a little better than the mainline version. (the mainline version is mainly for vulkan/mac/mainline folks).

If I understand correctly, it works fine until getting past a certain context length? What client were you using (the built in web ui, or an agentic coding tool like opencode etc?).

Cheers!

geveent

about 1 month ago

@ubergarm
Hi John. I'm so sorry. I'm incredibly late to respond to your message. I don't do AI or programming for living, and sometimes I struggle with system settings and LLM commands and parameters.

I was talking about ik_llama.cpp versions. Ever since I updated the ik_llama.cpp, your versions like MiniMax-M2.5 IQ5_K and Kimi-K2.5 IQ3_K stopped working correctly. They outputs non-words (random characters). I don't know what I'm doing wrong. Everything works fine after I rolled back to the previous version of ik_llama.cpp. I used the following commands to rollback.

git clone https://github.com/ikawrakow/ik_llama.cpp.git .
git reset --hard f7923739
git submodule update --init --recursive
mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j $(nproc)

I was just using OpenWebUI to ask questions. I want to learn opencode, openhands, openclaw, etc, but those are the future projects.

My AMD 9600X was too slow, so I stopped using the computer. Now, I'm using Intel Xeon 8570 + 8-channel 512GB + two RTX 5090. An example of parameters I'm using is as follows:
CUDA_VISIBLE_DEVICES=1 ./llama-server -m "/mnt/raid0/ubergarm/Kimi-K2.5-GGUF/Kimi-K2.5-IQ3_K-00001-of-00012.gguf" --alias "Kimi-K2.5 IQ3_K" -mla 3 --parallel 1 --threads-batch 128 -ctk q8_0 -ctv q8_0 -c 32768 -t 52 -ngl 99 --temp 1.0 --min-p 0.01 --top-p 0.95 -ot "blk.([3]).ffn_.*=CUDA0,exps=CPU" --host 0.0.0.0 --port 11234 --special --reasoning-budget 500

I get about 16 t/s from Kimi-K2.5. Recently I tried using Qwen3.5 122B for studying for CISSP exam, and I was happy with the result. I get decent answers and mostly importantly, I get about 85 t/s from dual RTX 5090.

geveent changed discussion status to closed about 1 month ago

ubergarm

Owner about 1 month ago

@geveent

Oh your new rig is very nice! With 2x GPUs on some models that support -sm graph on ik_llama.cpp you will get very good performance (near vLLM speed for single user with full offload).

ik_llama.cpp moves very quickly, it is possible there was a regression at some point hence why you needed to roll back to your known working version. I'd suggest try pulling the latest version again (omit the git reset ... step and see if the latest "tip of main" as we call it is working for you now on the new rig.

Your command for Kimi-K2.5 is pretty good, though it is simpler to remove:

-ot "blk.([3]).ffn_.*=CUDA0,exps=CPU"

And use the easier:

--n-cpu-moe 50

And hopefully soon you can test for OOM with --dry-run as coming soon here: https://github.com/ikawrakow/ik_llama.cpp/pull/1462

Finally, here is a quick start opencode.json you can use. I'd suggeest creating a directory e.g. vibecoding and putall your projects in subdirectories there. And keep the opencode.json file at the top level directory. The "cost estimate" is set to claude opus 4.6 the most expensive so I can see how much money I'm saving by running locally! haha

{
    "$schema": "https://opencode.ai/config.json",
    "share": "disabled",
    "autoupdate": false,
    "experimental": {
        "openTelemetry": false
    },
    "permission": {
        "websearch": "allow",
        "todo": "deny",
        "todoread": "deny",
        "todowrite": "deny",
        "doom_loop": "allow"
    },
    "disabled_providers": ["exa"],
    "lsp": false,
    "provider": {
        "LMstudio": {
            "npm": "@ai-sdk/openai-compatible",
            "name": "ik_llama.cpp (local)",
            "options": {
                "baseURL": "http://localhost:8080/v1",
                "timeout": 99999999999
            },
            "models": {
                "AnyNameQwen3.5": {
                  "name": "AnyNameQwen3.5",
                  "limit": { "context": 262144, "output": 65536 },
                  "cost": { "input": 5.0, "output": 25.0 },
                  "temperature": true,
                  "reasoning": true,
                  "tool_call": true,
                  "modalities": {
                    "input": ["text", "image"],
                    "output": ["text"]
                  }
                }
            }
        }
    }
}

Finally for smaller Qwen models take a peep here for one of my latest llama-server examples knowing you'd likelyuse -sm graph for your 2x GPUs: https://huggingface.co/ubergarm/Qwen3.5-35B-A3B-GGUF#quick-start

geveent

29 days ago

Thank you John. I definitely want to try opencode when I have time. For some reason I always get a worse speed with -sm graph. The following is the answer my Gemini gave me.
The logs confirm the throughput drop you noticed. Here are the "Eval" (generation) speeds from your two tasks:

First Run (Graph Mode): 60.73 tokens per second

Second Run (Layer Mode): 87.53 tokens per second

Why? In the first run, your logs show graph splits = 208. This means for every single token generated, your two RTX 5090s had to synchronize their data 208 times. In the second run (Layer mode), the graph splits dropped to only 9.

By removing -sm graph, you eliminated nearly 200 "conversations" between your GPUs per token, allowing them to spend more time calculating and less time waiting for the PCIe bus.

ubergarm

Owner 29 days ago

@gevent

Oh wait, do you not have P2P nvidia drivers? e.g. what is your output of these commands:

# get the nvidia driver and cuda version
$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+

# don't think 5090 supports nvlinks so should be inActive
$ nvidia-smi nvlink -s
GPU 0: NVIDIA RTX A6000 (UUID: GPU-b9caf726-e47e-74cf-60c9-ff87caeb0d8c)
NVML: Unable to retrieve NVLink information as all links are inActive
GPU 1: NVIDIA RTX A6000 (UUID: GPU-c97e723d-6e3e-7c71-c7f1-b0540f6f4f57)
NVML: Unable to retrieve NVLink information as all links are inActive

# check for P2P drivers
$ nvidia-smi topo -p2p r
        GPU0    GPU1
 GPU0   X       OK
 GPU1   OK      X

When you compile it should say ==================== NCCL found!

Are you on ubuntu or what flavor linux? You might need some packages like libnccl2 or maybe 5090 blackwell has special requirements for P2P support, I'm not 100 sure.

geveent

29 days ago

@ubergarm
I'm using Ubuntu 24.04.3. It will be so cool if I can use -sm graph.
Here are the output from the commands:

geveent@geveent-MS73-HB1-000:~$ nvidia-smi
Fri Mar 20 06:27:49 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:40:00.0  On |                  N/A |
|  0%   49C    P8             43W /  575W |     889MiB /  32607MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5090        On  |   00000000:6A:00.0 Off |                  N/A |
|  0%   36C    P8             12W /  575W |   27116MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            4979      G   /usr/lib/xorg/Xorg                      310MiB |
|    0   N/A  N/A            5238      G   /usr/bin/gnome-shell                     74MiB |
|    0   N/A  N/A            6100      G   ...rack-uuid=3190708988185955192        220MiB |
|    0   N/A  N/A            8031    C+G   ...am/ubuntu12_64/steamwebhelper         14MiB |
|    0   N/A  N/A           10802      G   /usr/bin/gnome-system-monitor            18MiB |
|    0   N/A  N/A           29915      G   ...ProgramFiles/.venv/bin/python         53MiB |
|    0   N/A  N/A          272140      G   /opt/Obsidian/obsidian                   31MiB |
|    1   N/A  N/A            4979      G   /usr/lib/xorg/Xorg                       33MiB |
|    1   N/A  N/A            7799      G   ...share/Steam/ubuntu12_32/steam         13MiB |
|    1   N/A  N/A            7996      G   ./steamwebhelper                          8MiB |
|    1   N/A  N/A          273326      C   ./llama-server                        27014MiB |
+-----------------------------------------------------------------------------------------+

geveent@geveent-MS73-HB1-000:~$ nvidia-smi nvlink -s
GPU 0: NVIDIA GeForce RTX 5090 (UUID: GPU-878328f7-cc0a-4d64-0614-8f358544a6a9)
Device does not have or support Nvlink
GPU 1: NVIDIA GeForce RTX 5090 (UUID: GPU-0a7ea96f-0882-718e-bbdb-bfebe8acedd6)
Device does not have or support Nvlink

geveent@geveent-MS73-HB1-000:~$ nvidia-smi topo -p2p r
     GPU0	GPU1	
 GPU0	X	CNS	
 GPU1	CNS	X	

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown

geveent

29 days ago

This time I installed 'libnccl2' and recompiled ik_llama.cpp with '-DGGML_CUDA_NCCL=ON'.

sudo apt-get update
sudo apt-get install libnccl2 libnccl-dev

git clone https://github.com/ikawrakow/ik_llama.cpp
cd ik_llama.cpp

cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON GGML_CUDA_NCCL=ON
cmake --build build --config Release -j $(nproc)

Unfortunately I got the error as follows:

geveent@geveent-MS73-HB1-000:~/ik_llama.cpp/build/bin$ ./llama-server -m "/home/geveent/lmstudio/models/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-IQ4_KS.gguf" -c 32768 -t 8 -ngl 99 -fa on -ub 1024 -b 1024 --parallel 1 --ctx-checkpoints 16 --host 0.0.0.0 --port 11235 --jinja --reasoning-budget 0 -sm graph
INFO [                    main] build info | tid="125942173741056" timestamp=1774005959 build=4332 commit="10b44eca"
INFO [                    main] system info | tid="125942173741056" timestamp=1774005959 n_threads=8 n_threads_batch=-1 total_threads=112 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32089 MiB
  Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32111 MiB
=============================== NCCL main communicator initialized
CUDA0: using device CUDA0 - 30639 MiB free
CUDA1: using device CUDA1 - 31423 MiB free
llama_model_loader: loaded meta data with 47 key-value pairs and 733 tensors from /home/geveent/lmstudio/models/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-IQ4_KS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Qwen3.5 35B A3B
llama_model_loader: - kv   6:                           general.basename str              = Qwen3.5
llama_model_loader: - kv   7:                         general.size_label str              = 35B-A3B
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv  10:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  11:                      qwen35moe.block_count u32              = 40
llama_model_loader: - kv  12:                   qwen35moe.context_length u32              = 262144
llama_model_loader: - kv  13:                 qwen35moe.embedding_length u32              = 2048
llama_model_loader: - kv  14:             qwen35moe.attention.head_count u32              = 16
llama_model_loader: - kv  15:          qwen35moe.attention.head_count_kv u32              = 2
llama_model_loader: - kv  16:          qwen35moe.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  17:                   qwen35moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  18: qwen35moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  19:                     qwen35moe.expert_count u32              = 256
llama_model_loader: - kv  20:                qwen35moe.expert_used_count u32              = 8
llama_model_loader: - kv  21:             qwen35moe.attention.key_length u32              = 256
llama_model_loader: - kv  22:           qwen35moe.attention.value_length u32              = 256
llama_model_loader: - kv  23:                          general.file_type u32              = 145
llama_model_loader: - kv  24:       qwen35moe.expert_feed_forward_length u32              = 512
llama_model_loader: - kv  25: qwen35moe.expert_shared_feed_forward_length u32              = 512
llama_model_loader: - kv  26:                  qwen35moe.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  27:                   qwen35moe.ssm.state_size u32              = 128
llama_model_loader: - kv  28:                  qwen35moe.ssm.group_count u32              = 16
llama_model_loader: - kv  29:               qwen35moe.ssm.time_step_rank u32              = 32
llama_model_loader: - kv  30:                   qwen35moe.ssm.inner_size u32              = 4096
llama_model_loader: - kv  31:          qwen35moe.full_attention_interval u32              = 4
llama_model_loader: - kv  32:             qwen35moe.rope.dimension_count u32              = 64
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 248044
llama_model_loader: - kv  41:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  42:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  43:                      quantize.imatrix.file str              = /mnt/data/models/ubergarm/Qwen3.5-35B...
llama_model_loader: - kv  44:                   quantize.imatrix.dataset str              = ubergarm-imatrix-calibration-corpus-v...
llama_model_loader: - kv  45:             quantize.imatrix.entries_count i32              = 511
llama_model_loader: - kv  46:              quantize.imatrix.chunks_count i32              = 829
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  252 tensors
llama_model_loader: - type iq4_ks:   80 tensors
llama_model_loader: - type iq5_ks:   40 tensors
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen35moe
llm_load_print_meta: n_ctx_train      = 262144
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 0
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 40
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 262144
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: mrope sections   = [11, 11, 10, 0]
llm_load_print_meta: ssm_d_conv       = 4
llm_load_print_meta: ssm_d_inner      = 4096
llm_load_print_meta: ssm_d_state      = 128
llm_load_print_meta: ssm_dt_rank      = 32
llm_load_print_meta: ssm_n_group      = 16
llm_load_print_meta: model type       = 35B.A3B
llm_load_print_meta: model ftype      = IQ4_KS - 4.25 bpw
llm_load_print_meta: model params     = 34.661 B
llm_load_print_meta: model size       = 19.799 GiB (4.907 BPW) 
llm_load_print_meta: repeating layers = 18.792 GiB (4.798 BPW, 33.643 B parameters)
llm_load_print_meta: general.name     = Qwen3.5 35B A3B
print_info: vocab type       = BPE
print_info: n_vocab          = 248320
print_info: n_merges         = 247587
print_info: BOS token        = 11 ','
print_info: EOS token        = 248046 '<|im_end|>'
print_info: EOT token        = 248046 '<|im_end|>'
print_info: PAD token        = 248044 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 248060 '<|fim_prefix|>'
print_info: FIM SUF token    = 248062 '<|fim_suffix|>'
print_info: FIM MID token    = 248061 '<|fim_middle|>'
print_info: FIM PAD token    = 248063 '<|fim_pad|>'
print_info: FIM REP token    = 248064 '<|repo_name|>'
print_info: FIM SEP token    = 248065 '<|file_sep|>'
print_info: EOG token        = 248044 '<|endoftext|>'
print_info: EOG token        = 248046 '<|im_end|>'
print_info: EOG token        = 248063 '<|fim_pad|>'
print_info: EOG token        = 248064 '<|repo_name|>'
print_info: EOG token        = 248065 '<|file_sep|>'
print_info: max token length = 256
Oops: tensor with strange name output_norm.weight
------------------- Layer sizes:
Layer  0: 482.837 MiB
Layer  1: 482.837 MiB
Layer  2: 482.837 MiB
Layer  3: 475.838 MiB
Layer  4: 482.837 MiB
Layer  5: 482.837 MiB
Layer  6: 482.837 MiB
Layer  7: 475.838 MiB
Layer  8: 482.837 MiB
Layer  9: 482.837 MiB
Layer 10: 482.837 MiB
Layer 11: 475.838 MiB
Layer 12: 482.837 MiB
Layer 13: 482.837 MiB
Layer 14: 482.837 MiB
Layer 15: 475.838 MiB
Layer 16: 482.837 MiB
Layer 17: 482.837 MiB
Layer 18: 482.837 MiB
Layer 19: 475.838 MiB
Layer 20: 482.837 MiB
Layer 21: 482.837 MiB
Layer 22: 482.837 MiB
Layer 23: 475.838 MiB
Layer 24: 482.837 MiB
Layer 25: 482.837 MiB
Layer 26: 482.837 MiB
Layer 27: 475.838 MiB
Layer 28: 482.837 MiB
Layer 29: 482.837 MiB
Layer 30: 482.837 MiB
Layer 31: 475.838 MiB
Layer 32: 482.837 MiB
Layer 33: 482.837 MiB
Layer 34: 482.837 MiB
Layer 35: 475.838 MiB
Layer 36: 482.837 MiB
Layer 37: 482.837 MiB
Layer 38: 482.837 MiB
Layer 39: 475.838 MiB
Layer 40: 515.312 MiB (output layer)
Setting default device in layer  0 to 0
Setting default device in layer  1 to 0
Setting default device in layer  2 to 0
Setting default device in layer  3 to 0
Setting default device in layer  4 to 0
Setting default device in layer  5 to 0
Setting default device in layer  6 to 0
Setting default device in layer  7 to 0
Setting default device in layer  8 to 0
Setting default device in layer  9 to 0
Setting default device in layer 10 to 0
Setting default device in layer 11 to 0
Setting default device in layer 12 to 0
Setting default device in layer 13 to 0
Setting default device in layer 14 to 0
Setting default device in layer 15 to 0
Setting default device in layer 16 to 0
Setting default device in layer 17 to 0
Setting default device in layer 18 to 0
Setting default device in layer 19 to 0
Setting default device in layer 20 to 1
Setting default device in layer 21 to 1
Setting default device in layer 22 to 1
Setting default device in layer 23 to 1
Setting default device in layer 24 to 1
Setting default device in layer 25 to 1
Setting default device in layer 26 to 1
Setting default device in layer 27 to 1
Setting default device in layer 28 to 1
Setting default device in layer 29 to 1
Setting default device in layer 30 to 1
Setting default device in layer 31 to 1
Setting default device in layer 32 to 1
Setting default device in layer 33 to 1
Setting default device in layer 34 to 1
Setting default device in layer 35 to 1
Setting default device in layer 36 to 1
Setting default device in layer 37 to 1
Setting default device in layer 38 to 1
Setting default device in layer 39 to 1
Setting default device in layer 40 to 1
llm_load_tensors: ggml ctx size =    4.27 MiB
================================ max_gpu = 0
Estimated model buffer size per device:
    Device 0:   9702.23 MiB
    Device 1:   9702.23 MiB
No tensors in buffer type CUDA0
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:        CPU buffer size =   515.31 MiB
llm_load_tensors: CUDA_Split buffer size = 19404.96 MiB
llm_load_tensors:      CUDA1 buffer size =   515.32 MiB
.................................................................................................
llama_init_from_model: n_ctx         = 32768
llama_init_from_model: n_batch       = 1024
llama_init_from_model: n_ubatch      = 1024
llama_init_from_model: flash_attn    = 1
llama_init_from_model: attn_max_b    = 0
llama_init_from_model: fused_moe     = 1
llama_init_from_model: grouped er    = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad    = 1
llama_init_from_model: rope_cache    = 0
llama_init_from_model: graph_reuse   = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type   = f16
llama_init_from_model: sched_async   = 0
llama_init_from_model: ser           = -1, 0
llama_init_from_model: freq_base     = 10000000.0
llama_init_from_model: freq_scale    = 1
llama_kv_cache_init: CUDA_Split KV buffer size =   702.82 MiB
llama_kv_cache_init: KV cache size per device:
    Device 0:  351.406 MiB
    Device 1:  351.406 MiB
llama_init_from_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.95 MiB
llama_init_from_model:      CUDA0 compute buffer size =   152.00 MiB
llama_init_from_model:      CUDA1 compute buffer size =   982.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    72.03 MiB
llama_init_from_model: graph nodes  = 5386
llama_init_from_model: graph splits = 241
llama_init_from_model: enabling only_active_experts scheduling
ggml_cuda_op_reduce: ncclAllReduce failed with status 1
/home/geveent/ik_llama.cpp/ggml/src/ggml-cuda/reduce.cu:159: Fatal error
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)

dehnhaide

28 days ago

•

edited 28 days ago

@ubergarm
I'm using Ubuntu 24.04.3. It will be so cool if I can use -sm graph.
Here are the output from the commands:

You need P2P modified drivers for your 5090s. Use these:
https://www.reddit.com/r/LocalLLaMA/comments/1r66jyp/vllm_maximum_performance_on_multi3090/
https://github.com/aikitoria/open-gpu-kernel-modules?tab=readme-ov-file

--> It's doable but long, so be patient. Read carefully because there are a lot of prerequisites (e.g., BIOS support for BAR, IOMMU disabled or NOT, "sudo ./NVIDIA-Linux-x86_64-595.45.04.run --no--kernel-modules", etc)
Have phun! ;) 😀 In the end it will look smth like:

vik@SuperDomeLIN:~/NVIDIA_Drivers_P2P$ nvidia-smi topo -p2p r
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X OK OK OK OK OK OK OK
GPU1 OK X OK OK OK OK OK OK
GPU2 OK OK X OK OK OK OK OK
GPU3 OK OK OK X OK OK OK OK
GPU4 OK OK OK OK X OK OK OK
GPU5 OK OK OK OK OK X OK OK
GPU6 OK OK OK OK OK OK X OK
GPU7 OK OK OK OK OK OK OK X

ubergarm

Owner 28 days ago

@dehnhaide

Thanks for coming to the rescue with instructions on the patched 5090 P2P drivers! I don't have hardware to try it myself.

@geveent

So basically you need to install the patched P2P 5090 drivers on your linux box, and then you will know it is working when you get

geveent@geveent-MS73-HB1-000:~$ nvidia-smi topo -p2p r
     GPU0	GPU1	
 GPU0	X	OK	
 GPU1	OK	X

geveent

26 days ago

•

edited 26 days ago

@dehnhaide @ubergarm
Thank you so much for guiding me to the right direction. I got to the point where p2p status displayed "OK". Unfortunately ik_llama.cpp is not running with -sm graph, and I'm stuck at the moment.

geveent@geveent-MS73-HB1-000:~$ nvidia-smi
Sun Mar 22 13:33:05 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.45.04              Driver Version: 595.45.04      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        Off |   00000000:40:00.0  On |                  N/A |
| 35%   48C    P8             37W /  575W |   28544MiB /  32607MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5090        Off |   00000000:6A:00.0 Off |                  N/A |
| 33%   45C    P8             16W /  575W |   31375MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            5000      G   /usr/lib/xorg/Xorg                      190MiB |
|    0   N/A  N/A            5395      G   /usr/bin/gnome-shell                     69MiB |
|    0   N/A  N/A            6471      G   /opt/Obsidian/obsidian                   26MiB |
|    0   N/A  N/A            6671      G   ...rack-uuid=3190708988185955192        120MiB |
|    0   N/A  N/A           23353      C   ./llama-server                        28022MiB |
|    1   N/A  N/A            5000      G   /usr/lib/xorg/Xorg                        4MiB |
|    1   N/A  N/A           23353      C   ./llama-server                        31352MiB |
+-----------------------------------------------------------------------------------------+
geveent@geveent-MS73-HB1-000:~$ nvidia-smi topo -p2p r
     GPU0	GPU1	
 GPU0	X	OK	
 GPU1	OK	X	
...
geveent@geveent-MS73-HB1-000:~/ik_llama.cpp/build/bin$ sudo lspci -vvv | grep -i -A40 'VGA compatible controller' | grep Region
[sudo] password for geveent: 
    Region 0: Memory at a4000000 (32-bit, non-prefetchable) [size=16M]
    Region 1: Memory at a5000000 (32-bit, non-prefetchable) [size=256K]
    Region 2: I/O ports at 2000 [size=128]
    Region 0: Memory at c0000000 (32-bit, non-prefetchable) [size=64M]
    Region 1: Memory at 202000000000 (64-bit, prefetchable) [size=32G]
    Region 3: Memory at 202812000000 (64-bit, prefetchable) [size=32M]
    Region 5: I/O ports at 9000 [size=128]
    Region 0: Memory at d0000000 (32-bit, non-prefetchable) [size=64M]
    Region 1: Memory at 203000000000 (64-bit, prefetchable) [size=32G]
    Region 3: Memory at 203812000000 (64-bit, prefetchable) [size=32M]
    Region 5: I/O ports at b000 [size=128]

geveent@geveent-MS73-HB1-000:~/ik_llama.cpp/build/bin$ ./llama-server -m "/home/geveent/lmstudio/models/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-IQ4_KS.gguf" -c 32768 -t 8 -ngl 99 -fa on -ub 1024 -b 1024 --parallel 1 --ctx-checkpoints 16 --host 0.0.0.0 --port 11235 --jinja -sm graph
INFO [                    main] build info | tid="126177819144192" timestamp=1774200905 build=4335 commit="56e026f6"
INFO [                    main] system info | tid="126177819144192" timestamp=1774200905 n_threads=8 n_threads_batch=-1 total_threads=112 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32088 MiB
  Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32110 MiB
=============================== NCCL main communicator initialized
CUDA0: using device CUDA0 - 30871 MiB free
CUDA1: using device CUDA1 - 31454 MiB free
llama_model_loader: loaded meta data with 47 key-value pairs and 733 tensors from /home/geveent/lmstudio/models/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-IQ4_KS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
...

 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 0->1
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 1->0
llama_kv_cache_init: CUDA_Split KV buffer size =   702.82 MiB
llama_kv_cache_init: KV cache size per device:
    Device 0:  351.406 MiB
    Device 1:  351.406 MiB
llama_init_from_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.95 MiB
llama_init_from_model:      CUDA0 compute buffer size =   152.00 MiB
llama_init_from_model:      CUDA1 compute buffer size =   982.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    72.03 MiB
llama_init_from_model: graph nodes  = 5386
llama_init_from_model: graph splits = 241
llama_init_from_model: enabling only_active_experts scheduling
ggml_cuda_op_reduce: ncclAllReduce failed with status 1
/home/geveent/ik_llama.cpp/ggml/src/ggml-cuda/reduce.cu:159: Fatal error
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)

geveent

26 days ago

The following is the answer I got from Gemini Pro:
The discussion you linked highlights exactly why -sm graph (split-mode graph) is causing headaches on a dual RTX 5090 setup. While graph splitting is incredibly fast on enterprise hardware like A100s or H100s, it is actually the wrong approach for consumer GPUs and will result in crashes or severely degraded performance.

Here is the technical reality of why -sm graph is fighting your hardware:

1. The P2P Roadblock

As the diagnostic logs in that Hugging Face thread reveal, running nvidia-smi topo -p2p r on dual 5090s returns CNS (Chipset Not Supported). NVIDIA permanently disables Peer-to-Peer (P2P) communication and NVLink on their 5090s to segment them from data center cards. The GPUs cannot talk to each other directly.

2. Why Graph Mode Fails Here

When you use -sm graph, you are instructing the inference engine to split the computational graph so that both GPUs work on the exact same token simultaneously. To do this, the GPUs must synchronize their math using NCCL operations (like ncclAllReduce) hundreds of times for every single token generated.

Because P2P is disabled on the 5090s, NCCL either:

Crashes: It attempts to use direct communication, hits the NVIDIA lock, and throws the ncclAllReduce failed with status 1 error seen in the thread.
Chokes: If you force NCCL to bypass P2P and use shared system memory, it routes those 200+ synchronization steps per token through your system RAM and CPU. This creates a massive PCIe traffic jam, destroying your tokens-per-second throughput.

The Path Forward for Maximum Speed

For a consumer setup without NVLink, you must use Layer Splitting instead of Graph Splitting.

By running your model without the -sm graph flag, llama.cpp defaults to assigning sequential blocks of layers to each GPU. GPU 0 finishes its layers and hands the result to GPU 1. This method only requires the GPUs to communicate across the PCIe bus a handful of times per token (usually less than 10) instead of hundreds.

To maximize your speed:

Stop using -sm graph.
Disable NCCL completely in your build as we discussed previously (-DGGML_CUDA=ON without the NCCL flags).
Let the standard CUDA backend handle the layer splits. It is optimized to keep your 5090s calculating rather than waiting on a choked PCIe bus.

geveent

26 days ago

•

edited 26 days ago

I got it to run without error by installing NCCL 2.29.7 instead of NCCL version 2.18.3. I started at 9am and now it is 5pm, but I squeezed out little more speed out of my setup. I will report back after I play with this more. At the moment, the LLMs generate random characters when I use -sm graph.

ubergarm

Owner 25 days ago

•

edited 25 days ago

@geveent

Check here for another guy with 2x RTX6000 (blackwell) having issues: https://github.com/ikawrakow/ik_llama.cpp/issues/1495#issuecomment-4111368557

Specifically you might be able to recompile with -DGGML_NCCL=OFF and see how that works.

Also might be able to play with different dtypes / quants for --graph-reduce-type f32 but i'm just spitballing at this point.

geveent

12 days ago

@ubergarm
Hi John! I hope all is well. I couldn't get -sm graph to work. I gave up for the time being, and then I stumbled upon this reddit post.
https://www.reddit.com/r/unsloth/comments/1sc9jdc/gemma_4_other_low_bit_quants_gibberish_with_cuda/

So, I tried downgrading my CUDA to 13.0 from 13.2. And Voilà "-sm graph" worked like a magic!!

When running ubergarm Qwen3.5-35B-A3B-IQ4_KS, I got about 180 t/s. There was virtually no speed difference between -sm graph and -sm layer.
When running ubergarm Qwen3.5-27B-IQ5_KS, I got about 80 t/s in -sm graph mode and 60 t/s in -sm layer mode. The GPU utilization was about 90% with -sm graph and about 50% with -sm layer.

In order to use "-sm graph", the model needs to support the graph split mode, correct? It seems like GLM-5 and Kimi-K2.5 do not support graph split mode.

ubergarm

Owner 11 days ago

@geveent

In order to use "-sm graph", the model needs to support the graph split mode, correct? It seems like GLM-5 and Kimi-K2.5 do not support graph split mode.

Here is a list of supported model architechtures for -sm graph: https://github.com/ikawrakow/ik_llama.cpp/blob/main/src/llama.cpp#L1983-L2003

Correct, the deepseek MLA attention models e.g. GLM-5 and Kimi-K2.5 do not support -sm graph yet.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment