Testing IQ4_NL
W790E Sage + QYFS + 512G + RTX5090
Tensor blk.61.ffn_down_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 62 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 63/63 layers to GPU
llm_load_tensors: CPU buffer size = 120528.00 MiB
llm_load_tensors: CUDA_Host buffer size = 329.70 MiB
llm_load_tensors: CUDA0 buffer size = 3441.36 MiB
....................................................................................................
~ggml_backend_cuda_context: have 0 graphs
===================================== llama_init_from_model: f16
llama_init_from_model: n_ctx = 200192
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 512
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 5000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 19696.65 MiB
llama_init_from_model: KV self size = 19696.62 MiB, K (q6_0): 9848.31 MiB, V (q6_0): 9848.31 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.76 MiB
llama_init_from_model: CUDA0 compute buffer size = 3222.00 MiB
llama_init_from_model: CUDA_Host compute buffer size = 1612.05 MiB
llama_init_from_model: graph nodes = 2361
llama_init_from_model: graph splits = 126
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
main: n_kv_max = 200192, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 1024 | 0 | 12.206 | 335.58 | 37.231 | 27.50 |
| 4096 | 1024 | 4096 | 12.200 | 335.75 | 30.720 | 33.33 |
| 4096 | 1024 | 8192 | 22.922 | 178.69 | 32.279 | 31.72 |
| 4096 | 1024 | 12288 | 21.965 | 186.48 | 46.059 | 22.23 |
| 4096 | 1024 | 16384 | 23.216 | 176.43 | 47.742 | 21.45 |
| 4096 | 1024 | 20480 | 23.498 | 174.31 | 47.164 | 21.71 |
I love seeing your wild vibe coded creations! Happy lunar new year! ππππ
I am going to ask stupid question and am fine with being slapped!
Why can't I get the model in ik_llama to properly use GPU at full like it happens in vllm / sglang? I love "-sm graph" + p2p get some strong push but still it's still at about half the t/s I am getting in sglang or vllm.
And I am not talking about multiple parallel requests. Nope, single one.
Is it some configuration / ik_llama line parametrization I am doing wrong, or is it architecturally so that *llama can't be that fast (given the lack of true tensor parallelism) and have to move on with my life and accept for what it is?
I'll try to answer you over on your discussion: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discussions/9
tl;dr; are you using -ngl 999 for full GPU offload? If you add -ot exps=CPU or --cpu-moe or --n-cpu-moe 50 or whatever it won't offload all layers onto GPU.



