Testing IQ5_K

#8
by shewin - opened

Tensor blk.61.ffn_down_exps.weight (size = 954.00 MiB) buffer type overriden to CUDA_Host

Allocating 154.28 GiB of pinned host memory, this may take a while.
Using pinned host memory improves PP performance by a significant margin.
But if it takes too long for your model and amount of patience, kill the process and run using

GGML_CUDA_NO_PINNED=1 your_command_goes_here
done allocating 154.28 GiB in 43265.6 ms

llm_load_tensors: offloading 62 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 63/63 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 157978.76 MiB
llm_load_tensors: CUDA0 buffer size = 3578.73 MiB
....................................................................................................
~ggml_backend_cuda_context: have 0 graphs
llama_init_from_model: n_ctx = 100096
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 2048
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 512
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 5000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 24242.00 MiB
llama_init_from_model: KV self size = 24242.00 MiB, K (f16): 12121.00 MiB, V (f16): 12121.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.76 MiB
llama_init_from_model: CUDA0 compute buffer size = 2125.01 MiB
llama_init_from_model: CUDA_Host compute buffer size = 415.02 MiB
llama_init_from_model: graph nodes = 2361
llama_init_from_model: graph splits = 126
llama_init_from_model: enabling only_active_experts scheduling

main: n_kv_max = 100096, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 3.653 560.65 23.748 21.56
2048 512 2048 3.384 605.21 23.956 21.37
2048 512 4096 3.414 599.86 24.295 21.07
2048 512 6144 3.446 594.38 24.470 20.92
2048 512 8192 3.499 585.38 24.869 20.59

2026-04-14_18-04

2026-04-14_18-06
Top-tier open source model!

With other options:

~ggml_backend_cuda_context: have 0 graphs
llama_init_from_model: n_ctx = 130048
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 4096
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 5000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 16732.28 MiB
llama_init_from_model: KV self size = 16732.25 MiB, K (q8_0): 8366.12 MiB, V (q8_0): 8366.12 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.76 MiB
llama_init_from_model: CUDA0 compute buffer size = 3222.00 MiB
llama_init_from_model: CUDA_Host compute buffer size = 1064.05 MiB
llama_init_from_model: graph nodes = 2361
llama_init_from_model: graph splits = 126
llama_init_from_model: enabling only_active_experts scheduling

main: n_kv_max = 130048, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 3.960 1034.23 44.808 22.85
4096 1024 4096 3.999 1024.23 33.446 30.62
4096 1024 8192 4.132 991.26 35.122 29.16
4096 1024 12288 4.250 963.68 35.847 28.57
4096 1024 16384 4.384 934.34 36.511 28.05

Donwload in progress..

i always love seeing these reports! thanks as usual @shewin

Sign up or log in to comment