Testing IQ5_K
Tensor blk.61.ffn_down_exps.weight (size = 954.00 MiB) buffer type overriden to CUDA_Host
Allocating 154.28 GiB of pinned host memory, this may take a while.
Using pinned host memory improves PP performance by a significant margin.
But if it takes too long for your model and amount of patience, kill the process and run using
GGML_CUDA_NO_PINNED=1 your_command_goes_here
done allocating 154.28 GiB in 43265.6 ms
llm_load_tensors: offloading 62 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 63/63 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 157978.76 MiB
llm_load_tensors: CUDA0 buffer size = 3578.73 MiB
....................................................................................................
~ggml_backend_cuda_context: have 0 graphs
llama_init_from_model: n_ctx = 100096
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 2048
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 512
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 5000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 24242.00 MiB
llama_init_from_model: KV self size = 24242.00 MiB, K (f16): 12121.00 MiB, V (f16): 12121.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.76 MiB
llama_init_from_model: CUDA0 compute buffer size = 2125.01 MiB
llama_init_from_model: CUDA_Host compute buffer size = 415.02 MiB
llama_init_from_model: graph nodes = 2361
llama_init_from_model: graph splits = 126
llama_init_from_model: enabling only_active_experts scheduling
main: n_kv_max = 100096, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 2048 | 512 | 0 | 3.653 | 560.65 | 23.748 | 21.56 |
| 2048 | 512 | 2048 | 3.384 | 605.21 | 23.956 | 21.37 |
| 2048 | 512 | 4096 | 3.414 | 599.86 | 24.295 | 21.07 |
| 2048 | 512 | 6144 | 3.446 | 594.38 | 24.470 | 20.92 |
| 2048 | 512 | 8192 | 3.499 | 585.38 | 24.869 | 20.59 |
With other options:
~ggml_backend_cuda_context: have 0 graphs
llama_init_from_model: n_ctx = 130048
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 4096
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 5000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 16732.28 MiB
llama_init_from_model: KV self size = 16732.25 MiB, K (q8_0): 8366.12 MiB, V (q8_0): 8366.12 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.76 MiB
llama_init_from_model: CUDA0 compute buffer size = 3222.00 MiB
llama_init_from_model: CUDA_Host compute buffer size = 1064.05 MiB
llama_init_from_model: graph nodes = 2361
llama_init_from_model: graph splits = 126
llama_init_from_model: enabling only_active_experts scheduling
main: n_kv_max = 130048, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 1024 | 0 | 3.960 | 1034.23 | 44.808 | 22.85 |
| 4096 | 1024 | 4096 | 3.999 | 1024.23 | 33.446 | 30.62 |
| 4096 | 1024 | 8192 | 4.132 | 991.26 | 35.122 | 29.16 |
| 4096 | 1024 | 12288 | 4.250 | 963.68 | 35.847 | 28.57 |
| 4096 | 1024 | 16384 | 4.384 | 934.34 | 36.511 | 28.05 |
Donwload in progress..

