thank you! - running on llama.cpp on M2 Max now :-)

#8
by ljupco - opened

Thanks for that. It's the biggest model I can run at some speed on an M2 Max 96gb v/ram. The 107B-A7B4 is about the biggest I can make use of comfortably. (next bigger like MiniMax-M2.7 even at smallest quant the A10B drops the TG to ~20 tok/s - too slow)

After lots of prodding, agents (Codex, OpenCode/GLM/DeepSeek/Kimi) adapted a llama.cpp fork to run a 4-bit ,gguf quant for me. This is great, I'm loving this. :-)
Uploaded the model one 4-bit quant here
https://huggingface.co/ljupco/Ling-2.6-flash-GGUF
The code to run it is this llama.cpp branch here
https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flash-r2

It holds the speed S_TG tok/s quite well with increasing depth, this makes it usable.

build/bin/llama-batched-bench -m ~/llama.cpp/models/Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf -npp 512,1024,2048,4096,8192,16384,32768 -ntg 128 -npl 1 -c 36000

main: n_kv_max = 36096, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 8, n_threads_batch = 8

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 128 1 640 1.201 426.29 2.829 45.25 4.030 158.83
1024 128 1 1152 2.804 365.22 3.682 34.76 6.486 177.62
2048 128 1 2176 6.085 336.54 3.691 34.67 9.777 222.56
4096 128 1 4224 12.587 325.41 3.794 33.74 16.381 257.87
8192 128 1 8320 26.703 306.78 4.023 31.82 30.726 270.78
16384 128 1 16512 58.853 278.39 4.358 29.37 63.211 261.22
32768 128 1 32896 134.525 243.58 4.932 25.95 139.457 235.89

llama_perf_context_print: load time = 4333.31 ms
llama_perf_context_print: prompt eval time = 242936.45 ms / 65040 tokens ( 3.74 ms per token, 267.72 tokens per second)
llama_perf_context_print: eval time = 27298.86 ms / 896 runs ( 30.47 ms per token, 32.82 tokens per second)
llama_perf_context_print: total time = 274404.15 ms / 65936 tokens
llama_perf_context_print: graphs reused = 889

inclusionAI org

Wow this is amazing @ljupco ! Thanks SOO much for sharing your wonderful work. It is quite a revelation moment as an open model builder. :)

Sign up or log in to comment