thank you! - running on llama.cpp on M2 Max now :-)

by ljupco - opened 3 days ago

Thanks for that. It's the biggest model I can run at some speed on an M2 Max 96gb v/ram. The 107B-A7B4 is about the biggest I can make use of comfortably. (next bigger like MiniMax-M2.7 even at smallest quant the A10B drops the TG to ~20 tok/s - too slow)

After lots of prodding, agents (Codex, OpenCode/GLM/DeepSeek/Kimi) adapted a llama.cpp fork to run a 4-bit ,gguf quant for me. This is great, I'm loving this. :-)
Uploaded the model one 4-bit quant here
https://huggingface.co/ljupco/Ling-2.6-flash-GGUF
The code to run it is this llama.cpp branch here
https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flash-r2

It holds the speed S_TG tok/s quite well with increasing depth, this makes it usable.

build/bin/llama-batched-bench -m ~/llama.cpp/models/Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf -npp 512,1024,2048,4096,8192,16384,32768 -ntg 128 -npl 1 -c 36000

main: n_kv_max = 36096, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 8, n_threads_batch = 8

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	128	1	640	1.201	426.29	2.829	45.25	4.030	158.83
1024	128	1	1152	2.804	365.22	3.682	34.76	6.486	177.62
2048	128	1	2176	6.085	336.54	3.691	34.67	9.777	222.56
4096	128	1	4224	12.587	325.41	3.794	33.74	16.381	257.87
8192	128	1	8320	26.703	306.78	4.023	31.82	30.726	270.78
16384	128	1	16512	58.853	278.39	4.358	29.37	63.211	261.22
32768	128	1	32896	134.525	243.58	4.932	25.95	139.457	235.89

llama_perf_context_print: load time = 4333.31 ms
llama_perf_context_print: prompt eval time = 242936.45 ms / 65040 tokens ( 3.74 ms per token, 267.72 tokens per second)
llama_perf_context_print: eval time = 27298.86 ms / 896 runs ( 30.47 ms per token, 32.82 tokens per second)
llama_perf_context_print: total time = 274404.15 ms / 65936 tokens
llama_perf_context_print: graphs reused = 889

RichardBian

inclusionAI org 3 days ago

Wow this is amazing @ljupco ! Thanks SOO much for sharing your wonderful work. It is quite a revelation moment as an open model builder. :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment