Does the IQ4_XS have the specific ik_llama optimizations?

#4
by MrHills-rs - opened

For example, does it work with -rtr? That's only for quantizations made with IK if I recall correctly?

@MrHills-rs

I forget if ik_llama.cpp will make an iq4_xs_r8 when using -rtr. You can try and see, I think it probably will, though it may not be any faster unless you're running those tensors on CPU, and even then the PP will likely suffer but might see marginal TG gains.

Best advise is to try a/b testing with and without -rtr with your command using llama-sweep-bench then you'll see the full profile.

Using -rtr gives me this error:

~/AI/ik/src/llama.cpp:4340: GGML_ASSERT(l.ffn_up_gate_exps->type == l.ffn_up_exps->type && l.ffn_up_gate_exps->type == l.ffn_gate_exps->type) failed
ptrace: Operation not permitted.
No stack.
The program is not being run.

It's not a big deal, mind you. I just wanted to know if I'm doing something wrong it if it's normal behavior.

@MrHills-rs

I just gave the IQ4_XS a try running on CPU-only compiled rig with ik_llama.cpp tip of main and it seems to be working. I assume you are using ik_llama.cpp despite the folder being renamed to llama.cpp in your example ?

#!/usr/bin/env bash

model=/mnt/raid/hf/MiniMax-M2.5-GGUF/IQ4_XS/MiniMax-M2.5-IQ4_XS-00001-of-00004.gguf

export SOCKET=0
numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/MiniMax-M2.5 \
    --ctx-size 131072 \
    -rtr \
    -ger \
    --merge-qkv \
    -ctk q8_0 -ctv q8_0 \
    -ub 4096 -b 4096 \
    --parallel 1 \
    --threads 96 \
    --threads-batch 128 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap \
    --jinja \
    --validate-quants

.
.
.

llm_load_tensors:        CPU buffer size = 117598.47 MiB
....................................................................................................
============ Repacked 435 tensors

I did not benchmark with llama-sweep-bench to compare -rtr to omitting it however, just confirmed it worked in a single inference.

Are you using CUDA or another backend too?

I'm using cuda+cpu, albeit I'm a little tight on memory. Does rtr require additional memory?
Also, running -muge gives nonsense output. Should I open an issue on git?

./build/bin/llama-server
-m ~/AI/ik/models/MiniMax-M2.5-IQ4_XS.gguf
-ot "blk.(?:[0-9]|[1-5][0-9]|[6][0-1]).ffn._exps.=CPU"
-c 65536 -b 8192 -ub 8192
-ctk q8_0 -ctv q8_0
--threads 8 -ngl 95
-cuda fusion=1,offload-batch-size=8,mmq-id-size=512
-amb 512 -mqkv -muge
--webui none --jinja
--repeat-last-n 2048
--reasoning-format deepseek-legacy
--draft-min 1 --spec-ngram-size-n 8
--draft-max 4 --spec-type ngram-simple --draft-p-min 0.2

I'm using cuda+cpu, albeit I'm a little tight on memory. Does rtr require additional memory?
Also, running -muge gives nonsense output. Should I open an issue on git?

./build/bin/llama-server
-m ~/AI/ik/models/MiniMax-M2.5-IQ4_XS.gguf
-ot "blk.(?:[0-9]|[1-5][0-9]|[6][0-1]).ffn._exps.=CPU"
-c 65536 -b 8192 -ub 8192
-ctk q8_0 -ctv q8_0
--threads 8 -ngl 95
-cuda fusion=1,offload-batch-size=8,mmq-id-size=512
-amb 512 -mqkv -muge
--webui none --jinja
--repeat-last-n 2048
--reasoning-format deepseek-legacy
--draft-min 1 --spec-ngram-size-n 8
--draft-max 4 --spec-type ngram-simple --draft-p-min 0.2

MrHills-rs changed discussion status to closed
MrHills-rs changed discussion status to open

I'm using cuda+cpu, albeit I'm a little tight on memory. Does rtr require additional memory?

I don't think it uses more memory, it only repacks the tensors running on CPU and should be same BPW for this quantization type.

You're doing a lot of stuff all at once, i'd recommend dialing back to no more than -ub 4096 -b 4096 as sometimes larger batches can cause problems (at least in general and historically).

You don't need -amb 512 as this model is not MLA.

For testing purposes, remove all repeat/reasoningformat/draft stuff as that is pretty new. Just start with the most basic command and make sure that works, you're testing too many things at the same time imo (or at least more than my brain can think about).

Also, running -muge gives nonsense output. Should I open an issue on git?

I don't even know what -muge does, but it sounds funny to say it outloud πŸ˜… hah... For now consider removing it, and only after you have the basics working add it back in and see if it is the culprit.

-ot "blk.(?:[0-9]|[1-5][0-9]|[6][0-1]).ffn._exps.=CPU"

I'm not sure how to read this easily, but maybe just use --n-cpu-moe XX or whatever? I know it can be tricky if you're tight on RAM though.

Let's see if we can get something that makes sense before opening an issue as ik seems busy working on Qwen3-Next haha

Ok so -rtr is incompatible with -muge, but it will work without it.
-rtr also makes pp unbearably slow tho, so I have to disable it (1/3 of the pp speed)
-muge still gives nonsense output with this model

Honestly this isn't a big deal, I'll wait for the guys at ik to finish with qwen 3.5/next then I'll open an issue.

Thanks for your help!

MrHills-rs changed discussion status to closed

@MrHills-rs

-rtr also makes pp unbearably slow tho, so I have to disable it (1/3 of the pp speed)

It depends again, but when using repacked quant types they don't benefit PP as much from increased batch sizes even for MoEs as do the un repacked quants. So I stopped releasing pre-repacked quants a long time ago now because of this. Did you observe any benefit for TG though?

-muge still gives nonsense output with this model

What does -muge do and why are you trying to use it?

I'll wait for the guys at ik to finish with qwen 3.5/next then I'll open an issue.

I think the only guys at ik working on that would be ik, haha... I too am excited to give Qwen3(.5)Next a try as I've heard pretty good things anecdotally about its long context performance.

@MrHills-rs

-rtr also makes pp unbearably slow tho, so I have to disable it (1/3 of the pp speed)

It depends again, but when using repacked quant types they don't benefit PP as much from increased batch sizes even for MoEs as do the un repacked quants. So I stopped releasing pre-repacked quants a long time ago now because of this. Did you observe any benefit for TG though?

-muge still gives nonsense output with this model

What does -muge do and why are you trying to use it?

I'll wait for the guys at ik to finish with qwen 3.5/next then I'll open an issue.

I think the only guys at ik working on that would be ik, haha... I too am excited to give Qwen3(.5)Next a try as I've heard pretty good things anecdotally about its long context performance.

Merge ffn_up and ffn_gate experts tensors, muge. Provides a small speed boost. Probably interferes with the experts repackaging.

From here
https://github.com/ikawrakow/ik_llama.cpp/pull/1137

It's odd because it worked with minimax m2.1, I benched it too.

https://github.com/ikawrakow/ik_llama.cpp/pull/1139

As for rtr the loss of pp is really bad, and the gain in tg is barely noticeable, around 2-3%. Unfortunately my bottleneck is RAM speed. I have to move around 4.5 GB of weights from RAM to CPU every token and my memory can only do about 80 GB per second. Even with infinite processing speed both for the CPU and GPU I could only ever hope to reach 16 to 18 tokens per second with current RAM. Considering that I already get 14 tokens per second, a computation speed boost is unlikely to help me much.

Owner
β€’
edited Feb 14

@MrHills-rs

From here

Ahh thanks for the link! ik added so many small fused OPs that add up in terms of speed, its hard even for me to keep up who spends a lot of time on this stuff haha... I see you were active in that thread. Reading the PR is sounds like this -muge might need to be implemented differently for different architectures? Might be why it works on some models and not others depending on underlying arch?

You could probably let ik know in that existing thread with you if you wanted. I haven't tried -muge yet for anything, and amusingly when calculating imatrix i have to explicitly prevent all the fused ops to collect full importance data for each tensor haha...

As for rtr the loss of pp is really bad, and the gain in tg is barely noticeable, around 2-3%.

Yeah, that is my impression most of the time assuming large batches e.g. -ub 4096 -b 4096 given more recent improvements in kernels since -rtr was originally introduced.

Unfortunately my bottleneck is RAM speed... a computation speed boost is unlikely to help me much.

Yep, this is running inference given decode is autoregressive. Pretty much all quantizations and kernels are memory bandwidth limited here. Interestingly the KT "trellis" / QTIP style "EXL3" quantization is usually still CPU bound for inference given it has to calculate the trellis on the fly each time which is prohibitive on CPU usually. I've always wondered if there is the right mix of "normal" quant types and KT types to keep both the CPU and memory bandwidth saturated possibly giving marginal improvements in perpelexity/quality. But it would likely be per rig to tune that so not worth it for me here.

Looking at your specs:

  • 7800x3d
  • 128gb ddr5 6000mt
  • 5090 pcie5

You have a very nice system for inference and gaming. I assume you have 2x64GB DIMMs as getting 4x DIMMs to post at 6000MT/s is not so easy. You estimated memory bandwidth of ~80GB/s is pretty good honestly.

My own 9950X / 2x48GB DDR5-6400MT/s / 3090TI FE rig clocks roughly 86GB/s but required some OC of the infinity fabric and slight power envelope increase to not lose CPU boost. I have details here if you're interested: https://forum.level1techs.com/t/ryzen-9950x-ram-tuning-and-benchmarks/219347

Are you using LACT yet for your 5090 to save some power with undervolt without losing perf? I have some links on that too if you're interested.

Cheers!

You have a very nice system for inference and gaming. I assume you have 2x64GB DIMMs as getting 4x DIMMs to post at 6000MT/s is not so easy. You estimated memory bandwidth of ~80GB/s is pretty good honestly.

My own 9950X / 2x48GB DDR5-6400MT/s / 3090TI FE rig clocks roughly 86GB/s but required some OC of the infinity fabric and slight power envelope increase to not lose CPU boost. I have details here if you're interested: https://forum.level1techs.com/t/ryzen-9950x-ram-tuning-and-benchmarks/219347

Are you using LACT yet for your 5090 to save some power with undervolt without losing perf? I have some links on that too if you're interested.

Cheers!

Yeah I wanted to build something general purpose + deepseek capable in a small form factor (micro ATX) to keep on my desk. Unfortunately I couldn't find any mobo with 4 channels memory and small size. We'll have to wait for Lisa to bring 4 channels to consumer..

Actually I got a 4x64GB RAM kit, 6000mt, cl32, the g.skill flare one, I'm only using half because that's the limit of the CPU. I got that before the price hike, now they're like 4k €.. I feel like I'm buying Bitcoin or smth.
I also just bought a new mobo (Gigabyte X870M AORUS ELITE WIFI7) and a 9850x3d as soon as it came out. It seems that the 9850x3d is on a new stepping, has a better memory controller, spotted running really high speeds. I'll try to OC as far as I can with the full 256gb memory soon.
Thanks for the link btw - I didn't know you could lose compute by overclocking the memory controller.. albeit my CPU being only an 8 core with less power draw might be less affected. Still, something I'll have to watch out for.

Yeah I'm already using LACT. I'm OCing more then undervolting tbh, helps with long context tg speeds a little, and def helps with pp for sequences longer then -cuda offload-batch-size=X

With some luck, MTP, EAGLE, self speculative decoding and maybe sparser models, one should be able to run deepseek sized models at both decent speeds and decent quality quants with this setup. Maybe even glm 5? Anyway, as soon as the CPU arrives I'll start testing. Let me know if you want to know what type of tg/pp you can get with large models on maxed out consumer ram!

Sign up or log in to comment