Does the IQ4_XS have the specific ik_llama optimizations?

#4
by MrHills-rs - opened

For example, does it work with -rtr? That's only for quantizations made with IK if I recall correctly?

@MrHills-rs

I forget if ik_llama.cpp will make an iq4_xs_r8 when using -rtr. You can try and see, I think it probably will, though it may not be any faster unless you're running those tensors on CPU, and even then the PP will likely suffer but might see marginal TG gains.

Best advise is to try a/b testing with and without -rtr with your command using llama-sweep-bench then you'll see the full profile.

Using -rtr gives me this error:

~/AI/ik/src/llama.cpp:4340: GGML_ASSERT(l.ffn_up_gate_exps->type == l.ffn_up_exps->type && l.ffn_up_gate_exps->type == l.ffn_gate_exps->type) failed
ptrace: Operation not permitted.
No stack.
The program is not being run.

It's not a big deal, mind you. I just wanted to know if I'm doing something wrong it if it's normal behavior.

@MrHills-rs

I just gave the IQ4_XS a try running on CPU-only compiled rig with ik_llama.cpp tip of main and it seems to be working. I assume you are using ik_llama.cpp despite the folder being renamed to llama.cpp in your example ?

#!/usr/bin/env bash

model=/mnt/raid/hf/MiniMax-M2.5-GGUF/IQ4_XS/MiniMax-M2.5-IQ4_XS-00001-of-00004.gguf

export SOCKET=0
numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/MiniMax-M2.5 \
    --ctx-size 131072 \
    -rtr \
    -ger \
    --merge-qkv \
    -ctk q8_0 -ctv q8_0 \
    -ub 4096 -b 4096 \
    --parallel 1 \
    --threads 96 \
    --threads-batch 128 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap \
    --jinja \
    --validate-quants

.
.
.

llm_load_tensors:        CPU buffer size = 117598.47 MiB
....................................................................................................
============ Repacked 435 tensors

I did not benchmark with llama-sweep-bench to compare -rtr to omitting it however, just confirmed it worked in a single inference.

Are you using CUDA or another backend too?

I'm using cuda+cpu, albeit I'm a little tight on memory. Does rtr require additional memory?
Also, running -muge gives nonsense output. Should I open an issue on git?

./build/bin/llama-server
-m ~/AI/ik/models/MiniMax-M2.5-IQ4_XS.gguf
-ot "blk.(?:[0-9]|[1-5][0-9]|[6][0-1]).ffn._exps.=CPU"
-c 65536 -b 8192 -ub 8192
-ctk q8_0 -ctv q8_0
--threads 8 -ngl 95
-cuda fusion=1,offload-batch-size=8,mmq-id-size=512
-amb 512 -mqkv -muge
--webui none --jinja
--repeat-last-n 2048
--reasoning-format deepseek-legacy
--draft-min 1 --spec-ngram-size-n 8
--draft-max 4 --spec-type ngram-simple --draft-p-min 0.2

I'm using cuda+cpu, albeit I'm a little tight on memory. Does rtr require additional memory?
Also, running -muge gives nonsense output. Should I open an issue on git?

./build/bin/llama-server
-m ~/AI/ik/models/MiniMax-M2.5-IQ4_XS.gguf
-ot "blk.(?:[0-9]|[1-5][0-9]|[6][0-1]).ffn._exps.=CPU"
-c 65536 -b 8192 -ub 8192
-ctk q8_0 -ctv q8_0
--threads 8 -ngl 95
-cuda fusion=1,offload-batch-size=8,mmq-id-size=512
-amb 512 -mqkv -muge
--webui none --jinja
--repeat-last-n 2048
--reasoning-format deepseek-legacy
--draft-min 1 --spec-ngram-size-n 8
--draft-max 4 --spec-type ngram-simple --draft-p-min 0.2

MrHills-rs changed discussion status to closed
MrHills-rs changed discussion status to open

I'm using cuda+cpu, albeit I'm a little tight on memory. Does rtr require additional memory?

I don't think it uses more memory, it only repacks the tensors running on CPU and should be same BPW for this quantization type.

You're doing a lot of stuff all at once, i'd recommend dialing back to no more than -ub 4096 -b 4096 as sometimes larger batches can cause problems (at least in general and historically).

You don't need -amb 512 as this model is not MLA.

For testing purposes, remove all repeat/reasoningformat/draft stuff as that is pretty new. Just start with the most basic command and make sure that works, you're testing too many things at the same time imo (or at least more than my brain can think about).

Also, running -muge gives nonsense output. Should I open an issue on git?

I don't even know what -muge does, but it sounds funny to say it outloud πŸ˜… hah... For now consider removing it, and only after you have the basics working add it back in and see if it is the culprit.

-ot "blk.(?:[0-9]|[1-5][0-9]|[6][0-1]).ffn._exps.=CPU"

I'm not sure how to read this easily, but maybe just use --n-cpu-moe XX or whatever? I know it can be tricky if you're tight on RAM though.

Let's see if we can get something that makes sense before opening an issue as ik seems busy working on Qwen3-Next haha

Sign up or log in to comment