MTP?

#1
by floory - opened

are the MTP layers stripped? if not, i would love to use this together with https://github.com/ggml-org/llama.cpp/pull/22673 ! as currently, there are no good quants of this model which fit within 24gb and MTP makes a big difference (20tps --> 50tps)

Yeah, I've been watching that PR (very exciting. I use MTP on my vLLM setup and it's amazing). I'm waiting for it to merge into master since there's still a bit of final work for it to be fully stable. Once it's merged in and done, I'd be more than happy to rebuild and re-post the quants properly. But the llama.cpp I used for this model was a main git pull I did today.

Oh and since you're on a 24GB model. Hope you enjoy the MQ-Q6_K_3 and MQ-Q5_K_S_1 because this model was the first to fire anomaly detection with any models I've worked with thus far within the predictive engine. AKA, it was able to detect things that broke standard quantization rules and abused that pattern discovered excessively. That's why there's no Q8 because it found Q6 patterns that could not just be smaller but better KLD than Q8, which was pretty cool.

You only see that when anomaly detection fires because the architecture had weirdness that isn't normal and could be replicated. But both that Q6 and Q5 hit far above it's weight. The Q5_K_S_1 beat the standard llama.cpp Q6_K which was super cool too.

i've tested the PR on my system with Vulkan and it works fine on my system! people reported -50% PP which is why it's not merged but i personally still get 600tps pp (from maybe 900) but the TG from 20 to 40-60 is 100% worth it. pretty please? 🥺 it's hard to go back once you try it and vLLM isn't a great experience on 24gb but there aren't good quants for it, they all feel dumb </3

can barely run Q5_K_S_1 so i'll check that one out

crazy how you're able to pull this off. i really appreciate your work! been following you for months :D

Thank you! And you're tempting me now! Is that PR just not stripping those MTP tensors or something? Meaning as long as MTP isn't enabled is it stable? Would you know?

Also, the wiki isn't done being fully updated, but I'm trying to document how it works. But this is due to how v2.0 works.

But basically it's utilizing what it learned from other model tensor configurations. Usually better versions of the model is simply making trades. It's not necessarily less bits, just swapping where we prioritize bits. This is the "normal".

But what I call an "anomaly" within my system is where there's a strict violation to the rule. In isolated sampling, ffn_down group at Q8 had a lower KLD than Q6. Thus Q8 is better than Q6, which is obvious and standard. But there was emergent behavior the system smoke tested and was able to find and validate to find localized emergent behavior in real hybrid scenario's that allowed a violation to the rule and thus patterns could be utilized to cause Q6_K to beat Q8_0 in specific groups or patterns.

Similar to how it utilized IQ4_NL in embedding group. THis wasn't a violation to the rule, it just hit way above what it deserved too with emergent behavior and this was utilized as well. Each MagicQuant repo actually is automated with everything created unless I manually change the ReadMe a bit like I tend to do. But the magicquant-manifest folder in every repo is full transparent logs of what the hybrids derive from. If it's utilizing Unsloth Dynamic learned configurations when and where. Plus tensor by tensor configuration maps for full reproducibility :)

I think the MQ-Q5 actually took learned configurations from Unsloth Dynamic UD-Q6 with a mix of Q8 if I remember correctly to pull it off.

Is that PR just not stripping those MTP tensors or something? Meaning as long as MTP isn't enabled is it stable? Would you know?

from what i've read, the way models are quantised strips the MTP layer since it's unsupported, but it's needed for MTP to work so from my understanding, as long as it's not stripped, it should work with the PR. not certain, but pretty sure.

sadly i just got it to my disk, just to realise it's 22gb and i can't fit that with 128k KV cache even with q8_0, gonna have to settle for less but your models are goated regardless

no pressure but like... 🥺 🙏 seriously though if it is not much effort you'd win my heart

Haha, I got you. I was about to get off for the night and I am going to be at my brothers wedding and doing wedding stuff for like 2 days. So I may not be able to come back to this and add MTP till this weekend. I just want to test that it doesn't break anything else, but if I have time after dinner I'll try to fit it in before I head out :)

awesome, thank you very much! enjoy the wedding and wishing you all the best, can't wait for MTP :D

I'm heading out for wedding stuff, but I'll continue looking into this when I'm back. There's a lot of reports specifically about it causing issues with images right now. So I just need to make sure I properly test it to make sure the model is backwards compatible and it can be disabled. That way the posted version is still stable :)

Though I'm very excited to use MTP as well! It's a world of difference.

yes, the only issue is the PP regression and images crashing. if anything, it could probably just be in a separate qwen3.6-27b-mtp-magiquant repo, but makes sense you don't wanna bother until everything is stable. hope you have a nice time!

Sign up or log in to comment