ManniX PRO
AI & ML interests
Recent Activity
Organizations
SL-AI/GRaPE-2-Pro
This is the flagship model of the GRaPE 2 family and the largest model I have trained to date, sitting at 27B parameters. It is built on Qwen3.5-27B and trained on a closed-source proprietary dataset, with roughly half of post-training focused on code and the rest split between STEAM subjects and structured logical reasoning. It punches seriously above its weight class.
GRaPE 2 Pro supports multimodal input (image + text) and features 6 thinking modes via the
tag. This gives you real control over how hard the model thinks, from skipping the reasoning phase entirely with minimal, all the way up to xtra-Hi for deep, extended thought on hard problems. For most agentic use, auto or low is the move to keep things snappy.It also runs on consumer hardware. You can get it going with as low as 12GB of VRAM on a quantized build.
If you want to try it out and give feedback, that would be really appreciated. Email us at
contact@skinnertopia.com βΆ Mythic-RDT - OpenMythos blueprint with a retrofit-recurrence fine-tune
https://github.com/mann1x/Mythic-RDT
βΆ cross-tokenizer-distill (CTD) - knowledge distillation across different tokenizer vocabularies
https://github.com/mann1x/cross-tokenizer-distill
For Mythic-RDT, I have chosen the pretty outdated DS-Coder-V2 16B.
It's small enough to not need more than 48GB VRAM but once I leaned on KL for depth recurring fine-tune (couldn't go above parity to T=1 with T=4, not the best for 4x inference time), started investigating the KL recipe and questioned the teacher, same DS-Coder-V2 but at BF16.
For a better teacher the option would have been just one, DS-Coder-V2-236B. Not only so big that I'd need 4xH100 to run but also surpassed even by Qwen3-Coder-32B on HE/MBPP.
Hence here's CTD tool, validated but still in development to find a good recipe for Qwen->DS distill.
βΆ Qwen3.5-4B-MicroCoder - code-leaning and reasoning merge of Qwen3.5-4B
ManniX-ITA/Qwen3.5-4B-MicroCoder
βΆ Omnimergekit - merge toolkit, merge & quantization scripts, experiments logs
https://github.com/mann1x/omnimergekit
You can find my merge toolkit and scripts in the repo, so they don't get scattered over the HF repos.
Interesting experiment with MicroCoder; only a couple of base, reasoning broken, coding fine-tune to merge with the excellent instruct reasoning JackRong-v2.
The result is not truly exciting but manages to improve LiveCodeBench above JR-v2, improve MBPP and not completely breaking reasoning.
This is achieved with omnimergekit using differential signals generated by the delta vs the base model from the good and wrong answers delta between the sources (HE/MBPP/AIME).
The very long eval sessions proved that the method does not just bias the scores of these evals but improve others even above the baseline.
π Follow-up to the M1..M5 study above β re-ran M4 against the updated PR #682 head ("turbo" branch by @Tusm11, HEAD 8d989f6) with rebalanced hyperparams (equal weights w=1/1, density 0.7, closer to the PR's worked example). Same AttnLRP signal as M4-orig, same sources.
βΆ Qwen3.5-4B-M4-v2-ex-LRP-turbo
ManniX-ITA/Qwen3.5-4B-M4-v2-ex-LRP-turbo
Q6_K HE / MBPP pass@1, M4-v2 inserted:
- M1 Vanilla DARE-TIES β 51.22 / 47.00
- M2 OMv2 (no signal) β 52.44 / 49.40
- M3 OMv2 + Fisher β 57.93 π₯ / 48.80
- M4 ex-LRP (PR #682 orig) β 51.22 / 49.40
- M4-v2 ex-LRP (PR #682 turbo) β 55.49 / 52.20 π₯
- M5 OMv2 + LRP β 53.05 / 51.40
Ξ M4-v2 vs M4-orig: +4.27 pp HE, +2.80 pp MBPP. M4-v2 takes the MBPP medal of the whole study (overtakes M5) while staying competitive on HumanEval. The turbo code path + rebalanced hyperparams clearly beat the original PR head on this configuration.
Findings refresh: Fisher still leads HE; ex-LRP (turbo) now leads MBPP, narrowly ahead of OMv2+LRP. Both LRP variants land within 1 pp on MBPP β strong signal that LRP-driven sparsification is doing real work for code-gen on small Qwen merges.
Big thanks to @Tusm11 for the supercharged ex-LRP turbo head β multimodal support + Iron-Man stabilization + in-place math are a real upgrade. Posted full results + the 6 patches needed to run it against Qwen3_5ForConditionalGeneration on the PR thread:
https://github.com/arcee-ai/mergekit/pull/682
βΆ Qwen3.6-27B-Omnimerge-v4-MLP
ManniX-ITA/Qwen3.6-27B-Omnimerge-v4
Same-base DARE-TIES merge of Qwen3.6-27B + 3 fine-tunes (rico03 Claude distill, Esper3.1, kai-os Opus reasoning anchor) via my Omnimerge_v2 method (OBIM-lite + DAREx-q + EMR election).
Hit a Qwen3.6-specific fragility: hyperparams that work flawlessly on 3.5 produced 80% unclosed-<think> on 3.6, collapsing pass@1 to ~20%. Per-tensor delta forensics localized the failure to mlp.{gate,up,down}_proj in
layers 27β52. Fix: MLP-passthrough surgery β copy MLPs verbatim from base, keep merged attn + linear_attn. Leak β 0%.
Q6_K results (vs Qwen3.6 base / vs Omnimerge-v2 on Qwen3.5):
β’ HumanEval: 84.76% (= base, +5.49 pp vs v2)
β’ MBPP corrected: 73.40% (+15.80 pp vs base, β v2)
β’ GPQA Diamond: ~84.75% partial 192/198 (+15.5 pp vs v2)
βΆ Qwen3.5-4B Importance-Signal Study (M1..M5)
Controlled 5-way comparison: same Qwen3.5-4B base, same 2 fine-tunes (Jackrong Claude-4.5 distill + Crow Opus-4.6 distill), only the importance signal driving DARE-TIES sparsification varies.
Q6_K HE / MBPP pass@1:
β’ M1 Vanilla DARE-TIES β 51.22 / 47.00
β’ M2 OMv2 (no signal) β 52.44 / 49.40
β’ M3 OMv2 + Fisher β 57.93 π₯ / 48.80
β’ M4 mergekit ex-LRP (PR #682) β 51.22 / 49.40
β’ M5 OMv2 + LRP β 53.05 / 51.40 π₯
Findings: Fisher wins HE (+4.88 pp over vanilla), LRP wins MBPP (+2.60 pp). Both signals + Omnimerge_v2 recipe beat vanilla. To make multimodal-LM ex-LRP work end-to-end against Qwen3_5ForConditionalGeneration, I filed
5 patches against arcee-ai/mergekit PR #682 + 1 against rachtibat/lxt.
All five Mx checkpoints + Fisher/LRP signal safetensors + reproducer scripts published.
πΉ ManniX-ITA/Qwen3.5-27B-Omnimerge-v2
3-source weight-space merge over Qwen3.5-27B combining OBIM-lite magnitude masking + DAREx rescaling + EMR election (sign from consensus, amplitude from max-abs across sources). GPU-accelerated, ~35Γ over CPU.
Sources: Claude-4.6-Opus-distill (0.40), Esper3.1 code (0.35), Gemini-3.1-Pro-distill (0.25). density 0.53, DAREx q 0.75.
Q6_K vs best source:
β’ GPQA Diamond: 53.03 β 69.19 (+16.16 pp)
β’ MBPP pass@1: 71.20 β 74.60 (+3.40)
β’ HumanEval pass@1: 76.22 β 79.27 (+3.05)
vs Omnimerge v1 (vanilla DARE-TIES): +8.08 pp GPQA, +2.80 MBPP. Amplitude-from-max + sign-from-consensus is what unlocked the GPQA jump.
πΉ ManniX-ITA/gemma-4-A4B-98e-v3-it
Gemma 4 26B-A4B pruned 128 β 98 experts/layer (-23.4% MoE capacity, -5.2B params), zero GPQA degradation.
GPQA Diamond:
β’ 128e reference: 75.25%
β’ 98e v3 (this): 75.25% β +0.00 pp despite -23.4% capacity, -5.2B params
β’ 109e v3 (older): 71.72% β -3.53 pp
The win over 109e v3 came from changing the importance map: aggregate per-expert contribution across math/logic/code/science/creative via 128-token teacher-force, instead of GPQA-specific per-question top-16 (which overfitted). Result: more experts dropped, quality preserved.
Findings worth flagging:
β’ Experts NOT topic-specialized β 28/32 overlap math/creative top-32.
β’ Expert weight cosine β 0.05 max β merging destroys the model. Dropping is the only viable structural compression here.
β’ Contribution Gini β 0.38 β ~75 experts/layer carry 80% of signal.
Eval: lm-eval gpqa_diamond_cot_zeroshot, llama-server --reasoning-format deepseek --reasoning-budget 8192, Gemma 4 official sampling. Feedback welcome.
> LiquidAI/LFM2.5-1.2B-Base: 10T β 28T tokens
> LiquidAI/LFM2.5-1.2B-Instruct: new large-scale multi-stage RL
> LiquidAI/LFM2.5-1.2B-JP: our most polite model
> LiquidAI/LFM2.5-VL-1.6B: multi-image multilingual
> LiquidAI/LFM2.5-Audio-1.5B: 8x times faster, no quality loss
Super proud of this release π€
the table of contents looks like everything you need to know about agents + code:
> advanced prompt techniques
> multi-agent patterns
> tool use and MCP
> you name it
read it here: https://docs.google.com/document/d/1rsaK53T3Lg5KoGwvf8ukOUvbELRtH-V0LnOIFDxBryE/edit?tab=t.0#heading=h.pxcur8v2qagu
you can also pre-order on Amazon (published by Springer) and the royalties goes to Save the Children: https://www.amazon.com/Agentic-Design-Patterns-Hands-Intelligent/dp/3032014018/
π₯ We are ready to announce a new series of Supple Diffusion models, these are new generation diffusion models (about 1-2 weeks left before release).
π¦Ύ The new series aims to take diffusion models to the next level, with performance and versatility as the main goal.
π§ How will our models be better than others? Firstly, we worked on the CLIP models, now they understand your requests better, it will become easier to process. Secondly, we trained the models with high quality, even better than all our previous ones. Thirdly, you wonβt have to keep 20 models on your disk; only 4-6 will be enough.
πΊοΈ Roadmap:
1. Create Supple Diffusion Small
2. Creating Supple Diffusion Medium
3. Create Supple Diffusion Large
π Our models are universal for realism, and for cartoons, and for anime, and for caricatures.
π The project really needs your support and your recommendations and reviews, please do not hesitate to write comments under this post, thank you!
πΌοΈ Below are demo images made with the pre-release version of Supple Diffusion Small.
Tonight I wrote up a WandB report (the panel editor is super broken in Firefox π) that sums up some of the more interesting bits from the results: https://wandb.ai/augmxnt/train-bench/reports/torchtune-vs-axolotl-vs-unsloth-Trainer-Comparison--Vmlldzo4MzU3NTAx
AutoQuant is the evolution of my previous AutoGGUF notebook (https://colab.research.google.com/drive/1P646NEg33BZy4BfLDNpTz0V0lwIU3CHu). It allows you to quantize your models in five different formats:
- GGUF: perfect for inference on CPUs (and LM Studio)
- GPTQ/EXL2: fast inference on GPUs
- AWQ: super fast inference on GPUs with vLLM (https://github.com/vllm-project/vllm)
- HQQ: extreme quantization with decent 2-bit and 3-bit models
Once the model is converted, it automatically uploads it on the Hugging Face Hub. To quantize a 7B model, GGUF only needs a T4 GPU, while the other methods require an A100 GPU.
Here's an example of a model I quantized using HQQ and AutoQuant: mlabonne/AlphaMonarch-7B-2bit-HQQ
I hope you'll enjoy it and quantize lots of models! :)
π» AutoQuant: https://colab.research.google.com/drive/1b6nqC7UZVt8bx4MksX7s656GXPM-eWw4