Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
DavidAU 
posted an update 4 days ago
Post
3611
21 Qwen 3.5 Fine Tunes (thinking and instruct) ; reg and uncensored (2B to 27B) exceed benchmarks, and work better than org models.

All are bench marked against org model.
Many exceed all benchmarks of org model.
Claude, GLM, Gemini and other distills.
Thinking AND dedicated Instruct versions.

Core goal: Increase benchmarks, and address long thinking blocks.

Highlights:

9B and 27B instruct "Claude" versions hit 624 and 675 on the "ARC-C" (hard challenge).

Thinking fine tunes exceed org model performance (in thinking mode).

In many cases there is a drastic reduction in thinking block size.

9B Claude Heretic Uncensored, GGUF :
-Neo, Code Imatrix (duel imatrix)
- Updated Jinja template
- Custom tensor enhancements.

DavidAU/Qwen3.5-9B-Claude-4.6-OS-Auto-Variable-HERETIC-UNCENSORED-THINKING-MAX-NEOCODE-Imatrix-GGUF

COLLECTION [21 models]:
https://huggingface.co/collections/DavidAU/qwen-35-08-2-4-9-27-35b-regular-uncensored

UPDATE:
Now 31 models, including experimental 21B and new 13B models.

Hello @DavidAU ,

can you try to do the MoE versions, please? We could use something between 9B and 27B.

I was surprised about the capability of the small 9B model. While it wasn't capable of creating something as advanced as bigger models and the code it generated from scratch was usually broken, when I gave its own broken code, it was smart enough to fix individual issues. I do believe that it would have been capable to improve the quality iteratively when prompted to do so in small single tasks, just like with fixing the issues. Unfortunately, creating something little bit more ambitious from scratch with the 9B model was not possible in a single prompt.

On the other hand, the 27B model is already a bit too demanding for my hardware and token generation speed is too slow for the model to be useable for me.

The smallest MoE is 35B which is even bigger than the previous generation, but due to its Mixture of Experts architecture, it is still a bit faster than the 27B model for me and there are also some smaller REAP versions like 26B Creative and 24B regular version which are even faster.

I believe these MoE models would nicely fill up the gap between 9B and 27B dense variants in terms of quality, if you tuned them similarly with the Claude datasets.

·

Currently have full running 13B (GLM 4.7 Flash) - which is very strong ; and experimental 21Bs of Qwen 3.5.
These are trained.

These are in testing, and access is limited as of this writing.

As for MOEs:
This is a little more complicated as scripting must be written for Mergekit to "moe together" 0.8B, 2B, 4B, 9Bs etc etc.
A draft (by me) has been completed to do this; but not tested/debugged yet.

No time line here ; too many variables.

RE 35B moes ; it is possible to address this in a different way ; but I have not tried it yet.
This is a different approach than REAP.