Text Generation
Transformers
Safetensors
qwen3_5
image-text-to-text
text-generation-inference
unsloth
qwen3_6
reasoning
chain-of-thought
lora
sft
agent
tool-use
function-calling
coder
conversational

4.5bpw Exl3 H6 LLMFan46 Heretic Base Qwopus 3.6 Coder

#2
by tw33kr442 - opened

Making a better model for my personal local use, needed an exl3 quant. Currently running a Qwopus 3.6 Coder off the llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved base so it's a bit different but I hope it works well in a repo. Removing vision and mtp, since this is specifically for my 3090 24gb and that allows for a nice context management and size. I've had more luck with the EXL3's in my repos.

Hoping for a multi-use local model, with an agentic lean, and my preferred quant.
Also burning a exl3 quant of the full coder for good measure.

Just wanted to thank you for the information and the work you guys do, excited to see what I can get on the SWEBench for the new model build. Here are the build details, I'll drop a model card and setup the page for my version. The training I'm doing is 0.56 epoch at about 24hour burn for a H200. It seems like there is diminishing returns past so much with the way the Qwen architecture is built and how I'm running the training, new to the model builds so I'm hoping my methodology can come close to the build you guys made, though without the Qwopus base I'm curious how it will perform as I think this is as close as I can reasonably do in a quick build.

While I think you guys went a different direction, this is my first full build that might be useful. So any thoughts incase I run another model later would be great!

My hope is that the model works well and the runpod usage was worth the coin.

  • Quantization: EXL3
  • Target bitrate: 4.5 bpw
  • Head bits: 6
  • Context target: 32K training context
  • Serving target: local ExLlamaV3/TextGen-style coder-agent use
  • Vision/MTP: intentionally not part of the serving target

Training Summary

The adapter was trained on an H200 using continuous QLoRA SFT with response-only masking.
The source model has hybrid Qwen3.5-style attention, so LoRA coverage includes both
standard self-attention and linear-attention modules.

Core settings:

  • Max sequence length: 32768
  • Target optimizer steps: 1500
  • Effective batch size: 8
  • Dataset exposure: about 0.55 epoch over 21,785 rows
  • Learning rate: 1.5e-4
  • Scheduler: cosine
  • Warmup: 20 steps
  • Checkpoint cadence during training: 50 steps
  • LoRA rank/alpha: 16 / 32
  • Batch size: 1
  • Gradient accumulation: 8
  • Optimizer: adamw_8bit
  • Weight decay: 0.01
  • Precision: bf16
  • Attention backend: FlashAttention 2 for the standard attention path
  • Loss mask: assistant responses only, using <|im_start|>assistant\n
  • Target module coverage:
    • self_attn.q_proj, self_attn.k_proj, self_attn.v_proj, self_attn.o_proj
    • linear_attn.in_proj_qkv, linear_attn.in_proj_a, linear_attn.in_proj_b, linear_attn.in_proj_z, linear_attn.out_proj
    • mlp.gate_proj, mlp.up_proj, mlp.down_proj
  • Explicitly excluded from LoRA: MTP, vision, norms, A_log, dt_bias

Coverage gate:

  • Trainable adapter tensors: 992
  • Trainable parameters: 116,727,808
  • self_attn: 128 trainable tensors
  • linear_attn: 480 trainable tensors
  • mlp: 384 trainable tensors
  • mtp: 0
  • vision: 0

The mtp and vision counts above refer to LoRA trainable coverage only. Those
components are intentionally excluded from adapter training. The final EXL3 serving
artifact is intended to be text-only and non-MTP after post-merge stripping/validation.

Curriculum

The 32K training curriculum was rendered into final chat-template text before SFT.
It contains 21,785 formatted rows, built from:

  • Claude Opus trace-inversion datasets from the Jackrong catalog
  • Hermes agent reasoning traces
  • Qwen3 Coder 480B distill mini
  • Competitive Python programming blend
  • A small local ECC/Codex/STAR rules-and-agent-behavior slice

The local slice is deliberately small and is meant to steer repo-agent behavior rather
than make the model specific to one private repository.

The training data is a blended single-pass curriculum rather than the official Jackrong
staged production run. It aims to compress the public Qwopus-style trace-inversion,
agentic coding, and long-context behaviors into a practical single-H200 QLoRA build.

Well my training mixed up tool calls as chat and did not work, however I put up 2 versions Stock with my quant non-mtp and with or without vision for any other 3090 users

Sign up or log in to comment