This model was trained on the following datasets using the qwen3.6 chat template (training was done with enable_thinking and preserve_thinking set to True):

  • armand0e/badlogicgames-pi-mono-opus-filtered - Pi traces from Claude Opus (mainly 4.5)
  • armand0e/kimi-k2.6-claude-code-traces - Claude Code traces from kimi k2.6
  • armand0e/kimi-k2.6-agent - Codex traces from kimi k2.6
  • armand0e/minimax-m2.7-agent - Pi traces from minimax m2.7
  • TeichAI/Claude-Opus-4.6-Reasoning-887x (Downsampled to 200 examples, only present to stabilize chat behavior)

I recommend using the following sampling parameters:

  • temp: 1.0
  • top_k: 20 (though higher values like 40 still seem to work and be stable with tool calling and agentic tasks)
  • top_p: 0.95
  • min_p: 0.00
  • repeat_penalty: 1.0
  • presence_penalty: 1.5

Training code:

MAX_SEQ_LEN = 49152

from unsloth import FastModel
import torch

model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules       = True,  # Should leave on always!

    r = 64,           # Larger = higher accuracy, but might overfit
    lora_alpha = 64,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

from teich import prepare_data

train_dataset = prepare_data(
    {
        "opus-agent": {
            "source": "armand0e/badlogicgames-pi-mono-opus-filtered",
        },
        "kimi-claude": {
            "source": "armand0e/kimi-k2.6-claude-code-traces",
        },
        "kimi-codex": {
            "source": "armand0e/kimi-k2.6-agent",
        },
        "minimax-m2.7": {
            "source": "armand0e/minimax-m2.7-agent",
        },
        "chat": {
            "source": "TeichAI/Claude-Opus-4.6-Reasoning-887x",
            "max_examples": 200,
        }
    },
    tokenizer,
    split="train",
    hf_token=HF_TOKEN,
    chat_template_kwargs={"enable_thinking": True, "preserve_thinking": True},
    max_length=MAX_SEQ_LEN,
    drop_oversized_examples=True,
    trim_oversized_followups=True,
    tokenize=True,
    strict=True,
)

from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=None,
    args=SFTConfig(
        dataset_text_field="text",
        dataset_num_proc=1,
        max_length=MAX_SEQ_LEN,
        packing=False,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        warmup_steps= 5,
        num_train_epochs=2,
        learning_rate=2e-4,
        logging_steps=1,
        save_steps=100,
        save_total_limit=3,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        output_dir=OUTPUT_DIR,
        seed=3407,
        report_to="none",
    ),
)

from teich import mask_data

trainer = mask_data(
    trainer,
    tokenizer=tokenizer,
    train_on_reasoning=True,
    train_on_final_answers=True,
    train_on_tools=True,
)

This tune was very data limited, but still impresses me. I encourage everyone to generate their own high quality data for their own use cases, they can all be aggregated together.


Uploaded finetuned model

  • Developed by: armand0e
  • License: apache-2.0
  • Finetuned from model : unsloth/Qwen3.5-9B

This qwen3_5 model was trained 2x faster with Unsloth and Huggingface's TRL library.

Downloads last month
190
Safetensors
Model size
10B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for armand0e/Qwen3.5-9B-Agent

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(80)
this model
Merges
1 model
Quantizations
1 model

Datasets used to train armand0e/Qwen3.5-9B-Agent