Qwopus3.5-9B-v4
Training Pipeline
Jackrong/Qwopus3.5-9B-v3 (clean base, 87.8% HumanEval)
-> Phase 1: DAPO-GRPO 150 steps (native TRL)
loss=dapo, clip=[0.2, 0.28], beta=0, mask_truncated=True
Multiplicative reward: tags as entry fee, filler penalized
-> Phase 2: SAI-DPO Round 1 (self-generated pairs)
-> Phase 3: SAI-DPO Round 2 (hard example mining)
-> This Model
Key Innovation: Multiplicative Rewards
- No tags + correct = score x 0.1 (format bad)
- Filler thinking = score x 0.15 - 0.5 (canned phrase penalty)
- Full tags + correct = score x 1.0 + bonuses (full credit)
DAPO Features (Native TRL)
- Token-level loss (no length bias)
- Asymmetric clipping (epsilon_high=0.28)
- Overlong filtering (mask_truncated=True)
- No KL penalty (beta=0)
Files
merged-model/— Full merged safetensors
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
'MeridianVector/Qwopus3.5-9B-v4', subfolder='merged-model',
torch_dtype='auto', device_map='auto', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
'MeridianVector/Qwopus3.5-9B-v4', subfolder='merged-model', trust_remote_code=True)
Acknowledgements
Trained on Google Colab G4 (RTX PRO 6000 Blackwell, 96GB). Benchmarks pending.
Model tree for MeridianVector/Qwopus3.5-9B-v4
Base model
Qwen/Qwen3.5-9B-Base Finetuned
Qwen/Qwen3.5-9B Finetuned
unsloth/Qwen3.5-9B Adapter
Jackrong/Qwopus3.5-9B-v3