| --- |
| base_model: Jackrong/Qwopus3.5-9B-v3 |
| tags: [qwen3.5, reasoning, grpo, dapo, dpo, self-correction, verifiable-rewards] |
| license: apache-2.0 |
| language: [en] |
| pipeline_tag: text-generation |
| --- |
| |
| # Qwopus3.5-9B-v4 |
|
|
| ## Training Pipeline |
| ``` |
| Jackrong/Qwopus3.5-9B-v3 (clean base, 87.8% HumanEval) |
| -> Phase 1: DAPO-GRPO 150 steps (native TRL) |
| loss=dapo, clip=[0.2, 0.28], beta=0, mask_truncated=True |
| Multiplicative reward: tags as entry fee, filler penalized |
| -> Phase 2: SAI-DPO Round 1 (self-generated pairs) |
| -> Phase 3: SAI-DPO Round 2 (hard example mining) |
| -> This Model |
| ``` |
|
|
| ## Key Innovation: Multiplicative Rewards |
| - No tags + correct = score x 0.1 (format bad) |
| - Filler thinking = score x 0.15 - 0.5 (canned phrase penalty) |
| - Full tags + correct = score x 1.0 + bonuses (full credit) |
|
|
| ## DAPO Features (Native TRL) |
| - Token-level loss (no length bias) |
| - Asymmetric clipping (epsilon_high=0.28) |
| - Overlong filtering (mask_truncated=True) |
| - No KL penalty (beta=0) |
|
|
| ## Files |
| - `merged-model/` — Full merged safetensors |
|
|
| ## Usage |
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| model = AutoModelForCausalLM.from_pretrained( |
| 'MeridianVector/Qwopus3.5-9B-v4', subfolder='merged-model', |
| torch_dtype='auto', device_map='auto', trust_remote_code=True) |
| tokenizer = AutoTokenizer.from_pretrained( |
| 'MeridianVector/Qwopus3.5-9B-v4', subfolder='merged-model', trust_remote_code=True) |
| ``` |
|
|
| ## Acknowledgements |
| - [Jackrong](https://huggingface.co/Jackrong) for Qwopus3.5-9B-v3 |
| - [Qwen Team](https://huggingface.co/Qwen) for Qwen3.5-9B |
|
|
| Trained on Google Colab G4 (RTX PRO 6000 Blackwell, 96GB). |
| Benchmarks pending. |
|
|