Deimos A1

Satellite Class · 4B · CCoT Fine-tune

Overview

01

Deimos A1 is a concise chain-of-thought (CCoT) fine-tune of Qwen3.5-4B. It produces dense, stepwise <think> blocks averaging ~1/8 the tokens of the base model while improving accuracy on every reasoning benchmark we measured.

Our first model release. Trained on Quark, a 4,919-row CCoT SFT dataset whose <think> traces were compressed by a Qwen3.6-35B teacher (NVFP4) running in our internal Tokamak pipeline. Final answers in the training data are byte-identical to the source — only the reasoning channel is rewritten.

The "A1" suffix means Alpha 1 — the first public iteration of the Deimos line. Future revisions (A2, …) will fold in additional sources (Kimi K2.5, larger Quark builds), longer training runs, and answer-channel compression once it lands in the Tokamak pipeline.

Specifications

02
Model
ClassSatellite
Parameters4B
ArchitectureQwen3_5 (Gated DeltaNet + sparse attention)
BaseQwen/Qwen3.5-4B
PrecisionBF16 (merged)
Context131,072 tokens
Training
MethodLoRA r=128, α=128 (merged)
Targetsall attention + MLP
Epochs3
OptimiserAdamW (cosine, lr 1e-4)
Wall time~3 h 49 m

Training Details

03
  • Dataset: Michael-Kozu/Quark — 4,919 rows of CCoT SFT data (Opus 4.6 + GPT-5.4 sources). Train/val/test = 3,937 / 491 / 491.
  • Adapter: LoRA rank 128, α 128, dropout 0; targets q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj; trainable parameters 169.9M of 4.7B (3.61%). Adapter merged into base before release.
  • Schedule: 3 epochs · per-device batch 4 · gradient accumulation 4 · effective batch 16 · 741 total steps · cosine LR 1e-4 · 5% warmup · AdamW · weight decay 0.01.
  • Sequence: max 4,096 tokens · packing disabled (Qwen3.5 multimodal architecture).

Loss trajectory

Training and eval loss curves

Train loss descends from 1.043 → 0.477 over 741 steps; eval loss bottoms at 0.814 at the end of epoch 2, then drifts up to 0.831 at the end of epoch 3, indicating mild overfitting. The released weights use the ep-3 final state; the lower-val ep-2 checkpoint is preserved internally.

Learning-rate schedule

Cosine LR schedule with 5% warmup

Benchmarks

04

The headline measurement is token efficiency — mean output tokens per problem and wall-clock per benchmark. These are direct, reproducible measurements of the same harness running against both endpoints on the same hardware.

Token efficiency — Deimos vs Base
Mean tokens / problem~1,400 → ~150  ≈ -89%
Wall-clock (full bench)58m → 9m 38s  ≈ 6× faster
Contamination (13-gram)0% overlap — Quark vs GSM8K / MMLU-Pro / ARC-C test sets

Token efficiency by task

Mean output tokens per problem, Deimos A1 vs Qwen3.5-4B base

Accuracy — comprehensive evaluation in progress.

Initial harness runs show Deimos consistently emits a parseable answer immediately after its <think> block, while the base model under the same harness more often does not — making accuracy comparisons sensitive to the parser, the per-task max_tokens budget, and chat-template handling. We are tuning the eval harness (matching Qwen's recommended max_tokens for thinking mode, verifying our scores reproduce Qwen's published baseline numbers before claiming any delta) and will publish a full report with per-task tables, contamination disclosure, and reproduction instructions in a follow-up update to this card.

Until that report lands, the only quantitative claims we make about this model are the token-efficiency and wall-clock numbers above — both of which are measured the same way for both models and are not sensitive to harness parsing.

Limitations & License

05
  • Subset benchmarks only. Per-task n=10 (stderr ±16%). MMLU-Pro at n=140 (stderr ±4%). Larger-n runs are planned for the next release.
  • Inherited Qwen3.5-4B limitations — language coverage, knowledge cutoff, and any biases of the base model. Quark fine-tuning shifts style, not knowledge.
  • Mild ep-3 overfitting on the 4,919-row Quark training set (val loss 0.814 → 0.831 from ep 2 to ep 3).
  • English only at this time.
  • License: MIT, consistent with the Quark dataset and the Qwen3.5-4B base license terms.
Kozu AI Turning the laws of reality into unparalleled creation.
Downloads last month
634
Safetensors
Model size
5B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Michael-Kozu/Deimos-A1

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(240)
this model
Quantizations
2 models

Dataset used to train Michael-Kozu/Deimos-A1