v6.2 release: cosine LR fine-tune of v6.1, mean loss 10.18 → 8.43 (-17%)

2dcc491 verified about 6 hours ago

5.66 kB

base_model: Qwen/Qwen2.5-0.5B-Instruct
library_name: safetensors
license: apache-2.0
tags:
  - qubitcoin
  - aether
  - blockchain
  - quantum
  - native-rust
  - candle
  - long-context
  - cosine-schedule
  - resume-fine-tune
language:
  - en
pipeline_tag: text-generation

Aether Mind v6.2 — cosine-decay fine-tune of v6.1

V6.2 picks up where v6.1 plateaued. Same architecture, same 256-token context, same Aether curated corpus — but trained for another 30,000 steps under a cosine LR decay (2e-5 → 2e-7) to push the student past its fine-tune plateau without overshooting.

This is the third native (non-LoRA) Aether release and the first to use a learning-rate schedule beyond constant. The cosine flag landed in commit 186b2622.

What you're getting

Field	Value
Base model	`Qwen/Qwen2.5-0.5B-Instruct` (initialised from), then v6.1 fine-tune resumed here
Architecture	V6 transformer: 24 layers, 896 hidden, 14 attention heads (10 Sephirot + 2 generalist + 2 sink), head_dim=64
Trainable params	~558 M (all weights, no LoRA)
Training mode	Pure cross-entropy (no distillation — same as v6.1)
Training context	256 tokens (same as v6.1)
LR schedule	Cosine decay 2e-5 → 2e-7 over 30,000 fine-tune steps
Precision	BF16 weights, F32 KL/CE math internally
NSA config	compression_block=64, top_k=2048, sliding_window=512, sink_tokens=4
Vocab	151,936 (Qwen2.5 tokenizer, untouched)
Max position	32,768 (RoPE theta = 1e6)
Total training	60,000 steps (30K v6.1 + 30K v6.2)
File	`model.safetensors` (1.32 GB, BF16)
License	Apache-2.0 (matches base)

Training run

Metric	v6.1	v6.2	Δ
Steps (this run)	30,000	30,000	=
Total steps	30,000	60,000	+30K
Wall-clock (this run)	44.4 min	44.9 min	+0.5 min
Mean CE loss (this run)	10.18	8.43	−17 %
Throughput	629.9 tok/s	622.9 tok/s	flat
Mean Sephirot aux	0.149	0.140	−6 %
LR schedule	constant 2e-5	cosine 2e-5 → 2e-7	new
NaN events	0	0	=
Resume base	random init (Qwen)	v6.1 final	new

Loss trajectory

step      1   loss=13.00  avg=13.00   (v6.1 final state)
step    100   loss=12.00  avg=11.78
step   1000   loss= 7.75  avg= 8.82   ← LR still high, big descent through v6.1's plateau
step   5000   loss= 7.25  avg= 7.71
step  10000   loss= 6.69  avg= 7.41   ← minimum running average
step  15000   loss= 9.56  avg= 7.51   ← cosine kicks in, per-step variance ↑, drift ↓
step  20000   loss= 8.94  avg= 7.92
step  25000   loss= 8.75  avg= 8.22
step  29999   loss= 9.31  avg= 8.43

The reported mean (8.43) is the run-wide average. The lowest observed running average (7.41 at step 10K) is the actual fine-tune minimum; the back-half drift is the cosine schedule reducing step size to near zero, which makes per-step variance dominate the running average. This is the expected shape of a converged cosine fine-tune.

What changed vs v6.1

Cosine LR decay. Constant LR at 2e-5 in v6.1 caused a plateau from step ~10K onward — the optimiser kept bouncing around the loss minimum it could see at that step size. Cosine decay to 2e-7 lets later steps take much smaller updates, fine-tuning past the plateau.
Resume from v6.1 rather than fresh init. The model starts at v6.1's final state and refines from there.
Otherwise identical to v6.1: same architecture, same corpus, same context, same NSA config, same Sephirot aux. The single variable changed is the LR schedule.

How to use

Native runtime (recommended) — Rust `aether-mind`

Set AETHER_V6_CHECKPOINT to the local path of model.safetensors, restart qbc-aether-mind.service.

Python

from safetensors.torch import load_file
weights = load_file("model.safetensors")
print("params:", sum(t.numel() for t in weights.values()))

Same architecture as v6.1, so any custom loader/wrapper for v6.1 works here.

Evaluation

(lm-evaluation-harness numbers to follow once the eval binary ships. For now: training-loss curve + sample generations are the primary signal.)

Open items for v6.3

Per-chunk backward for distillation at ctx ≥ 256, so we can add KL teacher signal back without OOMing.
Long-context curriculum (1K, 4K, 16K → 1M) per the V6 master spec.
lm-evaluation-harness pass (MMLU / ARC / HellaSwag / TruthfulQA) for honest published numbers.

License + citation

Apache-2.0 (matches the base model license).

@misc{aether_mind_v62_2026,
  title  = {Aether Mind v6.2 --- cosine-decay fine-tune of v6.1},
  author = {{BlockArtica} and {QuantumAI-Blockchain}},
  year   = {2026},
  url    = {https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.2},
}

QuantumAI-Blockchain
/

aether-mind-v6.2

Aether Mind v6.2 — cosine-decay fine-tune of v6.1

What you're getting

Training run

Loss trajectory

What changed vs v6.1

How to use

Native runtime (recommended) — Rust `aether-mind`

Python

Evaluation

Open items for v6.3

License + citation

Links

Aether Mind v6.2 — cosine-decay fine-tune of v6.1

What you're getting

Training run

Loss trajectory

What changed vs v6.1

How to use

Native runtime (recommended) — Rust aether-mind

Python

Evaluation

Open items for v6.3

License + citation

Links

Native runtime (recommended) — Rust `aether-mind`