aether-mind-v6.2 / README.md
BlockArtica's picture
v6.2 release: cosine LR fine-tune of v6.1, mean loss 10.18 β†’ 8.43 (-17%)
2dcc491 verified
metadata
base_model: Qwen/Qwen2.5-0.5B-Instruct
library_name: safetensors
license: apache-2.0
tags:
  - qubitcoin
  - aether
  - blockchain
  - quantum
  - native-rust
  - candle
  - long-context
  - cosine-schedule
  - resume-fine-tune
language:
  - en
pipeline_tag: text-generation

Aether Mind v6.2 β€” cosine-decay fine-tune of v6.1

V6.2 picks up where v6.1 plateaued. Same architecture, same 256-token context, same Aether curated corpus β€” but trained for another 30,000 steps under a cosine LR decay (2e-5 β†’ 2e-7) to push the student past its fine-tune plateau without overshooting.

This is the third native (non-LoRA) Aether release and the first to use a learning-rate schedule beyond constant. The cosine flag landed in commit 186b2622.

What you're getting

Field Value
Base model Qwen/Qwen2.5-0.5B-Instruct (initialised from), then v6.1 fine-tune resumed here
Architecture V6 transformer: 24 layers, 896 hidden, 14 attention heads (10 Sephirot + 2 generalist + 2 sink), head_dim=64
Trainable params ~558 M (all weights, no LoRA)
Training mode Pure cross-entropy (no distillation β€” same as v6.1)
Training context 256 tokens (same as v6.1)
LR schedule Cosine decay 2e-5 β†’ 2e-7 over 30,000 fine-tune steps
Precision BF16 weights, F32 KL/CE math internally
NSA config compression_block=64, top_k=2048, sliding_window=512, sink_tokens=4
Vocab 151,936 (Qwen2.5 tokenizer, untouched)
Max position 32,768 (RoPE theta = 1e6)
Total training 60,000 steps (30K v6.1 + 30K v6.2)
File model.safetensors (1.32 GB, BF16)
License Apache-2.0 (matches base)

Training run

Metric v6.1 v6.2 Ξ”
Steps (this run) 30,000 30,000 =
Total steps 30,000 60,000 +30K
Wall-clock (this run) 44.4 min 44.9 min +0.5 min
Mean CE loss (this run) 10.18 8.43 βˆ’17 %
Throughput 629.9 tok/s 622.9 tok/s flat
Mean Sephirot aux 0.149 0.140 βˆ’6 %
LR schedule constant 2e-5 cosine 2e-5 β†’ 2e-7 new
NaN events 0 0 =
Resume base random init (Qwen) v6.1 final new

Loss trajectory

step      1   loss=13.00  avg=13.00   (v6.1 final state)
step    100   loss=12.00  avg=11.78
step   1000   loss= 7.75  avg= 8.82   ← LR still high, big descent through v6.1's plateau
step   5000   loss= 7.25  avg= 7.71
step  10000   loss= 6.69  avg= 7.41   ← minimum running average
step  15000   loss= 9.56  avg= 7.51   ← cosine kicks in, per-step variance ↑, drift ↓
step  20000   loss= 8.94  avg= 7.92
step  25000   loss= 8.75  avg= 8.22
step  29999   loss= 9.31  avg= 8.43

The reported mean (8.43) is the run-wide average. The lowest observed running average (7.41 at step 10K) is the actual fine-tune minimum; the back-half drift is the cosine schedule reducing step size to near zero, which makes per-step variance dominate the running average. This is the expected shape of a converged cosine fine-tune.

What changed vs v6.1

  1. Cosine LR decay. Constant LR at 2e-5 in v6.1 caused a plateau from step ~10K onward β€” the optimiser kept bouncing around the loss minimum it could see at that step size. Cosine decay to 2e-7 lets later steps take much smaller updates, fine-tuning past the plateau.

  2. Resume from v6.1 rather than fresh init. The model starts at v6.1's final state and refines from there.

  3. Otherwise identical to v6.1: same architecture, same corpus, same context, same NSA config, same Sephirot aux. The single variable changed is the LR schedule.

How to use

Native runtime (recommended) β€” Rust aether-mind

Set AETHER_V6_CHECKPOINT to the local path of model.safetensors, restart qbc-aether-mind.service.

Python

from safetensors.torch import load_file
weights = load_file("model.safetensors")
print("params:", sum(t.numel() for t in weights.values()))

Same architecture as v6.1, so any custom loader/wrapper for v6.1 works here.

Evaluation

(lm-evaluation-harness numbers to follow once the eval binary ships. For now: training-loss curve + sample generations are the primary signal.)

Open items for v6.3

  • Per-chunk backward for distillation at ctx β‰₯ 256, so we can add KL teacher signal back without OOMing.
  • Long-context curriculum (1K, 4K, 16K β†’ 1M) per the V6 master spec.
  • lm-evaluation-harness pass (MMLU / ARC / HellaSwag / TruthfulQA) for honest published numbers.

License + citation

Apache-2.0 (matches the base model license).

@misc{aether_mind_v62_2026,
  title  = {Aether Mind v6.2 --- cosine-decay fine-tune of v6.1},
  author = {{BlockArtica} and {QuantumAI-Blockchain}},
  year   = {2026},
  url    = {https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.2},
}

Links