v6.2 release: cosine LR fine-tune of v6.1, mean loss 10.18 → 8.43 (-17%)

2dcc491 verified about 8 hours ago

5.66 kB

	---
	base_model: Qwen/Qwen2.5-0.5B-Instruct
	library_name: safetensors
	license: apache-2.0
	tags:
	- qubitcoin
	- aether
	- blockchain
	- quantum
	- native-rust
	- candle
	- long-context
	- cosine-schedule
	- resume-fine-tune
	language:
	- en
	pipeline_tag: text-generation
	---

	# Aether Mind v6.2 — cosine-decay fine-tune of v6.1

	V6.2 picks up where [v6.1](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.1)
	plateaued. Same architecture, same 256-token context, same Aether
	curated corpus — but trained for **another 30,000 steps under a
	cosine LR decay (2e-5 → 2e-7)** to push the student past its
	fine-tune plateau without overshooting.

	This is the third native (non-LoRA) Aether release and the first to
	use a learning-rate schedule beyond constant. The cosine flag landed
	in commit
	[`186b2622`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/commit/186b2622).

	## What you're getting

	\| Field \| Value \|
	\|---\|---\|
	\| Base model \| `Qwen/Qwen2.5-0.5B-Instruct` (initialised from), then v6.1 fine-tune resumed here \|
	\| Architecture \| V6 transformer: 24 layers, 896 hidden, 14 attention heads (10 Sephirot + 2 generalist + 2 sink), head_dim=64 \|
	\| Trainable params \| ~558 M (all weights, no LoRA) \|
	\| Training mode \| Pure cross-entropy (no distillation — same as v6.1) \|
	\| Training context \| 256 tokens (same as v6.1) \|
	\| LR schedule \| Cosine decay 2e-5 → 2e-7 over 30,000 fine-tune steps \|
	\| Precision \| BF16 weights, F32 KL/CE math internally \|
	\| NSA config \| compression_block=64, top_k=2048, sliding_window=512, sink_tokens=4 \|
	\| Vocab \| 151,936 (Qwen2.5 tokenizer, untouched) \|
	\| Max position \| 32,768 (RoPE theta = 1e6) \|
	\| Total training \| 60,000 steps (30K v6.1 + 30K v6.2) \|
	\| File \| `model.safetensors` (1.32 GB, BF16) \|
	\| License \| Apache-2.0 (matches base) \|

	## Training run

	\| Metric \| v6.1 \| v6.2 \| Δ \|
	\|---\|---\|---\|---\|
	\| Steps (this run) \| 30,000 \| 30,000 \| = \|
	\| Total steps \| 30,000 \| 60,000 \| +30K \|
	\| Wall-clock (this run) \| 44.4 min \| 44.9 min \| +0.5 min \|
	\| Mean CE loss (this run) \| 10.18 \| 8.43 \| −17 % \|
	\| Throughput \| 629.9 tok/s \| 622.9 tok/s \| flat \|
	\| Mean Sephirot aux \| 0.149 \| 0.140 \| −6 % \|
	\| LR schedule \| constant 2e-5 \| cosine 2e-5 → 2e-7 \| new \|
	\| NaN events \| 0 \| 0 \| = \|
	\| Resume base \| random init (Qwen) \| v6.1 final \| new \|

	### Loss trajectory

	```
	step 1 loss=13.00 avg=13.00 (v6.1 final state)
	step 100 loss=12.00 avg=11.78
	step 1000 loss= 7.75 avg= 8.82 ← LR still high, big descent through v6.1's plateau
	step 5000 loss= 7.25 avg= 7.71
	step 10000 loss= 6.69 avg= 7.41 ← minimum running average
	step 15000 loss= 9.56 avg= 7.51 ← cosine kicks in, per-step variance ↑, drift ↓
	step 20000 loss= 8.94 avg= 7.92
	step 25000 loss= 8.75 avg= 8.22
	step 29999 loss= 9.31 avg= 8.43
	```

	The reported mean (8.43) is the run-wide average. The lowest observed
	running average (7.41 at step 10K) is the actual fine-tune minimum;
	the back-half drift is the cosine schedule reducing step size to near
	zero, which makes per-step variance dominate the running average.
	This is the expected shape of a converged cosine fine-tune.

	## What changed vs v6.1

	1. Cosine LR decay. Constant LR at 2e-5 in v6.1 caused a plateau
	from step ~10K onward — the optimiser kept bouncing around the
	loss minimum it could see at that step size. Cosine decay to
	2e-7 lets later steps take much smaller updates, fine-tuning past
	the plateau.

	2. Resume from v6.1 rather than fresh init. The model starts at
	v6.1's final state and refines from there.

	3. Otherwise identical to v6.1: same architecture, same corpus,
	same context, same NSA config, same Sephirot aux. The single
	variable changed is the LR schedule.

	## How to use

	### Native runtime (recommended) — Rust `aether-mind`

	Set `AETHER_V6_CHECKPOINT` to the local path of `model.safetensors`,
	restart `qbc-aether-mind.service`.

	### Python

	```python
	from safetensors.torch import load_file
	weights = load_file("model.safetensors")
	print("params:", sum(t.numel() for t in weights.values()))
	```

	Same architecture as v6.1, so any custom loader/wrapper for v6.1
	works here.

	## Evaluation

	(lm-evaluation-harness numbers to follow once the eval binary
	ships. For now: training-loss curve + sample generations are the
	primary signal.)

	## Open items for v6.3

	- Per-chunk backward for distillation at ctx ≥ 256, so we can
	add KL teacher signal back without OOMing.
	- Long-context curriculum (1K, 4K, 16K → 1M) per the V6 master
	spec.
	- lm-evaluation-harness pass (MMLU / ARC / HellaSwag /
	TruthfulQA) for honest published numbers.

	## License + citation

	Apache-2.0 (matches the base model license).

	```bibtex
	@misc{aether_mind_v62_2026,
	title = {Aether Mind v6.2 --- cosine-decay fine-tune of v6.1},
	author = {{BlockArtica} and {QuantumAI-Blockchain}},
	year = {2026},
	url = {https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.2},
	}
	```

	## Links

	- Aether Mind v6.1 — [https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.1](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.1)
	- Aether Mind v6.0 — [https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0)
	- Aether v5.2-lora — [https://huggingface.co/QuantumAI-Blockchain/aether-v5.2-lora](https://huggingface.co/QuantumAI-Blockchain/aether-v5.2-lora)
	- QuantumAI Blockchain — [qbc.network](https://qbc.network)
	- GitHub — [github.com/QuantumAI-Blockchain](https://github.com/QuantumAI-Blockchain)