v6.1 release: post-NaN-fix at ctx=256, CE-only, 30K steps

92ac28f verified 1 day ago

7.73 kB

	---
	base_model: Qwen/Qwen2.5-0.5B-Instruct
	library_name: safetensors
	license: apache-2.0
	tags:
	- qubitcoin
	- aether
	- blockchain
	- quantum
	- native-rust
	- candle
	- long-context
	language:
	- en
	pipeline_tag: text-generation
	---

	# Aether Mind v6.1 — long-context after the NaN fix

	V6.1 is the third public Aether release and the first that
	trains on a meaningfully long context window. It supersedes
	[aether-mind-v6.0](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0)
	which was published with a forced `ctx=64` workaround because of a
	forward-pass numerical instability in the NSA compressed branch
	(`v6/attention.rs::compressed_branch`).

	That instability is now diagnosed + fixed. **Compressed-branch
	attention's causal mask was producing all-`-inf` rows for query
	positions before the first 64-token block completed, driving softmax
	to `0/0 = NaN`.** The fix tracks per-row validity, unmasks a single
	block on otherwise-fully-masked rows to keep softmax finite, and
	multiplies the branch output by a row-validity mask so those rows
	contribute zero attention (their proper behaviour). Source +
	verification log in
	[`docs/ops/v6-training-nan-bug.md`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/blob/presale/v1/docs/ops/v6-training-nan-bug.md);
	the fix landed in commit
	[`7f9189f8`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/commit/7f9189f8).

	V6.1 was trained at 4× the v6.0 context (256 vs 64 tokens) on
	the same 36,860-row Aether curated corpus, on the same RTX 3080 Ti,
	in the same wall-clock envelope (~44 min vs v6.0's 50 min — slightly
	faster because no Qwen teacher forward).

	## What you're getting

	\| Field \| Value \|
	\|---\|---\|
	\| Base model \| `Qwen/Qwen2.5-0.5B-Instruct` (initialised from, then CE-trained) \|
	\| Architecture \| V6 transformer: 24 layers, 896 hidden, 14 attention heads (10 Sephirot + 2 generalist + 2 sink), head_dim=64 \|
	\| Trainable params \| ~558 M (all weights, no LoRA) \|
	\| Training mode \| Pure cross-entropy (no distillation in this release — see notes below) \|
	\| Training context \| 256 tokens (4× the v6.0 release) \|
	\| Precision \| BF16 weights, F32 KL/CE math internally for numerical stability \|
	\| NSA config \| compression_block=64, top_k=2048, sliding_window=512, sink_tokens=4 \|
	\| Vocab \| 151,936 (Qwen2.5 tokenizer, untouched) \|
	\| Max position \| 32,768 (RoPE theta = 1e6) \|
	\| Checkpoint published \| step 30,000 (full Phase-1 run) \|
	\| File \| `model.safetensors` (1.32 GB, BF16) \|
	\| License \| Apache-2.0 (matches base) \|

	## Training run

	\| Metric \| Value \| Δ vs v6.0 \|
	\|---\|---\|---\|
	\| Steps \| 30,000 \| = \|
	\| Wall-clock \| 44.4 min \| −10 % \|
	\| Tokens scored \| 1,676,479 \| +0.3 % (4× context lets more rows fit) \|
	\| Throughput \| 629.9 tokens/sec \| +12 % \|
	\| Mean CE loss \| 10.18 nats/token \| better (v6.0 was 10.35 mean CE under the KL blend) \|
	\| Mean Sephirot aux \| 0.149 \| = \|
	\| Max tokens processed \| 167 \| (v6.0 truncated to 64) \|
	\| NaN events \| 0 \| (v6.0 also 0 thanks to the ctx=64 workaround) \|

	### Loss trajectory

	```
	step 1 loss=15.75 avg=15.75 (random init)
	step 100 loss=15.94 avg=16.32 warm-up
	step 1000 loss=11.63 avg=13.20 ← CE/lm-head learning the vocab
	step 5000 loss=10.00 avg=11.01
	step 10000 loss= 9.13 avg=10.07 ← representational floor (much lower than v6.0's 7.68 at this step — but apples-to-oranges; v6.0 was loss-blended with KL teacher signal)
	step 15000 loss=11.13 avg= 9.87
	step 20000 loss=10.25 avg=10.02
	step 25000 loss= 9.75 avg=10.15
	step 29999 loss= 9.81 avg=10.18
	```

	The interesting fact: at step 122 (the row where v6.0 first NaN'd —
	tokens=167), v6.1 reads a real loss in the 9-16 range and continues
	training. **This release is the empirical proof that the
	compressed-branch fix is the right one.**

	## Architecture (unchanged from v6.0)

	V6 is not a vanilla Qwen2.5 fine-tune. The attention layer
	implements a 14-head split designed for on-chain cognitive routing:

	- 10 Sephirot heads — one per cognitive domain (Keter → Malkuth).
	Each head's attention pattern is what the on-chain
	`pallet_qbc_aether_anchor` records as the per-cycle attestation root.
	- 2 generalist heads — un-gated, full-context attention. Used
	for the "global workspace" path in `aether-mind`.
	- 2 sink heads — anchor-token attention (first 4 tokens) for
	stable long-context performance.

	The NSA compressed branch (the one that NaN'd) now correctly handles
	the early-query case via row-validity masking.

	## How to use

	### Native runtime (recommended) — Rust `aether-mind`

	Set `AETHER_V6_CHECKPOINT` to the local path of `model.safetensors`,
	restart `qbc-aether-mind.service`. The Rust binary loads via candle.

	### Python

	```python
	from safetensors.torch import load_file
	weights = load_file("model.safetensors") # 315 BF16 tensors
	print("params:", sum(t.numel() for t in weights.values()))
	```

	There is no upstream 🤗 transformers loader for the V6 14-head
	split + Sephirot routing. Production use goes through the Rust
	binary in
	[`qubitcoin-aether`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether).

	## Evaluation

	Not yet run. lm-evaluation-harness vs MMLU / ARC / HellaSwag /
	TruthfulQA is the next session's work. We will back-fill the
	numbers + comparison vs v5.2-lora + v6.0 here when they land.

	## Notes vs v6.0

	- No KL distillation in this release. The full distillation
	path (KL teacher signal + CE + Sephirot aux) hits a CUDA OOM at
	the new ctx=256 because the F32-stable KL log-softmax of the
	151K-vocab tensor allocates ~600 MB of intermediates per step that
	don't free fast enough. Memory optimisation (in-place softmax, KL
	chunking by vocab-tile) is the v6.2 work. v6.1 is CE-only over
	the 4× longer context — a different bet that prioritises context
	reach over teacher matching.
	- All 30K steps used the new attention path. The NaN-safe
	compressed branch runs by default; no env var or config to enable
	it.
	- **Same architecture, weights file format, tokenizer, and config
	shape as v6.0.** The Rust binary loads v6.0 and v6.1 from the same
	loader.

	## Open items for v6.2

	- Restore KL+CE distillation at ctx ≥ 256 by chunking the
	151K-vocab log-softmax (compute per-512-token vocab-chunk so peak
	memory stays bounded).
	- Long-context curriculum (16K → 64K → 128K → 1M) per the V6
	master spec, now that the forward-pass NaN is gone.
	- lm-evaluation-harness pass for honest numbers.
	- HumanEval / coding evals if we add a coding-domain corpus
	chunk.

	## License + citation

	Apache-2.0 (matches the base model license).

	```bibtex
	@misc{aether_mind_v61_2026,
	title = {Aether Mind v6.1 --- long-context after the compressed-branch NaN fix},
	author = {{BlockArtica} and {QuantumAI-Blockchain}},
	year = {2026},
	url = {https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.1},
	}
	```

	## Links

	- QuantumAI Blockchain: [qbc.network](https://qbc.network)
	- GitHub org: [github.com/QuantumAI-Blockchain](https://github.com/QuantumAI-Blockchain)
	- Aether (Rust): [qubitcoin-aether](https://github.com/QuantumAI-Blockchain/qubitcoin-aether)
	- Prior releases:
	- [aether-mind-v6.0](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0) (ctx=64, distilled)
	- [aether-v5.2-lora](https://huggingface.co/QuantumAI-Blockchain/aether-v5.2-lora) (7B LoRA)
	- X / Twitter: [@qu_bitcoin](https://x.com/qu_bitcoin)
	- Contact: info@qbc.network

	### Framework versions

	- candle 0.10 + CUDA 12.6
	- Rust `aether-v6-train` binary @ commit
	[`7f9189f8`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/commit/7f9189f8)
	- Qwen2.5 tokenizer (vocab 151,936)