Aether Mind v6.1 โ€” long-context after the NaN fix

V6.1 is the third public Aether release and the first that trains on a meaningfully long context window. It supersedes aether-mind-v6.0 which was published with a forced ctx=64 workaround because of a forward-pass numerical instability in the NSA compressed branch (v6/attention.rs::compressed_branch).

That instability is now diagnosed + fixed. Compressed-branch attention's causal mask was producing all--inf rows for query positions before the first 64-token block completed, driving softmax to 0/0 = NaN. The fix tracks per-row validity, unmasks a single block on otherwise-fully-masked rows to keep softmax finite, and multiplies the branch output by a row-validity mask so those rows contribute zero attention (their proper behaviour). Source + verification log in docs/ops/v6-training-nan-bug.md; the fix landed in commit 7f9189f8.

V6.1 was trained at 4ร— the v6.0 context (256 vs 64 tokens) on the same 36,860-row Aether curated corpus, on the same RTX 3080 Ti, in the same wall-clock envelope (~44 min vs v6.0's 50 min โ€” slightly faster because no Qwen teacher forward).

What you're getting

Field Value
Base model Qwen/Qwen2.5-0.5B-Instruct (initialised from, then CE-trained)
Architecture V6 transformer: 24 layers, 896 hidden, 14 attention heads (10 Sephirot + 2 generalist + 2 sink), head_dim=64
Trainable params ~558 M (all weights, no LoRA)
Training mode Pure cross-entropy (no distillation in this release โ€” see notes below)
Training context 256 tokens (4ร— the v6.0 release)
Precision BF16 weights, F32 KL/CE math internally for numerical stability
NSA config compression_block=64, top_k=2048, sliding_window=512, sink_tokens=4
Vocab 151,936 (Qwen2.5 tokenizer, untouched)
Max position 32,768 (RoPE theta = 1e6)
Checkpoint published step 30,000 (full Phase-1 run)
File model.safetensors (1.32 GB, BF16)
License Apache-2.0 (matches base)

Training run

Metric Value ฮ” vs v6.0
Steps 30,000 =
Wall-clock 44.4 min โˆ’10 %
Tokens scored 1,676,479 +0.3 % (4ร— context lets more rows fit)
Throughput 629.9 tokens/sec +12 %
Mean CE loss 10.18 nats/token better (v6.0 was 10.35 mean CE under the KL blend)
Mean Sephirot aux 0.149 =
Max tokens processed 167 (v6.0 truncated to 64)
NaN events 0 (v6.0 also 0 thanks to the ctx=64 workaround)

Loss trajectory

step      1  loss=15.75  avg=15.75   (random init)
step    100  loss=15.94  avg=16.32   warm-up
step   1000  loss=11.63  avg=13.20   โ† CE/lm-head learning the vocab
step   5000  loss=10.00  avg=11.01
step  10000  loss= 9.13  avg=10.07   โ† representational floor (much lower than v6.0's 7.68 at this step โ€” but apples-to-oranges; v6.0 was loss-blended with KL teacher signal)
step  15000  loss=11.13  avg= 9.87
step  20000  loss=10.25  avg=10.02
step  25000  loss= 9.75  avg=10.15
step  29999  loss= 9.81  avg=10.18

The interesting fact: at step 122 (the row where v6.0 first NaN'd โ€” tokens=167), v6.1 reads a real loss in the 9-16 range and continues training. This release is the empirical proof that the compressed-branch fix is the right one.

Architecture (unchanged from v6.0)

V6 is not a vanilla Qwen2.5 fine-tune. The attention layer implements a 14-head split designed for on-chain cognitive routing:

  • 10 Sephirot heads โ€” one per cognitive domain (Keter โ†’ Malkuth). Each head's attention pattern is what the on-chain pallet_qbc_aether_anchor records as the per-cycle attestation root.
  • 2 generalist heads โ€” un-gated, full-context attention. Used for the "global workspace" path in aether-mind.
  • 2 sink heads โ€” anchor-token attention (first 4 tokens) for stable long-context performance.

The NSA compressed branch (the one that NaN'd) now correctly handles the early-query case via row-validity masking.

How to use

Native runtime (recommended) โ€” Rust aether-mind

Set AETHER_V6_CHECKPOINT to the local path of model.safetensors, restart qbc-aether-mind.service. The Rust binary loads via candle.

Python

from safetensors.torch import load_file
weights = load_file("model.safetensors")  # 315 BF16 tensors
print("params:", sum(t.numel() for t in weights.values()))

There is no upstream ๐Ÿค— transformers loader for the V6 14-head split + Sephirot routing. Production use goes through the Rust binary in qubitcoin-aether.

Evaluation

Not yet run. lm-evaluation-harness vs MMLU / ARC / HellaSwag / TruthfulQA is the next session's work. We will back-fill the numbers + comparison vs v5.2-lora + v6.0 here when they land.

Notes vs v6.0

  • No KL distillation in this release. The full distillation path (KL teacher signal + CE + Sephirot aux) hits a CUDA OOM at the new ctx=256 because the F32-stable KL log-softmax of the 151K-vocab tensor allocates ~600 MB of intermediates per step that don't free fast enough. Memory optimisation (in-place softmax, KL chunking by vocab-tile) is the v6.2 work. v6.1 is CE-only over the 4ร— longer context โ€” a different bet that prioritises context reach over teacher matching.
  • All 30K steps used the new attention path. The NaN-safe compressed branch runs by default; no env var or config to enable it.
  • Same architecture, weights file format, tokenizer, and config shape as v6.0. The Rust binary loads v6.0 and v6.1 from the same loader.

Open items for v6.2

  • Restore KL+CE distillation at ctx โ‰ฅ 256 by chunking the 151K-vocab log-softmax (compute per-512-token vocab-chunk so peak memory stays bounded).
  • Long-context curriculum (16K โ†’ 64K โ†’ 128K โ†’ 1M) per the V6 master spec, now that the forward-pass NaN is gone.
  • lm-evaluation-harness pass for honest numbers.
  • HumanEval / coding evals if we add a coding-domain corpus chunk.

License + citation

Apache-2.0 (matches the base model license).

@misc{aether_mind_v61_2026,
  title  = {Aether Mind v6.1 --- long-context after the compressed-branch NaN fix},
  author = {{BlockArtica} and {QuantumAI-Blockchain}},
  year   = {2026},
  url    = {https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.1},
}

Links

Framework versions

  • candle 0.10 + CUDA 12.6
  • Rust aether-v6-train binary @ commit 7f9189f8
  • Qwen2.5 tokenizer (vocab 151,936)
Downloads last month
17
Safetensors
Model size
0.7B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for QuantumAI-Blockchain/aether-mind-v6.1

Finetuned
(771)
this model