--- base_model: Qwen/Qwen2.5-0.5B-Instruct library_name: safetensors license: apache-2.0 tags: - qubitcoin - aether - blockchain - quantum - native-rust - candle - long-context language: - en pipeline_tag: text-generation --- # Aether Mind v6.1 — long-context after the NaN fix V6.1 is the **third public Aether release** and the first that trains on a meaningfully long context window. It supersedes [aether-mind-v6.0](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0) which was published with a forced `ctx=64` workaround because of a forward-pass numerical instability in the NSA compressed branch (`v6/attention.rs::compressed_branch`). That instability is now diagnosed + fixed. **Compressed-branch attention's causal mask was producing all-`-inf` rows for query positions before the first 64-token block completed, driving softmax to `0/0 = NaN`.** The fix tracks per-row validity, unmasks a single block on otherwise-fully-masked rows to keep softmax finite, and multiplies the branch output by a row-validity mask so those rows contribute zero attention (their proper behaviour). Source + verification log in [`docs/ops/v6-training-nan-bug.md`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/blob/presale/v1/docs/ops/v6-training-nan-bug.md); the fix landed in commit [`7f9189f8`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/commit/7f9189f8). V6.1 was trained at **4× the v6.0 context** (256 vs 64 tokens) on the same 36,860-row Aether curated corpus, on the same RTX 3080 Ti, in the same wall-clock envelope (~44 min vs v6.0's 50 min — slightly faster because no Qwen teacher forward). ## What you're getting | Field | Value | |---|---| | Base model | `Qwen/Qwen2.5-0.5B-Instruct` (initialised from, then CE-trained) | | Architecture | V6 transformer: 24 layers, 896 hidden, 14 attention heads (10 Sephirot + 2 generalist + 2 sink), head_dim=64 | | Trainable params | ~558 M (all weights, no LoRA) | | Training mode | **Pure cross-entropy** (no distillation in this release — see notes below) | | Training context | **256 tokens** (4× the v6.0 release) | | Precision | BF16 weights, F32 KL/CE math internally for numerical stability | | NSA config | compression_block=64, top_k=2048, sliding_window=512, sink_tokens=4 | | Vocab | 151,936 (Qwen2.5 tokenizer, untouched) | | Max position | 32,768 (RoPE theta = 1e6) | | Checkpoint published | **step 30,000** (full Phase-1 run) | | File | `model.safetensors` (1.32 GB, BF16) | | License | Apache-2.0 (matches base) | ## Training run | Metric | Value | Δ vs v6.0 | |---|---|---| | Steps | 30,000 | = | | Wall-clock | 44.4 min | −10 % | | Tokens scored | 1,676,479 | +0.3 % (4× context lets more rows fit) | | Throughput | 629.9 tokens/sec | +12 % | | Mean CE loss | **10.18** nats/token | better (v6.0 was 10.35 mean CE under the KL blend) | | Mean Sephirot aux | 0.149 | = | | Max tokens processed | **167** | (v6.0 truncated to 64) | | **NaN events** | **0** | (v6.0 also 0 thanks to the ctx=64 workaround) | ### Loss trajectory ``` step 1 loss=15.75 avg=15.75 (random init) step 100 loss=15.94 avg=16.32 warm-up step 1000 loss=11.63 avg=13.20 ← CE/lm-head learning the vocab step 5000 loss=10.00 avg=11.01 step 10000 loss= 9.13 avg=10.07 ← representational floor (much lower than v6.0's 7.68 at this step — but apples-to-oranges; v6.0 was loss-blended with KL teacher signal) step 15000 loss=11.13 avg= 9.87 step 20000 loss=10.25 avg=10.02 step 25000 loss= 9.75 avg=10.15 step 29999 loss= 9.81 avg=10.18 ``` The interesting fact: at step 122 (the row where v6.0 first NaN'd — tokens=167), v6.1 reads a real loss in the 9-16 range and continues training. **This release is the empirical proof that the compressed-branch fix is the right one.** ## Architecture (unchanged from v6.0) V6 is **not** a vanilla Qwen2.5 fine-tune. The attention layer implements a 14-head split designed for on-chain cognitive routing: - **10 Sephirot heads** — one per cognitive domain (Keter → Malkuth). Each head's attention pattern is what the on-chain `pallet_qbc_aether_anchor` records as the per-cycle attestation root. - **2 generalist heads** — un-gated, full-context attention. Used for the "global workspace" path in `aether-mind`. - **2 sink heads** — anchor-token attention (first 4 tokens) for stable long-context performance. The NSA compressed branch (the one that NaN'd) now correctly handles the early-query case via row-validity masking. ## How to use ### Native runtime (recommended) — Rust `aether-mind` Set `AETHER_V6_CHECKPOINT` to the local path of `model.safetensors`, restart `qbc-aether-mind.service`. The Rust binary loads via candle. ### Python ```python from safetensors.torch import load_file weights = load_file("model.safetensors") # 315 BF16 tensors print("params:", sum(t.numel() for t in weights.values())) ``` There is **no upstream 🤗 transformers loader** for the V6 14-head split + Sephirot routing. Production use goes through the Rust binary in [`qubitcoin-aether`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether). ## Evaluation **Not yet run.** lm-evaluation-harness vs MMLU / ARC / HellaSwag / TruthfulQA is the next session's work. We will back-fill the numbers + comparison vs v5.2-lora + v6.0 here when they land. ## Notes vs v6.0 - **No KL distillation in this release.** The full distillation path (KL teacher signal + CE + Sephirot aux) hits a CUDA OOM at the new ctx=256 because the F32-stable KL log-softmax of the 151K-vocab tensor allocates ~600 MB of intermediates per step that don't free fast enough. Memory optimisation (in-place softmax, KL chunking by vocab-tile) is the v6.2 work. v6.1 is CE-only over the 4× longer context — a different bet that prioritises context reach over teacher matching. - **All 30K steps used the new attention path.** The NaN-safe compressed branch runs by default; no env var or config to enable it. - **Same architecture, weights file format, tokenizer, and config shape as v6.0.** The Rust binary loads v6.0 and v6.1 from the same loader. ## Open items for v6.2 - **Restore KL+CE distillation** at ctx ≥ 256 by chunking the 151K-vocab log-softmax (compute per-512-token vocab-chunk so peak memory stays bounded). - **Long-context curriculum** (16K → 64K → 128K → 1M) per the V6 master spec, now that the forward-pass NaN is gone. - **lm-evaluation-harness pass** for honest numbers. - **HumanEval / coding evals** if we add a coding-domain corpus chunk. ## License + citation Apache-2.0 (matches the base model license). ```bibtex @misc{aether_mind_v61_2026, title = {Aether Mind v6.1 --- long-context after the compressed-branch NaN fix}, author = {{BlockArtica} and {QuantumAI-Blockchain}}, year = {2026}, url = {https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.1}, } ``` ## Links - **QuantumAI Blockchain:** [qbc.network](https://qbc.network) - **GitHub org:** [github.com/QuantumAI-Blockchain](https://github.com/QuantumAI-Blockchain) - **Aether (Rust):** [qubitcoin-aether](https://github.com/QuantumAI-Blockchain/qubitcoin-aether) - **Prior releases:** - [aether-mind-v6.0](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0) (ctx=64, distilled) - [aether-v5.2-lora](https://huggingface.co/QuantumAI-Blockchain/aether-v5.2-lora) (7B LoRA) - **X / Twitter:** [@qu_bitcoin](https://x.com/qu_bitcoin) - **Contact:** info@qbc.network ### Framework versions - candle 0.10 + CUDA 12.6 - Rust `aether-v6-train` binary @ commit [`7f9189f8`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/commit/7f9189f8) - Qwen2.5 tokenizer (vocab 151,936)