aether-mind-v6.1 / README.md
BlockArtica's picture
v6.1 release: post-NaN-fix at ctx=256, CE-only, 30K steps
92ac28f verified
---
base_model: Qwen/Qwen2.5-0.5B-Instruct
library_name: safetensors
license: apache-2.0
tags:
- qubitcoin
- aether
- blockchain
- quantum
- native-rust
- candle
- long-context
language:
- en
pipeline_tag: text-generation
---
# Aether Mind v6.1 β€” long-context after the NaN fix
V6.1 is the **third public Aether release** and the first that
trains on a meaningfully long context window. It supersedes
[aether-mind-v6.0](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0)
which was published with a forced `ctx=64` workaround because of a
forward-pass numerical instability in the NSA compressed branch
(`v6/attention.rs::compressed_branch`).
That instability is now diagnosed + fixed. **Compressed-branch
attention's causal mask was producing all-`-inf` rows for query
positions before the first 64-token block completed, driving softmax
to `0/0 = NaN`.** The fix tracks per-row validity, unmasks a single
block on otherwise-fully-masked rows to keep softmax finite, and
multiplies the branch output by a row-validity mask so those rows
contribute zero attention (their proper behaviour). Source +
verification log in
[`docs/ops/v6-training-nan-bug.md`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/blob/presale/v1/docs/ops/v6-training-nan-bug.md);
the fix landed in commit
[`7f9189f8`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/commit/7f9189f8).
V6.1 was trained at **4Γ— the v6.0 context** (256 vs 64 tokens) on
the same 36,860-row Aether curated corpus, on the same RTX 3080 Ti,
in the same wall-clock envelope (~44 min vs v6.0's 50 min β€” slightly
faster because no Qwen teacher forward).
## What you're getting
| Field | Value |
|---|---|
| Base model | `Qwen/Qwen2.5-0.5B-Instruct` (initialised from, then CE-trained) |
| Architecture | V6 transformer: 24 layers, 896 hidden, 14 attention heads (10 Sephirot + 2 generalist + 2 sink), head_dim=64 |
| Trainable params | ~558 M (all weights, no LoRA) |
| Training mode | **Pure cross-entropy** (no distillation in this release β€” see notes below) |
| Training context | **256 tokens** (4Γ— the v6.0 release) |
| Precision | BF16 weights, F32 KL/CE math internally for numerical stability |
| NSA config | compression_block=64, top_k=2048, sliding_window=512, sink_tokens=4 |
| Vocab | 151,936 (Qwen2.5 tokenizer, untouched) |
| Max position | 32,768 (RoPE theta = 1e6) |
| Checkpoint published | **step 30,000** (full Phase-1 run) |
| File | `model.safetensors` (1.32 GB, BF16) |
| License | Apache-2.0 (matches base) |
## Training run
| Metric | Value | Ξ” vs v6.0 |
|---|---|---|
| Steps | 30,000 | = |
| Wall-clock | 44.4 min | βˆ’10 % |
| Tokens scored | 1,676,479 | +0.3 % (4Γ— context lets more rows fit) |
| Throughput | 629.9 tokens/sec | +12 % |
| Mean CE loss | **10.18** nats/token | better (v6.0 was 10.35 mean CE under the KL blend) |
| Mean Sephirot aux | 0.149 | = |
| Max tokens processed | **167** | (v6.0 truncated to 64) |
| **NaN events** | **0** | (v6.0 also 0 thanks to the ctx=64 workaround) |
### Loss trajectory
```
step 1 loss=15.75 avg=15.75 (random init)
step 100 loss=15.94 avg=16.32 warm-up
step 1000 loss=11.63 avg=13.20 ← CE/lm-head learning the vocab
step 5000 loss=10.00 avg=11.01
step 10000 loss= 9.13 avg=10.07 ← representational floor (much lower than v6.0's 7.68 at this step β€” but apples-to-oranges; v6.0 was loss-blended with KL teacher signal)
step 15000 loss=11.13 avg= 9.87
step 20000 loss=10.25 avg=10.02
step 25000 loss= 9.75 avg=10.15
step 29999 loss= 9.81 avg=10.18
```
The interesting fact: at step 122 (the row where v6.0 first NaN'd β€”
tokens=167), v6.1 reads a real loss in the 9-16 range and continues
training. **This release is the empirical proof that the
compressed-branch fix is the right one.**
## Architecture (unchanged from v6.0)
V6 is **not** a vanilla Qwen2.5 fine-tune. The attention layer
implements a 14-head split designed for on-chain cognitive routing:
- **10 Sephirot heads** β€” one per cognitive domain (Keter β†’ Malkuth).
Each head's attention pattern is what the on-chain
`pallet_qbc_aether_anchor` records as the per-cycle attestation root.
- **2 generalist heads** β€” un-gated, full-context attention. Used
for the "global workspace" path in `aether-mind`.
- **2 sink heads** β€” anchor-token attention (first 4 tokens) for
stable long-context performance.
The NSA compressed branch (the one that NaN'd) now correctly handles
the early-query case via row-validity masking.
## How to use
### Native runtime (recommended) β€” Rust `aether-mind`
Set `AETHER_V6_CHECKPOINT` to the local path of `model.safetensors`,
restart `qbc-aether-mind.service`. The Rust binary loads via candle.
### Python
```python
from safetensors.torch import load_file
weights = load_file("model.safetensors") # 315 BF16 tensors
print("params:", sum(t.numel() for t in weights.values()))
```
There is **no upstream πŸ€— transformers loader** for the V6 14-head
split + Sephirot routing. Production use goes through the Rust
binary in
[`qubitcoin-aether`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether).
## Evaluation
**Not yet run.** lm-evaluation-harness vs MMLU / ARC / HellaSwag /
TruthfulQA is the next session's work. We will back-fill the
numbers + comparison vs v5.2-lora + v6.0 here when they land.
## Notes vs v6.0
- **No KL distillation in this release.** The full distillation
path (KL teacher signal + CE + Sephirot aux) hits a CUDA OOM at
the new ctx=256 because the F32-stable KL log-softmax of the
151K-vocab tensor allocates ~600 MB of intermediates per step that
don't free fast enough. Memory optimisation (in-place softmax, KL
chunking by vocab-tile) is the v6.2 work. v6.1 is CE-only over
the 4Γ— longer context β€” a different bet that prioritises context
reach over teacher matching.
- **All 30K steps used the new attention path.** The NaN-safe
compressed branch runs by default; no env var or config to enable
it.
- **Same architecture, weights file format, tokenizer, and config
shape as v6.0.** The Rust binary loads v6.0 and v6.1 from the same
loader.
## Open items for v6.2
- **Restore KL+CE distillation** at ctx β‰₯ 256 by chunking the
151K-vocab log-softmax (compute per-512-token vocab-chunk so peak
memory stays bounded).
- **Long-context curriculum** (16K β†’ 64K β†’ 128K β†’ 1M) per the V6
master spec, now that the forward-pass NaN is gone.
- **lm-evaluation-harness pass** for honest numbers.
- **HumanEval / coding evals** if we add a coding-domain corpus
chunk.
## License + citation
Apache-2.0 (matches the base model license).
```bibtex
@misc{aether_mind_v61_2026,
title = {Aether Mind v6.1 --- long-context after the compressed-branch NaN fix},
author = {{BlockArtica} and {QuantumAI-Blockchain}},
year = {2026},
url = {https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.1},
}
```
## Links
- **QuantumAI Blockchain:** [qbc.network](https://qbc.network)
- **GitHub org:** [github.com/QuantumAI-Blockchain](https://github.com/QuantumAI-Blockchain)
- **Aether (Rust):** [qubitcoin-aether](https://github.com/QuantumAI-Blockchain/qubitcoin-aether)
- **Prior releases:**
- [aether-mind-v6.0](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0) (ctx=64, distilled)
- [aether-v5.2-lora](https://huggingface.co/QuantumAI-Blockchain/aether-v5.2-lora) (7B LoRA)
- **X / Twitter:** [@qu_bitcoin](https://x.com/qu_bitcoin)
- **Contact:** info@qbc.network
### Framework versions
- candle 0.10 + CUDA 12.6
- Rust `aether-v6-train` binary @ commit
[`7f9189f8`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/commit/7f9189f8)
- Qwen2.5 tokenizer (vocab 151,936)