File size: 7,734 Bytes
92ac28f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 | ---
base_model: Qwen/Qwen2.5-0.5B-Instruct
library_name: safetensors
license: apache-2.0
tags:
- qubitcoin
- aether
- blockchain
- quantum
- native-rust
- candle
- long-context
language:
- en
pipeline_tag: text-generation
---
# Aether Mind v6.1 β long-context after the NaN fix
V6.1 is the **third public Aether release** and the first that
trains on a meaningfully long context window. It supersedes
[aether-mind-v6.0](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0)
which was published with a forced `ctx=64` workaround because of a
forward-pass numerical instability in the NSA compressed branch
(`v6/attention.rs::compressed_branch`).
That instability is now diagnosed + fixed. **Compressed-branch
attention's causal mask was producing all-`-inf` rows for query
positions before the first 64-token block completed, driving softmax
to `0/0 = NaN`.** The fix tracks per-row validity, unmasks a single
block on otherwise-fully-masked rows to keep softmax finite, and
multiplies the branch output by a row-validity mask so those rows
contribute zero attention (their proper behaviour). Source +
verification log in
[`docs/ops/v6-training-nan-bug.md`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/blob/presale/v1/docs/ops/v6-training-nan-bug.md);
the fix landed in commit
[`7f9189f8`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/commit/7f9189f8).
V6.1 was trained at **4Γ the v6.0 context** (256 vs 64 tokens) on
the same 36,860-row Aether curated corpus, on the same RTX 3080 Ti,
in the same wall-clock envelope (~44 min vs v6.0's 50 min β slightly
faster because no Qwen teacher forward).
## What you're getting
| Field | Value |
|---|---|
| Base model | `Qwen/Qwen2.5-0.5B-Instruct` (initialised from, then CE-trained) |
| Architecture | V6 transformer: 24 layers, 896 hidden, 14 attention heads (10 Sephirot + 2 generalist + 2 sink), head_dim=64 |
| Trainable params | ~558 M (all weights, no LoRA) |
| Training mode | **Pure cross-entropy** (no distillation in this release β see notes below) |
| Training context | **256 tokens** (4Γ the v6.0 release) |
| Precision | BF16 weights, F32 KL/CE math internally for numerical stability |
| NSA config | compression_block=64, top_k=2048, sliding_window=512, sink_tokens=4 |
| Vocab | 151,936 (Qwen2.5 tokenizer, untouched) |
| Max position | 32,768 (RoPE theta = 1e6) |
| Checkpoint published | **step 30,000** (full Phase-1 run) |
| File | `model.safetensors` (1.32 GB, BF16) |
| License | Apache-2.0 (matches base) |
## Training run
| Metric | Value | Ξ vs v6.0 |
|---|---|---|
| Steps | 30,000 | = |
| Wall-clock | 44.4 min | β10 % |
| Tokens scored | 1,676,479 | +0.3 % (4Γ context lets more rows fit) |
| Throughput | 629.9 tokens/sec | +12 % |
| Mean CE loss | **10.18** nats/token | better (v6.0 was 10.35 mean CE under the KL blend) |
| Mean Sephirot aux | 0.149 | = |
| Max tokens processed | **167** | (v6.0 truncated to 64) |
| **NaN events** | **0** | (v6.0 also 0 thanks to the ctx=64 workaround) |
### Loss trajectory
```
step 1 loss=15.75 avg=15.75 (random init)
step 100 loss=15.94 avg=16.32 warm-up
step 1000 loss=11.63 avg=13.20 β CE/lm-head learning the vocab
step 5000 loss=10.00 avg=11.01
step 10000 loss= 9.13 avg=10.07 β representational floor (much lower than v6.0's 7.68 at this step β but apples-to-oranges; v6.0 was loss-blended with KL teacher signal)
step 15000 loss=11.13 avg= 9.87
step 20000 loss=10.25 avg=10.02
step 25000 loss= 9.75 avg=10.15
step 29999 loss= 9.81 avg=10.18
```
The interesting fact: at step 122 (the row where v6.0 first NaN'd β
tokens=167), v6.1 reads a real loss in the 9-16 range and continues
training. **This release is the empirical proof that the
compressed-branch fix is the right one.**
## Architecture (unchanged from v6.0)
V6 is **not** a vanilla Qwen2.5 fine-tune. The attention layer
implements a 14-head split designed for on-chain cognitive routing:
- **10 Sephirot heads** β one per cognitive domain (Keter β Malkuth).
Each head's attention pattern is what the on-chain
`pallet_qbc_aether_anchor` records as the per-cycle attestation root.
- **2 generalist heads** β un-gated, full-context attention. Used
for the "global workspace" path in `aether-mind`.
- **2 sink heads** β anchor-token attention (first 4 tokens) for
stable long-context performance.
The NSA compressed branch (the one that NaN'd) now correctly handles
the early-query case via row-validity masking.
## How to use
### Native runtime (recommended) β Rust `aether-mind`
Set `AETHER_V6_CHECKPOINT` to the local path of `model.safetensors`,
restart `qbc-aether-mind.service`. The Rust binary loads via candle.
### Python
```python
from safetensors.torch import load_file
weights = load_file("model.safetensors") # 315 BF16 tensors
print("params:", sum(t.numel() for t in weights.values()))
```
There is **no upstream π€ transformers loader** for the V6 14-head
split + Sephirot routing. Production use goes through the Rust
binary in
[`qubitcoin-aether`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether).
## Evaluation
**Not yet run.** lm-evaluation-harness vs MMLU / ARC / HellaSwag /
TruthfulQA is the next session's work. We will back-fill the
numbers + comparison vs v5.2-lora + v6.0 here when they land.
## Notes vs v6.0
- **No KL distillation in this release.** The full distillation
path (KL teacher signal + CE + Sephirot aux) hits a CUDA OOM at
the new ctx=256 because the F32-stable KL log-softmax of the
151K-vocab tensor allocates ~600 MB of intermediates per step that
don't free fast enough. Memory optimisation (in-place softmax, KL
chunking by vocab-tile) is the v6.2 work. v6.1 is CE-only over
the 4Γ longer context β a different bet that prioritises context
reach over teacher matching.
- **All 30K steps used the new attention path.** The NaN-safe
compressed branch runs by default; no env var or config to enable
it.
- **Same architecture, weights file format, tokenizer, and config
shape as v6.0.** The Rust binary loads v6.0 and v6.1 from the same
loader.
## Open items for v6.2
- **Restore KL+CE distillation** at ctx β₯ 256 by chunking the
151K-vocab log-softmax (compute per-512-token vocab-chunk so peak
memory stays bounded).
- **Long-context curriculum** (16K β 64K β 128K β 1M) per the V6
master spec, now that the forward-pass NaN is gone.
- **lm-evaluation-harness pass** for honest numbers.
- **HumanEval / coding evals** if we add a coding-domain corpus
chunk.
## License + citation
Apache-2.0 (matches the base model license).
```bibtex
@misc{aether_mind_v61_2026,
title = {Aether Mind v6.1 --- long-context after the compressed-branch NaN fix},
author = {{BlockArtica} and {QuantumAI-Blockchain}},
year = {2026},
url = {https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.1},
}
```
## Links
- **QuantumAI Blockchain:** [qbc.network](https://qbc.network)
- **GitHub org:** [github.com/QuantumAI-Blockchain](https://github.com/QuantumAI-Blockchain)
- **Aether (Rust):** [qubitcoin-aether](https://github.com/QuantumAI-Blockchain/qubitcoin-aether)
- **Prior releases:**
- [aether-mind-v6.0](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0) (ctx=64, distilled)
- [aether-v5.2-lora](https://huggingface.co/QuantumAI-Blockchain/aether-v5.2-lora) (7B LoRA)
- **X / Twitter:** [@qu_bitcoin](https://x.com/qu_bitcoin)
- **Contact:** info@qbc.network
### Framework versions
- candle 0.10 + CUDA 12.6
- Rust `aether-v6-train` binary @ commit
[`7f9189f8`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/commit/7f9189f8)
- Qwen2.5 tokenizer (vocab 151,936)
|