File size: 7,734 Bytes

92ac28f

---
base_model: Qwen/Qwen2.5-0.5B-Instruct
library_name: safetensors
license: apache-2.0
tags:
  - qubitcoin
  - aether
  - blockchain
  - quantum
  - native-rust
  - candle
  - long-context
language:
  - en
pipeline_tag: text-generation
---

# Aether Mind v6.1 — long-context after the NaN fix

V6.1 is the **third public Aether release** and the first that
trains on a meaningfully long context window. It supersedes
[aether-mind-v6.0](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0)
which was published with a forced `ctx=64` workaround because of a
forward-pass numerical instability in the NSA compressed branch
(`v6/attention.rs::compressed_branch`).

That instability is now diagnosed + fixed. **Compressed-branch
attention's causal mask was producing all-`-inf` rows for query
positions before the first 64-token block completed, driving softmax
to `0/0 = NaN`.** The fix tracks per-row validity, unmasks a single
block on otherwise-fully-masked rows to keep softmax finite, and
multiplies the branch output by a row-validity mask so those rows
contribute zero attention (their proper behaviour). Source +
verification log in
[`docs/ops/v6-training-nan-bug.md`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/blob/presale/v1/docs/ops/v6-training-nan-bug.md);
the fix landed in commit
[`7f9189f8`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/commit/7f9189f8).

V6.1 was trained at **4× the v6.0 context** (256 vs 64 tokens) on
the same 36,860-row Aether curated corpus, on the same RTX 3080 Ti,
in the same wall-clock envelope (~44 min vs v6.0's 50 min — slightly
faster because no Qwen teacher forward).

## What you're getting

| Field | Value |
|---|---|
| Base model | `Qwen/Qwen2.5-0.5B-Instruct` (initialised from, then CE-trained) |
| Architecture | V6 transformer: 24 layers, 896 hidden, 14 attention heads (10 Sephirot + 2 generalist + 2 sink), head_dim=64 |
| Trainable params | ~558 M (all weights, no LoRA) |
| Training mode | **Pure cross-entropy** (no distillation in this release — see notes below) |
| Training context | **256 tokens** (4× the v6.0 release) |
| Precision | BF16 weights, F32 KL/CE math internally for numerical stability |
| NSA config | compression_block=64, top_k=2048, sliding_window=512, sink_tokens=4 |
| Vocab | 151,936 (Qwen2.5 tokenizer, untouched) |
| Max position | 32,768 (RoPE theta = 1e6) |
| Checkpoint published | **step 30,000** (full Phase-1 run) |
| File | `model.safetensors` (1.32 GB, BF16) |
| License | Apache-2.0 (matches base) |

## Training run

| Metric | Value | Δ vs v6.0 |
|---|---|---|
| Steps | 30,000 | = |
| Wall-clock | 44.4 min | −10 % |
| Tokens scored | 1,676,479 | +0.3 % (4× context lets more rows fit) |
| Throughput | 629.9 tokens/sec | +12 % |
| Mean CE loss | **10.18** nats/token | better (v6.0 was 10.35 mean CE under the KL blend) |
| Mean Sephirot aux | 0.149 | = |
| Max tokens processed | **167** | (v6.0 truncated to 64) |
| **NaN events** | **0** | (v6.0 also 0 thanks to the ctx=64 workaround) |

### Loss trajectory

```
step      1  loss=15.75  avg=15.75   (random init)
step    100  loss=15.94  avg=16.32   warm-up
step   1000  loss=11.63  avg=13.20   ← CE/lm-head learning the vocab
step   5000  loss=10.00  avg=11.01
step  10000  loss= 9.13  avg=10.07   ← representational floor (much lower than v6.0's 7.68 at this step — but apples-to-oranges; v6.0 was loss-blended with KL teacher signal)
step  15000  loss=11.13  avg= 9.87
step  20000  loss=10.25  avg=10.02
step  25000  loss= 9.75  avg=10.15
step  29999  loss= 9.81  avg=10.18
```

The interesting fact: at step 122 (the row where v6.0 first NaN'd —
tokens=167), v6.1 reads a real loss in the 9-16 range and continues
training. **This release is the empirical proof that the
compressed-branch fix is the right one.**

## Architecture (unchanged from v6.0)

V6 is **not** a vanilla Qwen2.5 fine-tune. The attention layer
implements a 14-head split designed for on-chain cognitive routing:

- **10 Sephirot heads** — one per cognitive domain (Keter → Malkuth).
  Each head's attention pattern is what the on-chain
  `pallet_qbc_aether_anchor` records as the per-cycle attestation root.
- **2 generalist heads** — un-gated, full-context attention. Used
  for the "global workspace" path in `aether-mind`.
- **2 sink heads** — anchor-token attention (first 4 tokens) for
  stable long-context performance.

The NSA compressed branch (the one that NaN'd) now correctly handles
the early-query case via row-validity masking.

## How to use

### Native runtime (recommended) — Rust `aether-mind`

Set `AETHER_V6_CHECKPOINT` to the local path of `model.safetensors`,
restart `qbc-aether-mind.service`. The Rust binary loads via candle.

### Python

```python
from safetensors.torch import load_file
weights = load_file("model.safetensors")  # 315 BF16 tensors
print("params:", sum(t.numel() for t in weights.values()))
```

There is **no upstream 🤗 transformers loader** for the V6 14-head
split + Sephirot routing. Production use goes through the Rust
binary in
[`qubitcoin-aether`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether).

## Evaluation

**Not yet run.** lm-evaluation-harness vs MMLU / ARC / HellaSwag /
TruthfulQA is the next session's work. We will back-fill the
numbers + comparison vs v5.2-lora + v6.0 here when they land.

## Notes vs v6.0

- **No KL distillation in this release.** The full distillation
  path (KL teacher signal + CE + Sephirot aux) hits a CUDA OOM at
  the new ctx=256 because the F32-stable KL log-softmax of the
  151K-vocab tensor allocates ~600 MB of intermediates per step that
  don't free fast enough. Memory optimisation (in-place softmax, KL
  chunking by vocab-tile) is the v6.2 work. v6.1 is CE-only over
  the 4× longer context — a different bet that prioritises context
  reach over teacher matching.
- **All 30K steps used the new attention path.** The NaN-safe
  compressed branch runs by default; no env var or config to enable
  it.
- **Same architecture, weights file format, tokenizer, and config
  shape as v6.0.** The Rust binary loads v6.0 and v6.1 from the same
  loader.

## Open items for v6.2

- **Restore KL+CE distillation** at ctx ≥ 256 by chunking the
  151K-vocab log-softmax (compute per-512-token vocab-chunk so peak
  memory stays bounded).
- **Long-context curriculum** (16K → 64K → 128K → 1M) per the V6
  master spec, now that the forward-pass NaN is gone.
- **lm-evaluation-harness pass** for honest numbers.
- **HumanEval / coding evals** if we add a coding-domain corpus
  chunk.

## License + citation

Apache-2.0 (matches the base model license).

```bibtex
@misc{aether_mind_v61_2026,
  title  = {Aether Mind v6.1 --- long-context after the compressed-branch NaN fix},
  author = {{BlockArtica} and {QuantumAI-Blockchain}},
  year   = {2026},
  url    = {https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.1},
}
```

## Links

- **QuantumAI Blockchain:** [qbc.network](https://qbc.network)
- **GitHub org:** [github.com/QuantumAI-Blockchain](https://github.com/QuantumAI-Blockchain)
- **Aether (Rust):** [qubitcoin-aether](https://github.com/QuantumAI-Blockchain/qubitcoin-aether)
- **Prior releases:**
  - [aether-mind-v6.0](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0) (ctx=64, distilled)
  - [aether-v5.2-lora](https://huggingface.co/QuantumAI-Blockchain/aether-v5.2-lora) (7B LoRA)
- **X / Twitter:** [@qu_bitcoin](https://x.com/qu_bitcoin)
- **Contact:** info@qbc.network

### Framework versions

- candle 0.10 + CUDA 12.6
- Rust `aether-v6-train` binary @ commit
  [`7f9189f8`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/commit/7f9189f8)
- Qwen2.5 tokenizer (vocab 151,936)