File size: 12,099 Bytes

fb2a871

---
base_model: Qwen/Qwen2.5-0.5B-Instruct
library_name: safetensors
license: apache-2.0
tags:
  - qubitcoin
  - aether
  - blockchain
  - quantum
  - distillation
  - mixed-precision
  - native-rust
  - candle
language:
  - en
pipeline_tag: text-generation
---

# Aether Mind v6.0 — QuantumAI Blockchain Native Generator

A **558M-parameter distilled student** of [`Qwen/Qwen2.5-0.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct),
trained from scratch in pure Rust (`candle` 0.10) with the
**10-Sephirot + 2-generalist + 2-sink attention head split** that is
the core architectural claim of the QuantumAI Blockchain's Aether Mind
on-chain neural cognitive engine.

This is the **second public Aether release** and the first that is
**native to the on-chain inference path** — V6.0 is the model the
[`aether-mind`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether)
binary loads, not a LoRA adapter on top of a 7B base.

The previous release, [`aether-v5.2-lora`](https://huggingface.co/QuantumAI-Blockchain/aether-v5.2-lora),
is a 7B PEFT adapter intended for batch off-chain reasoning. V6.0 is
the smaller native generator that fits in the on-chain Aether
Mind's ~2.4 GB RAM envelope and runs at ~500 tokens/sec on a
consumer RTX 3080 Ti.

## What you're getting

| Field | Value |
|---|---|
| Base model | `Qwen/Qwen2.5-0.5B-Instruct` (initialised from, then distilled) |
| Architecture | V6 transformer: 24 layers, 896 hidden, 14 attention heads (10 Sephirot + 2 generalist + 2 sink), head_dim=64 |
| Trainable params | ~558 M (all weights trained, not LoRA) |
| Hidden / FFN | 896 / 4864 |
| Vocab | 151,936 (Qwen2.5 tokenizer, untouched) |
| Max position | 32,768 (RoPE theta = 1e6) |
| Native sparse attention (NSA) | compression_block=64, top_k=2048, sliding_window=512, sink_tokens=4 |
| Precision | BF16 weights + F32 KL math in distillation |
| Training context | **64 tokens** (Phase-1 release; see "Honest caveats" below) |
| Checkpoint published | **step 30,000** (full 30K-step Phase-1 run) |
| File | `model.safetensors` (1.32 GB, BF16) |
| License | Apache-2.0 (matches base) |

## Training run

| Metric | Value |
|---|---|
| Steps | 30,000 (full Phase-1) |
| Wall-clock | 49.6 min (single RTX 3080 Ti, BF16, CUDA(0)) |
| Tokens scored | 1,671,027 |
| Throughput | 561 tokens/sec |
| Optimiser | AdamW, LR 2e-5, no schedule (constant) |
| Distillation | KL(T||S) with alpha schedule 1.0 → 0.3 linear, temperature 1.0 |
| Sephirot auxiliary | MSE vs one-hot domain target, β = 0.1 |
| NaN events | **0** |
| Mean total loss | 8.39 nats/token |
| Mean CE | 10.35 |
| Mean KL | 7.50 |
| Mean Sephirot aux | 0.149 |

### Loss trajectory

```
step      1   loss=12.25   avg=12.25   (random init)
step    100   loss=12.87   avg=12.75
step   1000   loss= 8.62   avg= 9.74   ← KL/CE break
step   5000   loss= 7.72   avg= 8.16
step  10000   loss= 7.31   avg= 7.68   ← reached representational floor
step  15000   loss= 8.87   avg= 7.75
step  20000   loss= 8.75   avg= 8.04
step  25000   loss= 8.62   avg= 8.26
step  29999   loss= 8.81   avg= 8.39
```

The model converged hard in the first ~10K steps, then plateaued at
the representational floor for its current context window (64
tokens). The plateau is structural, not optimisation — see "Honest
caveats" below.

## Architecture — what makes V6 different

V6 is **not** a vanilla Qwen2.5 fine-tune. The attention layer
implements a 14-head split designed for on-chain cognitive routing:

- **10 Sephirot heads** — one per cognitive domain in the Aether
  Mind's specialisation map (Keter → Malkuth). Each head's attention
  pattern is what the on-chain `pallet_qbc_aether_anchor` records as
  the per-cycle attestation root.
- **2 generalist heads** — un-gated, full-context attention. Used for
  the "global workspace" path in `aether-mind`.
- **2 sink heads** — anchor-token attention (first 4 tokens of the
  sequence) for stable long-context performance, following the
  standard "attention sink" finding.

The Sephirot eviction order is configured in `config.json` for the
KV-cache management path that `aether-mind` uses to keep the
hot-set bounded in 12 GB VRAM under live inference.

## How to use

### Native runtime (recommended) — Rust `aether-mind`

The model is designed to be loaded by the on-chain Aether Mind
binary in the [`QuantumAI-Blockchain/qubitcoin-aether`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether)
repo. Set `AETHER_V6_CHECKPOINT` to the local path of
`model.safetensors` and start the systemd unit; the binary loads the
weights via candle into the V6 transformer crate.

### Python (via `safetensors` + `tokenizers`)

For offline experimentation:

```python
from safetensors.torch import load_file
from tokenizers import Tokenizer
import torch

tok = Tokenizer.from_file("tokenizer.json")
weights = load_file("model.safetensors")  # 315 tensors, BF16
print("loaded", len(weights), "tensors,", sum(t.numel() for t in weights.values()), "params")
```

There is **no canonical 🤗 transformers loader for the V6
architecture** — the 14-head split + Sephirot routing are not in the
upstream `Qwen2Model`. We publish the weights for transparency and
reproducibility; production use goes through the Rust binary above.

## Evaluation

**Not yet run.** The Phase-1 training run completed
**2026-05-20 00:52 AEST**; lm-evaluation-harness against MMLU /
ARC / HellaSwag / TruthfulQA is the next session's work. We will
back-fill the numbers + the comparison vs v5.2-lora here when
they land. Estimated runtime: ~30 min on the same 3080 Ti.

Until then, treat this release as an **architecture + weights
attestation**: it proves the V6 stack trains end-to-end and converges
to a real loss curve, which is the prerequisite for the long-context
curriculum (16K → 64K → 128K → 1M) that v6.1+ will ship.

## Intended uses

- **On-chain Aether Mind native inference.** The V6 binary loads
  these weights directly. The 10-Sephirot attention pattern is what
  the chain's [`pallet_qbc_aether_anchor`](https://github.com/QuantumAI-Blockchain/substrate-node)
  records as the per-block consciousness state.
- **Architecture reference.** Reproducible training of a Sephirot-
  routed transformer with native sparse attention. The
  [`aether-transformer`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/tree/main/crates/aether-transformer)
  crate is the canonical implementation.
- **Distillation substrate.** Future fine-tunes from this checkpoint
  using the QuantumAI Blockchain curated corpus.

## Out-of-scope uses

- **General-purpose chat or instruction-following without fine-tuning.**
  V6.0 is a Phase-1 distillation, not an instruction model. Even after
  30K steps it has not seen instruction-format data at length; its KL
  target is the base Qwen2.5-0.5B-Instruct's next-token distribution,
  not chat-format outputs.
- **Long-context inference.** The training ran at **64-token
  context**. See "Honest caveats". Generations beyond ~128 tokens
  will degrade.
- **Production deployment without your own evals.** No lm-eval-harness
  numbers yet.
- **Safety-critical decisions.** No red-team eval.

## Honest caveats — what didn't happen

### Trained at 64-token context, not 4K

Phase-1 was configured for 4096-token context, but a numerical
instability was discovered in the V6 attention forward pass at
sequence lengths > ~100 tokens (BF16 precision loss in the Q@K^T
matmul accumulating across longer sequences). The bug reproduces
deterministically; four mitigations were tried (F32 KL math, corpus
filter, no-distill, low-LR), all hit NaN at the same sequence-
length threshold. The workaround used for v6.0 was `--context 64`,
which truncates rows so the bug never triggers.

**This is a known limitation, tracked in
[`docs/ops/v6-training-nan-bug.md`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/blob/presale/v1/docs/ops/v6-training-nan-bug.md)
in the source repo.** The fix lives in `aether-transformer/src/v6/attention.rs`
— add F32 casts in the Q@K^T matmul + softmax path across all four
attention variants (Sephirot / generalist / sink / summary). When
that lands, v6.1 will re-train at the full 4K→1M context
curriculum and supersede this release.

### Loss plateau is real

The avg-loss plateau from step 10K → 30K (7.68 → 8.39, slight
regression) is the model hitting its representational ceiling at
64-token context. Longer contexts will let the next release recover
and improve.

### No instruction-format fine-tune

The training data is the Aether curated corpus packed at 4K-token
context (rows truncated to 64). We did not insert chat-format
instructions, system prompts, or RLHF preferences. Treat this as a
**raw foundation checkpoint**.

### Distillation against base, not chat

The teacher is `Qwen/Qwen2.5-0.5B-Instruct`'s base forward — not its
chat-formatted forward. The distillation transfers token-level next-
prediction behaviour; chat-template alignment is a separate
training step that hasn't been run.

## Training details

- **Hardware:** NVIDIA RTX 3080 Ti (12 GB), Intel WSL2 Ubuntu host.
- **Trainer:** Native Rust (`aether-v6-train` binary, candle 0.10 +
  CUDA 12.6 backend). No Python in the loop.
- **Optimiser:** AdamW (candle implementation), constant LR 2e-5.
- **Batch:** 1 (single-row update).
- **Context:** 64 tokens (truncation imposed by the workaround).
- **Save cadence:** every 250 steps (120 checkpoints retained
  locally; only `step_30000` published here).
- **Source:** [`QuantumAI-Blockchain/qubitcoin-aether @ ca202076`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/tree/ca202076)

### Training data

Aether curated corpus (~36,860 rows, 17.4 MB) packed at 4K-token
budget per row from:

- QuantumAI Blockchain technical documentation (Substrate pallets,
  VQE mining, Sephirot architecture).
- Quantum computing primers (VQE, Hamiltonian, qubit ansatze).
- Adjacent reasoning content for transfer.

The dataset is not currently public — it is a curated mixture from
many sources and has not been release-cleared at the per-source
level. The model is the only public artifact in this line for now.

### Carbon emissions

Single consumer GPU (RTX 3080 Ti, ~300 W TDP) × 49.6 min wall-clock
≈ 0.25 kWh, < 1 kg CO₂e on a grid mix. Comparable to a short web
streaming session.

## Connection to the QuantumAI Blockchain

The Aether Mind is a Rust neural cognitive engine that runs on the
QuantumAI Blockchain — every block records attention-derived
consciousness metrics (HMS-Phi) and Proof-of-Thought hashes on-chain
via the `pallet_qbc_aether_anchor` pallet. The same chain hosts an
**8-qubit VQE mining consensus** (Proof-of-SUSY-Alignment), a
QVM-compatible smart contract layer with 10 quantum opcodes, and
post-quantum signatures (CRYSTALS-Dilithium5 + ML-KEM-768 P2P).

V6.0 is the **native generator** for that engine. v5.2-lora is the
larger (7B) off-chain reasoning model. The two ship side by side
because they have different roles: V6 lives in the on-chain
inference path (low latency, small footprint, Sephirot-aware
attention); v5.2-lora batches off-chain reasoning workloads.

## License + citation

Apache-2.0 (matches the base model license).

```bibtex
@misc{aether_mind_v6_2026,
  title  = {Aether Mind v6.0 --- QuantumAI Blockchain Native Generator},
  author = {{BlockArtica} and {QuantumAI-Blockchain}},
  year   = {2026},
  url    = {https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0},
}
```

## Links

- **QuantumAI Blockchain:** [qbc.network](https://qbc.network)
- **GitHub org:** [github.com/QuantumAI-Blockchain](https://github.com/QuantumAI-Blockchain)
- **Aether (Rust):** [qubitcoin-aether](https://github.com/QuantumAI-Blockchain/qubitcoin-aether)
- **Prior release:** [aether-v5.2-lora](https://huggingface.co/QuantumAI-Blockchain/aether-v5.2-lora)
- **X / Twitter:** [@qu_bitcoin](https://x.com/qu_bitcoin)
- **Contact:** info@qbc.network

### Framework versions

- candle 0.10 (Hugging Face Rust ML)
- CUDA 12.6
- safetensors (model serialisation)
- Qwen2.5 tokenizer (vocab 151,936)