Aether Mind v6.0 — QuantumAI Blockchain Native Generator

A 558M-parameter distilled student of Qwen/Qwen2.5-0.5B-Instruct, trained from scratch in pure Rust (candle 0.10) with the 10-Sephirot + 2-generalist + 2-sink attention head split that is the core architectural claim of the QuantumAI Blockchain's Aether Mind on-chain neural cognitive engine.

This is the second public Aether release and the first that is native to the on-chain inference path — V6.0 is the model the aether-mind binary loads, not a LoRA adapter on top of a 7B base.

The previous release, aether-v5.2-lora, is a 7B PEFT adapter intended for batch off-chain reasoning. V6.0 is the smaller native generator that fits in the on-chain Aether Mind's ~2.4 GB RAM envelope and runs at ~500 tokens/sec on a consumer RTX 3080 Ti.

What you're getting

Field	Value
Base model	`Qwen/Qwen2.5-0.5B-Instruct` (initialised from, then distilled)
Architecture	V6 transformer: 24 layers, 896 hidden, 14 attention heads (10 Sephirot + 2 generalist + 2 sink), head_dim=64
Trainable params	~558 M (all weights trained, not LoRA)
Hidden / FFN	896 / 4864
Vocab	151,936 (Qwen2.5 tokenizer, untouched)
Max position	32,768 (RoPE theta = 1e6)
Native sparse attention (NSA)	compression_block=64, top_k=2048, sliding_window=512, sink_tokens=4
Precision	BF16 weights + F32 KL math in distillation
Training context	64 tokens (Phase-1 release; see "Honest caveats" below)
Checkpoint published	step 30,000 (full 30K-step Phase-1 run)
File	`model.safetensors` (1.32 GB, BF16)
License	Apache-2.0 (matches base)

Training run

Metric	Value
Steps	30,000 (full Phase-1)
Wall-clock	49.6 min (single RTX 3080 Ti, BF16, CUDA(0))
Tokens scored	1,671,027
Throughput	561 tokens/sec
Optimiser	AdamW, LR 2e-5, no schedule (constant)
Distillation	KL(T
Sephirot auxiliary	MSE vs one-hot domain target, β = 0.1
NaN events	0
Mean total loss	8.39 nats/token
Mean CE	10.35
Mean KL	7.50
Mean Sephirot aux	0.149

Loss trajectory

step      1   loss=12.25   avg=12.25   (random init)
step    100   loss=12.87   avg=12.75
step   1000   loss= 8.62   avg= 9.74   ← KL/CE break
step   5000   loss= 7.72   avg= 8.16
step  10000   loss= 7.31   avg= 7.68   ← reached representational floor
step  15000   loss= 8.87   avg= 7.75
step  20000   loss= 8.75   avg= 8.04
step  25000   loss= 8.62   avg= 8.26
step  29999   loss= 8.81   avg= 8.39

The model converged hard in the first ~10K steps, then plateaued at the representational floor for its current context window (64 tokens). The plateau is structural, not optimisation — see "Honest caveats" below.

Architecture — what makes V6 different

V6 is not a vanilla Qwen2.5 fine-tune. The attention layer implements a 14-head split designed for on-chain cognitive routing:

10 Sephirot heads — one per cognitive domain in the Aether Mind's specialisation map (Keter → Malkuth). Each head's attention pattern is what the on-chain pallet_qbc_aether_anchor records as the per-cycle attestation root.
2 generalist heads — un-gated, full-context attention. Used for the "global workspace" path in aether-mind.
2 sink heads — anchor-token attention (first 4 tokens of the sequence) for stable long-context performance, following the standard "attention sink" finding.

The Sephirot eviction order is configured in config.json for the KV-cache management path that aether-mind uses to keep the hot-set bounded in 12 GB VRAM under live inference.

How to use

Native runtime (recommended) — Rust `aether-mind`

The model is designed to be loaded by the on-chain Aether Mind binary in the QuantumAI-Blockchain/qubitcoin-aether repo. Set AETHER_V6_CHECKPOINT to the local path of model.safetensors and start the systemd unit; the binary loads the weights via candle into the V6 transformer crate.

Python (via `safetensors` + `tokenizers`)

For offline experimentation:

from safetensors.torch import load_file
from tokenizers import Tokenizer
import torch

tok = Tokenizer.from_file("tokenizer.json")
weights = load_file("model.safetensors")  # 315 tensors, BF16
print("loaded", len(weights), "tensors,", sum(t.numel() for t in weights.values()), "params")

There is no canonical 🤗 transformers loader for the V6 architecture — the 14-head split + Sephirot routing are not in the upstream Qwen2Model. We publish the weights for transparency and reproducibility; production use goes through the Rust binary above.

Evaluation

Not yet run. The Phase-1 training run completed 2026-05-20 00:52 AEST; lm-evaluation-harness against MMLU / ARC / HellaSwag / TruthfulQA is the next session's work. We will back-fill the numbers + the comparison vs v5.2-lora here when they land. Estimated runtime: ~30 min on the same 3080 Ti.

Until then, treat this release as an architecture + weights attestation: it proves the V6 stack trains end-to-end and converges to a real loss curve, which is the prerequisite for the long-context curriculum (16K → 64K → 128K → 1M) that v6.1+ will ship.

Intended uses

On-chain Aether Mind native inference. The V6 binary loads these weights directly. The 10-Sephirot attention pattern is what the chain's pallet_qbc_aether_anchor records as the per-block consciousness state.
Architecture reference. Reproducible training of a Sephirot- routed transformer with native sparse attention. The aether-transformer crate is the canonical implementation.
Distillation substrate. Future fine-tunes from this checkpoint using the QuantumAI Blockchain curated corpus.

Out-of-scope uses

General-purpose chat or instruction-following without fine-tuning. V6.0 is a Phase-1 distillation, not an instruction model. Even after 30K steps it has not seen instruction-format data at length; its KL target is the base Qwen2.5-0.5B-Instruct's next-token distribution, not chat-format outputs.
Long-context inference. The training ran at 64-token context. See "Honest caveats". Generations beyond ~128 tokens will degrade.
Production deployment without your own evals. No lm-eval-harness numbers yet.
Safety-critical decisions. No red-team eval.

Honest caveats — what didn't happen

Trained at 64-token context, not 4K

Phase-1 was configured for 4096-token context, but a numerical instability was discovered in the V6 attention forward pass at sequence lengths > ~100 tokens (BF16 precision loss in the Q@K^T matmul accumulating across longer sequences). The bug reproduces deterministically; four mitigations were tried (F32 KL math, corpus filter, no-distill, low-LR), all hit NaN at the same sequence- length threshold. The workaround used for v6.0 was --context 64, which truncates rows so the bug never triggers.

This is a known limitation, tracked in docs/ops/v6-training-nan-bug.md in the source repo. The fix lives in aether-transformer/src/v6/attention.rs — add F32 casts in the Q@K^T matmul + softmax path across all four attention variants (Sephirot / generalist / sink / summary). When that lands, v6.1 will re-train at the full 4K→1M context curriculum and supersede this release.

Loss plateau is real

The avg-loss plateau from step 10K → 30K (7.68 → 8.39, slight regression) is the model hitting its representational ceiling at 64-token context. Longer contexts will let the next release recover and improve.

No instruction-format fine-tune

The training data is the Aether curated corpus packed at 4K-token context (rows truncated to 64). We did not insert chat-format instructions, system prompts, or RLHF preferences. Treat this as a raw foundation checkpoint.

Distillation against base, not chat

The teacher is Qwen/Qwen2.5-0.5B-Instruct's base forward — not its chat-formatted forward. The distillation transfers token-level next- prediction behaviour; chat-template alignment is a separate training step that hasn't been run.

Training details

Hardware: NVIDIA RTX 3080 Ti (12 GB), Intel WSL2 Ubuntu host.
Trainer: Native Rust (aether-v6-train binary, candle 0.10 + CUDA 12.6 backend). No Python in the loop.
Optimiser: AdamW (candle implementation), constant LR 2e-5.
Batch: 1 (single-row update).
Context: 64 tokens (truncation imposed by the workaround).
Save cadence: every 250 steps (120 checkpoints retained locally; only step_30000 published here).
Source: QuantumAI-Blockchain/qubitcoin-aether @ ca202076

Training data

Aether curated corpus (~36,860 rows, 17.4 MB) packed at 4K-token budget per row from:

QuantumAI Blockchain technical documentation (Substrate pallets, VQE mining, Sephirot architecture).
Quantum computing primers (VQE, Hamiltonian, qubit ansatze).
Adjacent reasoning content for transfer.

The dataset is not currently public — it is a curated mixture from many sources and has not been release-cleared at the per-source level. The model is the only public artifact in this line for now.

Carbon emissions

Single consumer GPU (RTX 3080 Ti, ~300 W TDP) × 49.6 min wall-clock ≈ 0.25 kWh, < 1 kg CO₂e on a grid mix. Comparable to a short web streaming session.

Connection to the QuantumAI Blockchain

The Aether Mind is a Rust neural cognitive engine that runs on the QuantumAI Blockchain — every block records attention-derived consciousness metrics (HMS-Phi) and Proof-of-Thought hashes on-chain via the pallet_qbc_aether_anchor pallet. The same chain hosts an 8-qubit VQE mining consensus (Proof-of-SUSY-Alignment), a QVM-compatible smart contract layer with 10 quantum opcodes, and post-quantum signatures (CRYSTALS-Dilithium5 + ML-KEM-768 P2P).

V6.0 is the native generator for that engine. v5.2-lora is the larger (7B) off-chain reasoning model. The two ship side by side because they have different roles: V6 lives in the on-chain inference path (low latency, small footprint, Sephirot-aware attention); v5.2-lora batches off-chain reasoning workloads.

License + citation

Apache-2.0 (matches the base model license).

@misc{aether_mind_v6_2026,
  title  = {Aether Mind v6.0 --- QuantumAI Blockchain Native Generator},
  author = {{BlockArtica} and {QuantumAI-Blockchain}},
  year   = {2026},
  url    = {https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0},
}

Model tree for QuantumAI-Blockchain/aether-mind-v6.0

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Finetuned

(773)

this model

QuantumAI-Blockchain
/

aether-mind-v6.0

Aether Mind v6.0 — QuantumAI Blockchain Native Generator

What you're getting

Training run

Loss trajectory

Architecture — what makes V6 different

How to use

Native runtime (recommended) — Rust `aether-mind`

Python (via `safetensors` + `tokenizers`)

Evaluation

Intended uses

Out-of-scope uses

Honest caveats — what didn't happen

Trained at 64-token context, not 4K

Loss plateau is real

No instruction-format fine-tune

Distillation against base, not chat

Training details

Training data

Carbon emissions

Connection to the QuantumAI Blockchain

License + citation

Links

Framework versions

Model tree for QuantumAI-Blockchain/aether-mind-v6.0

Aether Mind v6.0 — QuantumAI Blockchain Native Generator

What you're getting

Training run

Loss trajectory

Architecture — what makes V6 different

How to use

Native runtime (recommended) — Rust aether-mind

Python (via safetensors + tokenizers)

Evaluation

Intended uses

Out-of-scope uses

Honest caveats — what didn't happen

Trained at 64-token context, not 4K

Loss plateau is real

No instruction-format fine-tune

Distillation against base, not chat

Training details

Training data

Carbon emissions

Connection to the QuantumAI Blockchain

License + citation

Links

Framework versions

Model tree for QuantumAI-Blockchain/aether-mind-v6.0

Native runtime (recommended) — Rust `aether-mind`

Python (via `safetensors` + `tokenizers`)