v6.0 phase-1 release: 30K-step distilled student, ctx=64

fb2a871 verified 1 day ago

12.1 kB

	---
	base_model: Qwen/Qwen2.5-0.5B-Instruct
	library_name: safetensors
	license: apache-2.0
	tags:
	- qubitcoin
	- aether
	- blockchain
	- quantum
	- distillation
	- mixed-precision
	- native-rust
	- candle
	language:
	- en
	pipeline_tag: text-generation
	---

	# Aether Mind v6.0 — QuantumAI Blockchain Native Generator

	A 558M-parameter distilled student of [`Qwen/Qwen2.5-0.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct),
	trained from scratch in pure Rust (`candle` 0.10) with the
	10-Sephirot + 2-generalist + 2-sink attention head split that is
	the core architectural claim of the QuantumAI Blockchain's Aether Mind
	on-chain neural cognitive engine.

	This is the second public Aether release and the first that is
	native to the on-chain inference path — V6.0 is the model the
	[`aether-mind`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether)
	binary loads, not a LoRA adapter on top of a 7B base.

	The previous release, [`aether-v5.2-lora`](https://huggingface.co/QuantumAI-Blockchain/aether-v5.2-lora),
	is a 7B PEFT adapter intended for batch off-chain reasoning. V6.0 is
	the smaller native generator that fits in the on-chain Aether
	Mind's ~2.4 GB RAM envelope and runs at ~500 tokens/sec on a
	consumer RTX 3080 Ti.

	## What you're getting

	\| Field \| Value \|
	\|---\|---\|
	\| Base model \| `Qwen/Qwen2.5-0.5B-Instruct` (initialised from, then distilled) \|
	\| Architecture \| V6 transformer: 24 layers, 896 hidden, 14 attention heads (10 Sephirot + 2 generalist + 2 sink), head_dim=64 \|
	\| Trainable params \| ~558 M (all weights trained, not LoRA) \|
	\| Hidden / FFN \| 896 / 4864 \|
	\| Vocab \| 151,936 (Qwen2.5 tokenizer, untouched) \|
	\| Max position \| 32,768 (RoPE theta = 1e6) \|
	\| Native sparse attention (NSA) \| compression_block=64, top_k=2048, sliding_window=512, sink_tokens=4 \|
	\| Precision \| BF16 weights + F32 KL math in distillation \|
	\| Training context \| 64 tokens (Phase-1 release; see "Honest caveats" below) \|
	\| Checkpoint published \| step 30,000 (full 30K-step Phase-1 run) \|
	\| File \| `model.safetensors` (1.32 GB, BF16) \|
	\| License \| Apache-2.0 (matches base) \|

	## Training run

	\| Metric \| Value \|
	\|---\|---\|
	\| Steps \| 30,000 (full Phase-1) \|
	\| Wall-clock \| 49.6 min (single RTX 3080 Ti, BF16, CUDA(0)) \|
	\| Tokens scored \| 1,671,027 \|
	\| Throughput \| 561 tokens/sec \|
	\| Optimiser \| AdamW, LR 2e-5, no schedule (constant) \|
	\| Distillation \| KL(T\|\|S) with alpha schedule 1.0 → 0.3 linear, temperature 1.0 \|
	\| Sephirot auxiliary \| MSE vs one-hot domain target, β = 0.1 \|
	\| NaN events \| 0 \|
	\| Mean total loss \| 8.39 nats/token \|
	\| Mean CE \| 10.35 \|
	\| Mean KL \| 7.50 \|
	\| Mean Sephirot aux \| 0.149 \|

	### Loss trajectory

	```
	step 1 loss=12.25 avg=12.25 (random init)
	step 100 loss=12.87 avg=12.75
	step 1000 loss= 8.62 avg= 9.74 ← KL/CE break
	step 5000 loss= 7.72 avg= 8.16
	step 10000 loss= 7.31 avg= 7.68 ← reached representational floor
	step 15000 loss= 8.87 avg= 7.75
	step 20000 loss= 8.75 avg= 8.04
	step 25000 loss= 8.62 avg= 8.26
	step 29999 loss= 8.81 avg= 8.39
	```

	The model converged hard in the first ~10K steps, then plateaued at
	the representational floor for its current context window (64
	tokens). The plateau is structural, not optimisation — see "Honest
	caveats" below.

	## Architecture — what makes V6 different

	V6 is not a vanilla Qwen2.5 fine-tune. The attention layer
	implements a 14-head split designed for on-chain cognitive routing:

	- 10 Sephirot heads — one per cognitive domain in the Aether
	Mind's specialisation map (Keter → Malkuth). Each head's attention
	pattern is what the on-chain `pallet_qbc_aether_anchor` records as
	the per-cycle attestation root.
	- 2 generalist heads — un-gated, full-context attention. Used for
	the "global workspace" path in `aether-mind`.
	- 2 sink heads — anchor-token attention (first 4 tokens of the
	sequence) for stable long-context performance, following the
	standard "attention sink" finding.

	The Sephirot eviction order is configured in `config.json` for the
	KV-cache management path that `aether-mind` uses to keep the
	hot-set bounded in 12 GB VRAM under live inference.

	## How to use

	### Native runtime (recommended) — Rust `aether-mind`

	The model is designed to be loaded by the on-chain Aether Mind
	binary in the [`QuantumAI-Blockchain/qubitcoin-aether`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether)
	repo. Set `AETHER_V6_CHECKPOINT` to the local path of
	`model.safetensors` and start the systemd unit; the binary loads the
	weights via candle into the V6 transformer crate.

	### Python (via `safetensors` + `tokenizers`)

	For offline experimentation:

	```python
	from safetensors.torch import load_file
	from tokenizers import Tokenizer
	import torch

	tok = Tokenizer.from_file("tokenizer.json")
	weights = load_file("model.safetensors") # 315 tensors, BF16
	print("loaded", len(weights), "tensors,", sum(t.numel() for t in weights.values()), "params")
	```

	There is **no canonical 🤗 transformers loader for the V6
	architecture** — the 14-head split + Sephirot routing are not in the
	upstream `Qwen2Model`. We publish the weights for transparency and
	reproducibility; production use goes through the Rust binary above.

	## Evaluation

	Not yet run. The Phase-1 training run completed
	2026-05-20 00:52 AEST; lm-evaluation-harness against MMLU /
	ARC / HellaSwag / TruthfulQA is the next session's work. We will
	back-fill the numbers + the comparison vs v5.2-lora here when
	they land. Estimated runtime: ~30 min on the same 3080 Ti.

	Until then, treat this release as an **architecture + weights
	attestation**: it proves the V6 stack trains end-to-end and converges
	to a real loss curve, which is the prerequisite for the long-context
	curriculum (16K → 64K → 128K → 1M) that v6.1+ will ship.

	## Intended uses

	- On-chain Aether Mind native inference. The V6 binary loads
	these weights directly. The 10-Sephirot attention pattern is what
	the chain's [`pallet_qbc_aether_anchor`](https://github.com/QuantumAI-Blockchain/substrate-node)
	records as the per-block consciousness state.
	- Architecture reference. Reproducible training of a Sephirot-
	routed transformer with native sparse attention. The
	[`aether-transformer`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/tree/main/crates/aether-transformer)
	crate is the canonical implementation.
	- Distillation substrate. Future fine-tunes from this checkpoint
	using the QuantumAI Blockchain curated corpus.

	## Out-of-scope uses

	- General-purpose chat or instruction-following without fine-tuning.
	V6.0 is a Phase-1 distillation, not an instruction model. Even after
	30K steps it has not seen instruction-format data at length; its KL
	target is the base Qwen2.5-0.5B-Instruct's next-token distribution,
	not chat-format outputs.
	- Long-context inference. The training ran at **64-token
	context**. See "Honest caveats". Generations beyond ~128 tokens
	will degrade.
	- Production deployment without your own evals. No lm-eval-harness
	numbers yet.
	- Safety-critical decisions. No red-team eval.

	## Honest caveats — what didn't happen

	### Trained at 64-token context, not 4K

	Phase-1 was configured for 4096-token context, but a numerical
	instability was discovered in the V6 attention forward pass at
	sequence lengths > ~100 tokens (BF16 precision loss in the Q@K^T
	matmul accumulating across longer sequences). The bug reproduces
	deterministically; four mitigations were tried (F32 KL math, corpus
	filter, no-distill, low-LR), all hit NaN at the same sequence-
	length threshold. The workaround used for v6.0 was `--context 64`,
	which truncates rows so the bug never triggers.

	**This is a known limitation, tracked in
	[`docs/ops/v6-training-nan-bug.md`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/blob/presale/v1/docs/ops/v6-training-nan-bug.md)
	in the source repo.** The fix lives in `aether-transformer/src/v6/attention.rs`
	— add F32 casts in the Q@K^T matmul + softmax path across all four
	attention variants (Sephirot / generalist / sink / summary). When
	that lands, v6.1 will re-train at the full 4K→1M context
	curriculum and supersede this release.

	### Loss plateau is real

	The avg-loss plateau from step 10K → 30K (7.68 → 8.39, slight
	regression) is the model hitting its representational ceiling at
	64-token context. Longer contexts will let the next release recover
	and improve.

	### No instruction-format fine-tune

	The training data is the Aether curated corpus packed at 4K-token
	context (rows truncated to 64). We did not insert chat-format
	instructions, system prompts, or RLHF preferences. Treat this as a
	raw foundation checkpoint.

	### Distillation against base, not chat

	The teacher is `Qwen/Qwen2.5-0.5B-Instruct`'s base forward — not its
	chat-formatted forward. The distillation transfers token-level next-
	prediction behaviour; chat-template alignment is a separate
	training step that hasn't been run.

	## Training details

	- Hardware: NVIDIA RTX 3080 Ti (12 GB), Intel WSL2 Ubuntu host.
	- Trainer: Native Rust (`aether-v6-train` binary, candle 0.10 +
	CUDA 12.6 backend). No Python in the loop.
	- Optimiser: AdamW (candle implementation), constant LR 2e-5.
	- Batch: 1 (single-row update).
	- Context: 64 tokens (truncation imposed by the workaround).
	- Save cadence: every 250 steps (120 checkpoints retained
	locally; only `step_30000` published here).
	- Source: [`QuantumAI-Blockchain/qubitcoin-aether @ ca202076`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/tree/ca202076)

	### Training data

	Aether curated corpus (~36,860 rows, 17.4 MB) packed at 4K-token
	budget per row from:

	- QuantumAI Blockchain technical documentation (Substrate pallets,
	VQE mining, Sephirot architecture).
	- Quantum computing primers (VQE, Hamiltonian, qubit ansatze).
	- Adjacent reasoning content for transfer.

	The dataset is not currently public — it is a curated mixture from
	many sources and has not been release-cleared at the per-source
	level. The model is the only public artifact in this line for now.

	### Carbon emissions

	Single consumer GPU (RTX 3080 Ti, ~300 W TDP) × 49.6 min wall-clock
	≈ 0.25 kWh, < 1 kg CO₂e on a grid mix. Comparable to a short web
	streaming session.

	## Connection to the QuantumAI Blockchain

	The Aether Mind is a Rust neural cognitive engine that runs on the
	QuantumAI Blockchain — every block records attention-derived
	consciousness metrics (HMS-Phi) and Proof-of-Thought hashes on-chain
	via the `pallet_qbc_aether_anchor` pallet. The same chain hosts an
	8-qubit VQE mining consensus (Proof-of-SUSY-Alignment), a
	QVM-compatible smart contract layer with 10 quantum opcodes, and
	post-quantum signatures (CRYSTALS-Dilithium5 + ML-KEM-768 P2P).

	V6.0 is the native generator for that engine. v5.2-lora is the
	larger (7B) off-chain reasoning model. The two ship side by side
	because they have different roles: V6 lives in the on-chain
	inference path (low latency, small footprint, Sephirot-aware
	attention); v5.2-lora batches off-chain reasoning workloads.

	## License + citation

	Apache-2.0 (matches the base model license).

	```bibtex
	@misc{aether_mind_v6_2026,
	title = {Aether Mind v6.0 --- QuantumAI Blockchain Native Generator},
	author = {{BlockArtica} and {QuantumAI-Blockchain}},
	year = {2026},
	url = {https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0},
	}
	```

	## Links

	- QuantumAI Blockchain: [qbc.network](https://qbc.network)
	- GitHub org: [github.com/QuantumAI-Blockchain](https://github.com/QuantumAI-Blockchain)
	- Aether (Rust): [qubitcoin-aether](https://github.com/QuantumAI-Blockchain/qubitcoin-aether)
	- Prior release: [aether-v5.2-lora](https://huggingface.co/QuantumAI-Blockchain/aether-v5.2-lora)
	- X / Twitter: [@qu_bitcoin](https://x.com/qu_bitcoin)
	- Contact: info@qbc.network

	### Framework versions

	- candle 0.10 (Hugging Face Rust ML)
	- CUDA 12.6
	- safetensors (model serialisation)
	- Qwen2.5 tokenizer (vocab 151,936)