File size: 5,661 Bytes
2dcc491 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | ---
base_model: Qwen/Qwen2.5-0.5B-Instruct
library_name: safetensors
license: apache-2.0
tags:
- qubitcoin
- aether
- blockchain
- quantum
- native-rust
- candle
- long-context
- cosine-schedule
- resume-fine-tune
language:
- en
pipeline_tag: text-generation
---
# Aether Mind v6.2 β cosine-decay fine-tune of v6.1
V6.2 picks up where [v6.1](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.1)
plateaued. Same architecture, same 256-token context, same Aether
curated corpus β but trained for **another 30,000 steps under a
cosine LR decay (2e-5 β 2e-7)** to push the student past its
fine-tune plateau without overshooting.
This is the third native (non-LoRA) Aether release and the first to
use a learning-rate schedule beyond constant. The cosine flag landed
in commit
[`186b2622`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/commit/186b2622).
## What you're getting
| Field | Value |
|---|---|
| Base model | `Qwen/Qwen2.5-0.5B-Instruct` (initialised from), then v6.1 fine-tune resumed here |
| Architecture | V6 transformer: 24 layers, 896 hidden, 14 attention heads (10 Sephirot + 2 generalist + 2 sink), head_dim=64 |
| Trainable params | ~558 M (all weights, no LoRA) |
| Training mode | **Pure cross-entropy** (no distillation β same as v6.1) |
| Training context | **256 tokens** (same as v6.1) |
| LR schedule | **Cosine decay 2e-5 β 2e-7** over 30,000 fine-tune steps |
| Precision | BF16 weights, F32 KL/CE math internally |
| NSA config | compression_block=64, top_k=2048, sliding_window=512, sink_tokens=4 |
| Vocab | 151,936 (Qwen2.5 tokenizer, untouched) |
| Max position | 32,768 (RoPE theta = 1e6) |
| Total training | **60,000 steps** (30K v6.1 + 30K v6.2) |
| File | `model.safetensors` (1.32 GB, BF16) |
| License | Apache-2.0 (matches base) |
## Training run
| Metric | v6.1 | **v6.2** | Ξ |
|---|---|---|---|
| Steps (this run) | 30,000 | 30,000 | = |
| Total steps | 30,000 | **60,000** | +30K |
| Wall-clock (this run) | 44.4 min | **44.9 min** | +0.5 min |
| Mean CE loss (this run) | 10.18 | **8.43** | **β17 %** |
| Throughput | 629.9 tok/s | 622.9 tok/s | flat |
| Mean Sephirot aux | 0.149 | **0.140** | β6 % |
| LR schedule | constant 2e-5 | **cosine 2e-5 β 2e-7** | new |
| NaN events | 0 | 0 | = |
| Resume base | random init (Qwen) | v6.1 final | new |
### Loss trajectory
```
step 1 loss=13.00 avg=13.00 (v6.1 final state)
step 100 loss=12.00 avg=11.78
step 1000 loss= 7.75 avg= 8.82 β LR still high, big descent through v6.1's plateau
step 5000 loss= 7.25 avg= 7.71
step 10000 loss= 6.69 avg= 7.41 β minimum running average
step 15000 loss= 9.56 avg= 7.51 β cosine kicks in, per-step variance β, drift β
step 20000 loss= 8.94 avg= 7.92
step 25000 loss= 8.75 avg= 8.22
step 29999 loss= 9.31 avg= 8.43
```
The reported mean (8.43) is the run-wide average. The lowest observed
running average (7.41 at step 10K) is the actual fine-tune minimum;
the back-half drift is the cosine schedule reducing step size to near
zero, which makes per-step variance dominate the running average.
This is the expected shape of a converged cosine fine-tune.
## What changed vs v6.1
1. **Cosine LR decay**. Constant LR at 2e-5 in v6.1 caused a plateau
from step ~10K onward β the optimiser kept bouncing around the
loss minimum it could see at that step size. Cosine decay to
2e-7 lets later steps take much smaller updates, fine-tuning past
the plateau.
2. **Resume from v6.1** rather than fresh init. The model starts at
v6.1's final state and refines from there.
3. **Otherwise identical to v6.1**: same architecture, same corpus,
same context, same NSA config, same Sephirot aux. The single
variable changed is the LR schedule.
## How to use
### Native runtime (recommended) β Rust `aether-mind`
Set `AETHER_V6_CHECKPOINT` to the local path of `model.safetensors`,
restart `qbc-aether-mind.service`.
### Python
```python
from safetensors.torch import load_file
weights = load_file("model.safetensors")
print("params:", sum(t.numel() for t in weights.values()))
```
Same architecture as v6.1, so any custom loader/wrapper for v6.1
works here.
## Evaluation
(lm-evaluation-harness numbers to follow once the eval binary
ships. For now: training-loss curve + sample generations are the
primary signal.)
## Open items for v6.3
- **Per-chunk backward** for distillation at ctx β₯ 256, so we can
add KL teacher signal back without OOMing.
- **Long-context curriculum** (1K, 4K, 16K β 1M) per the V6 master
spec.
- **lm-evaluation-harness pass** (MMLU / ARC / HellaSwag /
TruthfulQA) for honest published numbers.
## License + citation
Apache-2.0 (matches the base model license).
```bibtex
@misc{aether_mind_v62_2026,
title = {Aether Mind v6.2 --- cosine-decay fine-tune of v6.1},
author = {{BlockArtica} and {QuantumAI-Blockchain}},
year = {2026},
url = {https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.2},
}
```
## Links
- **Aether Mind v6.1** β [https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.1](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.1)
- **Aether Mind v6.0** β [https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0)
- **Aether v5.2-lora** β [https://huggingface.co/QuantumAI-Blockchain/aether-v5.2-lora](https://huggingface.co/QuantumAI-Blockchain/aether-v5.2-lora)
- **QuantumAI Blockchain** β [qbc.network](https://qbc.network)
- **GitHub** β [github.com/QuantumAI-Blockchain](https://github.com/QuantumAI-Blockchain)
|