| --- |
| base_model: Qwen/Qwen2.5-0.5B-Instruct |
| library_name: safetensors |
| license: apache-2.0 |
| tags: |
| - qubitcoin |
| - aether |
| - blockchain |
| - quantum |
| - native-rust |
| - candle |
| - long-context |
| - cosine-schedule |
| - resume-fine-tune |
| language: |
| - en |
| pipeline_tag: text-generation |
| --- |
| |
| # Aether Mind v6.2 β cosine-decay fine-tune of v6.1 |
|
|
| V6.2 picks up where [v6.1](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.1) |
| plateaued. Same architecture, same 256-token context, same Aether |
| curated corpus β but trained for **another 30,000 steps under a |
| cosine LR decay (2e-5 β 2e-7)** to push the student past its |
| fine-tune plateau without overshooting. |
|
|
| This is the third native (non-LoRA) Aether release and the first to |
| use a learning-rate schedule beyond constant. The cosine flag landed |
| in commit |
| [`186b2622`](https://github.com/QuantumAI-Blockchain/qubitcoin-aether/commit/186b2622). |
|
|
| ## What you're getting |
|
|
| | Field | Value | |
| |---|---| |
| | Base model | `Qwen/Qwen2.5-0.5B-Instruct` (initialised from), then v6.1 fine-tune resumed here | |
| | Architecture | V6 transformer: 24 layers, 896 hidden, 14 attention heads (10 Sephirot + 2 generalist + 2 sink), head_dim=64 | |
| | Trainable params | ~558 M (all weights, no LoRA) | |
| | Training mode | **Pure cross-entropy** (no distillation β same as v6.1) | |
| | Training context | **256 tokens** (same as v6.1) | |
| | LR schedule | **Cosine decay 2e-5 β 2e-7** over 30,000 fine-tune steps | |
| | Precision | BF16 weights, F32 KL/CE math internally | |
| | NSA config | compression_block=64, top_k=2048, sliding_window=512, sink_tokens=4 | |
| | Vocab | 151,936 (Qwen2.5 tokenizer, untouched) | |
| | Max position | 32,768 (RoPE theta = 1e6) | |
| | Total training | **60,000 steps** (30K v6.1 + 30K v6.2) | |
| | File | `model.safetensors` (1.32 GB, BF16) | |
| | License | Apache-2.0 (matches base) | |
| |
| ## Training run |
| |
| | Metric | v6.1 | **v6.2** | Ξ | |
| |---|---|---|---| |
| | Steps (this run) | 30,000 | 30,000 | = | |
| | Total steps | 30,000 | **60,000** | +30K | |
| | Wall-clock (this run) | 44.4 min | **44.9 min** | +0.5 min | |
| | Mean CE loss (this run) | 10.18 | **8.43** | **β17 %** | |
| | Throughput | 629.9 tok/s | 622.9 tok/s | flat | |
| | Mean Sephirot aux | 0.149 | **0.140** | β6 % | |
| | LR schedule | constant 2e-5 | **cosine 2e-5 β 2e-7** | new | |
| | NaN events | 0 | 0 | = | |
| | Resume base | random init (Qwen) | v6.1 final | new | |
| |
| ### Loss trajectory |
| |
| ``` |
| step 1 loss=13.00 avg=13.00 (v6.1 final state) |
| step 100 loss=12.00 avg=11.78 |
| step 1000 loss= 7.75 avg= 8.82 β LR still high, big descent through v6.1's plateau |
| step 5000 loss= 7.25 avg= 7.71 |
| step 10000 loss= 6.69 avg= 7.41 β minimum running average |
| step 15000 loss= 9.56 avg= 7.51 β cosine kicks in, per-step variance β, drift β |
| step 20000 loss= 8.94 avg= 7.92 |
| step 25000 loss= 8.75 avg= 8.22 |
| step 29999 loss= 9.31 avg= 8.43 |
| ``` |
| |
| The reported mean (8.43) is the run-wide average. The lowest observed |
| running average (7.41 at step 10K) is the actual fine-tune minimum; |
| the back-half drift is the cosine schedule reducing step size to near |
| zero, which makes per-step variance dominate the running average. |
| This is the expected shape of a converged cosine fine-tune. |
| |
| ## What changed vs v6.1 |
| |
| 1. **Cosine LR decay**. Constant LR at 2e-5 in v6.1 caused a plateau |
| from step ~10K onward β the optimiser kept bouncing around the |
| loss minimum it could see at that step size. Cosine decay to |
| 2e-7 lets later steps take much smaller updates, fine-tuning past |
| the plateau. |
| |
| 2. **Resume from v6.1** rather than fresh init. The model starts at |
| v6.1's final state and refines from there. |
| |
| 3. **Otherwise identical to v6.1**: same architecture, same corpus, |
| same context, same NSA config, same Sephirot aux. The single |
| variable changed is the LR schedule. |
| |
| ## How to use |
| |
| ### Native runtime (recommended) β Rust `aether-mind` |
| |
| Set `AETHER_V6_CHECKPOINT` to the local path of `model.safetensors`, |
| restart `qbc-aether-mind.service`. |
| |
| ### Python |
| |
| ```python |
| from safetensors.torch import load_file |
| weights = load_file("model.safetensors") |
| print("params:", sum(t.numel() for t in weights.values())) |
| ``` |
| |
| Same architecture as v6.1, so any custom loader/wrapper for v6.1 |
| works here. |
| |
| ## Evaluation |
| |
| (lm-evaluation-harness numbers to follow once the eval binary |
| ships. For now: training-loss curve + sample generations are the |
| primary signal.) |
| |
| ## Open items for v6.3 |
| |
| - **Per-chunk backward** for distillation at ctx β₯ 256, so we can |
| add KL teacher signal back without OOMing. |
| - **Long-context curriculum** (1K, 4K, 16K β 1M) per the V6 master |
| spec. |
| - **lm-evaluation-harness pass** (MMLU / ARC / HellaSwag / |
| TruthfulQA) for honest published numbers. |
| |
| ## License + citation |
| |
| Apache-2.0 (matches the base model license). |
| |
| ```bibtex |
| @misc{aether_mind_v62_2026, |
| title = {Aether Mind v6.2 --- cosine-decay fine-tune of v6.1}, |
| author = {{BlockArtica} and {QuantumAI-Blockchain}}, |
| year = {2026}, |
| url = {https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.2}, |
| } |
| ``` |
| |
| ## Links |
| |
| - **Aether Mind v6.1** β [https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.1](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.1) |
| - **Aether Mind v6.0** β [https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0](https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.0) |
| - **Aether v5.2-lora** β [https://huggingface.co/QuantumAI-Blockchain/aether-v5.2-lora](https://huggingface.co/QuantumAI-Blockchain/aether-v5.2-lora) |
| - **QuantumAI Blockchain** β [qbc.network](https://qbc.network) |
| - **GitHub** β [github.com/QuantumAI-Blockchain](https://github.com/QuantumAI-Blockchain) |
| |