AETHER-Micro 0.5B (Phase 1 Checkpoint)
AETHER-Micro is an experimental MoE-based language model.
Model Details
| Item | Value |
|---|---|
| Architecture | MoE big.LITTLE + LTL + MTP |
| Total Parameters | 2.08B |
| Active Parameters | ~0.5B per token |
| Hidden Size | 1024 |
| Layers | 24 |
| Attention | GQA 16 heads, 4 KV heads |
| Experts | 5 Big + 15 Small + 2 Shared |
| Vocab Size | 64,000 Korean + English + Code |
| Context Length | 8,192 RoPE |
| Training Step | 57,000 / 100,000 |
| Training Loss | ~3.54 |
Architecture Features
- big.LITTLE MoE: 5 large experts (2048 intermediate) + 15 small experts (1024 intermediate) + 2 shared experts (always active)
- Latent Thought Layer (LTL): K-step latent reasoning (K=0,1,2) via Gumbel-Softmax selection
- Multi-Token Prediction (MTP): 4-step ahead prediction replacing standard NTP loss
- Wu-Xing Router: Five-element inspired expert routing
- Quality Head: 4-dimensional quality assessment
Training
- Phase: 1 of 3 (57% complete)
- Data: 13.1B tokens (Korean 22%, English 25%, Code 21%, Math 24%, Dialogue 8%)
- Optimizer: AdamW (lr=1e-4, cosine decay)
- Precision: FP32
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Be2Jay/AETHER-Micro-0.5B",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("Be2Jay/AETHER-Micro-0.5B")
Note: This is a Phase 1 training checkpoint. The model is still in early training and not yet suitable for production use.
License
Apache 2.0
- Downloads last month
- 171