AETHER-Micro 0.5B (Phase 1 Checkpoint)

AETHER-Micro is an experimental MoE-based language model.

Model Details

Item Value
Architecture MoE big.LITTLE + LTL + MTP
Total Parameters 2.08B
Active Parameters ~0.5B per token
Hidden Size 1024
Layers 24
Attention GQA 16 heads, 4 KV heads
Experts 5 Big + 15 Small + 2 Shared
Vocab Size 64,000 Korean + English + Code
Context Length 8,192 RoPE
Training Step 57,000 / 100,000
Training Loss ~3.54

Architecture Features

  • big.LITTLE MoE: 5 large experts (2048 intermediate) + 15 small experts (1024 intermediate) + 2 shared experts (always active)
  • Latent Thought Layer (LTL): K-step latent reasoning (K=0,1,2) via Gumbel-Softmax selection
  • Multi-Token Prediction (MTP): 4-step ahead prediction replacing standard NTP loss
  • Wu-Xing Router: Five-element inspired expert routing
  • Quality Head: 4-dimensional quality assessment

Training

  • Phase: 1 of 3 (57% complete)
  • Data: 13.1B tokens (Korean 22%, English 25%, Code 21%, Math 24%, Dialogue 8%)
  • Optimizer: AdamW (lr=1e-4, cosine decay)
  • Precision: FP32

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Be2Jay/AETHER-Micro-0.5B",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("Be2Jay/AETHER-Micro-0.5B")

Note: This is a Phase 1 training checkpoint. The model is still in early training and not yet suitable for production use.

License

Apache 2.0

Downloads last month
171
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support