Keural 14.8B β Stage 1 Pretraining Checkpoints
This repository contains raw pretraining checkpoints from Stage 1 of the Keural 14.8B MoE language model, trained from scratch on bilingual Korean-English data.
These are not instruction-tuned models. They are raw base checkpoints useful for research, resuming training, or fine-tuning experiments.
Model Overview
| Field | Value |
|---|---|
| Architecture | Mixtral-style MoE (Mixture of Experts) |
| Parameters | 14.83B total (2.6B active per token) |
| Layers | 24 transformer blocks |
| Hidden size | 4096 |
| Attention heads | 32 (8 KV heads, GQA) |
| Experts | 8 total, top-2 routing |
| Vocab size | 131,072 (custom SentencePiece) |
| Context length | 4096 tokens |
| Languages | Korean + English |
| Training hardware | 2Γ NVIDIA H200 150GB, FSDP |
| Infrastructure | KT Cloud NIPA2-H200 |
Training Data
Stage 1 pretraining used ~64.56B tokens of bilingual web text:
| Source | Language | Tokens |
|---|---|---|
| FineWeb | English | ~40B |
| WanJuan | Chinese/English | ~10B |
| HPLT | Korean | ~8B |
| FineWeb2 | English | ~6B |
Data was filtered for quality (spam removal, length filtering, deduplication) and tokenized with the custom Keural SentencePiece tokenizer (vocab=131,072).
Training Configuration
| Parameter | Value |
|---|---|
| Total steps | 100,000 |
| Batch size | ~645 tokens Γ large batch |
| Total tokens | ~64.56B |
| Optimizer | AdamW |
| Learning rate | Cosine schedule with warmup |
| Precision | bfloat16 |
| Parallelism | FSDP (2Γ H200) |
Checkpoints in This Repository
| File | Step | Tokens Seen | Notes |
|---|---|---|---|
checkpoint_19000.pt |
19,000 | ~12.3B | Early training |
checkpoint_25000.pt |
25,000 | ~16.1B | Early training |
checkpoint_25500.pt |
25,500 | ~16.5B | Early training |
checkpoint_36000.pt |
36,000 | ~23.2B | Mid training |
checkpoint_50000.pt |
50,000 | ~32.3B | Mid training |
checkpoint_65000.pt |
65,000 | ~41.9B | Late training |
checkpoint_70000.pt |
70,000 | ~45.2B | Late training |
checkpoint_75000.pt |
75,000 | ~48.4B | Late training |
checkpoint_80000.pt |
80,000 | ~51.6B | Late training |
checkpoint_85000.pt |
85,000 | ~54.9B | Late training |
checkpoint_90000.pt |
90,000 | ~58.1B | Late training |
checkpoint_95000.pt |
95,000 | ~61.3B | Late training |
checkpoint_100000.pt |
100,000 | ~64.56B | Stage 1 Final |
Each .pt file is a full FSDP training checkpoint (~83GB) containing:
modelβ model state dictoptimizerβ optimizer statestepβ training step numberlossβ training loss at that step
Full Training Pipeline
This is Stage 1 of the full Keural training pipeline:
Stage 1: Pretraining β 100K steps, ~64.56B tokens β
Done (this repo)
Stage 2: Annealing β 20K steps, ~5.16B tokens β
Done
Stage 3: SFT β 18K steps, 1.13M samples π In Progress
Stage 4: DPO β ~8K steps, 233K pairs β³ Planned
For the final instruction-tuned model, see: mkd-hossain/keural-14.8b-sft
Loading a Checkpoint
These checkpoints use the custom Keural model architecture. You need the training code to load them:
import torch
ckpt = torch.load("checkpoint_100000.pt", map_location="cpu", weights_only=False)
state_dict = ckpt["model"]
step = ckpt["step"] # 100000
loss = ckpt["loss"] # training loss
print(f"Step: {step} | Loss: {loss:.4f}")
print(f"Keys: {list(state_dict.keys())[:5]}")
For inference, use the vLLM-converted safetensors version: π mkd-hossain/keural-14.8b-stage2-vllm
Tokenizer
The model uses a custom SentencePiece tokenizer with 131,072 vocab size, trained on Korean and English text:
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("keural_tokenizer.model")
ids = sp.Encode("μλ
νμΈμ, Hello!", out_type=int)
text = sp.Decode(ids)
Tokenizer available at: mkd-hossain/keural-14.8b-base
Intended Use
- Research on bilingual Korean-English language models
- Fine-tuning experiments starting from a pretrained base
- Studying MoE model behavior at different training stages
- Resuming or extending pretraining
Not intended for: Direct deployment without instruction tuning. These are raw pretrained checkpoints.
Citation
@misc{keural-14.8b-2026,
title = {Keural 14.8B: A Bilingual Korean-English MoE Language Model Trained from Scratch},
author = {MKD Hossain},
year = {2026},
url = {https://huggingface.co/mkd-hossain/keural-stage1-checkpoints}
}
Keural 14.8B β Trained from scratch on KT Cloud NIPA2-H200 infrastructure. MKD-CORP | 2026