Keural 14.8B — Stage 1 Pretraining Checkpoints

This repository contains raw pretraining checkpoints from Stage 1 of the Keural 14.8B MoE language model, trained from scratch on bilingual Korean-English data.

These are not instruction-tuned models. They are raw base checkpoints useful for research, resuming training, or fine-tuning experiments.

Model Overview

Field	Value
Architecture	Mixtral-style MoE (Mixture of Experts)
Parameters	14.83B total (2.6B active per token)
Layers	24 transformer blocks
Hidden size	4096
Attention heads	32 (8 KV heads, GQA)
Experts	8 total, top-2 routing
Vocab size	131,072 (custom SentencePiece)
Context length	4096 tokens
Languages	Korean + English
Training hardware	2× NVIDIA H200 150GB, FSDP
Infrastructure	KT Cloud NIPA2-H200

Training Data

Stage 1 pretraining used ~64.56B tokens of bilingual web text:

Source	Language	Tokens
FineWeb	English	~40B
WanJuan	Chinese/English	~10B
HPLT	Korean	~8B
FineWeb2	English	~6B

Data was filtered for quality (spam removal, length filtering, deduplication) and tokenized with the custom Keural SentencePiece tokenizer (vocab=131,072).

Training Configuration

Parameter	Value
Total steps	100,000
Batch size	~645 tokens × large batch
Total tokens	~64.56B
Optimizer	AdamW
Learning rate	Cosine schedule with warmup
Precision	bfloat16
Parallelism	FSDP (2× H200)

Checkpoints in This Repository

File	Step	Tokens Seen	Notes
`checkpoint_19000.pt`	19,000	~12.3B	Early training
`checkpoint_25000.pt`	25,000	~16.1B	Early training
`checkpoint_25500.pt`	25,500	~16.5B	Early training
`checkpoint_36000.pt`	36,000	~23.2B	Mid training
`checkpoint_50000.pt`	50,000	~32.3B	Mid training
`checkpoint_65000.pt`	65,000	~41.9B	Late training
`checkpoint_70000.pt`	70,000	~45.2B	Late training
`checkpoint_75000.pt`	75,000	~48.4B	Late training
`checkpoint_80000.pt`	80,000	~51.6B	Late training
`checkpoint_85000.pt`	85,000	~54.9B	Late training
`checkpoint_90000.pt`	90,000	~58.1B	Late training
`checkpoint_95000.pt`	95,000	~61.3B	Late training
`checkpoint_100000.pt`	100,000	~64.56B	Stage 1 Final

Each .pt file is a full FSDP training checkpoint (~83GB) containing:

model — model state dict
optimizer — optimizer state
step — training step number
loss — training loss at that step

Full Training Pipeline

This is Stage 1 of the full Keural training pipeline:

Stage 1: Pretraining     → 100K steps, ~64.56B tokens  ✅ Done (this repo)
Stage 2: Annealing       → 20K steps,  ~5.16B tokens   ✅ Done
Stage 3: SFT             → 18K steps,  1.13M samples   🔄 In Progress
Stage 4: DPO             → ~8K steps,  233K pairs       ⏳ Planned

For the final instruction-tuned model, see: mkd-hossain/keural-14.8b-sft

Loading a Checkpoint

These checkpoints use the custom Keural model architecture. You need the training code to load them:

import torch

ckpt = torch.load("checkpoint_100000.pt", map_location="cpu", weights_only=False)
state_dict = ckpt["model"]
step = ckpt["step"]        # 100000
loss = ckpt["loss"]        # training loss

print(f"Step: {step} | Loss: {loss:.4f}")
print(f"Keys: {list(state_dict.keys())[:5]}")

For inference, use the vLLM-converted safetensors version: 👉 mkd-hossain/keural-14.8b-stage2-vllm

Tokenizer

The model uses a custom SentencePiece tokenizer with 131,072 vocab size, trained on Korean and English text:

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.Load("keural_tokenizer.model")

ids = sp.Encode("안녕하세요, Hello!", out_type=int)
text = sp.Decode(ids)

Tokenizer available at: mkd-hossain/keural-14.8b-base

Intended Use

Research on bilingual Korean-English language models
Fine-tuning experiments starting from a pretrained base
Studying MoE model behavior at different training stages
Resuming or extending pretraining

Not intended for: Direct deployment without instruction tuning. These are raw pretrained checkpoints.

Citation

@misc{keural-14.8b-2026,
  title  = {Keural 14.8B: A Bilingual Korean-English MoE Language Model Trained from Scratch},
  author = {MKD Hossain},
  year   = {2026},
  url    = {https://huggingface.co/mkd-hossain/keural-stage1-checkpoints}
}

Keural 14.8B — Trained from scratch on KT Cloud NIPA2-H200 infrastructure. MKD-CORP | 2026

Downloads last month: -; Downloads are not tracked for this model. How to track