Keural 14.8B β€” Stage 1 Pretraining Checkpoints

This repository contains raw pretraining checkpoints from Stage 1 of the Keural 14.8B MoE language model, trained from scratch on bilingual Korean-English data.

These are not instruction-tuned models. They are raw base checkpoints useful for research, resuming training, or fine-tuning experiments.


Model Overview

Field Value
Architecture Mixtral-style MoE (Mixture of Experts)
Parameters 14.83B total (2.6B active per token)
Layers 24 transformer blocks
Hidden size 4096
Attention heads 32 (8 KV heads, GQA)
Experts 8 total, top-2 routing
Vocab size 131,072 (custom SentencePiece)
Context length 4096 tokens
Languages Korean + English
Training hardware 2Γ— NVIDIA H200 150GB, FSDP
Infrastructure KT Cloud NIPA2-H200

Training Data

Stage 1 pretraining used ~64.56B tokens of bilingual web text:

Source Language Tokens
FineWeb English ~40B
WanJuan Chinese/English ~10B
HPLT Korean ~8B
FineWeb2 English ~6B

Data was filtered for quality (spam removal, length filtering, deduplication) and tokenized with the custom Keural SentencePiece tokenizer (vocab=131,072).


Training Configuration

Parameter Value
Total steps 100,000
Batch size ~645 tokens Γ— large batch
Total tokens ~64.56B
Optimizer AdamW
Learning rate Cosine schedule with warmup
Precision bfloat16
Parallelism FSDP (2Γ— H200)

Checkpoints in This Repository

File Step Tokens Seen Notes
checkpoint_19000.pt 19,000 ~12.3B Early training
checkpoint_25000.pt 25,000 ~16.1B Early training
checkpoint_25500.pt 25,500 ~16.5B Early training
checkpoint_36000.pt 36,000 ~23.2B Mid training
checkpoint_50000.pt 50,000 ~32.3B Mid training
checkpoint_65000.pt 65,000 ~41.9B Late training
checkpoint_70000.pt 70,000 ~45.2B Late training
checkpoint_75000.pt 75,000 ~48.4B Late training
checkpoint_80000.pt 80,000 ~51.6B Late training
checkpoint_85000.pt 85,000 ~54.9B Late training
checkpoint_90000.pt 90,000 ~58.1B Late training
checkpoint_95000.pt 95,000 ~61.3B Late training
checkpoint_100000.pt 100,000 ~64.56B Stage 1 Final

Each .pt file is a full FSDP training checkpoint (~83GB) containing:

  • model β€” model state dict
  • optimizer β€” optimizer state
  • step β€” training step number
  • loss β€” training loss at that step

Full Training Pipeline

This is Stage 1 of the full Keural training pipeline:

Stage 1: Pretraining     β†’ 100K steps, ~64.56B tokens  βœ… Done (this repo)
Stage 2: Annealing       β†’ 20K steps,  ~5.16B tokens   βœ… Done
Stage 3: SFT             β†’ 18K steps,  1.13M samples   πŸ”„ In Progress
Stage 4: DPO             β†’ ~8K steps,  233K pairs       ⏳ Planned

For the final instruction-tuned model, see: mkd-hossain/keural-14.8b-sft


Loading a Checkpoint

These checkpoints use the custom Keural model architecture. You need the training code to load them:

import torch

ckpt = torch.load("checkpoint_100000.pt", map_location="cpu", weights_only=False)
state_dict = ckpt["model"]
step = ckpt["step"]        # 100000
loss = ckpt["loss"]        # training loss

print(f"Step: {step} | Loss: {loss:.4f}")
print(f"Keys: {list(state_dict.keys())[:5]}")

For inference, use the vLLM-converted safetensors version: πŸ‘‰ mkd-hossain/keural-14.8b-stage2-vllm


Tokenizer

The model uses a custom SentencePiece tokenizer with 131,072 vocab size, trained on Korean and English text:

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.Load("keural_tokenizer.model")

ids = sp.Encode("μ•ˆλ…•ν•˜μ„Έμš”, Hello!", out_type=int)
text = sp.Decode(ids)

Tokenizer available at: mkd-hossain/keural-14.8b-base


Intended Use

  • Research on bilingual Korean-English language models
  • Fine-tuning experiments starting from a pretrained base
  • Studying MoE model behavior at different training stages
  • Resuming or extending pretraining

Not intended for: Direct deployment without instruction tuning. These are raw pretrained checkpoints.


Citation

@misc{keural-14.8b-2026,
  title  = {Keural 14.8B: A Bilingual Korean-English MoE Language Model Trained from Scratch},
  author = {MKD Hossain},
  year   = {2026},
  url    = {https://huggingface.co/mkd-hossain/keural-stage1-checkpoints}
}

Keural 14.8B β€” Trained from scratch on KT Cloud NIPA2-H200 infrastructure. MKD-CORP | 2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support