Keural 14.8B β SFT Training Checkpoints
This repository contains mid-training checkpoints from Stage 3 (SFT β Supervised Fine-Tuning) of the Keural 14.8B MoE language model.
These checkpoints are saved during SFT training and are useful for research, comparing training progress, or resuming training.
Model Overview
| Field | Value |
|---|---|
| Architecture | Mixtral-style MoE (Mixture of Experts) |
| Parameters | 14.83B total (2.6B active per token) |
| Layers | 24 transformer blocks |
| Hidden size | 4096 |
| Attention heads | 32 (8 KV heads, GQA) |
| Experts | 8 total, top-2 routing |
| Vocab size | 131,074 (131,072 base + 2 ChatML special tokens) |
| Context length | 4096 tokens |
| Languages | Korean + English |
| Training hardware | 2Γ NVIDIA H200 150GB, FSDP |
| Infrastructure | KT Cloud NIPA2-H200 |
SFT Training Details
| Parameter | Value |
|---|---|
| Base model | Keural 14.8B Stage 2 (120K steps annealing) |
| Dataset | mkd-chanwoo/keural-SFT (1,134,119 samples) |
| Format | ChatML (<|im_start|> / <|im_end|>) |
| Total steps | 18,000 |
| Learning rate | 1e-5 β 1e-6 (cosine decay) |
| Batch size | 2 per GPU Γ 2 GPUs |
| Precision | bfloat16 |
| Loss masking | Assistant turns only |
Special Tokens Added
| Token | ID |
|---|---|
<|im_start|> |
131072 |
<|im_end|> |
131073 |
Embedding layer resized from 131,072 β 131,074 at SFT start. New token embeddings initialized to the mean of existing embeddings.
Chat Format (ChatML)
All prompts must use ChatML format:
<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
What is the capital of South Korea?
<|im_end|>
<|im_start|>assistant
The capital of South Korea is Seoul.
<|im_end|>
Checkpoints in This Repository
| File | Step | % Complete | Notes |
|---|---|---|---|
checkpoint_5000.pt |
5,000 | 28% | Early SFT β learning format |
checkpoint_10000.pt |
10,000 | 56% | Mid SFT β improving responses |
More checkpoints will be added as training progresses toward step 18,000.
Each .pt file is a full FSDP training checkpoint (~83GB) containing:
modelβ model state dict (vocab_size=131,074)optimizerβ optimizer statestepβ training step numberlossβ training loss at that step
Full Training Pipeline
| Stage | Details | Status |
|---|---|---|
| Stage 1: Pretraining | 100K steps, ~64.56B tokens | β Done |
| Stage 2: Annealing | 20K steps, ~5.16B tokens | β Done |
| Stage 3: SFT | 18K steps, 1.13M ChatML samples | π In Progress |
| Stage 4: DPO | ~8K steps, 233K preference pairs | β³ Planned |
Related repositories:
- Base checkpoints: mkd-hossain/keural-stage1-checkpoints
- Stage 2 base (vLLM): mkd-hossain/keural-14.8b-stage2-vllm
- SFT model (vLLM, 7K steps): mkd-hossain/keural-14.8b-sft
- DPO dataset: mkd-hossain/keural-dpo-dataset
Loading a Checkpoint
import torch
ckpt = torch.load("checkpoint_10000.pt", map_location="cpu", weights_only=False)
state_dict = ckpt["model"]
step = ckpt["step"] # 10000
loss = ckpt["loss"]
print(f"Step: {step} | Loss: {loss:.4f}")
Note: These checkpoints require the custom Keural model architecture code. Vocab size is 131,074 (not 131,072).
Tokenizer
Custom SentencePiece tokenizer with 2 added ChatML special tokens:
import sentencepiece as spm
import re
sp = spm.SentencePieceProcessor()
sp.Load("keural_tokenizer.model")
def encode(text):
parts = re.split(r'(<\|im_start\|>|<\|im_end\|>)', text)
ids = []
for part in parts:
if part == '<|im_start|>':
ids.append(131072)
elif part == '<|im_end|>':
ids.append(131073)
elif part:
ids.extend(sp.Encode(part, out_type=int))
return ids
Citation
@misc{keural-14.8b-sft-2026,
title = {Keural 14.8B: SFT Training Checkpoints},
author = {MKD Hossain},
year = {2026},
url = {https://huggingface.co/mkd-hossain/keural-sft-checkpoints}
}
Keural 14.8B β Trained from scratch on KT Cloud NIPA2-H200 infrastructure. MKD-CORP | 2026
Model tree for mkd-hossain/keural-sft-checkpoints
Base model
mkd-hossain/keural-14.8b-stage2