Keural 14.8B — SFT Training Checkpoints

This repository contains mid-training checkpoints from Stage 3 (SFT — Supervised Fine-Tuning) of the Keural 14.8B MoE language model.

These checkpoints are saved during SFT training and are useful for research, comparing training progress, or resuming training.

Model Overview

Field	Value
Architecture	Mixtral-style MoE (Mixture of Experts)
Parameters	14.83B total (2.6B active per token)
Layers	24 transformer blocks
Hidden size	4096
Attention heads	32 (8 KV heads, GQA)
Experts	8 total, top-2 routing
Vocab size	131,074 (131,072 base + 2 ChatML special tokens)
Context length	4096 tokens
Languages	Korean + English
Training hardware	2× NVIDIA H200 150GB, FSDP
Infrastructure	KT Cloud NIPA2-H200

SFT Training Details

Parameter	Value
Base model	Keural 14.8B Stage 2 (120K steps annealing)
Dataset	mkd-chanwoo/keural-SFT (1,134,119 samples)
Format	ChatML (`<\|im_start\|>` / `<\|im_end\|>`)
Total steps	18,000
Learning rate	1e-5 → 1e-6 (cosine decay)
Batch size	2 per GPU × 2 GPUs
Precision	bfloat16
Loss masking	Assistant turns only

Special Tokens Added

Token	ID
`<\|im_start\|>`	131072
`<\|im_end\|>`	131073

Embedding layer resized from 131,072 → 131,074 at SFT start. New token embeddings initialized to the mean of existing embeddings.

Chat Format (ChatML)

All prompts must use ChatML format:

<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
What is the capital of South Korea?
<|im_end|>
<|im_start|>assistant
The capital of South Korea is Seoul.
<|im_end|>

Checkpoints in This Repository

File	Step	% Complete	Notes
`checkpoint_5000.pt`	5,000	28%	Early SFT — learning format
`checkpoint_10000.pt`	10,000	56%	Mid SFT — improving responses

More checkpoints will be added as training progresses toward step 18,000.

Each .pt file is a full FSDP training checkpoint (~83GB) containing:

model — model state dict (vocab_size=131,074)
optimizer — optimizer state
step — training step number
loss — training loss at that step

Full Training Pipeline

Stage	Details	Status
Stage 1: Pretraining	100K steps, ~64.56B tokens	✅ Done
Stage 2: Annealing	20K steps, ~5.16B tokens	✅ Done
Stage 3: SFT	18K steps, 1.13M ChatML samples	🔄 In Progress
Stage 4: DPO	~8K steps, 233K preference pairs	⏳ Planned

Related repositories:

Base checkpoints: mkd-hossain/keural-stage1-checkpoints
Stage 2 base (vLLM): mkd-hossain/keural-14.8b-stage2-vllm
SFT model (vLLM, 7K steps): mkd-hossain/keural-14.8b-sft
DPO dataset: mkd-hossain/keural-dpo-dataset

Loading a Checkpoint

import torch

ckpt = torch.load("checkpoint_10000.pt", map_location="cpu", weights_only=False)
state_dict = ckpt["model"]
step = ckpt["step"]   # 10000
loss = ckpt["loss"]

print(f"Step: {step} | Loss: {loss:.4f}")

Note: These checkpoints require the custom Keural model architecture code. Vocab size is 131,074 (not 131,072).

Tokenizer

Custom SentencePiece tokenizer with 2 added ChatML special tokens:

import sentencepiece as spm
import re

sp = spm.SentencePieceProcessor()
sp.Load("keural_tokenizer.model")

def encode(text):
    parts = re.split(r'(<\|im_start\|>|<\|im_end\|>)', text)
    ids = []
    for part in parts:
        if part == '<|im_start|>':
            ids.append(131072)
        elif part == '<|im_end|>':
            ids.append(131073)
        elif part:
            ids.extend(sp.Encode(part, out_type=int))
    return ids

Citation

@misc{keural-14.8b-sft-2026,
  title  = {Keural 14.8B: SFT Training Checkpoints},
  author = {MKD Hossain},
  year   = {2026},
  url    = {https://huggingface.co/mkd-hossain/keural-sft-checkpoints}
}

Keural 14.8B — Trained from scratch on KT Cloud NIPA2-H200 infrastructure. MKD-CORP | 2026

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mkd-hossain/keural-sft-checkpoints

Base model

mkd-hossain/keural-14.8b-stage2

Finetuned

(1)

this model