Keural 14.8B β€” SFT Training Checkpoints

This repository contains mid-training checkpoints from Stage 3 (SFT β€” Supervised Fine-Tuning) of the Keural 14.8B MoE language model.

These checkpoints are saved during SFT training and are useful for research, comparing training progress, or resuming training.


Model Overview

Field Value
Architecture Mixtral-style MoE (Mixture of Experts)
Parameters 14.83B total (2.6B active per token)
Layers 24 transformer blocks
Hidden size 4096
Attention heads 32 (8 KV heads, GQA)
Experts 8 total, top-2 routing
Vocab size 131,074 (131,072 base + 2 ChatML special tokens)
Context length 4096 tokens
Languages Korean + English
Training hardware 2Γ— NVIDIA H200 150GB, FSDP
Infrastructure KT Cloud NIPA2-H200

SFT Training Details

Parameter Value
Base model Keural 14.8B Stage 2 (120K steps annealing)
Dataset mkd-chanwoo/keural-SFT (1,134,119 samples)
Format ChatML (<|im_start|> / <|im_end|>)
Total steps 18,000
Learning rate 1e-5 β†’ 1e-6 (cosine decay)
Batch size 2 per GPU Γ— 2 GPUs
Precision bfloat16
Loss masking Assistant turns only

Special Tokens Added

Token ID
<|im_start|> 131072
<|im_end|> 131073

Embedding layer resized from 131,072 β†’ 131,074 at SFT start. New token embeddings initialized to the mean of existing embeddings.


Chat Format (ChatML)

All prompts must use ChatML format:

<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
What is the capital of South Korea?
<|im_end|>
<|im_start|>assistant
The capital of South Korea is Seoul.
<|im_end|>

Checkpoints in This Repository

File Step % Complete Notes
checkpoint_5000.pt 5,000 28% Early SFT β€” learning format
checkpoint_10000.pt 10,000 56% Mid SFT β€” improving responses

More checkpoints will be added as training progresses toward step 18,000.

Each .pt file is a full FSDP training checkpoint (~83GB) containing:

  • model β€” model state dict (vocab_size=131,074)
  • optimizer β€” optimizer state
  • step β€” training step number
  • loss β€” training loss at that step

Full Training Pipeline

Stage Details Status
Stage 1: Pretraining 100K steps, ~64.56B tokens βœ… Done
Stage 2: Annealing 20K steps, ~5.16B tokens βœ… Done
Stage 3: SFT 18K steps, 1.13M ChatML samples πŸ”„ In Progress
Stage 4: DPO ~8K steps, 233K preference pairs ⏳ Planned

Related repositories:


Loading a Checkpoint

import torch

ckpt = torch.load("checkpoint_10000.pt", map_location="cpu", weights_only=False)
state_dict = ckpt["model"]
step = ckpt["step"]   # 10000
loss = ckpt["loss"]

print(f"Step: {step} | Loss: {loss:.4f}")

Note: These checkpoints require the custom Keural model architecture code. Vocab size is 131,074 (not 131,072).


Tokenizer

Custom SentencePiece tokenizer with 2 added ChatML special tokens:

import sentencepiece as spm
import re

sp = spm.SentencePieceProcessor()
sp.Load("keural_tokenizer.model")

def encode(text):
    parts = re.split(r'(<\|im_start\|>|<\|im_end\|>)', text)
    ids = []
    for part in parts:
        if part == '<|im_start|>':
            ids.append(131072)
        elif part == '<|im_end|>':
            ids.append(131073)
        elif part:
            ids.extend(sp.Encode(part, out_type=int))
    return ids

Citation

@misc{keural-14.8b-sft-2026,
  title  = {Keural 14.8B: SFT Training Checkpoints},
  author = {MKD Hossain},
  year   = {2026},
  url    = {https://huggingface.co/mkd-hossain/keural-sft-checkpoints}
}

Keural 14.8B β€” Trained from scratch on KT Cloud NIPA2-H200 infrastructure. MKD-CORP | 2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mkd-hossain/keural-sft-checkpoints

Finetuned
(1)
this model