Instructions to use mkd-hossain/keural-sft2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mkd-hossain/keural-sft2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mkd-hossain/keural-sft2", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("mkd-hossain/keural-sft2", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use mkd-hossain/keural-sft2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mkd-hossain/keural-sft2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mkd-hossain/keural-sft2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mkd-hossain/keural-sft2
- SGLang
How to use mkd-hossain/keural-sft2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mkd-hossain/keural-sft2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mkd-hossain/keural-sft2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mkd-hossain/keural-sft2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mkd-hossain/keural-sft2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use mkd-hossain/keural-sft2 with Docker Model Runner:
docker model run hf.co/mkd-hossain/keural-sft2
Keural-SFT2-14.83B (SFT Epoch 2 β 29,112 steps)
Keural is a bilingual KoreanβEnglish Mixture-of-Experts language model trained entirely from scratch β no base model was used. This is the SFT epoch 2 checkpoint at step 29,112, continuing supervised fine-tuning from the epoch 1 checkpoint (18,000 steps) on the same 710K ChatML dataset.
Model Details
| Property | Value |
|---|---|
| Architecture | Mixtral-style MoE (8 experts, top-2 routing) |
| Parameters | 14.83B total / ~7.42B active per token |
| Layers | 24 |
| Hidden size | 4096 |
| Attention heads | 32 (GQA β 8 KV heads) |
| Head dim | 128 |
| Expert intermediate size | 5,632 |
| Experts | 8 total, top-2 per token |
| Context length | 4,096 tokens |
| Vocabulary | 131,074 (131,072 SPM + `< |
| RoPE theta | 500,000 |
| Sliding window | 512 (alternating every other layer) |
| Norm | RMSNorm (eps=1e-5) |
| Activation | SiLU |
| Dtype | bfloat16 |
| Languages | Korean (primary), English |
Full Training Pipeline
| Stage | Steps | Tokens | Data | Hardware |
|---|---|---|---|---|
| Pretraining Stage 1 | 100,000 | ~50B | Korean + English web corpus | 2Γ H200 SXM |
| Pretraining Stage 2 | 120,000 | ~13B | Korean + English web corpus (continued) | 2Γ H200 SXM |
| SFT Epoch 1 | 18,000 | 710M | mkd-chanwoo/keural-SFT (1.14M ChatML samples) | 2Γ H200 SXM |
| DPO (1 full epoch) | 6,927 | β | keural-dpo-raw (440K preference pairs) | 2Γ H200 SXM |
| SFT Epoch 2 (this checkpoint) | 29,112 | 7.63B | mkd-chanwoo/keural-SFT (710K ChatML samples, 2nd pass) | 2Γ H200 SXM |
SFT Epoch 2 Training Details
| Hyperparameter | Value |
|---|---|
| Resumed from | checkpoint_18000 (SFT epoch 1 final) |
| Learning rate | 1e-5 β 1e-6 cosine decay |
| Min learning rate | 1e-6 |
| Effective batch size | 64 (4 per GPU Γ 8 grad accum Γ 2 GPUs) |
| Max sequence length | 4,096 tokens |
| Weight decay | 0.05 |
| Gradient clipping | 1.0 |
| Optimizer | AdamW |
| Total steps | 29,112 |
| Dataset | mkd-chanwoo/keural-SFT (710K samples) |
| Total tokens | ~7.63B |
| Training time | ~56.71h |
| Parallelism | FSDP FULL_SHARD (ZeRO-3 equivalent) |
| Precision | bfloat16 + gradient checkpointing |
| Hardware | 2Γ NVIDIA H200 SXM (139 GiB each) |
SFT Dataset
| Source | Samples | Language |
|---|---|---|
| mkd-chanwoo/keural-SFT | 710,000 | Korean + English |
Chat Format (ChatML)
This model uses ChatML format. Always include a system prompt for best results.
<|im_start|>system
You are a helpful bilingual Korean-English assistant. Always respond in the same language as the user.<|im_end|>
<|im_start|>user
μλ
νμΈμ! μ€λ λ μ¨κ° μ΄λμ?<|im_end|>
<|im_start|>assistant
The model generates until it produces <|im_end|> (token ID 131073).
The chat template in
tokenizer_config.jsonautomatically injects a default system prompt if you don't provide one, so bilingual behavior works out of the box withapply_chat_template.
How to Use
With transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "mkd-hossain/keural-sft2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{
"role": "system",
"content": (
"You are a helpful bilingual Korean-English assistant. "
"Always respond in the same language as the user's message."
)
},
{"role": "user", "content": "νμ΄μ¬μμ 리μ€νΈλ₯Ό μ λ ¬νλ λ°©λ²μ μλ €μ£ΌμΈμ."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
top_k=50,
repetition_penalty=1.1,
no_repeat_ngram_size=8,
do_sample=True,
eos_token_id=131073,
)
response = tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
response = response.split("<|im_end|>")[0].strip()
print(response)
With vLLM (recommended for serving)
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model mkd-hossain/keural-sft2 \
--tokenizer mkd-hossain/keural-sft2 \
--dtype bfloat16 \
--max-model-len 4096 \
--tensor-parallel-size 1
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="mkd-hossain/keural-sft2",
messages=[
{"role": "system", "content": "You are a helpful bilingual assistant. Respond in the same language as the user."},
{"role": "user", "content": "What is the capital of South Korea?"},
],
max_tokens=512,
temperature=0.7,
)
print(response.choices[0].message.content)
Special Tokens
| Token | ID | Purpose |
|---|---|---|
| `< | im_start | >` |
| `< | im_end | >` |
<bos> |
1 | Beginning of sequence |
<eos> |
2 | End of sequence (not used for chat) |
<pad> |
0 | Padding token |
Critical: Always set
eos_token_id=131073when generating. Do not useeos_token_id=2.
Recommended Generation Settings
# Conversational / creative
{
"max_new_tokens": 512,
"temperature": 0.7,
"top_p": 0.9,
"top_k": 50,
"repetition_penalty": 1.1,
"no_repeat_ngram_size": 8,
"do_sample": True,
"eos_token_id": 131073,
}
# Factual / deterministic
{
"max_new_tokens": 512,
"temperature": 0.1,
"repetition_penalty": 1.1,
"do_sample": False,
"eos_token_id": 131073,
}
Checkpoint Comparison
| Checkpoint | Stage | Steps | Notes |
|---|---|---|---|
| mkd-hossain/keural-pretrained | Pretraining | 120,000 | Raw base, no instruction tuning |
| mkd-hossain/keural-sft-18k | SFT Epoch 1 | 18,000 | Instruction following, ChatML format |
| mkd-hossain/keural-dpo-3500 | DPO 50% | 3,500 | Early alignment |
| mkd-hossain/keural-dpo-5500 | DPO 79% | 5,500 | Late alignment |
| mkd-hossain/keural-dpo-final | DPO 100% | 6,927 | Full epoch DPO |
| mkd-hossain/keural-sft2 | SFT Epoch 2 | 29,112 | Continued SFT on 710K dataset |
Limitations
- Maximum context is 4,096 tokens.
- The pretraining corpus is Korean-dominant β always include a system prompt for correct bilingual behavior.
- Not safety-aligned β do not deploy in production without additional safety fine-tuning.
- This is an intermediate checkpoint. SFT epoch 3 on a 2.35M sample merged dataset is in progress.
License
Apache 2.0
- Downloads last month
- 58