Keural-14.8B-Base (Stage 1 Checkpoint — Step 80K)

Status: Early-stage pretraining checkpoint. Not a finished model. This is a research preview of an ongoing training run, shared for transparency.

Model Overview

Property	Value
Architecture	Mixtral-style MoE (Mixture of Experts)
Total Parameters	14.83B
Active Parameters per Token	~3.7B (top-2 of 8 experts)
Context Length	4,096 tokens
Languages	English, Korean (primary)
Training Stage	Stage 1 Pretraining (step 80K / 100K)
License	Apache 2.0
Precision	bfloat16

Architecture Details

Parameter	Value
Layers	24
Hidden size	4,096
Attention heads	32
KV heads (GQA)	8
Head dim	128
Experts per layer	8
Active experts per token	2 (top-2 routing)
FFN type	SwiGLU
Positional encoding	RoPE (θ = 500,000)
Attention	Alternating full causal + sliding window (512)
Norm	RMSNorm (ε = 1e-5)
Vocab size	131,072

Training Details

Hardware

GPUs: 2× NVIDIA H200 (150 GB VRAM each)
Parallelism: FSDP (Fully Sharded Data Parallel)
Precision: bfloat16 with gradient checkpointing

Hyperparameters

Parameter	Value
Batch size	4 per GPU
Gradient accumulation	8 steps
Effective batch size	64 sequences
Peak learning rate	3e-4
Min learning rate	3e-5
LR schedule	Cosine decay
Warmup steps	2,000
Weight decay	0.1
Gradient clip	1.0
Optimizer	AdamW (β₁=0.9, β₂=0.95, fused)
Sequence length	4,096 tokens
Total steps	100,000 (this checkpoint: step 80,000)

Loss Curve

Step	Loss
10	12.68
2,000	~3.5
15,000	2.64
33,000	1.37
50,000	1.79
56,000	1.11
70,000	1.06
78,000	0.85
80,000	~0.85

Training Data (Stage 1)

Domain	Source	Tokens
English	FineWeb (HuggingFaceFW)	30B
Code	The Stack v1 (BigCode)	8B
Science	arXiv	3.5B
Science	PubMed	2.4B
Korean	Wikipedia-ko	0.5B
Korean	Korean-Webtext (HAERAE)	2.2B
Korean	WanJuan-Korean	3.0B
Korean	CC-100 Korean	0.16B
Literature	PG-19	0.45B
Total		~50B raw / ~70B packed

Binary dataset: 158 shards, 15.76M sequences, 95.1% sequence utilization.

Tokenizer: Custom SentencePiece model trained on Korean + English + code corpus. Vocab size: 131,072.

Known Limitations

This is a raw pretraining checkpoint, not an instruction-tuned or RLHF'd model. It has significant known issues:

Data quality: Stage 1 training data contains unfiltered web content including HTML artifacts ([content7], <table>), spam, and low-quality Korean web pages. This directly affects output quality.
Korean outputs: May produce brand spam, gambling content, or HTML artifacts — artifacts from noisy Korean web data in the training set.
No instruction following: This is a base language model. It continues text, it does not follow instructions or answer questions in a chat format.
Not safety-tuned: No RLHF, DPO, or safety filtering has been applied.
Incomplete training: This checkpoint is at step 80K of a planned 100K step run. Training was ongoing at upload time.

Stage 2 pretraining with cleaner data (FineWeb-edu, FineWeb2-Korean, HPLT-Korean) is planned before instruction tuning.

Tokenizer

Custom SentencePiece tokenizer with 131,072 vocabulary tokens, trained on a multilingual corpus (Korean + English + Code). Uses LlamaTokenizer interface for HuggingFace compatibility.

Special tokens:

<s> (BOS) → ID 1
</s> (EOS) → ID 2
<unk> → ID 0

Usage

With vLLM

pip install vllm
vllm serve mkd-chanwoo/keural-14.8b-base --dtype bfloat16 --max-model-len 4096

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "mkd-chanwoo/keural-14.8b-base"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

inputs = tokenizer("인공지능의 미래는", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Text Generation with Sampling

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    repetition_penalty=1.5,   # recommended — reduces repetition loops
    do_sample=True,
)

Model Card Metadata

Model type: Causal language model, MoE
Training regime: Pretraining only (no SFT, no RLHF)
Checkpoint step: 80,000
Converted from: Native Keural .pt format → HuggingFace Mixtral-compatible safetensors
Conversion: Weights remapped to MixtralForCausalLM schema for vLLM/transformers compatibility

Citation

@misc{keural2026,
  title  = {Keural: A Korean-English Mixture-of-Experts Language Model},
  author = {mkd-chanwoo},
  year   = {2026},
  url    = {https://huggingface.co/mkd-chanwoo/keural-14.8b-base}
}

Roadmap

Stage 1 pretraining (50B tokens, mixed quality data)
Stage 1 completion (100K steps)
Stage 2 pretraining (70B clean tokens: FineWeb-edu + FineWeb2-Korean + HPLT-Korean)
Supervised Fine-Tuning (SFT)
Preference alignment (DPO/RLHF)
Evaluation on Korean benchmarks (KoBEST, KLUE)

Downloads last month: 5

Safetensors

Model size

15B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support