YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Keural-14.8B-Base (Stage 1 Checkpoint โ€” Step 80K)

Status: Early-stage pretraining checkpoint. Not a finished model. This is a research preview of an ongoing training run, shared for transparency.

Model Overview

Property Value
Architecture Mixtral-style MoE (Mixture of Experts)
Total Parameters 14.83B
Active Parameters per Token ~3.7B (top-2 of 8 experts)
Context Length 4,096 tokens
Languages English, Korean (primary)
Training Stage Stage 1 Pretraining (step 80K / 100K)
License Apache 2.0
Precision bfloat16

Architecture Details

Parameter Value
Layers 24
Hidden size 4,096
Attention heads 32
KV heads (GQA) 8
Head dim 128
Experts per layer 8
Active experts per token 2 (top-2 routing)
FFN type SwiGLU
Positional encoding RoPE (ฮธ = 500,000)
Attention Alternating full causal + sliding window (512)
Norm RMSNorm (ฮต = 1e-5)
Vocab size 131,072

Training Details

Hardware

  • GPUs: 2ร— NVIDIA H200 (150 GB VRAM each)
  • Parallelism: FSDP (Fully Sharded Data Parallel)
  • Precision: bfloat16 with gradient checkpointing

Hyperparameters

Parameter Value
Batch size 4 per GPU
Gradient accumulation 8 steps
Effective batch size 64 sequences
Peak learning rate 3e-4
Min learning rate 3e-5
LR schedule Cosine decay
Warmup steps 2,000
Weight decay 0.1
Gradient clip 1.0
Optimizer AdamW (ฮฒโ‚=0.9, ฮฒโ‚‚=0.95, fused)
Sequence length 4,096 tokens
Total steps 100,000 (this checkpoint: step 80,000)

Loss Curve

Step Loss
10 12.68
2,000 ~3.5
15,000 2.64
33,000 1.37
50,000 1.79
56,000 1.11
70,000 1.06
78,000 0.85
80,000 ~0.85

Training Data (Stage 1)

Domain Source Tokens
English FineWeb (HuggingFaceFW) 30B
Code The Stack v1 (BigCode) 8B
Science arXiv 3.5B
Science PubMed 2.4B
Korean Wikipedia-ko 0.5B
Korean Korean-Webtext (HAERAE) 2.2B
Korean WanJuan-Korean 3.0B
Korean CC-100 Korean 0.16B
Literature PG-19 0.45B
Total ~50B raw / ~70B packed

Binary dataset: 158 shards, 15.76M sequences, 95.1% sequence utilization.

Tokenizer: Custom SentencePiece model trained on Korean + English + code corpus. Vocab size: 131,072.


Known Limitations

This is a raw pretraining checkpoint, not an instruction-tuned or RLHF'd model. It has significant known issues:

  • Data quality: Stage 1 training data contains unfiltered web content including HTML artifacts ([content7], <table>), spam, and low-quality Korean web pages. This directly affects output quality.
  • Korean outputs: May produce brand spam, gambling content, or HTML artifacts โ€” artifacts from noisy Korean web data in the training set.
  • No instruction following: This is a base language model. It continues text, it does not follow instructions or answer questions in a chat format.
  • Not safety-tuned: No RLHF, DPO, or safety filtering has been applied.
  • Incomplete training: This checkpoint is at step 80K of a planned 100K step run. Training was ongoing at upload time.

Stage 2 pretraining with cleaner data (FineWeb-edu, FineWeb2-Korean, HPLT-Korean) is planned before instruction tuning.


Tokenizer

Custom SentencePiece tokenizer with 131,072 vocabulary tokens, trained on a multilingual corpus (Korean + English + Code). Uses LlamaTokenizer interface for HuggingFace compatibility.

Special tokens:

  • <s> (BOS) โ†’ ID 1
  • </s> (EOS) โ†’ ID 2
  • <unk> โ†’ ID 0

Usage

With vLLM

pip install vllm
vllm serve mkd-chanwoo/keural-14.8b-base --dtype bfloat16 --max-model-len 4096

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "mkd-chanwoo/keural-14.8b-base"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

inputs = tokenizer("์ธ๊ณต์ง€๋Šฅ์˜ ๋ฏธ๋ž˜๋Š”", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Text Generation with Sampling

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    repetition_penalty=1.5,   # recommended โ€” reduces repetition loops
    do_sample=True,
)

Model Card Metadata

  • Model type: Causal language model, MoE
  • Training regime: Pretraining only (no SFT, no RLHF)
  • Checkpoint step: 80,000
  • Converted from: Native Keural .pt format โ†’ HuggingFace Mixtral-compatible safetensors
  • Conversion: Weights remapped to MixtralForCausalLM schema for vLLM/transformers compatibility

Citation

@misc{keural2026,
  title  = {Keural: A Korean-English Mixture-of-Experts Language Model},
  author = {mkd-chanwoo},
  year   = {2026},
  url    = {https://huggingface.co/mkd-chanwoo/keural-14.8b-base}
}

Roadmap

  • Stage 1 pretraining (50B tokens, mixed quality data)
  • Stage 1 completion (100K steps)
  • Stage 2 pretraining (70B clean tokens: FineWeb-edu + FineWeb2-Korean + HPLT-Korean)
  • Supervised Fine-Tuning (SFT)
  • Preference alignment (DPO/RLHF)
  • Evaluation on Korean benchmarks (KoBEST, KLUE)
Downloads last month
1,856
Safetensors
Model size
15B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support