HanForge 35M (Korean Base)

HanForge 35M is a small Korean causal language model pretrained from scratch on 467M tokens of Korean text. It is designed as a research-friendly base model for downstream fine-tuning. The model is not instruction-tuned and should not be used directly for chat or question answering — see drlee1/HanForge-47M-SFT for that.

Model Details


Architecture	Llama-style decoder (RMSNorm, RoPE, Grouped-Query Attention)
Parameters	34.84M
Hidden size	512
Layers	8
Attention heads	8 (KV heads: 2, GQA)
Intermediate size	1408
Max position	4096 (RoPE θ = 50000)
Vocab size	24,000
Tokenizer	SentencePiece BPE, Korean-optimized (~2.17 chars/token)

Intended Use

This model is intended for:

Continued fine-tuning on Korean downstream tasks (instruction tuning, classification, etc.)
Korean text continuation and language modeling research
Educational use — exploring small language model training on a single language

It is not intended for:

Direct chat or instruction following (use the fine-tuned variant)
Production text generation without further training and safety review
Tasks requiring factual accuracy, reasoning, or multilingual capability

Training Data

The model was pretrained on 467M raw tokens of Korean text drawn from three publicly available sources:

Source	Description
Wikipedia (Korean)	Encyclopedic articles, factual prose
FineWeb-2 (Korean subset)	Filtered Korean web text
korean-webtext-edu	Educational Korean web content

The corpus was deduplicated, length-filtered, and tokenized with a Korean-optimized SentencePiece BPE (24k vocab) trained separately on the same data.

Training Procedure


Tokens seen	467M (1 epoch)
Batch size (effective)	16
Optimizer	AdamW (β1=0.9, β2=0.95, weight decay 0.1)
Learning rate	Cosine schedule, peak 3e-4
Sequence length	1024
Precision	bf16 mixed precision
Hardware	Mac MPS / single GPU

Evaluation

The base model achieves the following on internal Korean evaluations:

Metric	Value
Korean character ratio (sample mode)	87.3%
Minimal-pair grammar accuracy	60.8%
Held-out perplexity	~25

The Korean character ratio includes some false positives where the model produces repeated Korean tokens — this is expected for a base model that has not learned chat formatting. For coherent Korean output, use the fine-tuned variant.

Limitations and Bias

Small scale (35M): Limited reasoning, factual accuracy, and long-form coherence
Single-language pretrain: No English or other language capability
Web-derived data: May reflect biases present in Korean web text; no explicit safety filtering was applied
Short pretrain (1 epoch on 467M tokens): Roughly 13× the parameter count in tokens — well below modern best practice

This model has not been aligned, RLHF'd, or safety-tuned. Do not deploy in user-facing applications without further training and review.

License

Released under the Apache License 2.0. The underlying pretraining corpora are subject to their own licenses.

Citation

@misc{hanforge_base_2026,
  author = {DongRyeol Lee},
  title  = {HanForge 35M: A Small Korean Language Model Pretrained from Scratch},
  year   = {2026},
  note   = {Pretrained on 467M Korean tokens with a 24k SentencePiece BPE tokenizer}
}

Downloads last month: -

Model tree for drlee1/HanForge-base

Finetunes

1 model