HanForge 35M (Korean Base)
HanForge 35M is a small Korean causal language model pretrained from scratch on 467M tokens of Korean text. It is designed as a research-friendly base model for downstream fine-tuning. The model is not instruction-tuned and should not be used directly for chat or question answering — see drlee1/HanForge-47M-SFT for that.
Model Details
| Architecture | Llama-style decoder (RMSNorm, RoPE, Grouped-Query Attention) |
| Parameters | 34.84M |
| Hidden size | 512 |
| Layers | 8 |
| Attention heads | 8 (KV heads: 2, GQA) |
| Intermediate size | 1408 |
| Max position | 4096 (RoPE θ = 50000) |
| Vocab size | 24,000 |
| Tokenizer | SentencePiece BPE, Korean-optimized (~2.17 chars/token) |
Intended Use
This model is intended for:
- Continued fine-tuning on Korean downstream tasks (instruction tuning, classification, etc.)
- Korean text continuation and language modeling research
- Educational use — exploring small language model training on a single language
It is not intended for:
- Direct chat or instruction following (use the fine-tuned variant)
- Production text generation without further training and safety review
- Tasks requiring factual accuracy, reasoning, or multilingual capability
Training Data
The model was pretrained on 467M raw tokens of Korean text drawn from three publicly available sources:
| Source | Description |
|---|---|
| Wikipedia (Korean) | Encyclopedic articles, factual prose |
| FineWeb-2 (Korean subset) | Filtered Korean web text |
| korean-webtext-edu | Educational Korean web content |
The corpus was deduplicated, length-filtered, and tokenized with a Korean-optimized SentencePiece BPE (24k vocab) trained separately on the same data.
Training Procedure
| Tokens seen | 467M (1 epoch) |
| Batch size (effective) | 16 |
| Optimizer | AdamW (β1=0.9, β2=0.95, weight decay 0.1) |
| Learning rate | Cosine schedule, peak 3e-4 |
| Sequence length | 1024 |
| Precision | bf16 mixed precision |
| Hardware | Mac MPS / single GPU |
Evaluation
The base model achieves the following on internal Korean evaluations:
| Metric | Value |
|---|---|
| Korean character ratio (sample mode) | 87.3% |
| Minimal-pair grammar accuracy | 60.8% |
| Held-out perplexity | ~25 |
The Korean character ratio includes some false positives where the model produces repeated Korean tokens — this is expected for a base model that has not learned chat formatting. For coherent Korean output, use the fine-tuned variant.
Limitations and Bias
- Small scale (35M): Limited reasoning, factual accuracy, and long-form coherence
- Single-language pretrain: No English or other language capability
- Web-derived data: May reflect biases present in Korean web text; no explicit safety filtering was applied
- Short pretrain (1 epoch on 467M tokens): Roughly 13× the parameter count in tokens — well below modern best practice
This model has not been aligned, RLHF'd, or safety-tuned. Do not deploy in user-facing applications without further training and review.
License
Released under the Apache License 2.0. The underlying pretraining corpora are subject to their own licenses.
Citation
@misc{hanforge_base_2026,
author = {DongRyeol Lee},
title = {HanForge 35M: A Small Korean Language Model Pretrained from Scratch},
year = {2026},
note = {Pretrained on 467M Korean tokens with a 24k SentencePiece BPE tokenizer}
}
- Downloads last month
- -
docker model run hf.co/drlee1/HanForge-base