How to use from
Docker Model Runner
docker model run hf.co/drlee1/HanForge-base
Quick Links

HanForge 35M (Korean Base)

HanForge 35M is a small Korean causal language model pretrained from scratch on 467M tokens of Korean text. It is designed as a research-friendly base model for downstream fine-tuning. The model is not instruction-tuned and should not be used directly for chat or question answering — see drlee1/HanForge-47M-SFT for that.

Model Details

Architecture Llama-style decoder (RMSNorm, RoPE, Grouped-Query Attention)
Parameters 34.84M
Hidden size 512
Layers 8
Attention heads 8 (KV heads: 2, GQA)
Intermediate size 1408
Max position 4096 (RoPE θ = 50000)
Vocab size 24,000
Tokenizer SentencePiece BPE, Korean-optimized (~2.17 chars/token)

Intended Use

This model is intended for:

  • Continued fine-tuning on Korean downstream tasks (instruction tuning, classification, etc.)
  • Korean text continuation and language modeling research
  • Educational use — exploring small language model training on a single language

It is not intended for:

  • Direct chat or instruction following (use the fine-tuned variant)
  • Production text generation without further training and safety review
  • Tasks requiring factual accuracy, reasoning, or multilingual capability

Training Data

The model was pretrained on 467M raw tokens of Korean text drawn from three publicly available sources:

Source Description
Wikipedia (Korean) Encyclopedic articles, factual prose
FineWeb-2 (Korean subset) Filtered Korean web text
korean-webtext-edu Educational Korean web content

The corpus was deduplicated, length-filtered, and tokenized with a Korean-optimized SentencePiece BPE (24k vocab) trained separately on the same data.

Training Procedure

Tokens seen 467M (1 epoch)
Batch size (effective) 16
Optimizer AdamW (β1=0.9, β2=0.95, weight decay 0.1)
Learning rate Cosine schedule, peak 3e-4
Sequence length 1024
Precision bf16 mixed precision
Hardware Mac MPS / single GPU

Evaluation

The base model achieves the following on internal Korean evaluations:

Metric Value
Korean character ratio (sample mode) 87.3%
Minimal-pair grammar accuracy 60.8%
Held-out perplexity ~25

The Korean character ratio includes some false positives where the model produces repeated Korean tokens — this is expected for a base model that has not learned chat formatting. For coherent Korean output, use the fine-tuned variant.

Limitations and Bias

  • Small scale (35M): Limited reasoning, factual accuracy, and long-form coherence
  • Single-language pretrain: No English or other language capability
  • Web-derived data: May reflect biases present in Korean web text; no explicit safety filtering was applied
  • Short pretrain (1 epoch on 467M tokens): Roughly 13× the parameter count in tokens — well below modern best practice

This model has not been aligned, RLHF'd, or safety-tuned. Do not deploy in user-facing applications without further training and review.

License

Released under the Apache License 2.0. The underlying pretraining corpora are subject to their own licenses.

Citation

@misc{hanforge_base_2026,
  author = {DongRyeol Lee},
  title  = {HanForge 35M: A Small Korean Language Model Pretrained from Scratch},
  year   = {2026},
  note   = {Pretrained on 467M Korean tokens with a 24k SentencePiece BPE tokenizer}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for drlee1/HanForge-base

Finetunes
1 model