| --- |
| language: |
| - ko |
| license: apache-2.0 |
| library_name: transformers |
| tags: |
| - korean |
| - causal-lm |
| - pretraining |
| - small-language-model |
| pipeline_tag: text-generation |
| --- |
| |
| # HanForge 35M (Korean Base) |
|
|
| HanForge 35M is a small Korean causal language model pretrained from scratch on **467M tokens** of Korean text. It is designed as a research-friendly base model for downstream fine-tuning. The model is **not instruction-tuned** and should not be used directly for chat or question answering — see [`drlee1/HanForge-47M-SFT`](https://huggingface.co/drlee1/HanForge-47M-SFT) for that. |
|
|
| ## Model Details |
|
|
| | | | |
| |---|---| |
| | **Architecture** | Llama-style decoder (RMSNorm, RoPE, Grouped-Query Attention) | |
| | **Parameters** | 34.84M | |
| | **Hidden size** | 512 | |
| | **Layers** | 8 | |
| | **Attention heads** | 8 (KV heads: 2, GQA) | |
| | **Intermediate size** | 1408 | |
| | **Max position** | 4096 (RoPE θ = 50000) | |
| | **Vocab size** | 24,000 | |
| | **Tokenizer** | SentencePiece BPE, Korean-optimized (~2.17 chars/token) | |
|
|
| ## Intended Use |
|
|
| This model is intended for: |
|
|
| - **Continued fine-tuning** on Korean downstream tasks (instruction tuning, classification, etc.) |
| - **Korean text continuation** and language modeling research |
| - **Educational use** — exploring small language model training on a single language |
|
|
| It is **not** intended for: |
|
|
| - Direct chat or instruction following (use the fine-tuned variant) |
| - Production text generation without further training and safety review |
| - Tasks requiring factual accuracy, reasoning, or multilingual capability |
|
|
| ## Training Data |
|
|
| The model was pretrained on **467M raw tokens** of Korean text drawn from three publicly available sources: |
|
|
| | Source | Description | |
| |---|---| |
| | Wikipedia (Korean) | Encyclopedic articles, factual prose | |
| | FineWeb-2 (Korean subset) | Filtered Korean web text | |
| | korean-webtext-edu | Educational Korean web content | |
|
|
| The corpus was deduplicated, length-filtered, and tokenized with a Korean-optimized SentencePiece BPE (24k vocab) trained separately on the same data. |
|
|
| ## Training Procedure |
|
|
| | | | |
| |---|---| |
| | **Tokens seen** | 467M (1 epoch) | |
| | **Batch size (effective)** | 16 | |
| | **Optimizer** | AdamW (β1=0.9, β2=0.95, weight decay 0.1) | |
| | **Learning rate** | Cosine schedule, peak 3e-4 | |
| | **Sequence length** | 1024 | |
| | **Precision** | bf16 mixed precision | |
| | **Hardware** | Mac MPS / single GPU | |
|
|
| ## Evaluation |
|
|
| The base model achieves the following on internal Korean evaluations: |
|
|
| | Metric | Value | |
| |---|---| |
| | Korean character ratio (sample mode) | 87.3% | |
| | Minimal-pair grammar accuracy | 60.8% | |
| | Held-out perplexity | ~25 | |
|
|
| > The Korean character ratio includes some false positives where the model produces repeated Korean tokens — this is expected for a base model that has not learned chat formatting. For coherent Korean output, use the fine-tuned variant. |
|
|
| ## Limitations and Bias |
|
|
| - **Small scale (35M)**: Limited reasoning, factual accuracy, and long-form coherence |
| - **Single-language pretrain**: No English or other language capability |
| - **Web-derived data**: May reflect biases present in Korean web text; no explicit safety filtering was applied |
| - **Short pretrain (1 epoch on 467M tokens)**: Roughly 13× the parameter count in tokens — well below modern best practice |
|
|
| This model has not been aligned, RLHF'd, or safety-tuned. Do not deploy in user-facing applications without further training and review. |
|
|
| ## License |
|
|
| Released under the **Apache License 2.0**. The underlying pretraining corpora are subject to their own licenses. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{hanforge_base_2026, |
| author = {DongRyeol Lee}, |
| title = {HanForge 35M: A Small Korean Language Model Pretrained from Scratch}, |
| year = {2026}, |
| note = {Pretrained on 467M Korean tokens with a 24k SentencePiece BPE tokenizer} |
| } |
| ``` |
|
|