File size: 3,803 Bytes
a00d81d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b410ec
a00d81d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
language:
- ko
license: apache-2.0
library_name: transformers
tags:
- korean
- causal-lm
- pretraining
- small-language-model
pipeline_tag: text-generation
---

# HanForge 35M (Korean Base)

HanForge 35M is a small Korean causal language model pretrained from scratch on **467M tokens** of Korean text. It is designed as a research-friendly base model for downstream fine-tuning. The model is **not instruction-tuned** and should not be used directly for chat or question answering — see [`drlee1/HanForge-47M-SFT`](https://huggingface.co/drlee1/HanForge-47M-SFT) for that.

## Model Details

| | |
|---|---|
| **Architecture** | Llama-style decoder (RMSNorm, RoPE, Grouped-Query Attention) |
| **Parameters** | 34.84M |
| **Hidden size** | 512 |
| **Layers** | 8 |
| **Attention heads** | 8 (KV heads: 2, GQA) |
| **Intermediate size** | 1408 |
| **Max position** | 4096 (RoPE θ = 50000) |
| **Vocab size** | 24,000 |
| **Tokenizer** | SentencePiece BPE, Korean-optimized (~2.17 chars/token) |

## Intended Use

This model is intended for:

- **Continued fine-tuning** on Korean downstream tasks (instruction tuning, classification, etc.)
- **Korean text continuation** and language modeling research
- **Educational use** — exploring small language model training on a single language

It is **not** intended for:

- Direct chat or instruction following (use the fine-tuned variant)
- Production text generation without further training and safety review
- Tasks requiring factual accuracy, reasoning, or multilingual capability

## Training Data

The model was pretrained on **467M raw tokens** of Korean text drawn from three publicly available sources:

| Source | Description |
|---|---|
| Wikipedia (Korean) | Encyclopedic articles, factual prose |
| FineWeb-2 (Korean subset) | Filtered Korean web text |
| korean-webtext-edu | Educational Korean web content |

The corpus was deduplicated, length-filtered, and tokenized with a Korean-optimized SentencePiece BPE (24k vocab) trained separately on the same data.

## Training Procedure

| | |
|---|---|
| **Tokens seen** | 467M (1 epoch) |
| **Batch size (effective)** | 16 |
| **Optimizer** | AdamW (β1=0.9, β2=0.95, weight decay 0.1) |
| **Learning rate** | Cosine schedule, peak 3e-4 |
| **Sequence length** | 1024 |
| **Precision** | bf16 mixed precision |
| **Hardware** | Mac MPS / single GPU |

## Evaluation

The base model achieves the following on internal Korean evaluations:

| Metric | Value |
|---|---|
| Korean character ratio (sample mode) | 87.3% |
| Minimal-pair grammar accuracy | 60.8% |
| Held-out perplexity | ~25 |

> The Korean character ratio includes some false positives where the model produces repeated Korean tokens — this is expected for a base model that has not learned chat formatting. For coherent Korean output, use the fine-tuned variant.

## Limitations and Bias

- **Small scale (35M)**: Limited reasoning, factual accuracy, and long-form coherence
- **Single-language pretrain**: No English or other language capability
- **Web-derived data**: May reflect biases present in Korean web text; no explicit safety filtering was applied
- **Short pretrain (1 epoch on 467M tokens)**: Roughly 13× the parameter count in tokens — well below modern best practice

This model has not been aligned, RLHF'd, or safety-tuned. Do not deploy in user-facing applications without further training and review.

## License

Released under the **Apache License 2.0**. The underlying pretraining corpora are subject to their own licenses.

## Citation

```bibtex
@misc{hanforge_base_2026,
  author = {DongRyeol Lee},
  title  = {HanForge 35M: A Small Korean Language Model Pretrained from Scratch},
  year   = {2026},
  note   = {Pretrained on 467M Korean tokens with a 24k SentencePiece BPE tokenizer}
}
```