Update README.md

7b410ec verified 2 days ago

3.8 kB

	---
	language:
	- ko
	license: apache-2.0
	library_name: transformers
	tags:
	- korean
	- causal-lm
	- pretraining
	- small-language-model
	pipeline_tag: text-generation
	---

	# HanForge 35M (Korean Base)

	HanForge 35M is a small Korean causal language model pretrained from scratch on 467M tokens of Korean text. It is designed as a research-friendly base model for downstream fine-tuning. The model is not instruction-tuned and should not be used directly for chat or question answering — see [`drlee1/HanForge-47M-SFT`](https://huggingface.co/drlee1/HanForge-47M-SFT) for that.

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Architecture \| Llama-style decoder (RMSNorm, RoPE, Grouped-Query Attention) \|
	\| Parameters \| 34.84M \|
	\| Hidden size \| 512 \|
	\| Layers \| 8 \|
	\| Attention heads \| 8 (KV heads: 2, GQA) \|
	\| Intermediate size \| 1408 \|
	\| Max position \| 4096 (RoPE θ = 50000) \|
	\| Vocab size \| 24,000 \|
	\| Tokenizer \| SentencePiece BPE, Korean-optimized (~2.17 chars/token) \|

	## Intended Use

	This model is intended for:

	- Continued fine-tuning on Korean downstream tasks (instruction tuning, classification, etc.)
	- Korean text continuation and language modeling research
	- Educational use — exploring small language model training on a single language

	It is not intended for:

	- Direct chat or instruction following (use the fine-tuned variant)
	- Production text generation without further training and safety review
	- Tasks requiring factual accuracy, reasoning, or multilingual capability

	## Training Data

	The model was pretrained on 467M raw tokens of Korean text drawn from three publicly available sources:

	\| Source \| Description \|
	\|---\|---\|
	\| Wikipedia (Korean) \| Encyclopedic articles, factual prose \|
	\| FineWeb-2 (Korean subset) \| Filtered Korean web text \|
	\| korean-webtext-edu \| Educational Korean web content \|

	The corpus was deduplicated, length-filtered, and tokenized with a Korean-optimized SentencePiece BPE (24k vocab) trained separately on the same data.

	## Training Procedure

	\| \| \|
	\|---\|---\|
	\| Tokens seen \| 467M (1 epoch) \|
	\| Batch size (effective) \| 16 \|
	\| Optimizer \| AdamW (β1=0.9, β2=0.95, weight decay 0.1) \|
	\| Learning rate \| Cosine schedule, peak 3e-4 \|
	\| Sequence length \| 1024 \|
	\| Precision \| bf16 mixed precision \|
	\| Hardware \| Mac MPS / single GPU \|

	## Evaluation

	The base model achieves the following on internal Korean evaluations:

	\| Metric \| Value \|
	\|---\|---\|
	\| Korean character ratio (sample mode) \| 87.3% \|
	\| Minimal-pair grammar accuracy \| 60.8% \|
	\| Held-out perplexity \| ~25 \|

	> The Korean character ratio includes some false positives where the model produces repeated Korean tokens — this is expected for a base model that has not learned chat formatting. For coherent Korean output, use the fine-tuned variant.

	## Limitations and Bias

	- Small scale (35M): Limited reasoning, factual accuracy, and long-form coherence
	- Single-language pretrain: No English or other language capability
	- Web-derived data: May reflect biases present in Korean web text; no explicit safety filtering was applied
	- Short pretrain (1 epoch on 467M tokens): Roughly 13× the parameter count in tokens — well below modern best practice

	This model has not been aligned, RLHF'd, or safety-tuned. Do not deploy in user-facing applications without further training and review.

	## License

	Released under the Apache License 2.0. The underlying pretraining corpora are subject to their own licenses.

	## Citation

	```bibtex
	@misc{hanforge_base_2026,
	author = {DongRyeol Lee},
	title = {HanForge 35M: A Small Korean Language Model Pretrained from Scratch},
	year = {2026},
	note = {Pretrained on 467M Korean tokens with a 24k SentencePiece BPE tokenizer}
	}
	```

	---
	language:
	- ko
	license: apache-2.0
	library_name: transformers
	tags:
	- korean
	- causal-lm
	- pretraining
	- small-language-model
	pipeline_tag: text-generation
	---

	# HanForge 35M (Korean Base)

	HanForge 35M is a small Korean causal language model pretrained from scratch on 467M tokens of Korean text. It is designed as a research-friendly base model for downstream fine-tuning. The model is not instruction-tuned and should not be used directly for chat or question answering — see [`drlee1/HanForge-47M-SFT`](https://huggingface.co/drlee1/HanForge-47M-SFT) for that.

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Architecture \| Llama-style decoder (RMSNorm, RoPE, Grouped-Query Attention) \|
	\| Parameters \| 34.84M \|
	\| Hidden size \| 512 \|
	\| Layers \| 8 \|
	\| Attention heads \| 8 (KV heads: 2, GQA) \|
	\| Intermediate size \| 1408 \|
	\| Max position \| 4096 (RoPE θ = 50000) \|
	\| Vocab size \| 24,000 \|
	\| Tokenizer \| SentencePiece BPE, Korean-optimized (~2.17 chars/token) \|

	## Intended Use

	This model is intended for:

	- Continued fine-tuning on Korean downstream tasks (instruction tuning, classification, etc.)
	- Korean text continuation and language modeling research
	- Educational use — exploring small language model training on a single language

	It is not intended for:

	- Direct chat or instruction following (use the fine-tuned variant)
	- Production text generation without further training and safety review
	- Tasks requiring factual accuracy, reasoning, or multilingual capability

	## Training Data

	The model was pretrained on 467M raw tokens of Korean text drawn from three publicly available sources:

	\| Source \| Description \|
	\|---\|---\|
	\| Wikipedia (Korean) \| Encyclopedic articles, factual prose \|
	\| FineWeb-2 (Korean subset) \| Filtered Korean web text \|
	\| korean-webtext-edu \| Educational Korean web content \|

	The corpus was deduplicated, length-filtered, and tokenized with a Korean-optimized SentencePiece BPE (24k vocab) trained separately on the same data.

	## Training Procedure

	\| \| \|
	\|---\|---\|
	\| Tokens seen \| 467M (1 epoch) \|
	\| Batch size (effective) \| 16 \|
	\| Optimizer \| AdamW (β1=0.9, β2=0.95, weight decay 0.1) \|
	\| Learning rate \| Cosine schedule, peak 3e-4 \|
	\| Sequence length \| 1024 \|
	\| Precision \| bf16 mixed precision \|
	\| Hardware \| Mac MPS / single GPU \|

	## Evaluation

	The base model achieves the following on internal Korean evaluations:

	\| Metric \| Value \|
	\|---\|---\|
	\| Korean character ratio (sample mode) \| 87.3% \|
	\| Minimal-pair grammar accuracy \| 60.8% \|
	\| Held-out perplexity \| ~25 \|

	> The Korean character ratio includes some false positives where the model produces repeated Korean tokens — this is expected for a base model that has not learned chat formatting. For coherent Korean output, use the fine-tuned variant.

	## Limitations and Bias

	- Small scale (35M): Limited reasoning, factual accuracy, and long-form coherence
	- Single-language pretrain: No English or other language capability
	- Web-derived data: May reflect biases present in Korean web text; no explicit safety filtering was applied
	- Short pretrain (1 epoch on 467M tokens): Roughly 13× the parameter count in tokens — well below modern best practice

	This model has not been aligned, RLHF'd, or safety-tuned. Do not deploy in user-facing applications without further training and review.

	## License

	Released under the Apache License 2.0. The underlying pretraining corpora are subject to their own licenses.

	## Citation

	```bibtex
	@misc{hanforge_base_2026,
	author = {DongRyeol Lee},
	title = {HanForge 35M: A Small Korean Language Model Pretrained from Scratch},
	year = {2026},
	note = {Pretrained on 467M Korean tokens with a 24k SentencePiece BPE tokenizer}
	}
	```