language:
- ko
- lzh
license: cc-by-sa-4.0
library_name: transformers
tags:
- fill-mask
- text-generation
- sillok
- history
- korean-history
- classical-chinese
pipeline_tag: fill-mask
co2_eq_emissions:
emissions: 1.1662
source: codecarbon
training_type: from_scratch
geographical_location: South Korea, Seoul
hardware_used: 1 x NVIDIA A100-PCIE-40GB
datasets:
- "VERITABLE RECORDS of the JOSEON DYNASTY"
SillokBert-Scratch: A Korean Classical Language Model Trained from Scratch
SillokBert-Scratch: μν(η΄θ‘) μ€λ‘ μΈμ΄ λͺ¨λΈ μ²μλΆν° νμ΅μν€κΈ°
SillokBert-Scratchλ μ΄λ ν μ¬μ νμ΅λ μ§μ μμ΄, μ€μ§ 'μ‘°μ μμ‘°μ€λ‘' μλ¬Έ λ°μ΄ν°λ§μ μ¬μ©νμ¬ λ°±μ§μν(from scratch)μμλΆν° νλ ¨λ BERT κΈ°λ°μ μΈμ΄ λͺ¨λΈμ
λλ€. κΈ°μ‘΄μ λ€κ΅μ΄ λ²μ© λͺ¨λΈμ νΉμ λλ©μΈμ νμΈνλνλ λ°©μμμ λ²μ΄λ, μμνκ² μ€λ‘ λ°μ΄ν°μ κ³ μ ν μΈμ΄μ νΉμ±λ§μ νμ΅ν 'μν(Pure-blood)' μΈμ΄ λͺ¨λΈμ ꡬμΆνλ κ²μ λͺ©νλ‘ μ§νλ μ°κ΅¬ νλ‘μ νΈμ
λλ€.
SillokBert-Scratch is a BERT-based language model trained entirely from scratch, using only the original text data of the "Annals of the Joseon Dynasty" (μ‘°μ μμ‘°μ€λ‘). This research project aimed to build a 'pure-blood' language model that learns the unique linguistic characteristics of the historical text, moving away from the conventional approach of fine-tuning large multilingual models for specific domains.
μ°κ΅¬ μ§μ (Funding & Support)
λ³Έ λͺ¨λΈμ νκ΅νμ€μμ°κ΅¬μ λμ§νΈμΈλ¬Ένμ°κ΅¬μμ "νκ΅ κ³ μ λ¬Έν κΈ°λ° μ§λ₯ν νκ΅ν μΈμ΄λͺ¨λΈ κ°λ°" νλ‘μ νΈμ μΌνμΌλ‘ κ°λ°λμμ΅λλ€. λ³Έ λͺ¨λΈμ νμ΅ νκ²½μ κ³ΌνκΈ°μ μ 보ν΅μ λΆ μ 보ν΅μ μ°μ μ§ν₯μμ 2025λ κ³ μ±λ₯μ»΄ν¨ν μ§μ(GPU) μ¬μ (G2025-0450)μ μ§μμ λ°μμ΅λλ€. μ°κ΅¬μ νμμ μΈ κ³ μ±λ₯ μ»΄ν¨ν νκ²½μ μ§μν΄μ£Όμ μ μ§μ¬μΌλ‘ κ°μ¬λ립λλ€.
This model was developed as part of the "Development of an Intelligent Korean Studies Language Model based on Classical Korean Texts" project at the Digital Humanities Research Institute, The Academy of Korean Studies. The training environment for this model was supported by the 2025 High-Performance Computing Support (GPU) Program of the National IT Industry Promotion Agency (NIPA) (No. G2025-0450). We sincerely appreciate the support for providing the high-performance computing environment essential for our research.
λͺ¨λΈ μ±λ₯ (Model Performance)
| λͺ¨λΈ (Model) | Perplexity (PPL) | λΉκ³ (Note) |
|---|---|---|
| SillokBert-Scratch (This Model) | 1.4580 | Trained from Scratch |
| SillokBert (Previous Fine-tuned Model) | 4.1219 | Fine-tuned from bert-base-multilingual-cased |
bert-base-multilingual-cased (Baseline) |
132.5186 |
Untrained on Sillok data |
λ³Έ λͺ¨λΈμ κΈ°μ‘΄μ νμΈνλ κΈ°λ° λͺ¨λΈ λλΉ Perplexityλ₯Ό 64.63% ν₯μμμΌ°μΌλ©°, μ무 νλ ¨λ κ±°μΉμ§ μμ λ²μ© BERT-base λͺ¨λΈλ³΄λ€λ 98.90% ν₯μλ μλμ μΈ μ±λ₯μ 보μ λλ€.
This model achieved a 64.63% improvement in Perplexity compared to the previous fine-tuned model, and a 98.90% improvement over the baseline bert-base-multilingual-cased model, demonstrating its overwhelming superiority.
λͺ¨λΈ μ¬μ©λ² (How to Use)
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
# νκΉ
νμ΄μ€μ μ
λ‘λνμ λͺ¨λΈ κ²½λ‘λ₯Ό μ§μ ν©λλ€.
# Specify the path to your model uploaded on Hugging Face.
model_name = "ddokbaro/SillokBert-Scratch"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# ν
μ€νΈ λ¬Έμ₯
# Test sentence
text = "ημ΄ ε³ζ°, βθΏζ₯ [MASK]δΊκ° δ½?β"
inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# λ§μ€ν¬λ ν ν°μ μμΉλ₯Ό μ°Ύμ κ°μ₯ νλ₯ μ΄ λμ λ¨μ΄λ₯Ό μμΈ‘ν©λλ€.
# Find the masked token's position and predict the most probable words.
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
print(f"'{text}' λ¬Έμ₯μμ [MASK]μ λ€μ΄κ° νλ₯ μ΄ λμ λ¨μ΄ TOP 5:")
print(f"Top 5 most probable words for the [MASK] token in '{text}':")
for token_id in top_5_tokens:
print(f" - {tokenizer.decode([token_id])}")
λ°μ΄ν°μ μ 보 (Dataset Information)
λ°μ΄ν° μΆμ² λ° μμ§ (Data Source and Collection)
μμ² λ°μ΄ν° (Source Data): 곡곡λ°μ΄ν°ν¬νΈ - κ΅μ‘λΆ κ΅μ¬νΈμ°¬μμν_μ‘°μ μμ‘°μ€λ‘ μ 보_μ€λ‘μλ¬Έ https://www.data.go.kr/data/15053647/fileData.do. μ°κ΅¬μ ν λκ° λ κ·μ€ν μλ£λ₯Ό μ 곡ν΄μ£Όμ κ΅μ‘λΆ κ΅μ¬νΈμ°¬μμν μΈ‘μ κ°μ¬μ λ§μμ μ ν©λλ€.
Source Data: Public Data Portal - National Institute of Korean History (Ministry of Education)_Annals of the Joseon Dynasty Information_Original Sillok Texts https://www.data.go.kr/data/15053647/fileData.do. We express our gratitude to the National Institute of Korean History for providing the invaluable data that formed the foundation of this research.
λ°μ΄ν° λ²μ λ° μ¬νμ± (Data Version and Reproducibility): λ³Έ μ°κ΅¬λ 2022λ
11μ 03μΌμ λ±λ‘λ λ°μ΄ν°λ₯Ό κΈ°λ°μΌλ‘ ν©λλ€. 곡μ λ°°ν¬μ²μ λ°μ΄ν°κ° μ
λ°μ΄νΈλ μ μμ΄, μλ²½ν μ¬νμ±μ 보μ₯νκΈ° μν΄ νμ΅μ μ¬μ©λ μλ³Έ XML νμΌ μ 체λ₯Ό raw_data/sillok_raw_xml.zip νμΌλ‘ μ 곡ν©λλ€. λν, μ¦μ νμ© κ°λ₯ν μ μ²λ¦¬ μλ£ ν
μ€νΈ νμΌ(train.txt, validation.txt, test.txt)μ preprocessed_data/ ν΄λμμ νμΈνμ€ μ μμ΅λλ€.
Data Version and Reproducibility: This research is based on the data registered on November 3, 2022. As the data from the official distributor may be updated, the entire original XML files used for training are provided as raw_data/sillok_raw_xml.zip to ensure complete reproducibility. Additionally, the preprocessed text files (train.txt, validation.txt, test.txt) ready for immediate use are available in the preprocessed_data/ folder.
μ μ²λ¦¬λ λ°μ΄ν° κ²½λ‘ (Path to Preprocessed Data)
/home/work/baro/sillok25060103/preprocessed_corpus/
νλ ¨ μ μ°¨ (Training Procedure)
λ³Έ λͺ¨λΈμ μ΄ 4λ¨κ³μ κ³Όμ μ κ±°μ³ κ°λ°λμμ΅λλ€. / This model was developed through a four-stage process.
1λ¨κ³: μ€λ‘ νΉν ν ν¬λμ΄μ νλ ¨ (Stage 1: Sillok-Specific Tokenizer Training)
μκ³ λ¦¬μ¦ (Algorithm): μ΄κΈ°
WordPieceλ°©μμμ μμΈ λΆλͺ μ νλ ¨ μ€λ¨ λ¬Έμ κ° λ°μνμ¬, μ΅μ’ μ μΌλ‘BPE (Byte-Pair Encoding)μκ³ λ¦¬μ¦μ μ±ννμ΅λλ€. Byte-Level μ μ²λ¦¬ λ°©μμ μ¬μ©νμ¬[UNK]ν ν° λ°μμ μ΅μννμ΅λλ€. After encountering an inexplicable training halt with the initialWordPieceapproach, we ultimately adopted theBPE (Byte-Pair Encoding)algorithm. The Byte-Level pre-tokenization method was used to minimize the occurrence of[UNK]tokens.μ μ²λ¦¬ (Preprocessing): κ³ λ¬Έν λ°μ΄ν°μ νΉμ±μ κ³ λ €νμ¬, PUA(Private Use Area) μ½λ λ³ν λ° μ λμ½λ NFC μ κ·νλ₯Ό μ μ²λ¦¬ νμ΄νλΌμΈμ μ μ©νμ΅λλ€. Considering the characteristics of classical texts, the preprocessing pipeline included PUA (Private Use Area) code conversion and Unicode NFC normalization.
μ΄νμ§ ν¬κΈ° (Vocabulary Size):
500,000. μ€λ‘μ λ°©λν νμ λ° κ³ μ μ΄νλ₯Ό μ΅λν νννκΈ° μν΄ ν° μ΄νμ§μ ꡬμ±νμ΅λλ€.500,000. A large vocabulary was constructed to maximally represent the vast array of Hanja characters and unique terms found in the Sillok.μ΅μ’ μ°μΆλ¬Ό (Final Output):
sillok_tokenizer_bpe_preprocessed/tokenizer.json
2λ¨κ³: λͺ¨λΈ μν€ν μ² μ μ (Stage 2: Model Architecture Definition)
ꡬ쑰 (Architecture):
BERT-baseμ λμΌν μν€ν μ² (12-layer, 768-hidden, 12-heads)λ₯Ό μ¬μ©νλ, μ΄λ ν μ¬μ νμ΅λ κ°μ€μΉλ μλ 'λΉ κΉ‘ν΅(from scratch)' λͺ¨λΈλ‘ μ΄κΈ°ννμ΅λλ€. Used the same architecture asBERT-base(12-layer, 768-hidden, 12-heads), but initialized as a 'blank slate' model from scratch without any pre-trained weights.μ΄ νλΌλ―Έν° μ (Total Parameters):
470,542,880. μ΄νμ§ ν¬κΈ° 50λ§μ λ°λΌ μλ² λ© λ μ΄μ΄κ° μ»€μ Έ, μΌλ°μ μΈ BERT-baseλ³΄λ€ νλΌλ―Έν° μκ° λ§μ΅λλ€.470,542,880. Due to the large vocabulary size of 500,000, the embedding layer is larger, resulting in more parameters than a standard BERT-base model.
3λ¨κ³: μ¬μ νμ΅ (Stage 3: Pre-training)
λͺ©ν (Objective): Masked Language Modeling (MLM)
νλ ¨ λ°μ΄ν°μ (Training Datasets):
train.txt: 362,107 linesvalidation.txt: 20,116 lines
μ£Όμ νλ ¨ μΈμ (Key Training Arguments):
output_dir:./sillokbert_scratch_pretraining_outputνλ ¨ μ€ μμ±λλ λͺ¨λΈ 체ν¬ν¬μΈνΈμ μ΅μ’ κ²°κ³Όλ¬Όμ΄ μ μ₯λλ λλ ν 리μ λλ€. The directory where model checkpoints and final outputs are saved during training.num_train_epochs:10μ 체 νλ ¨ λ°μ΄ν°μ μ μ΄ 10λ² λ°λ³΅νμ¬ νμ΅ν©λλ€. The total number of times the entire training dataset is iterated over for learning.per_device_train_batch_size:4λ¨μΌ GPUμμ ν λ²μ μ²λ¦¬νλ νλ ¨ λ°μ΄ν°μ μν μμ λλ€. GPU λ©λͺ¨λ¦¬ νκ³λ‘ μΈν΄ μκ² μ€μ νμ΅λλ€. The number of training samples processed at once on a single GPU. This was set to a small value due to GPU memory limitations.gradient_accumulation_steps:44λ²μ μμ λ°°μΉμμ κ³μ°λ κ·ΈλλμΈνΈλ₯Ό λͺ¨μμ ν λ²μ λͺ¨λΈμ μ λ°μ΄νΈν©λλ€. μ΄λ₯Ό ν΅ν΄ μ€μ§μ μΈ λ°°μΉ μ¬μ΄μ¦λ 16 (4 * 4)μΌλ‘ μ μ§νλ©΄μ λ©λͺ¨λ¦¬ μ¬μ©λμ ν¬κ² μ€μΌ μ μμ΅λλ€. Gradients calculated from 4 smaller batches are accumulated before updating the model. This effectively maintains a batch size of 16 (4 * 4) while significantly reducing memory usage.learning_rate:5e-5(0.00005) λͺ¨λΈμ΄ νμ΅νλ μλλ₯Ό μ‘°μ νλ νμ΅λ₯ μ λλ€. AdamW μ΅ν°λ§μ΄μ μ νμ€ κ° μ€ νλλ₯Ό μ¬μ©νμ΅λλ€. The learning rate that controls the speed of model learning. A standard value for the AdamW optimizer was used.warmup_steps:1000νλ ¨ μ΄κΈ°μ νμ΅λ₯ μ μ μ§μ μΌλ‘ μ¦κ°μμΌ νλ ¨ μμ μ±μ λμ΄λ λ¨κ³μ μμ λλ€. The number of steps to linearly increase the learning rate from 0 to its initial value, which enhances training stability at the beginning.weight_decay:0.01λͺ¨λΈμ κ°μ€μΉκ° λ무 컀μ§λ κ²μ λ°©μ§νμ¬ κ³Όμ ν©μ μ΅μ νλ μ κ·ν κΈ°λ²μ λλ€. A regularization technique that prevents model weights from becoming too large, thus mitigating overfitting.fp16:True16λΉνΈ λΆλμμμ (Half-precision)μ μ¬μ©ν νΌν© μ λ°λ νλ ¨μ νμ±νν©λλ€. GPU λ©λͺ¨λ¦¬ μ¬μ©λμ μ€μ΄κ³ νλ ¨ μλλ₯Ό ν₯μμν΅λλ€. Enables mixed-precision training using 16-bit floating-point numbers. This reduces GPU memory usage and improves training speed.gradient_checkpointing:Trueμμ ν(forward pass) κ³Όμ μ μ€κ° νμ±ν κ°μ λͺ¨λ μ μ₯νλ λμ , μμ ν(backward pass) μ μ¬κ³μ°νμ¬ λ©λͺ¨λ¦¬ μ¬μ©λμ νκΈ°μ μΌλ‘ μ€μ¬μ£Όλ κΈ°μ μ λλ€. A technique that significantly reduces memory usage by recomputing intermediate activations during the backward pass instead of storing them all during the forward pass.eval_strategy:"steps"eval_stepsμ μ§μ λ κ°κ²©λ§λ€ κ²μ¦ λ°μ΄ν°μ μΌλ‘ λͺ¨λΈ μ±λ₯μ νκ°ν©λλ€. Evaluates the model on the validation dataset at the interval specified byeval_steps.eval_steps/save_steps/logging_steps:2000/2000/500κ°κ° 2000 μ€ν λ§λ€ λͺ¨λΈ νκ° λ° μ μ₯μ, 500 μ€ν λ§λ€ νλ ¨ λ‘κ·Έλ₯Ό κΈ°λ‘ν©λλ€. Evaluates and saves the model every 2000 steps, and logs training metrics every 500 steps, respectively.
μ΅μ’ μ°μΆλ¬Ό (Final Output):
sillokbert_scratch_pretraining_output/final_model
4λ¨κ³: νκ° (Stage 4: Evaluation)
- νκ° λ°μ΄ν°μ
(Test Dataset):
test.txt - νκ° μ§ν (Evaluation Metric): Perplexity (PPL)
- Eval Loss:
0.3770 - Perplexity:
1.4580
μ±λ₯ λΉκ΅ κΈ°μ€ λͺ¨λΈ (Baseline Models)
λ³Έ μ°κ΅¬λ μ΄μ SillokBert νμΈνλ νλ‘μ νΈμ κ²°κ³Όλ¬Όκ³Ό, μλ¬΄λ° μ¬μ νμ΅λ κ±°μΉμ§ μμ bert-base-multilingual-casedλ₯Ό ν΅μ¬ μ±λ₯ κΈ°μ€μ (Baseline)μΌλ‘ νμ©νμ΅λλ€.
This research utilized the results from the previous SillokBert fine-tuning project and the off-the-shelf bert-base-multilingual-cased model as key performance baselines.
| μμ(Rank) | λͺ¨λΈ(Model) | Perplexity (PPL) | λΉκ³ (Note) |
|---|---|---|---|
| 1 | SillokBert-Scratch | 1.4580 | Trained from Scratch |
| 2 | SillokBert (Fine-tuned) | 4.1219 | Fine-tuned from bert-base-multilingual-cased |
| 3 | bert-base-multilingual-cased |
132.5186 |
Untrained on Sillok data |
μ°κ΅¬μ (Author)
- κΉλ°λ‘ (Baro Kim), νκ΅νμ€μμ°κ΅¬μ (The Academy of Korean Studies)
μΈμ© (Citation)
@misc{kim2025sillokbertscratch,
author = {Kim, Baro},
title = {SillokBert-Scratch: Training a Pure-blood Sillok Language Model from Scratch},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{[https://huggingface.co/YOUR_HF_USERNAME/YOUR_MODEL_NAME](https://huggingface.co/YOUR_HF_USERNAME/YOUR_MODEL_NAME)}}
}