|
|
--- |
|
|
language: |
|
|
- ko |
|
|
- lzh |
|
|
license: cc-by-sa-4.0 |
|
|
library_name: transformers |
|
|
tags: |
|
|
- fill-mask |
|
|
- text-generation |
|
|
- sillok |
|
|
- history |
|
|
- korean-history |
|
|
- classical-chinese |
|
|
pipeline_tag: fill-mask |
|
|
co2_eq_emissions: |
|
|
emissions: 1.1662 |
|
|
source: "codecarbon" |
|
|
training_type: "from_scratch" |
|
|
geographical_location: "South Korea, Seoul" |
|
|
hardware_used: "1 x NVIDIA A100-PCIE-40GB" |
|
|
--- |
|
|
datasets: |
|
|
- "VERITABLE RECORDS of the JOSEON DYNASTY" |
|
|
|
|
|
# SillokBert-Scratch: A Korean Classical Language Model Trained from Scratch |
|
|
## SillokBert-Scratch: μν(η΄θ‘) μ€λ‘ μΈμ΄ λͺ¨λΈ μ²μλΆν° νμ΅μν€κΈ° |
|
|
|
|
|
`SillokBert-Scratch`λ μ΄λ ν μ¬μ νμ΅λ μ§μ μμ΄, μ€μ§ 'μ‘°μ μμ‘°μ€λ‘' μλ¬Έ λ°μ΄ν°λ§μ μ¬μ©νμ¬ λ°±μ§μν(from scratch)μμλΆν° νλ ¨λ BERT κΈ°λ°μ μΈμ΄ λͺ¨λΈμ
λλ€. κΈ°μ‘΄μ λ€κ΅μ΄ λ²μ© λͺ¨λΈμ νΉμ λλ©μΈμ νμΈνλνλ λ°©μμμ λ²μ΄λ, μμνκ² μ€λ‘ λ°μ΄ν°μ κ³ μ ν μΈμ΄μ νΉμ±λ§μ νμ΅ν **'μν(Pure-blood)' μΈμ΄ λͺ¨λΈ**μ ꡬμΆνλ κ²μ λͺ©νλ‘ μ§νλ μ°κ΅¬ νλ‘μ νΈμ
λλ€. |
|
|
|
|
|
`SillokBert-Scratch` is a BERT-based language model trained entirely from scratch, using only the original text data of the "Annals of the Joseon Dynasty" (μ‘°μ μμ‘°μ€λ‘). This research project aimed to build a **'pure-blood' language model** that learns the unique linguistic characteristics of the historical text, moving away from the conventional approach of fine-tuning large multilingual models for specific domains. |
|
|
|
|
|
### **μ°κ΅¬ μ§μ (Funding & Support)** |
|
|
|
|
|
λ³Έ λͺ¨λΈμ νκ΅νμ€μμ°κ΅¬μ λμ§νΈμΈλ¬Ένμ°κ΅¬μμ "νκ΅ κ³ μ λ¬Έν κΈ°λ° μ§λ₯ν νκ΅ν μΈμ΄λͺ¨λΈ κ°λ°" νλ‘μ νΈμ μΌνμΌλ‘ κ°λ°λμμ΅λλ€. λ³Έ λͺ¨λΈμ νμ΅ νκ²½μ κ³ΌνκΈ°μ μ 보ν΅μ λΆ μ 보ν΅μ μ°μ
μ§ν₯μμ 2025λ
κ³ μ±λ₯μ»΄ν¨ν
μ§μ(GPU) μ¬μ
(G2025-0450)μ μ§μμ λ°μμ΅λλ€. μ°κ΅¬μ νμμ μΈ κ³ μ±λ₯ μ»΄ν¨ν
νκ²½μ μ§μν΄μ£Όμ
μ μ§μ¬μΌλ‘ κ°μ¬λ립λλ€. |
|
|
|
|
|
This model was developed as part of the "Development of an Intelligent Korean Studies Language Model based on Classical Korean Texts" project at the Digital Humanities Research Institute, The Academy of Korean Studies. The training environment for this model was supported by the 2025 High-Performance Computing Support (GPU) Program of the National IT Industry Promotion Agency (NIPA) (No. G2025-0450). We sincerely appreciate the support for providing the high-performance computing environment essential for our research. |
|
|
|
|
|
## **λͺ¨λΈ μ±λ₯ (Model Performance)** |
|
|
|
|
|
| λͺ¨λΈ (Model) | Perplexity (PPL) | λΉκ³ (Note) | |
|
|
| :--- | :---: | :--- | |
|
|
| **SillokBert-Scratch (This Model)** | **1.4580** | **Trained from Scratch** | |
|
|
| SillokBert (Previous Fine-tuned Model) | 4.1219 | Fine-tuned from `bert-base-multilingual-cased` | |
|
|
| `bert-base-multilingual-cased` (Baseline) | `132.5186` | Untrained on Sillok data | |
|
|
|
|
|
λ³Έ λͺ¨λΈμ κΈ°μ‘΄μ νμΈνλ κΈ°λ° λͺ¨λΈ λλΉ **Perplexityλ₯Ό 64.63% ν₯μ**μμΌ°μΌλ©°, μ무 νλ ¨λ κ±°μΉμ§ μμ λ²μ© BERT-base λͺ¨λΈλ³΄λ€λ **98.90% ν₯μ**λ μλμ μΈ μ±λ₯μ 보μ
λλ€. |
|
|
|
|
|
This model achieved a **64.63% improvement in Perplexity** compared to the previous fine-tuned model, and a **98.90% improvement** over the baseline `bert-base-multilingual-cased` model, demonstrating its overwhelming superiority. |
|
|
|
|
|
## **λͺ¨λΈ μ¬μ©λ² (How to Use)** |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
import torch |
|
|
|
|
|
# νκΉ
νμ΄μ€μ μ
λ‘λνμ λͺ¨λΈ κ²½λ‘λ₯Ό μ§μ ν©λλ€. |
|
|
# Specify the path to your model uploaded on Hugging Face. |
|
|
model_name = "ddokbaro/SillokBert-Scratch" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForMaskedLM.from_pretrained(model_name) |
|
|
|
|
|
# ν
μ€νΈ λ¬Έμ₯ |
|
|
# Test sentence |
|
|
text = "ημ΄ ε³ζ°, βθΏζ₯ [MASK]δΊκ° δ½?β" |
|
|
|
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
token_logits = model(**inputs).logits |
|
|
|
|
|
# λ§μ€ν¬λ ν ν°μ μμΉλ₯Ό μ°Ύμ κ°μ₯ νλ₯ μ΄ λμ λ¨μ΄λ₯Ό μμΈ‘ν©λλ€. |
|
|
# Find the masked token's position and predict the most probable words. |
|
|
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1] |
|
|
mask_token_logits = token_logits[0, mask_token_index, :] |
|
|
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist() |
|
|
|
|
|
print(f"'{text}' λ¬Έμ₯μμ [MASK]μ λ€μ΄κ° νλ₯ μ΄ λμ λ¨μ΄ TOP 5:") |
|
|
print(f"Top 5 most probable words for the [MASK] token in '{text}':") |
|
|
for token_id in top_5_tokens: |
|
|
print(f" - {tokenizer.decode([token_id])}") |
|
|
``` |
|
|
|
|
|
## **λ°μ΄ν°μ
μ 보 (Dataset Information)** |
|
|
|
|
|
#### **λ°μ΄ν° μΆμ² λ° μμ§ (Data Source and Collection)** |
|
|
|
|
|
**μμ² λ°μ΄ν° (Source Data)**: 곡곡λ°μ΄ν°ν¬νΈ - κ΅μ‘λΆ κ΅μ¬νΈμ°¬μμν\_μ‘°μ μμ‘°μ€λ‘ μ 보\_μ€λ‘μλ¬Έ <https://www.data.go.kr/data/15053647/fileData.do>. μ°κ΅¬μ ν λκ° λ κ·μ€ν μλ£λ₯Ό μ 곡ν΄μ£Όμ κ΅μ‘λΆ κ΅μ¬νΈμ°¬μμν μΈ‘μ κ°μ¬μ λ§μμ μ ν©λλ€. |
|
|
|
|
|
**Source Data**: Public Data Portal - National Institute of Korean History (Ministry of Education)\_Annals of the Joseon Dynasty Information_Original Sillok Texts <https://www.data.go.kr/data/15053647/fileData.do>. We express our gratitude to the National Institute of Korean History for providing the invaluable data that formed the foundation of this research. |
|
|
|
|
|
**λ°μ΄ν° λ²μ λ° μ¬νμ± (Data Version and Reproducibility)**: λ³Έ μ°κ΅¬λ 2022λ
11μ 03μΌμ λ±λ‘λ λ°μ΄ν°λ₯Ό κΈ°λ°μΌλ‘ ν©λλ€. 곡μ λ°°ν¬μ²μ λ°μ΄ν°κ° μ
λ°μ΄νΈλ μ μμ΄, μλ²½ν μ¬νμ±μ 보μ₯νκΈ° μν΄ νμ΅μ μ¬μ©λ μλ³Έ XML νμΌ μ 체λ₯Ό `raw_data/sillok_raw_xml.zip` νμΌλ‘ μ 곡ν©λλ€. λν, μ¦μ νμ© κ°λ₯ν μ μ²λ¦¬ μλ£ ν
μ€νΈ νμΌ(`train.txt`, `validation.txt`, `test.txt`)μ `preprocessed_data/` ν΄λμμ νμΈνμ€ μ μμ΅λλ€. |
|
|
|
|
|
**Data Version and Reproducibility**: This research is based on the data registered on November 3, 2022. As the data from the official distributor may be updated, the entire original XML files used for training are provided as `raw_data/sillok_raw_xml.zip` to ensure complete reproducibility. Additionally, the preprocessed text files (`train.txt`, `validation.txt`, `test.txt`) ready for immediate use are available in the `preprocessed_data/` folder. |
|
|
|
|
|
#### **μ μ²λ¦¬λ λ°μ΄ν° κ²½λ‘ (Path to Preprocessed Data)** |
|
|
|
|
|
`/home/work/baro/sillok25060103/preprocessed_corpus/` |
|
|
|
|
|
## **νλ ¨ μ μ°¨ (Training Procedure)** |
|
|
|
|
|
λ³Έ λͺ¨λΈμ μ΄ 4λ¨κ³μ κ³Όμ μ κ±°μ³ κ°λ°λμμ΅λλ€. / This model was developed through a four-stage process. |
|
|
|
|
|
### **1λ¨κ³: μ€λ‘ νΉν ν ν¬λμ΄μ νλ ¨ (Stage 1: Sillok-Specific Tokenizer Training)** |
|
|
|
|
|
* **μκ³ λ¦¬μ¦ (Algorithm):** |
|
|
μ΄κΈ° `WordPiece` λ°©μμμ μμΈ λΆλͺ
μ νλ ¨ μ€λ¨ λ¬Έμ κ° λ°μνμ¬, μ΅μ’
μ μΌλ‘ `BPE (Byte-Pair Encoding)` μκ³ λ¦¬μ¦μ μ±ννμ΅λλ€. Byte-Level μ μ²λ¦¬ λ°©μμ μ¬μ©νμ¬ `[UNK]` ν ν° λ°μμ μ΅μννμ΅λλ€. |
|
|
After encountering an inexplicable training halt with the initial `WordPiece` approach, we ultimately adopted the `BPE (Byte-Pair Encoding)` algorithm. The Byte-Level pre-tokenization method was used to minimize the occurrence of `[UNK]` tokens. |
|
|
|
|
|
* **μ μ²λ¦¬ (Preprocessing):** |
|
|
κ³ λ¬Έν λ°μ΄ν°μ νΉμ±μ κ³ λ €νμ¬, **PUA(Private Use Area) μ½λ λ³ν** λ° **μ λμ½λ NFC μ κ·ν**λ₯Ό μ μ²λ¦¬ νμ΄νλΌμΈμ μ μ©νμ΅λλ€. |
|
|
Considering the characteristics of classical texts, the preprocessing pipeline included **PUA (Private Use Area) code conversion** and **Unicode NFC normalization**. |
|
|
|
|
|
* **μ΄νμ§ ν¬κΈ° (Vocabulary Size):** |
|
|
`500,000`. μ€λ‘μ λ°©λν νμ λ° κ³ μ μ΄νλ₯Ό μ΅λν νννκΈ° μν΄ ν° μ΄νμ§μ ꡬμ±νμ΅λλ€. |
|
|
`500,000`. A large vocabulary was constructed to maximally represent the vast array of Hanja characters and unique terms found in the Sillok. |
|
|
|
|
|
* **μ΅μ’
μ°μΆλ¬Ό (Final Output):** `sillok_tokenizer_bpe_preprocessed/tokenizer.json` |
|
|
|
|
|
### **2λ¨κ³: λͺ¨λΈ μν€ν
μ² μ μ (Stage 2: Model Architecture Definition)** |
|
|
|
|
|
* **ꡬ쑰 (Architecture):** |
|
|
`BERT-base`μ λμΌν μν€ν
μ² (12-layer, 768-hidden, 12-heads)λ₯Ό μ¬μ©νλ, μ΄λ ν μ¬μ νμ΅λ κ°μ€μΉλ μλ 'λΉ κΉ‘ν΅(from scratch)' λͺ¨λΈλ‘ μ΄κΈ°ννμ΅λλ€. |
|
|
Used the same architecture as `BERT-base` (12-layer, 768-hidden, 12-heads), but initialized as a 'blank slate' model from scratch without any pre-trained weights. |
|
|
|
|
|
* **μ΄ νλΌλ―Έν° μ (Total Parameters):** |
|
|
`470,542,880`. μ΄νμ§ ν¬κΈ° 50λ§μ λ°λΌ μλ² λ© λ μ΄μ΄κ° μ»€μ Έ, μΌλ°μ μΈ BERT-baseλ³΄λ€ νλΌλ―Έν° μκ° λ§μ΅λλ€. |
|
|
`470,542,880`. Due to the large vocabulary size of 500,000, the embedding layer is larger, resulting in more parameters than a standard BERT-base model. |
|
|
|
|
|
### **3λ¨κ³: μ¬μ νμ΅ (Stage 3: Pre-training)** |
|
|
|
|
|
* **λͺ©ν (Objective):** Masked Language Modeling (MLM) |
|
|
|
|
|
* **νλ ¨ λ°μ΄ν°μ
(Training Datasets):** |
|
|
* `train.txt`: 362,107 lines |
|
|
* `validation.txt`: 20,116 lines |
|
|
|
|
|
* **μ£Όμ νλ ¨ μΈμ (Key Training Arguments):** |
|
|
* **`output_dir`**: `./sillokbert_scratch_pretraining_output` |
|
|
νλ ¨ μ€ μμ±λλ λͺ¨λΈ 체ν¬ν¬μΈνΈμ μ΅μ’
κ²°κ³Όλ¬Όμ΄ μ μ₯λλ λλ ν 리μ
λλ€. |
|
|
The directory where model checkpoints and final outputs are saved during training. |
|
|
* **`num_train_epochs`**: `10` |
|
|
μ 체 νλ ¨ λ°μ΄ν°μ
μ μ΄ 10λ² λ°λ³΅νμ¬ νμ΅ν©λλ€. |
|
|
The total number of times the entire training dataset is iterated over for learning. |
|
|
* **`per_device_train_batch_size`**: `4` |
|
|
λ¨μΌ GPUμμ ν λ²μ μ²λ¦¬νλ νλ ¨ λ°μ΄ν°μ μν μμ
λλ€. GPU λ©λͺ¨λ¦¬ νκ³λ‘ μΈν΄ μκ² μ€μ νμ΅λλ€. |
|
|
The number of training samples processed at once on a single GPU. This was set to a small value due to GPU memory limitations. |
|
|
* **`gradient_accumulation_steps`**: `4` |
|
|
4λ²μ μμ λ°°μΉμμ κ³μ°λ κ·ΈλλμΈνΈλ₯Ό λͺ¨μμ ν λ²μ λͺ¨λΈμ μ
λ°μ΄νΈν©λλ€. μ΄λ₯Ό ν΅ν΄ μ€μ§μ μΈ λ°°μΉ μ¬μ΄μ¦λ 16 (`4 * 4`)μΌλ‘ μ μ§νλ©΄μ λ©λͺ¨λ¦¬ μ¬μ©λμ ν¬κ² μ€μΌ μ μμ΅λλ€. |
|
|
Gradients calculated from 4 smaller batches are accumulated before updating the model. This effectively maintains a batch size of 16 (`4 * 4`) while significantly reducing memory usage. |
|
|
* **`learning_rate`**: `5e-5` (0.00005) |
|
|
λͺ¨λΈμ΄ νμ΅νλ μλλ₯Ό μ‘°μ νλ νμ΅λ₯ μ
λλ€. AdamW μ΅ν°λ§μ΄μ μ νμ€ κ° μ€ νλλ₯Ό μ¬μ©νμ΅λλ€. |
|
|
The learning rate that controls the speed of model learning. A standard value for the AdamW optimizer was used. |
|
|
* **`warmup_steps`**: `1000` |
|
|
νλ ¨ μ΄κΈ°μ νμ΅λ₯ μ μ μ§μ μΌλ‘ μ¦κ°μμΌ νλ ¨ μμ μ±μ λμ΄λ λ¨κ³μ μμ
λλ€. |
|
|
The number of steps to linearly increase the learning rate from 0 to its initial value, which enhances training stability at the beginning. |
|
|
* **`weight_decay`**: `0.01` |
|
|
λͺ¨λΈμ κ°μ€μΉκ° λ무 컀μ§λ κ²μ λ°©μ§νμ¬ κ³Όμ ν©μ μ΅μ νλ μ κ·ν κΈ°λ²μ
λλ€. |
|
|
A regularization technique that prevents model weights from becoming too large, thus mitigating overfitting. |
|
|
* **`fp16`**: `True` |
|
|
16λΉνΈ λΆλμμμ (Half-precision)μ μ¬μ©ν νΌν© μ λ°λ νλ ¨μ νμ±νν©λλ€. GPU λ©λͺ¨λ¦¬ μ¬μ©λμ μ€μ΄κ³ νλ ¨ μλλ₯Ό ν₯μμν΅λλ€. |
|
|
Enables mixed-precision training using 16-bit floating-point numbers. This reduces GPU memory usage and improves training speed. |
|
|
* **`gradient_checkpointing`**: `True` |
|
|
μμ ν(forward pass) κ³Όμ μ μ€κ° νμ±ν κ°μ λͺ¨λ μ μ₯νλ λμ , μμ ν(backward pass) μ μ¬κ³μ°νμ¬ λ©λͺ¨λ¦¬ μ¬μ©λμ νκΈ°μ μΌλ‘ μ€μ¬μ£Όλ κΈ°μ μ
λλ€. |
|
|
A technique that significantly reduces memory usage by recomputing intermediate activations during the backward pass instead of storing them all during the forward pass. |
|
|
* **`eval_strategy`**: `"steps"` |
|
|
`eval_steps`μ μ§μ λ κ°κ²©λ§λ€ κ²μ¦ λ°μ΄ν°μ
μΌλ‘ λͺ¨λΈ μ±λ₯μ νκ°ν©λλ€. |
|
|
Evaluates the model on the validation dataset at the interval specified by `eval_steps`. |
|
|
* **`eval_steps`** / **`save_steps`** / **`logging_steps`**: `2000` / `2000` / `500` |
|
|
κ°κ° 2000 μ€ν
λ§λ€ λͺ¨λΈ νκ° λ° μ μ₯μ, 500 μ€ν
λ§λ€ νλ ¨ λ‘κ·Έλ₯Ό κΈ°λ‘ν©λλ€. |
|
|
Evaluates and saves the model every 2000 steps, and logs training metrics every 500 steps, respectively. |
|
|
|
|
|
* **μ΅μ’
μ°μΆλ¬Ό (Final Output):** `sillokbert_scratch_pretraining_output/final_model` |
|
|
|
|
|
### **4λ¨κ³: νκ° (Stage 4: Evaluation)** |
|
|
* **νκ° λ°μ΄ν°μ
(Test Dataset):** `test.txt` |
|
|
* **νκ° μ§ν (Evaluation Metric):** Perplexity (PPL) |
|
|
* **Eval Loss:** `0.3770` |
|
|
* **Perplexity:** `1.4580` |
|
|
|
|
|
## **μ±λ₯ λΉκ΅ κΈ°μ€ λͺ¨λΈ (Baseline Models)** |
|
|
λ³Έ μ°κ΅¬λ μ΄μ `SillokBert` νμΈνλ νλ‘μ νΈμ κ²°κ³Όλ¬Όκ³Ό, μλ¬΄λ° μ¬μ νμ΅λ κ±°μΉμ§ μμ `bert-base-multilingual-cased`λ₯Ό ν΅μ¬ μ±λ₯ κΈ°μ€μ (Baseline)μΌλ‘ νμ©νμ΅λλ€. |
|
|
This research utilized the results from the previous `SillokBert` fine-tuning project and the off-the-shelf `bert-base-multilingual-cased` model as key performance baselines. |
|
|
|
|
|
| μμ(Rank) | λͺ¨λΈ(Model) | Perplexity (PPL) | λΉκ³ (Note) | |
|
|
| :--- | :--- | :---: | :--- | |
|
|
| 1 | **SillokBert-Scratch** | **1.4580** | **Trained from Scratch** | |
|
|
| 2 | SillokBert (Fine-tuned) | 4.1219 | Fine-tuned from `bert-base-multilingual-cased` | |
|
|
| 3 | `bert-base-multilingual-cased` | `132.5186` | Untrained on Sillok data | |
|
|
|
|
|
## **μ°κ΅¬μ (Author)** |
|
|
* κΉλ°λ‘ (Baro Kim), νκ΅νμ€μμ°κ΅¬μ (The Academy of Korean Studies) |
|
|
|
|
|
## **μΈμ© (Citation)** |
|
|
``` |
|
|
@misc{kim2025sillokbertscratch, |
|
|
author = {Kim, Baro}, |
|
|
title = {SillokBert-Scratch: Training a Pure-blood Sillok Language Model from Scratch}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
journal = {Hugging Face repository}, |
|
|
howpublished = {\url{[https://huggingface.co/YOUR_HF_USERNAME/YOUR_MODEL_NAME](https://huggingface.co/YOUR_HF_USERNAME/YOUR_MODEL_NAME)}} |
|
|
} |
|
|
``` |
|
|
|