datasets:

  • "VERITABLE RECORDS of the JOSEON DYNASTY"

SillokBert-Scratch: A Korean Classical Language Model Trained from Scratch

SillokBert-Scratch: ์ˆœํ˜ˆ(็ด”่ก€) ์‹ค๋ก ์–ธ์–ด ๋ชจ๋ธ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต์‹œํ‚ค๊ธฐ

SillokBert-Scratch๋Š” ์–ด๋– ํ•œ ์‚ฌ์ „ ํ•™์Šต๋œ ์ง€์‹ ์—†์ด, ์˜ค์ง '์กฐ์„ ์™•์กฐ์‹ค๋ก' ์›๋ฌธ ๋ฐ์ดํ„ฐ๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐฑ์ง€์ƒํƒœ(from scratch)์—์„œ๋ถ€ํ„ฐ ํ›ˆ๋ จ๋œ BERT ๊ธฐ๋ฐ˜์˜ ์–ธ์–ด ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ ๋‹ค๊ตญ์–ด ๋ฒ”์šฉ ๋ชจ๋ธ์„ ํŠน์ • ๋„๋ฉ”์ธ์— ํŒŒ์ธํŠœ๋‹ํ•˜๋Š” ๋ฐฉ์‹์—์„œ ๋ฒ—์–ด๋‚˜, ์ˆœ์ˆ˜ํ•˜๊ฒŒ ์‹ค๋ก ๋ฐ์ดํ„ฐ์˜ ๊ณ ์œ ํ•œ ์–ธ์–ด์  ํŠน์„ฑ๋งŒ์„ ํ•™์Šตํ•œ '์ˆœํ˜ˆ(Pure-blood)' ์–ธ์–ด ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ์ง„ํ–‰๋œ ์—ฐ๊ตฌ ํ”„๋กœ์ ํŠธ์ž…๋‹ˆ๋‹ค.

SillokBert-Scratch is a BERT-based language model trained entirely from scratch, using only the original text data of the "Annals of the Joseon Dynasty" (์กฐ์„ ์™•์กฐ์‹ค๋ก). This research project aimed to build a 'pure-blood' language model that learns the unique linguistic characteristics of the historical text, moving away from the conventional approach of fine-tuning large multilingual models for specific domains.

์—ฐ๊ตฌ ์ง€์› (Funding & Support)

๋ณธ ๋ชจ๋ธ์€ ํ•œ๊ตญํ•™์ค‘์•™์—ฐ๊ตฌ์› ๋””์ง€ํ„ธ์ธ๋ฌธํ•™์—ฐ๊ตฌ์†Œ์˜ "ํ•œ๊ตญ ๊ณ ์ „ ๋ฌธํ—Œ ๊ธฐ๋ฐ˜ ์ง€๋Šฅํ˜• ํ•œ๊ตญํ•™ ์–ธ์–ด๋ชจ๋ธ ๊ฐœ๋ฐœ" ํ”„๋กœ์ ํŠธ์˜ ์ผํ™˜์œผ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋ชจ๋ธ์˜ ํ•™์Šต ํ™˜๊ฒฝ์€ ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณดํ†ต์‹ ๋ถ€ ์ •๋ณดํ†ต์‹ ์‚ฐ์—…์ง„ํฅ์›์˜ 2025๋…„ ๊ณ ์„ฑ๋Šฅ์ปดํ“จํŒ…์ง€์›(GPU) ์‚ฌ์—…(G2025-0450)์˜ ์ง€์›์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค. ์—ฐ๊ตฌ์— ํ•„์ˆ˜์ ์ธ ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ… ํ™˜๊ฒฝ์„ ์ง€์›ํ•ด์ฃผ์…”์„œ ์ง„์‹ฌ์œผ๋กœ ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

This model was developed as part of the "Development of an Intelligent Korean Studies Language Model based on Classical Korean Texts" project at the Digital Humanities Research Institute, The Academy of Korean Studies. The training environment for this model was supported by the 2025 High-Performance Computing Support (GPU) Program of the National IT Industry Promotion Agency (NIPA) (No. G2025-0450). We sincerely appreciate the support for providing the high-performance computing environment essential for our research.

๋ชจ๋ธ ์„ฑ๋Šฅ (Model Performance)

๋ชจ๋ธ (Model) Perplexity (PPL) ๋น„๊ณ  (Note)
SillokBert-Scratch (This Model) 1.4580 Trained from Scratch
SillokBert (Previous Fine-tuned Model) 4.1219 Fine-tuned from bert-base-multilingual-cased
bert-base-multilingual-cased (Baseline) 132.5186 Untrained on Sillok data

๋ณธ ๋ชจ๋ธ์€ ๊ธฐ์กด์˜ ํŒŒ์ธํŠœ๋‹ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ ๋Œ€๋น„ Perplexity๋ฅผ 64.63% ํ–ฅ์ƒ์‹œ์ผฐ์œผ๋ฉฐ, ์•„๋ฌด ํ›ˆ๋ จ๋„ ๊ฑฐ์น˜์ง€ ์•Š์€ ๋ฒ”์šฉ BERT-base ๋ชจ๋ธ๋ณด๋‹ค๋Š” 98.90% ํ–ฅ์ƒ๋œ ์••๋„์ ์ธ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

This model achieved a 64.63% improvement in Perplexity compared to the previous fine-tuned model, and a 98.90% improvement over the baseline bert-base-multilingual-cased model, demonstrating its overwhelming superiority.

๋ชจ๋ธ ์‚ฌ์šฉ๋ฒ• (How to Use)

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# ํ—ˆ๊น…ํŽ˜์ด์Šค์— ์—…๋กœ๋“œํ•˜์‹  ๋ชจ๋ธ ๊ฒฝ๋กœ๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.
# Specify the path to your model uploaded on Hugging Face.
model_name = "ddokbaro/SillokBert-Scratch" 

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# ํ…Œ์ŠคํŠธ ๋ฌธ์žฅ
# Test sentence
text = "็Ž‹์ด ๅ‚ณๆ›ฐ, โ€œ่ฟ‘ๆ—ฅ [MASK]ไบ‹๊ฐ€ ไฝ•?โ€"

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits

# ๋งˆ์Šคํฌ๋œ ํ† ํฐ์˜ ์œ„์น˜๋ฅผ ์ฐพ์•„ ๊ฐ€์žฅ ํ™•๋ฅ ์ด ๋†’์€ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
# Find the masked token's position and predict the most probable words.
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

print(f"'{text}' ๋ฌธ์žฅ์—์„œ [MASK]์— ๋“ค์–ด๊ฐˆ ํ™•๋ฅ ์ด ๋†’์€ ๋‹จ์–ด TOP 5:")
print(f"Top 5 most probable words for the [MASK] token in '{text}':")
for token_id in top_5_tokens:
    print(f"  - {tokenizer.decode([token_id])}")

๋ฐ์ดํ„ฐ์…‹ ์ •๋ณด (Dataset Information)

๋ฐ์ดํ„ฐ ์ถœ์ฒ˜ ๋ฐ ์ˆ˜์ง‘ (Data Source and Collection)

์›์ฒœ ๋ฐ์ดํ„ฐ (Source Data): ๊ณต๊ณต๋ฐ์ดํ„ฐํฌํ„ธ - ๊ต์œก๋ถ€ ๊ตญ์‚ฌํŽธ์ฐฌ์œ„์›ํšŒ_์กฐ์„ ์™•์กฐ์‹ค๋ก ์ •๋ณด_์‹ค๋ก์›๋ฌธ https://www.data.go.kr/data/15053647/fileData.do. ์—ฐ๊ตฌ์˜ ํ† ๋Œ€๊ฐ€ ๋œ ๊ท€์ค‘ํ•œ ์ž๋ฃŒ๋ฅผ ์ œ๊ณตํ•ด์ฃผ์‹  ๊ต์œก๋ถ€ ๊ตญ์‚ฌํŽธ์ฐฌ์œ„์›ํšŒ ์ธก์— ๊ฐ์‚ฌ์˜ ๋ง์”€์„ ์ „ํ•ฉ๋‹ˆ๋‹ค.

Source Data: Public Data Portal - National Institute of Korean History (Ministry of Education)_Annals of the Joseon Dynasty Information_Original Sillok Texts https://www.data.go.kr/data/15053647/fileData.do. We express our gratitude to the National Institute of Korean History for providing the invaluable data that formed the foundation of this research.

๋ฐ์ดํ„ฐ ๋ฒ„์ „ ๋ฐ ์žฌํ˜„์„ฑ (Data Version and Reproducibility): ๋ณธ ์—ฐ๊ตฌ๋Š” 2022๋…„ 11์›” 03์ผ์— ๋“ฑ๋ก๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๊ณต์‹ ๋ฐฐํฌ์ฒ˜์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์—…๋ฐ์ดํŠธ๋  ์ˆ˜ ์žˆ์–ด, ์™„๋ฒฝํ•œ ์žฌํ˜„์„ฑ์„ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต์— ์‚ฌ์šฉ๋œ ์›๋ณธ XML ํŒŒ์ผ ์ „์ฒด๋ฅผ raw_data/sillok_raw_xml.zip ํŒŒ์ผ๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ฆ‰์‹œ ํ™œ์šฉ ๊ฐ€๋Šฅํ•œ ์ „์ฒ˜๋ฆฌ ์™„๋ฃŒ ํ…์ŠคํŠธ ํŒŒ์ผ(train.txt, validation.txt, test.txt)์€ preprocessed_data/ ํด๋”์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Data Version and Reproducibility: This research is based on the data registered on November 3, 2022. As the data from the official distributor may be updated, the entire original XML files used for training are provided as raw_data/sillok_raw_xml.zip to ensure complete reproducibility. Additionally, the preprocessed text files (train.txt, validation.txt, test.txt) ready for immediate use are available in the preprocessed_data/ folder.

์ „์ฒ˜๋ฆฌ๋œ ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ (Path to Preprocessed Data)

/home/work/baro/sillok25060103/preprocessed_corpus/

ํ›ˆ๋ จ ์ ˆ์ฐจ (Training Procedure)

๋ณธ ๋ชจ๋ธ์€ ์ด 4๋‹จ๊ณ„์˜ ๊ณผ์ •์„ ๊ฑฐ์ณ ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. / This model was developed through a four-stage process.

1๋‹จ๊ณ„: ์‹ค๋ก ํŠนํ™” ํ† ํฌ๋‚˜์ด์ € ํ›ˆ๋ จ (Stage 1: Sillok-Specific Tokenizer Training)

  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ (Algorithm): ์ดˆ๊ธฐ WordPiece ๋ฐฉ์‹์—์„œ ์›์ธ ๋ถˆ๋ช…์˜ ํ›ˆ๋ จ ์ค‘๋‹จ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜์—ฌ, ์ตœ์ข…์ ์œผ๋กœ BPE (Byte-Pair Encoding) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ฑ„ํƒํ–ˆ์Šต๋‹ˆ๋‹ค. Byte-Level ์ „์ฒ˜๋ฆฌ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ [UNK] ํ† ํฐ ๋ฐœ์ƒ์„ ์ตœ์†Œํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. After encountering an inexplicable training halt with the initial WordPiece approach, we ultimately adopted the BPE (Byte-Pair Encoding) algorithm. The Byte-Level pre-tokenization method was used to minimize the occurrence of [UNK] tokens.

  • ์ „์ฒ˜๋ฆฌ (Preprocessing): ๊ณ ๋ฌธํ—Œ ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ, PUA(Private Use Area) ์ฝ”๋“œ ๋ณ€ํ™˜ ๋ฐ ์œ ๋‹ˆ์ฝ”๋“œ NFC ์ •๊ทœํ™”๋ฅผ ์ „์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ์— ์ ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. Considering the characteristics of classical texts, the preprocessing pipeline included PUA (Private Use Area) code conversion and Unicode NFC normalization.

  • ์–ดํœ˜์ง‘ ํฌ๊ธฐ (Vocabulary Size): 500,000. ์‹ค๋ก์˜ ๋ฐฉ๋Œ€ํ•œ ํ•œ์ž ๋ฐ ๊ณ ์œ  ์–ดํœ˜๋ฅผ ์ตœ๋Œ€ํ•œ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ํฐ ์–ดํœ˜์ง‘์„ ๊ตฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. 500,000. A large vocabulary was constructed to maximally represent the vast array of Hanja characters and unique terms found in the Sillok.

  • ์ตœ์ข… ์‚ฐ์ถœ๋ฌผ (Final Output): sillok_tokenizer_bpe_preprocessed/tokenizer.json

2๋‹จ๊ณ„: ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜ ์ •์˜ (Stage 2: Model Architecture Definition)

  • ๊ตฌ์กฐ (Architecture): BERT-base์™€ ๋™์ผํ•œ ์•„ํ‚คํ…์ฒ˜ (12-layer, 768-hidden, 12-heads)๋ฅผ ์‚ฌ์šฉํ•˜๋˜, ์–ด๋– ํ•œ ์‚ฌ์ „ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜๋„ ์—†๋Š” '๋นˆ ๊นกํ†ต(from scratch)' ๋ชจ๋ธ๋กœ ์ดˆ๊ธฐํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. Used the same architecture as BERT-base (12-layer, 768-hidden, 12-heads), but initialized as a 'blank slate' model from scratch without any pre-trained weights.

  • ์ด ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ (Total Parameters): 470,542,880. ์–ดํœ˜์ง‘ ํฌ๊ธฐ 50๋งŒ์— ๋”ฐ๋ผ ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๊ฐ€ ์ปค์ ธ, ์ผ๋ฐ˜์ ์ธ BERT-base๋ณด๋‹ค ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. 470,542,880. Due to the large vocabulary size of 500,000, the embedding layer is larger, resulting in more parameters than a standard BERT-base model.

3๋‹จ๊ณ„: ์‚ฌ์ „ํ•™์Šต (Stage 3: Pre-training)

  • ๋ชฉํ‘œ (Objective): Masked Language Modeling (MLM)

  • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹ (Training Datasets):

    • train.txt: 362,107 lines
    • validation.txt: 20,116 lines
  • ์ฃผ์š” ํ›ˆ๋ จ ์ธ์ž (Key Training Arguments):

    • output_dir: ./sillokbert_scratch_pretraining_output ํ›ˆ๋ จ ์ค‘ ์ƒ์„ฑ๋˜๋Š” ๋ชจ๋ธ ์ฒดํฌํฌ์ธํŠธ์™€ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ์ด ์ €์žฅ๋˜๋Š” ๋””๋ ‰ํ† ๋ฆฌ์ž…๋‹ˆ๋‹ค. The directory where model checkpoints and final outputs are saved during training.
    • num_train_epochs: 10 ์ „์ฒด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹์„ ์ด 10๋ฒˆ ๋ฐ˜๋ณตํ•˜์—ฌ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. The total number of times the entire training dataset is iterated over for learning.
    • per_device_train_batch_size: 4 ๋‹จ์ผ GPU์—์„œ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ์ƒ˜ํ”Œ ์ˆ˜์ž…๋‹ˆ๋‹ค. GPU ๋ฉ”๋ชจ๋ฆฌ ํ•œ๊ณ„๋กœ ์ธํ•ด ์ž‘๊ฒŒ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. The number of training samples processed at once on a single GPU. This was set to a small value due to GPU memory limitations.
    • gradient_accumulation_steps: 4 4๋ฒˆ์˜ ์ž‘์€ ๋ฐฐ์น˜์—์„œ ๊ณ„์‚ฐ๋œ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๋ชจ์•„์„œ ํ•œ ๋ฒˆ์— ๋ชจ๋ธ์„ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์‹ค์งˆ์ ์ธ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๋Š” 16 (4 * 4)์œผ๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Gradients calculated from 4 smaller batches are accumulated before updating the model. This effectively maintains a batch size of 16 (4 * 4) while significantly reducing memory usage.
    • learning_rate: 5e-5 (0.00005) ๋ชจ๋ธ์ด ํ•™์Šตํ•˜๋Š” ์†๋„๋ฅผ ์กฐ์ ˆํ•˜๋Š” ํ•™์Šต๋ฅ ์ž…๋‹ˆ๋‹ค. AdamW ์˜ตํ‹ฐ๋งˆ์ด์ €์˜ ํ‘œ์ค€ ๊ฐ’ ์ค‘ ํ•˜๋‚˜๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. The learning rate that controls the speed of model learning. A standard value for the AdamW optimizer was used.
    • warmup_steps: 1000 ํ›ˆ๋ จ ์ดˆ๊ธฐ์— ํ•™์Šต๋ฅ ์„ ์ ์ง„์ ์œผ๋กœ ์ฆ๊ฐ€์‹œ์ผœ ํ›ˆ๋ จ ์•ˆ์ •์„ฑ์„ ๋†’์ด๋Š” ๋‹จ๊ณ„์˜ ์ˆ˜์ž…๋‹ˆ๋‹ค. The number of steps to linearly increase the learning rate from 0 to its initial value, which enhances training stability at the beginning.
    • weight_decay: 0.01 ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜๊ฐ€ ๋„ˆ๋ฌด ์ปค์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜์—ฌ ๊ณผ์ ํ•ฉ์„ ์–ต์ œํ•˜๋Š” ์ •๊ทœํ™” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. A regularization technique that prevents model weights from becoming too large, thus mitigating overfitting.
    • fp16: True 16๋น„ํŠธ ๋ถ€๋™์†Œ์ˆ˜์ (Half-precision)์„ ์‚ฌ์šฉํ•œ ํ˜ผํ•ฉ ์ •๋ฐ€๋„ ํ›ˆ๋ จ์„ ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค. GPU ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๊ณ  ํ›ˆ๋ จ ์†๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. Enables mixed-precision training using 16-bit floating-point numbers. This reduces GPU memory usage and improves training speed.
    • gradient_checkpointing: True ์ˆœ์ „ํŒŒ(forward pass) ๊ณผ์ •์˜ ์ค‘๊ฐ„ ํ™œ์„ฑํ™” ๊ฐ’์„ ๋ชจ๋‘ ์ €์žฅํ•˜๋Š” ๋Œ€์‹ , ์—ญ์ „ํŒŒ(backward pass) ์‹œ ์žฌ๊ณ„์‚ฐํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ํš๊ธฐ์ ์œผ๋กœ ์ค„์—ฌ์ฃผ๋Š” ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. A technique that significantly reduces memory usage by recomputing intermediate activations during the backward pass instead of storing them all during the forward pass.
    • eval_strategy: "steps" eval_steps์— ์ง€์ •๋œ ๊ฐ„๊ฒฉ๋งˆ๋‹ค ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. Evaluates the model on the validation dataset at the interval specified by eval_steps.
    • eval_steps / save_steps / logging_steps: 2000 / 2000 / 500 ๊ฐ๊ฐ 2000 ์Šคํ…๋งˆ๋‹ค ๋ชจ๋ธ ํ‰๊ฐ€ ๋ฐ ์ €์žฅ์„, 500 ์Šคํ…๋งˆ๋‹ค ํ›ˆ๋ จ ๋กœ๊ทธ๋ฅผ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค. Evaluates and saves the model every 2000 steps, and logs training metrics every 500 steps, respectively.
  • ์ตœ์ข… ์‚ฐ์ถœ๋ฌผ (Final Output): sillokbert_scratch_pretraining_output/final_model

4๋‹จ๊ณ„: ํ‰๊ฐ€ (Stage 4: Evaluation)

  • ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹ (Test Dataset): test.txt
  • ํ‰๊ฐ€ ์ง€ํ‘œ (Evaluation Metric): Perplexity (PPL)
  • Eval Loss: 0.3770
  • Perplexity: 1.4580

์„ฑ๋Šฅ ๋น„๊ต ๊ธฐ์ค€ ๋ชจ๋ธ (Baseline Models)

๋ณธ ์—ฐ๊ตฌ๋Š” ์ด์ „ SillokBert ํŒŒ์ธํŠœ๋‹ ํ”„๋กœ์ ํŠธ์˜ ๊ฒฐ๊ณผ๋ฌผ๊ณผ, ์•„๋ฌด๋Ÿฐ ์‚ฌ์ „ํ•™์Šต๋„ ๊ฑฐ์น˜์ง€ ์•Š์€ bert-base-multilingual-cased๋ฅผ ํ•ต์‹ฌ ์„ฑ๋Šฅ ๊ธฐ์ค€์„ (Baseline)์œผ๋กœ ํ™œ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. This research utilized the results from the previous SillokBert fine-tuning project and the off-the-shelf bert-base-multilingual-cased model as key performance baselines.

์ˆœ์œ„(Rank) ๋ชจ๋ธ(Model) Perplexity (PPL) ๋น„๊ณ  (Note)
1 SillokBert-Scratch 1.4580 Trained from Scratch
2 SillokBert (Fine-tuned) 4.1219 Fine-tuned from bert-base-multilingual-cased
3 bert-base-multilingual-cased 132.5186 Untrained on Sillok data

์—ฐ๊ตฌ์ž (Author)

  • ๊น€๋ฐ”๋กœ (Baro Kim), ํ•œ๊ตญํ•™์ค‘์•™์—ฐ๊ตฌ์› (The Academy of Korean Studies)

์ธ์šฉ (Citation)

@misc{kim2025sillokbertscratch,
    author = {Kim, Baro},
    title = {SillokBert-Scratch: Training a Pure-blood Sillok Language Model from Scratch},
    year = {2025},
    publisher = {Hugging Face},
    journal = {Hugging Face repository},
    howpublished = {\url{[https://huggingface.co/YOUR_HF_USERNAME/YOUR_MODEL_NAME](https://huggingface.co/YOUR_HF_USERNAME/YOUR_MODEL_NAME)}}
}
Downloads last month
12
Safetensors
Model size
0.5B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support