SillokBert-Scratch / README.md

Update README.md

b05b53f verified 7 months ago

14.3 kB

	---
	language:
	- ko
	- lzh
	license: cc-by-sa-4.0
	library_name: transformers
	tags:
	- fill-mask
	- text-generation
	- sillok
	- history
	- korean-history
	- classical-chinese
	pipeline_tag: fill-mask
	co2_eq_emissions:
	emissions: 1.1662 # kg, 1-epoch sample run extrapolated to 10 epochs.
	source: "codecarbon"
	training_type: "from_scratch"
	geographical_location: "South Korea, Seoul"
	hardware_used: "1 x NVIDIA A100-PCIE-40GB"
	---
	datasets:
	- "VERITABLE RECORDS of the JOSEON DYNASTY"

	# SillokBert-Scratch: A Korean Classical Language Model Trained from Scratch
	## SillokBert-Scratch: 순혈(純血) 실록 언어 모델 처음부터 학습시키기

	`SillokBert-Scratch`는 어떠한 사전 학습된 지식 없이, 오직 '조선왕조실록' 원문 데이터만을 사용하여 백지상태(from scratch)에서부터 훈련된 BERT 기반의 언어 모델입니다. 기존의 다국어 범용 모델을 특정 도메인에 파인튜닝하는 방식에서 벗어나, 순수하게 실록 데이터의 고유한 언어적 특성만을 학습한 '순혈(Pure-blood)' 언어 모델을 구축하는 것을 목표로 진행된 연구 프로젝트입니다.

	`SillokBert-Scratch` is a BERT-based language model trained entirely from scratch, using only the original text data of the "Annals of the Joseon Dynasty" (조선왕조실록). This research project aimed to build a 'pure-blood' language model that learns the unique linguistic characteristics of the historical text, moving away from the conventional approach of fine-tuning large multilingual models for specific domains.

	### 연구 지원 (Funding & Support)

	본 모델은 한국학중앙연구원 디지털인문학연구소의 "한국 고전 문헌 기반 지능형 한국학 언어모델 개발" 프로젝트의 일환으로 개발되었습니다. 본 모델의 학습 환경은 과학기술정보통신부 정보통신산업진흥원의 2025년 고성능컴퓨팅지원(GPU) 사업(G2025-0450)의 지원을 받았습니다. 연구에 필수적인 고성능 컴퓨팅 환경을 지원해주셔서 진심으로 감사드립니다.

	This model was developed as part of the "Development of an Intelligent Korean Studies Language Model based on Classical Korean Texts" project at the Digital Humanities Research Institute, The Academy of Korean Studies. The training environment for this model was supported by the 2025 High-Performance Computing Support (GPU) Program of the National IT Industry Promotion Agency (NIPA) (No. G2025-0450). We sincerely appreciate the support for providing the high-performance computing environment essential for our research.

	## 모델 성능 (Model Performance)

	\| 모델 (Model) \| Perplexity (PPL) \| 비고 (Note) \|
	\| :--- \| :---: \| :--- \|
	\| SillokBert-Scratch (This Model) \| 1.4580 \| Trained from Scratch \|
	\| SillokBert (Previous Fine-tuned Model) \| 4.1219 \| Fine-tuned from `bert-base-multilingual-cased` \|
	\| `bert-base-multilingual-cased` (Baseline) \| `132.5186` \| Untrained on Sillok data \|

	본 모델은 기존의 파인튜닝 기반 모델 대비 Perplexity를 64.63% 향상시켰으며, 아무 훈련도 거치지 않은 범용 BERT-base 모델보다는 98.90% 향상된 압도적인 성능을 보입니다.

	This model achieved a 64.63% improvement in Perplexity compared to the previous fine-tuned model, and a 98.90% improvement over the baseline `bert-base-multilingual-cased` model, demonstrating its overwhelming superiority.

	## 모델 사용법 (How to Use)

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM
	import torch

	# 허깅페이스에 업로드하신 모델 경로를 지정합니다.
	# Specify the path to your model uploaded on Hugging Face.
	model_name = "ddokbaro/SillokBert-Scratch"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForMaskedLM.from_pretrained(model_name)

	# 테스트 문장
	# Test sentence
	text = "王이 傳曰, “近日 [MASK]事가 何?”"

	inputs = tokenizer(text, return_tensors="pt")
	token_logits = model(**inputs).logits

	# 마스크된 토큰의 위치를 찾아 가장 확률이 높은 단어를 예측합니다.
	# Find the masked token's position and predict the most probable words.
	mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
	mask_token_logits = token_logits[0, mask_token_index, :]
	top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

	print(f"'{text}' 문장에서 [MASK]에 들어갈 확률이 높은 단어 TOP 5:")
	print(f"Top 5 most probable words for the [MASK] token in '{text}':")
	for token_id in top_5_tokens:
	print(f" - {tokenizer.decode([token_id])}")
	```

	## 데이터셋 정보 (Dataset Information)

	#### 데이터 출처 및 수집 (Data Source and Collection)

	원천 데이터 (Source Data): 공공데이터포털 - 교육부 국사편찬위원회\_조선왕조실록 정보\_실록원문 <https://www.data.go.kr/data/15053647/fileData.do>. 연구의 토대가 된 귀중한 자료를 제공해주신 교육부 국사편찬위원회 측에 감사의 말씀을 전합니다.

	Source Data: Public Data Portal - National Institute of Korean History (Ministry of Education)\_Annals of the Joseon Dynasty Information_Original Sillok Texts <https://www.data.go.kr/data/15053647/fileData.do>. We express our gratitude to the National Institute of Korean History for providing the invaluable data that formed the foundation of this research.

	데이터 버전 및 재현성 (Data Version and Reproducibility): 본 연구는 2022년 11월 03일에 등록된 데이터를 기반으로 합니다. 공식 배포처의 데이터가 업데이트될 수 있어, 완벽한 재현성을 보장하기 위해 학습에 사용된 원본 XML 파일 전체를 `raw_data/sillok_raw_xml.zip` 파일로 제공합니다. 또한, 즉시 활용 가능한 전처리 완료 텍스트 파일(`train.txt`, `validation.txt`, `test.txt`)은 `preprocessed_data/` 폴더에서 확인하실 수 있습니다.

	Data Version and Reproducibility: This research is based on the data registered on November 3, 2022. As the data from the official distributor may be updated, the entire original XML files used for training are provided as `raw_data/sillok_raw_xml.zip` to ensure complete reproducibility. Additionally, the preprocessed text files (`train.txt`, `validation.txt`, `test.txt`) ready for immediate use are available in the `preprocessed_data/` folder.

	#### 전처리된 데이터 경로 (Path to Preprocessed Data)

	`/home/work/baro/sillok25060103/preprocessed_corpus/`

	## 훈련 절차 (Training Procedure)

	본 모델은 총 4단계의 과정을 거쳐 개발되었습니다. / This model was developed through a four-stage process.

	### 1단계: 실록 특화 토크나이저 훈련 (Stage 1: Sillok-Specific Tokenizer Training)

	* 알고리즘 (Algorithm):
	초기 `WordPiece` 방식에서 원인 불명의 훈련 중단 문제가 발생하여, 최종적으로 `BPE (Byte-Pair Encoding)` 알고리즘을 채택했습니다. Byte-Level 전처리 방식을 사용하여 `[UNK]` 토큰 발생을 최소화했습니다.
	After encountering an inexplicable training halt with the initial `WordPiece` approach, we ultimately adopted the `BPE (Byte-Pair Encoding)` algorithm. The Byte-Level pre-tokenization method was used to minimize the occurrence of `[UNK]` tokens.

	* 전처리 (Preprocessing):
	고문헌 데이터의 특성을 고려하여, PUA(Private Use Area) 코드 변환 및 유니코드 NFC 정규화를 전처리 파이프라인에 적용했습니다.
	Considering the characteristics of classical texts, the preprocessing pipeline included PUA (Private Use Area) code conversion and Unicode NFC normalization.

	* 어휘집 크기 (Vocabulary Size):
	`500,000`. 실록의 방대한 한자 및 고유 어휘를 최대한 표현하기 위해 큰 어휘집을 구성했습니다.
	`500,000`. A large vocabulary was constructed to maximally represent the vast array of Hanja characters and unique terms found in the Sillok.

	* 최종 산출물 (Final Output): `sillok_tokenizer_bpe_preprocessed/tokenizer.json`

	### 2단계: 모델 아키텍처 정의 (Stage 2: Model Architecture Definition)

	* 구조 (Architecture):
	`BERT-base`와 동일한 아키텍처 (12-layer, 768-hidden, 12-heads)를 사용하되, 어떠한 사전 학습된 가중치도 없는 '빈 깡통(from scratch)' 모델로 초기화했습니다.
	Used the same architecture as `BERT-base` (12-layer, 768-hidden, 12-heads), but initialized as a 'blank slate' model from scratch without any pre-trained weights.

	* 총 파라미터 수 (Total Parameters):
	`470,542,880`. 어휘집 크기 50만에 따라 임베딩 레이어가 커져, 일반적인 BERT-base보다 파라미터 수가 많습니다.
	`470,542,880`. Due to the large vocabulary size of 500,000, the embedding layer is larger, resulting in more parameters than a standard BERT-base model.

	### 3단계: 사전학습 (Stage 3: Pre-training)

	* 목표 (Objective): Masked Language Modeling (MLM)

	* 훈련 데이터셋 (Training Datasets):
	* `train.txt`: 362,107 lines
	* `validation.txt`: 20,116 lines

	* 주요 훈련 인자 (Key Training Arguments):
	* `output_dir`: `./sillokbert_scratch_pretraining_output`
	훈련 중 생성되는 모델 체크포인트와 최종 결과물이 저장되는 디렉토리입니다.
	The directory where model checkpoints and final outputs are saved during training.
	* `num_train_epochs`: `10`
	전체 훈련 데이터셋을 총 10번 반복하여 학습합니다.
	The total number of times the entire training dataset is iterated over for learning.
	* `per_device_train_batch_size`: `4`
	단일 GPU에서 한 번에 처리하는 훈련 데이터의 샘플 수입니다. GPU 메모리 한계로 인해 작게 설정했습니다.
	The number of training samples processed at once on a single GPU. This was set to a small value due to GPU memory limitations.
	* `gradient_accumulation_steps`: `4`
	4번의 작은 배치에서 계산된 그래디언트를 모아서 한 번에 모델을 업데이트합니다. 이를 통해 실질적인 배치 사이즈는 16 (`4 * 4`)으로 유지하면서 메모리 사용량을 크게 줄일 수 있습니다.
	Gradients calculated from 4 smaller batches are accumulated before updating the model. This effectively maintains a batch size of 16 (`4 * 4`) while significantly reducing memory usage.
	* `learning_rate`: `5e-5` (0.00005)
	모델이 학습하는 속도를 조절하는 학습률입니다. AdamW 옵티마이저의 표준 값 중 하나를 사용했습니다.
	The learning rate that controls the speed of model learning. A standard value for the AdamW optimizer was used.
	* `warmup_steps`: `1000`
	훈련 초기에 학습률을 점진적으로 증가시켜 훈련 안정성을 높이는 단계의 수입니다.
	The number of steps to linearly increase the learning rate from 0 to its initial value, which enhances training stability at the beginning.
	* `weight_decay`: `0.01`
	모델의 가중치가 너무 커지는 것을 방지하여 과적합을 억제하는 정규화 기법입니다.
	A regularization technique that prevents model weights from becoming too large, thus mitigating overfitting.
	* `fp16`: `True`
	16비트 부동소수점(Half-precision)을 사용한 혼합 정밀도 훈련을 활성화합니다. GPU 메모리 사용량을 줄이고 훈련 속도를 향상시킵니다.
	Enables mixed-precision training using 16-bit floating-point numbers. This reduces GPU memory usage and improves training speed.
	* `gradient_checkpointing`: `True`
	순전파(forward pass) 과정의 중간 활성화 값을 모두 저장하는 대신, 역전파(backward pass) 시 재계산하여 메모리 사용량을 획기적으로 줄여주는 기술입니다.
	A technique that significantly reduces memory usage by recomputing intermediate activations during the backward pass instead of storing them all during the forward pass.
	* `eval_strategy`: `"steps"`
	`eval_steps`에 지정된 간격마다 검증 데이터셋으로 모델 성능을 평가합니다.
	Evaluates the model on the validation dataset at the interval specified by `eval_steps`.
	* `eval_steps` / `save_steps` / `logging_steps`: `2000` / `2000` / `500`
	각각 2000 스텝마다 모델 평가 및 저장을, 500 스텝마다 훈련 로그를 기록합니다.
	Evaluates and saves the model every 2000 steps, and logs training metrics every 500 steps, respectively.

	* 최종 산출물 (Final Output): `sillokbert_scratch_pretraining_output/final_model`

	### 4단계: 평가 (Stage 4: Evaluation)
	* 평가 데이터셋 (Test Dataset): `test.txt`
	* 평가 지표 (Evaluation Metric): Perplexity (PPL)
	* Eval Loss: `0.3770`
	* Perplexity: `1.4580`

	## 성능 비교 기준 모델 (Baseline Models)
	본 연구는 이전 `SillokBert` 파인튜닝 프로젝트의 결과물과, 아무런 사전학습도 거치지 않은 `bert-base-multilingual-cased`를 핵심 성능 기준선(Baseline)으로 활용했습니다.
	This research utilized the results from the previous `SillokBert` fine-tuning project and the off-the-shelf `bert-base-multilingual-cased` model as key performance baselines.

	\| 순위(Rank) \| 모델(Model) \| Perplexity (PPL) \| 비고 (Note) \|
	\| :--- \| :--- \| :---: \| :--- \|
	\| 1 \| SillokBert-Scratch \| 1.4580 \| Trained from Scratch \|
	\| 2 \| SillokBert (Fine-tuned) \| 4.1219 \| Fine-tuned from `bert-base-multilingual-cased` \|
	\| 3 \| `bert-base-multilingual-cased` \| `132.5186` \| Untrained on Sillok data \|

	## 연구자 (Author)
	* 김바로 (Baro Kim), 한국학중앙연구원 (The Academy of Korean Studies)

	## 인용 (Citation)
	```
	@misc{kim2025sillokbertscratch,
	author = {Kim, Baro},
	title = {SillokBert-Scratch: Training a Pure-blood Sillok Language Model from Scratch},
	year = {2025},
	publisher = {Hugging Face},
	journal = {Hugging Face repository},
	howpublished = {\url{[https://huggingface.co/YOUR_HF_USERNAME/YOUR_MODEL_NAME](https://huggingface.co/YOUR_HF_USERNAME/YOUR_MODEL_NAME)}}
	}
	```