File size: 15,389 Bytes
bcaa297 2c2d993 bcaa297 2c2d993 bcaa297 2c2d993 371d24c bcaa297 c165356 bcaa297 510de6f bcaa297 63d581f bcaa297 510de6f bcaa297 f5d2567 bcaa297 f5d2567 d91e39e bcaa297 be47033 bcaa297 be47033 bcaa297 be47033 bcaa297 371d24c bcaa297 371d24c bcaa297 f5d2567 510de6f bcaa297 6b1b724 bcaa297 f5d2567 510de6f f5d2567 bcaa297 510de6f bcaa297 f5d2567 bcaa297 f5d2567 bcaa297 f5d2567 bcaa297 f5d2567 bcaa297 f5d2567 bcaa297 f5d2567 bcaa297 f5d2567 bcaa297 510de6f f5d2567 bcaa297 f5d2567 bcaa297 f5d2567 d91e39e bcaa297 510de6f f5d2567 d91e39e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | ---
language:
- ko
- lzh
license: cc-by-sa-4.0
library_name: transformers
tags:
- token-classification
- ner
- sillok
- history
- korean-history
- classical-chinese
pipeline_tag: token-classification
co2_eq_emissions:
emissions: 0.153 # kg, 1-epoch sample run extrapolated to 10 epochs.
source: "estimation based on training time and hardware"
training_type: "fine-tuning"
geographical_location: "South Korea, Seoul"
hardware_used: "1 x NVIDIA A100-PCIE-40GB"
---
datasets:
- "VERITABLE RECORDS of the JOSEON DYNASTY"
# **SillokBert-NER: ์กฐ์ ์์กฐ์ค๋ก ํนํ ๊ฐ์ฒด๋ช
์ธ์ ๋ชจ๋ธ**
# **SillokBert-NER: A Domain-Specific NER Model for the Veritable Records of the Joseon Dynasty**
## **๋ชจ๋ธ ์ค๋ช
(Model Description)**
`SillokBert-NER`์ ์กฐ์ ์์กฐ์ค๋ก ์๋ฌธ์ ํนํ๋ ๊ฐ์ฒด๋ช
์ธ์(Named Entity Recognition, NER) ๋ชจ๋ธ์
๋๋ค. ์ด ๋ชจ๋ธ์ ์กฐ์ ์์กฐ์ค๋ก ์ ์ฒด ์๋ฌธ(ํ๋ฌธ)์ผ๋ก ์ง์์ ์ฌ์ ํ์ต(continued pre-training)์ ์งํํ ์ธ์ด ๋ชจ๋ธ [ddokbaro/SillokBert](https://huggingface.co/ddokbaro/SillokBert) ํ๋ก์ ํธ์ Trial 11 ์ฒดํฌํฌ์ธํธ๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ํ์ธํ๋๋์์ผ๋ฉฐ, ์ญ์ฌ ๊ธฐ๋ก๋ฌผ ์์์ ๋ค์์ 4๊ฐ์ง ํต์ฌ ๊ฐ์ฒด ์ ํ์ ์ ํํ๊ฒ ์๋ณํ๋๋ก ์ค๊ณ๋์์ต๋๋ค.
(SillokBert-NER is a Named Entity Recognition (NER) model specialized for the Veritable Records of the Joseon Dynasty (์กฐ์ ์์กฐ์ค๋ก). It is fine-tuned from the Trial 11 checkpoint of the ddokbaro/SillokBert project, a language model that was continually pre-trained on the full-text classical Chinese (Hanja) corpus of the Veritable Records. This model is designed to accurately identify four key entity types within the historical texts.)
* **PER**: ์ธ๋ช
(Person)
* **LOC**: ์ง๋ช
(Location)
* **POH**: ์์ฑ
๋ช
(Publication of History)
* **DAT**: ์ฐํธ (Date / Era Name)
๋ณธ ๋ชจ๋ธ์ ํ๊ตญํ์ค์์ฐ๊ตฌ์ ๋์งํธ์ธ๋ฌธํ์ฐ๊ตฌ์์ "ํ๊ตญ ๊ณ ์ ๋ฌธํ ๊ธฐ๋ฐ ์ง๋ฅํ ํ๊ตญํ ์ธ์ด๋ชจ๋ธ ๊ฐ๋ฐ" ํ๋ก์ ํธ์ ์ผํ์ผ๋ก ๊ฐ๋ฐ๋์์ต๋๋ค. ๋ณธ ๋ชจ๋ธ์ ํ์ต ํ๊ฒฝ์ ๊ณผํ๊ธฐ์ ์ ๋ณดํต์ ๋ถ ์ ๋ณดํต์ ์ฐ์
์งํฅ์์ 2025๋
๊ณ ์ฑ๋ฅ์ปดํจํ
์ง์(GPU) ์ฌ์
(G2025-0450)์ ์ง์์ ๋ฐ์์ต๋๋ค. ์ฐ๊ตฌ์ ํ์์ ์ธ ๊ณ ์ฑ๋ฅ ์ปดํจํ
ํ๊ฒฝ์ ์ง์ํด์ฃผ์
์ ์ง์ฌ์ผ๋ก ๊ฐ์ฌ๋๋ฆฝ๋๋ค.
This model was developed as part of the "Development of an Intelligent Korean Studies Language Model based on Classical Korean Texts" project at the Digital Humanities Research Institute, The Academy of Korean Studies. The training environment for this model was supported by the 2025 High-Performance Computing Support (GPU) Program of the National IT Industry Promotion Agency (NIPA) (No. G2025-0450). We sincerely appreciate the support for providing the high-performance computing environment essential for our research.
## **์ฌ์ฉ ๋ชฉ์ ๋ฐ ํ๊ณ (Intended Uses & Limitations)**
์ด ๋ชจ๋ธ์ ํ์ ๋ฐ ์ฐ๊ตฌ ๋ชฉ์ ์ผ๋ก ์ ์๋์์ผ๋ฉฐ, ํนํ ์กฐ์ ์์กฐ์ค๋ก์ด๋ ์ ์ฌํ ํ๋ฌธ ์ญ์ฌ ๊ธฐ๋ก์ ๋ค๋ฃจ๋ ์ฐ๊ตฌ์์ ๊ฐ๋ฐ์์๊ฒ ์ ์ฉํฉ๋๋ค.
This model is intended for academic and research purposes, specifically for scholars and developers working with the Veritable Records of the Joseon Dynasty or similar historical Korean texts written in classical Chinese.
**ํ๊ณ (Limitations):**
* ์ด ๋ชจ๋ธ์ ํน์ ๋๋ฉ์ธ์ ๊ณ ๋๋ก ํนํ๋์ด ์์ผ๋ฏ๋ก, ํ๋ ํ๊ตญ์ด๋ ๋ค๋ฅธ ์ข
๋ฅ์ ํ
์คํธ์ ๋ํ ๋ฒ์ฉ NER ๋ชจ๋ธ๋ก๋ **์ ํฉํ์ง ์์ต๋๋ค.** (This model is a highly domain-specific model and is not suitable for general-purpose NER on modern Korean or other types of texts.)
* ์๋๋ ๋ฌธ์ฒด์ ํน์ง์ด ๋ค๋ฅธ ์ญ์ฌ ๋ฌธํ์์๋ ์ฑ๋ฅ์ด ๋ค๋ฅด๊ฒ ๋ํ๋ ์ ์์ต๋๋ค. (Performance may vary on historical documents from different eras or with different stylistic features.)
## **์ฌ์ฉ ๋ฐฉ๋ฒ (How to Get Started)**
`transformers` ๋ผ์ด๋ธ๋ฌ๋ฆฌ์ ํ์ดํ๋ผ์ธ์ ํตํด ๊ฐ๋จํ๊ฒ ์ฌ์ฉํ ์ ์์ต๋๋ค. (You can use this model with the `transformers` library pipeline.)
```
from transformers import pipeline
# ์ต์
1 (๊ถ์ฅ): ํ๊น
ํ์ด์ค ํ๋ธ์์ ์ง์ ๋ชจ๋ธ ๋ก๋
# Option 1 (Recommended): Load the model directly from the Hugging Face Hub
ner_pipeline = pipeline("token-classification", model="ddokbaro/SillokBert-NER")
# ์ต์
2: ๋ก์ปฌ์ ์ ์ฅ๋ ๋ชจ๋ธ ๋ก๋ (๊ฒฝ๋ก๋ฅผ ์ค์ ํ๊ฒฝ์ ๋ง๊ฒ ์์ ํด์ผ ํจ)
# Option 2: Load the model from a local directory (the path must be adjusted to your environment)
# local_model_path = "/home/work/baro/sillokner20250618/models/SillokBert-NER-trial11"
# ner_pipeline = pipeline("token-classification", model=local_model_path)
text = "ๆๅคชๅฎๅจๆฝ้ธ้ฃ่ถ่ฑ่่ซญๆไธๆฐไปๆๅๅฎถๅๅฎ้่ฅฟๅ่ท้ดจ็ถ ๆชๅ็พ้"
# ํ์ข
์ค๋ก 1๊ถ, ํ์กฐ 1๋
1์ 15์ผ (Veritable Records of Taejong, Vol. 1, 15th day of the 1st month of the 1st year of King Taejo)
results = ner_pipeline(text)
for entity in results:
print(entity)
# Expected Output:
# {'entity': 'B-PER', 'score': 0.99..., 'index': 2, 'word': 'ๅคชๅฎ', 'start': 3, 'end': 5}
# {'entity': 'B-PER', 'score': 0.99..., 'index': 6, 'word': '่ถ่ฑ่', 'start': 15, 'end': 18}
# {'entity': 'B-LOC', 'score': 0.99..., 'index': 13, 'word': '้ดจ็ถ ', 'start': 43, 'end': 45}
```
## **์ฌ์ ํ์ต ๋ชจ๋ธ ์๋ณธ (Original Pre-trained Model)**
๋ณธ ๋ฆฌํฌ์งํ ๋ฆฌ์๋ ์ด NER ๋ชจ๋ธ์ ๊ธฐ๋ฐ์ด ๋ ์๋ณธ SillokBert (Trial 11) ์ฒดํฌํฌ์ธํธ ํ์ผ๋ค์ด 'SillokBert_trial11/' ํด๋์ ํจ๊ป ํฌํจ๋์ด ์์ต๋๋ค. ๋ค๋ฅธ ๋ค์ด์คํธ๋ฆผ ํ์คํฌ์ ์ง์ ํ์ธํ๋์ ์๋ํด๋ณด๊ณ ์ ํ๋ ์ฐ๊ตฌ์๋ค์ ํด๋น ํด๋์ ํ์ผ๋ค์ ํ์ฉํ ์ ์์ต๋๋ค.
This repository also contains the original SillokBert (Trial 11) checkpoint files in the 'SillokBert_trial11/' folder. Researchers who wish to fine-tune this model on other downstream tasks can utilize the files in that directory.
## **ํ์ต ๋ฐ ํ๊ฐ ๋ฐ์ดํฐ (Training and Evaluation Data)**
### **๋ฐ์ดํฐ์
(Dataset)**
์ด ๋ชจ๋ธ์ ์กฐ์ ์์กฐ์ค๋ก ์๋ณธ XML ํ์ผ๋ก๋ถํฐ ๊ตฌ์ถ๋ **Sillok NER Corpus**๋ก ํ์ต๋์์ต๋๋ค. (This model was trained on the `Sillok NER Corpus`, a custom dataset built from the original XML files of the Veritable Records of the Joseon Dynasty.)
* **์์ฒ ๋ฐ์ดํฐ (Source Data)**: ๊ณต๊ณต๋ฐ์ดํฐํฌํธ \- ๊ต์ก๋ถ ๊ตญ์ฌํธ์ฐฌ์์ํ\_์กฐ์ ์์กฐ์ค๋ก ์ ๋ณด\_์ค๋ก์๋ฌธ https://www.data.go.kr/data/15053647/fileData.do. ์ฐ๊ตฌ์ ํ ๋๊ฐ ๋ ๊ท์คํ ์๋ฃ๋ฅผ ์ ๊ณตํด์ฃผ์ ๊ต์ก๋ถ ๊ตญ์ฌํธ์ฐฌ์์ํ ์ธก์ ๊ฐ์ฌ์ ๋ง์์ ์ ํ๋ค.
We express our gratitude to the National Institute of Korean History (Ministry of Education) for providing the invaluable data that formed the foundation of this research.
* **๋ฐ์ดํฐ ๋ฒ์ ๋ฐ ์ฌํ์ฑ (Data Version and Reproducibility)**: ๋ณธ ์ฐ๊ตฌ๋ 2022๋
11์ 03์ผ์ ๋ฑ๋ก๋ ๋ฐ์ดํฐ๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ํฉ๋๋ค. ๊ณต์ ๋ฐฐํฌ์ฒ์ ๋ฐ์ดํฐ๊ฐ ์
๋ฐ์ดํธ๋ ์ ์์ด, ์๋ฒฝํ ์ฌํ์ฑ์ ๋ณด์ฅํ๊ธฐ ์ํด ํ์ต์ ์ฌ์ฉ๋ ์๋ณธ XML ํ์ผ ์ ์ฒด๋ฅผ `raw_data/sillok_raw_xml.zip` ํ์ผ๋ก ์ ๊ณตํฉ๋๋ค. ๋ํ, ์ฆ์ ํ์ฉ ๊ฐ๋ฅํ ์ ์ฒ๋ฆฌ ์๋ฃ ํ
์คํธ ํ์ผ(`train.txt`, `validation.txt`, `test.txt`)์ `preprocessed_data/` ํด๋์์ ํ์ธํ์ค ์ ์์ต๋๋ค.
This research is based on the data registered on November 3, 2022\. As the data from the official distributor may be updated, we provide the entire original XML files used for training as `raw_data/sillok_raw_xml.zip` in this repository to ensure perfect reproducibility. Additionally, the preprocessed text files (`train.txt`, `validation.txt`, `test.txt`) ready for immediate use can be found in the `preprocessed_data/` folder.
* **์ ์ฒ๋ฆฌ (Preprocessing):** XML์ `<index\>` ํ๊ทธ๋ฅผ ํ์ฑํ์ฌ ๊ฐ์ฒด๋ช
ํ
์คํธ, ์ ํ(`์ด๋ฆ`, `์ง๋ช
`, `์๋ช
`, `์ฐํธ`), ๊ณ ์ ์ฐธ์กฐ ID๋ฅผ ์ถ์ถํ์ต๋๋ค. ์ด ์ ๋ณด๋ 3์ด์ CoNLL ํ์(`token` `ner_tag` `ref_id`)์ผ๋ก ๋ณํ๋์์ต๋๋ค. (The `<index\`> tags in the XML were parsed to extract entity text, types (`์ด๋ฆ`, `์ง๋ช
`, `์๋ช
`, `์ฐํธ`), and unique reference IDs. This information was converted into a 3-column CoNLL format (`token` `ner_tag` `ref_id`).)
* **๋ฐ์ดํฐ ๋ถํ (Data Split):** ์ ์ฒด ๋ง๋ญ์น๋ ํ์ต(80%), ๊ฒ์ฆ(10%), ํ๊ฐ(10%) ์ธํธ๋ก ๋ฌด์์ ๋ถํ ๋์์ต๋๋ค. (The full corpus was randomly split into training (80%), validation (10%), and test (10%) sets.)
* **ํ์ต ์ธํธ (Training Set):** 375,366 ๋ฌธ์ฅ
* **๊ฒ์ฆ ์ธํธ (Validation Set):** 46,920 ๋ฌธ์ฅ
* **ํ๊ฐ ์ธํธ (Test Set):** 46,922 ๋ฌธ์ฅ
### **๋ฐ์ดํฐ์
๋ค์ด๋ก๋ (Dataset Download)**
๋ณธ ๋ฆฌํฌ์งํ ๋ฆฌ์๋ ๋ชจ๋ธ๋ฟ๋ง ์๋๋ผ, ์ฐ๊ตฌ์ ์ฌ์ฉ๋ ์ ์ฒ๋ฆฌ ์๋ฃ ๋ฐ์ดํฐ์ ์๋ณธ ๋ฐ์ดํฐ๊ฐ ๋ชจ๋ ํฌํจ๋์ด ์์ด ์ฆ์ ํ์ฉ ๋ฐ ์ฌํ์ด ๊ฐ๋ฅํฉ๋๋ค.
This repository contains not only the model but also the pre-processed and raw data used in the research, allowing for immediate use and reproducibility.
'data/raw_xml/': ์ฐ๊ตฌ์ ๊ธฐ๋ฐ์ด ๋ ์๋ณธ XML ํ์ผ ์ ์ฒด๊ฐ ํฌํจ๋์ด ์์ต๋๋ค. (Contains the complete original XML files that formed the basis of this research.)
'preprocessed_data/': ์ฆ์ ํ์ฉ ๊ฐ๋ฅํ CoNLL ํ์์ 'train.txt', 'validation.txt', 'test.txt' ํ์ผ์ด ํฌํจ๋์ด ์์ต๋๋ค. (Contains ready-to-use CoNLL formatted files: 'train.txt', 'validation.txt', and 'test.txt'.)
### **๊ฐ์ฒด๋ช
์ ํ (Entity Types)**
| ํ๊ทธ (Tag) | ์ค๋ช
(Description) | XML type | ์๋ณธ ๋ฐ์ดํฐ ์ (Raw Data Count) |
| :---- | :---- | :---- | :---- |
| `PER` | Person Name (์ธ๋ช
) | ์ด๋ฆ | 1,495,199 |
| `LOC` | Location Name (์ง๋ช
) | ์ง๋ช
| 490,163 |
| `POH` | Publication of History (์์ฑ
๋ช
) | ์๋ช
| 49,506 |
| `DAT` | Date / Era Name (์ฐํธ) | ์ฐํธ | 5,964 |
## **ํ์ต ์ ์ฐจ (Training Procedure)**
๊ณต์ ํ ํ๊ฐ๋ฅผ ์ํด ๋ชจ๋ ๋น๊ต ๋ชจ๋ธ์ ๋์ผํ ํ์ดํผํ๋ผ๋ฏธํฐ๋ฅผ ์ฌ์ฉํ์ฌ ํ์ธํ๋์ ์งํํ์ต๋๋ค. (The model was fine-tuned using the same set of hyperparameters across all comparative models to ensure a fair evaluation.)
* **ํ์ต๋ฅ (Learning Rate):** 2e-5
* **๋ฐฐ์น ์ฌ์ด์ฆ (Batch Size):** 16
* **์ํญ (Epochs):** 3
* **๊ฐ์ค์น ๊ฐ์ (Weight Decay):** 0.01
## **์ฑ๋ฅ ํ๊ฐ (Evaluation)**
๋๋ฉ์ธ ํนํ ์ฌ์ ํ์ต์ ํจ๊ณผ๋ฅผ ๊ฒ์ฆํ๊ธฐ ์ํด ํฌ๊ด์ ์ธ ๋น๊ต ๋ถ์์ ์ํํ์ต๋๋ค. (We conducted a comprehensive comparative analysis to validate the effectiveness of domain-specific pre-training.)
### **๋น๊ต ๋ชจ๋ธ (Models for Comparison)**
* **๊ทธ๋ฃน 1 (์์ฒด ๋ชจ๋ธ / Our Models):** `SillokBert` (Top 3 Trials) vs. `bert-base-multilingual-cased` (๋ฒ ์ด์ค๋ผ์ธ / Baseline).
* **๊ทธ๋ฃน 2 (์ธ๋ถ ๋ชจ๋ธ / External Models):** ํ๋ ํ๊ตญ์ด(`klue/roberta-large`) ๋๋ ๋ค๋ฅธ ์ค๊ตญ ๊ณ ๋ฌธ(`SIKU-BERT`, `guwenbert-large`)์ผ๋ก ์ฌ์ ํ์ต๋ ๋ชจ๋ธ. (Models pre-trained on modern Korean (`klue/roberta-large`) or other classical Chinese texts (`SIKU-BERT`, `guwenbert-large`).)
* **๊ทธ๋ฃน 3 (SOTA ๋ฒค์น๋งํฌ / SOTA Benchmark):** ์ค๊ตญ ๊ณ ๋ฌธ NER ๊ณผ์ ๋ก ๊ธฐํ์ต๋ ๋ชจ๋ธ(`ethanyt/guwen-ner`). (A pre-trained NER model for classical Chinese (`ethanyt/guwen-ner`).)
### **๊ฒฐ๊ณผ (Results)**
๋ค์ ํ๋ ๊ฐ ๋ชจ๋ธ์ ๊ฒ์ฆ ์ธํธ์ ๋ํ ์ต๊ณ F1 ์ ์๋ฅผ ์์ฝํ ๊ฒ์
๋๋ค. (The following table summarizes the best F1 scores on the validation set for each model.)
| ๊ทธ๋ฃน (Group) | ๋ชจ๋ธ๋ช
(Model) | ๊ธฐ๋ฐ ๋ฐ์ดํฐ (Base Data) | F1 ์ ์ (F1) | ์ ๋ฐ๋ (P) | ์ฌํ์จ (R) | ์ ํ๋ (Acc) | ๋น๊ณ (Notes) |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
| 1 | [**SillokBert (Trial 11\)**](https://huggingface.co/ddokbaro/SillokBert) | **์ค๋ก (์์ฒด)** | **0.9569** | 0.9485 | 0.9655 | 0.9959 | **์ต๊ณ ์ฑ๋ฅ ๋ฌ์ฑ** |
| 1 | SillokBert (Trial 10\) | ์ค๋ก (์์ฒด) | 0.9565 | 0.9572 | 0.9558 | 0.9960 | ์ต๊ณ ์ฑ๋ฅ๊ณผ ๋๋ฑ |
| 1 | SillokBert (Trial 4\) | ์ค๋ก (์์ฒด) | 0.9564 | 0.9586 | 0.9542 | 0.9959 | ddokbaro/SillokBert ๊ณต์ ๋ชจ๋ธ |
| 1 | `bert-base-multilingual-cased` | ๋ค๊ตญ์ด (๋ฒ์ฉ) | 0.9530 | 0.9544 | 0.9516 | 0.9956 | ์ฌ์ ํ์ต ํจ๊ณผ ๋น๊ต๋ฅผ ์ํ ๋ฒ ์ด์ค๋ผ์ธ |
| 2 | `klue/roberta-large` | ํ๋ ํ๊ตญ์ด | 0.9488 | 0.9501 | 0.9475 | 0.9952 | ์ต์ ์ํคํ
์ฒ, ๋๋ฉ์ธ ๋ถ์ผ์น๋ก ์ฑ๋ฅ ํ๋ฝ |
| 2 | `ethanyt/guwenbert-large` | ์ค๊ตญ ๊ณ ๋ฌธ (๋ฒ์ฉ) | 0.9461 | 0.9450 | 0.9472 | 0.9951 | ์ ์ฌ ๋๋ฉ์ธ, SillokBert ๋๋น ์ฑ๋ฅ ํ๋ฝ |
| 2 | `SIKU-BERT/sikubert` | ์ค๊ตญ ๊ณ ๋ฌธ (์ฌ๊ณ ์ ์) | 0.9421 | 0.9380 | 0.9463 | 0.9948 | ํน์ ๊ณ ๋ฌธํ, SillokBert ๋๋น ์ฑ๋ฅ ํ๋ฝ |
| 3 | `ethanyt/guwen-ner (SOTA)` | ์ค๊ตญ ๊ณ ๋ฌธ (๊ธฐํ์ต) | 0.1749 | 0.2601 | 0.1317 | 0.9288 | ๋ผ๋ฒจ/๋๋ฉ์ธ ๋ถ์ผ์น๋ก ์ฑ๋ฅ ์ธก์ ๋ถ๊ฐ |
### **๊ฒฐ๊ณผ ๋ถ์ (Analysis of Results)**
* **SillokBert์ ์ฐ์์ฑ (Superiority of SillokBert):** `SillokBert`๋ ๋ค๋ฅธ ๋ชจ๋ ๋น๊ต ๋ชจ๋ธ๋ณด๋ค ์ผ๊ด๋๊ฒ ๋ฐ์ด๋ ์ฑ๋ฅ์ ๋ณด์ด๋ฉฐ, ๋๋ฉ์ธ ํนํ ์ง์-์ฌ์ ํ์ต(domain-specific continued pre-training)์ ๋ช
๋ฐฑํ ์ด์ ์ ๋ณด์ฌ์ฃผ์์ต๋๋ค. (`SillokBert` consistently outperformed all other models, demonstrating the clear advantage of domain-specific continued pre-training.)
* **๋๋ฉ์ธ ์ ํฉ์ฑ์ ์ค์์ฑ (Importance of Domain Alignment):** `klue/roberta-large`์ ๊ฐ์ด ํ๋ ํ๊ตญ์ด ๋ฐ์ดํฐ๋ก ํ์ต๋ ๊ฐ๋ ฅํ ๋ชจ๋ธ์ด๋, `guwenbert-large`, `SIKU-BERT` ๋ฑ ๋ค๋ฅธ ์ค๊ตญ ๊ณ ๋ฌธ ํ
์คํธ๋ก ํ์ต๋ ๋ชจ๋ธ์กฐ์ฐจ SillokBert์ ์ฑ๋ฅ์๋ ๋ฏธ์น์ง ๋ชปํ์ต๋๋ค. ์ด๋ ๋ณธ ๊ณผ์ ์์ ๋๋ฉ์ธ ์ ํฉ์ฑ์ด ์ํคํ
์ฒ ๊ฐ์ ์ด๋ ์ผ๋ฐ์ ์ธ ์ธ์ด ๋ฅ๋ ฅ๋ณด๋ค ๋ ์ค์ํ ์์์์ ๊ฐ์กฐํฉ๋๋ค. (Even powerful models trained on modern Korean (`klue/roberta-large`) or other classical Chinese texts (`guwenbert-large`, `SIKU-BERT`) could not match the performance of `SillokBert`. This highlights that domain alignment is more critical than architectural improvements or general language capabilities for this specific task.)
* **๊ธฐ์ฑ SOTA ๋ชจ๋ธ์ ํ๊ณ (Limitations of Out-of-the-Box SOTA Models):** ์ฌ์ ํ์ต๋ `guwen-ner` ๋ชจ๋ธ์ ๋ ์ด๋ธ ์ฒด๊ณ์ ๋๋ฉ์ธ์ ๋ถ์ผ์น๋ก ์ธํด ์ฐ๋ฆฌ ๋ฐ์ดํฐ์
์์ ์คํจํ์ต๋๋ค. ์ด๋ ์ธ๋ถ ๋๊ตฌ๋ฅผ ๋ฌด๋นํ์ ์ผ๋ก ์ ์ฉํ๊ธฐ๋ณด๋ค, ํนํ๋ ๋ฐ์ดํฐ๋ฅผ ์ํ ๋ง์ถคํ ๋ชจ๋ธ์ ๊ฐ๋ฐํ ํ์์ฑ์ ๊ฐ์กฐํฉ๋๋ค. (The pre-trained `guwen-ner` model failed on our dataset due to a mismatch in label schemas and domains. This underscores the necessity of developing custom models for specialized data rather than uncritically applying external tools.)
## **์ธ์ฉ (Citation)**
์ด ๋ชจ๋ธ์ด๋ Sillok NER Corpus๋ฅผ ์ฐ๊ตฌ์ ์ฌ์ฉํ์ ๋ค๋ฉด, ์ด ๋ฆฌํฌ์งํ ๋ฆฌ๋ฅผ ์ธ์ฉํด ์ฃผ์ญ์์ค. (If you use this model or the Sillok NER Corpus in your research, please cite this repository.)
```
@misc{SillokBertNER2025,
author = {Kim, Baro},
title = {SillokBert-NER: A Domain-Specific NER Model for the Veritable Records of the Joseon Dynasty},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {url{https://huggingface.co/ddokbaro/SillokBert-NER}}
}
``` |