jhangyejin/epitext-sikuroberta
Model Description / ๋ชจ๋ธ ์ค๋ช
This model is specifically designed to restore unidentifiable characters in historical Korean rubbing texts. It was trained on a large-scale dataset collected from four major Korean historical institutions and fine-tuned from SIKU-BERT/sikuroberta. By leveraging bidirectional contextual information, it fills in the gaps caused by physical damage or erosion of the original documents.
๋ณธ ๋ชจ๋ธ์ ํ๊ตญ ํ๋ณธ ๋ฐ์ดํฐ์์ ๋ง๋ชจ๋ ์์์ผ๋ก ์ธํด ๋ฐ์ํ ํ๋ ๋ถ๋ฅ์๋ฅผ ์ ํ ๋ฌธ๋งฅ์ ํตํด ๋ณต์ํ๊ธฐ ์ํด ๊ฐ๋ฐ๋์์ต๋๋ค. 4๊ฐ ์ฃผ์ ์ญ์ฌ ๊ธฐ๊ด์ ๋๊ท๋ชจ ๋ฐ์ดํฐ๋ฅผ ๋ฐํ์ผ๋ก ํ์ต๋์์ผ๋ฉฐ, SIKU-BERT/sikuroberta ๋ชจ๋ธ์ ๊ธฐ๋ฐ์ผ๋ก ํ ์๋ฐฉํฅ ๋ฌธ๋งฅ ํ์ ์ ํตํด ์ ๊ตํ ์๋ฌธ ๋ณต์์ ์ง์ํฉ๋๋ค.
Training Data / ํ์ต ๋ฐ์ดํฐ
Data Sources & Preprocessing / ๋ฐ์ดํฐ ์ถ์ฒ ๋ฐ ์ ์ฒ๋ฆฌ
[English] The data was collected from four major databases:
- National Institute of Korean History (๊ตญ์ฌํธ์ฐฌ์์ํ)
- National Heritage Administration (๊ตญ๊ฐ์ ์ฐ์ฒญ)
- Kyujanggak Institute for Korean Studies (๊ท์ฅ๊ฐํ๊ตญํ์ฐ๊ตฌ์)
- National Research Institute of Cultural Heritage (๊ตญ๋ฆฝ๋ฌธํ์ ์ฐ์ฐ๊ตฌ์)
Preprocessing Steps:
- Normalization: Unified various notations for unidentifiable characters across different institutions.
- Noise Removal: Removed editorial symbols used to describe stone wear or missing parts to extract pure text.
- Filtering: Excluded data containing Hiragana or Hangul as they were outside the research scope.
[Korean] ๋ณธ ์ฐ๊ตฌ์ ๋ฐ์ดํฐ๋ ์ด 4๊ฐ ๊ธฐ๊ด(๊ตญ์ฌํธ์ฐฌ์์ํ, ๊ตญ๊ฐ์ ์ฐ์ฒญ, ๊ท์ฅ๊ฐํ๊ตญํ์ฐ๊ตฌ์, ๊ตญ๋ฆฝ๋ฌธํ์ ์ฐ์ฐ๊ตฌ์)์ ๋ฐ์ดํฐ๋ฒ ์ด์ค์์ ์์ง๋์์ต๋๋ค.
- ์ ๊ทํ: ๊ธฐ๊ด๋ณ๋ก ์์ดํ ํ๋ ๋ถ๋ฅ์(๋ฏธ์์) ํ๊ธฐ ๋ฐฉ์์ ํต์ผํ์์ต๋๋ค.
- ๋ ธ์ด์ฆ ์ ๊ฑฐ: ํ๋ณธ ํธ์ง ๊ณผ์ ์์ ๋ถ๊ธฐ๋ ๊ธฐํธ๋ฅผ ์ ๊ฑฐํ์ฌ ์์ ํ ์คํธ ์ ๋ณด๋ง์ ์ถ์ถํ์์ต๋๋ค.
- ํํฐ๋ง: ๋ถ์ ๋ฒ์๋ฅผ ๋ฒ์ด๋๋ ํ๋ผ๊ฐ๋ ๋ฐ ํ๊ธ ํฌํจ ๋ฐ์ดํฐ๋ ์ ์ธํ์์ต๋๋ค.
Dataset Statistics / ๋ฐ์ดํฐ์ ํต๊ณ
| Dataset | Number of Sentences | Ratio |
|---|---|---|
| Training Set | 182,904 | 80% |
| Validation Set | 22,863 | 10% |
| Test Set | 22,863 | 10% |
| Total | 228,630 | 100% |
Hyperparameters / ํ์ดํผํ๋ผ๋ฏธํฐ
- Optimizer: AdamW
- Learning Rate: 2e-5
- Scheduler: Linear Warmup (Ratio: 0.06)
- Training Epochs: Max 20 (Early stopping: 3 epochs)
- Tokenizer: Expanded to 31,885 tokens to handle unique characters in rubbings.
Evaluation Results / ํ๊ฐ ๊ฒฐ๊ณผ
[English] The performance on the test dataset is as follows:
[Korean] ํ ์คํธ ๋ฐ์ดํฐ์ ์ ๋ํ ๋ชจ๋ธ์ ์ฑ๋ฅ ์งํ๋ ๋ค์๊ณผ ๊ฐ์ต๋๋ค.
| Metric | Value |
|---|---|
| Top-1 Accuracy | 68.22% |
| Top-5 Accuracy | 82.56% |
| Test Loss | 1.5714 |
| Test Perplexity | 4.8136 |
How to Use / ์ฌ์ฉ ๋ฐฉ๋ฒ
[English]
Below is an example code using the transformers library to restore masked characters in historical text.
[Korean]
transformers ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ์ฌ์ฉํ์ฌ ๊ณ ๋ฌธํ ์๋ฌธ์ ๋๋ฝ๋ ๊ธ์๋ฅผ ์ถ๋ก ํ๋ ์์ ์ฝ๋์
๋๋ค.
from transformers import pipeline, AutoTokenizer
# 1. Setup model and tokenizer
model_path = "jhangyejin/epitext-sikuroberta"
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
# 2. Create pipeline
fill_mask = pipeline("fill-mask", model=model_path, tokenizer=tokenizer)
# 3. Inference
# Example: "ๅ็ซ็ณๆผๅ
็ๅขๅไปฅ่กจไน่ไฝฟๅผๆฆฎ็ฅ[MASK]ๅ
ถ้ฐๅ
ๅคไธ"
text = "ๅ็ซ็ณๆผๅ
็ๅขๅไปฅ่กจไน่ไฝฟๅผๆฆฎ็ฅ[MASK]ๅ
ถ้ฐๅ
ๅคไธ"
results = fill_mask(text)
# 4. Print Top-5 Results
print(f"Input: {text}")
print("-" * 30)
for i, item in enumerate(results):
print(f"Top {i+1} - Score: {item['score']:.4f}, Token: {item['token_str']}")
Citation
If you find this model useful for your research, please cite it as follows:
BibTeX:
@misc{jhangyejin2026epitext,
author = {Jhang, Yejin},
title = {Epitext-SikuRoBERTa: Context-based Restoration for Historical Korean Rubbing Texts},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{[https://huggingface.co/jhangyejin/epitext-sikuroberta](https://huggingface.co/jhangyejin/epitext-sikuroberta)}}
}
- Downloads last month
- 88