jhangyejin/epitext-sikuroberta

Model Description / ๋ชจ๋ธ ์„ค๋ช…

This model is specifically designed to restore unidentifiable characters in historical Korean rubbing texts. It was trained on a large-scale dataset collected from four major Korean historical institutions and fine-tuned from SIKU-BERT/sikuroberta. By leveraging bidirectional contextual information, it fills in the gaps caused by physical damage or erosion of the original documents.

๋ณธ ๋ชจ๋ธ์€ ํ•œ๊ตญ ํƒ๋ณธ ๋ฐ์ดํ„ฐ์—์„œ ๋งˆ๋ชจ๋‚˜ ์†์ƒ์œผ๋กœ ์ธํ•ด ๋ฐœ์ƒํ•œ ํŒ๋… ๋ถˆ๋Šฅ์ž๋ฅผ ์ „ํ›„ ๋ฌธ๋งฅ์„ ํ†ตํ•ด ๋ณต์›ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. 4๊ฐœ ์ฃผ์š” ์—ญ์‚ฌ ๊ธฐ๊ด€์˜ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ•™์Šต๋˜์—ˆ์œผ๋ฉฐ, SIKU-BERT/sikuroberta ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์–‘๋ฐฉํ–ฅ ๋ฌธ๋งฅ ํŒŒ์•…์„ ํ†ตํ•ด ์ •๊ตํ•œ ์›๋ฌธ ๋ณต์›์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

Training Data / ํ•™์Šต ๋ฐ์ดํ„ฐ

Data Sources & Preprocessing / ๋ฐ์ดํ„ฐ ์ถœ์ฒ˜ ๋ฐ ์ „์ฒ˜๋ฆฌ

[English] The data was collected from four major databases:

  • National Institute of Korean History (๊ตญ์‚ฌํŽธ์ฐฌ์œ„์›ํšŒ)
  • National Heritage Administration (๊ตญ๊ฐ€์œ ์‚ฐ์ฒญ)
  • Kyujanggak Institute for Korean Studies (๊ทœ์žฅ๊ฐํ•œ๊ตญํ•™์—ฐ๊ตฌ์›)
  • National Research Institute of Cultural Heritage (๊ตญ๋ฆฝ๋ฌธํ™”์œ ์‚ฐ์—ฐ๊ตฌ์›)

Preprocessing Steps:

  1. Normalization: Unified various notations for unidentifiable characters across different institutions.
  2. Noise Removal: Removed editorial symbols used to describe stone wear or missing parts to extract pure text.
  3. Filtering: Excluded data containing Hiragana or Hangul as they were outside the research scope.

[Korean] ๋ณธ ์—ฐ๊ตฌ์˜ ๋ฐ์ดํ„ฐ๋Š” ์ด 4๊ฐœ ๊ธฐ๊ด€(๊ตญ์‚ฌํŽธ์ฐฌ์œ„์›ํšŒ, ๊ตญ๊ฐ€์œ ์‚ฐ์ฒญ, ๊ทœ์žฅ๊ฐํ•œ๊ตญํ•™์—ฐ๊ตฌ์›, ๊ตญ๋ฆฝ๋ฌธํ™”์œ ์‚ฐ์—ฐ๊ตฌ์›)์˜ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—์„œ ์ˆ˜์ง‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  1. ์ •๊ทœํ™”: ๊ธฐ๊ด€๋ณ„๋กœ ์ƒ์ดํ•œ ํŒ๋…๋ถˆ๋Šฅ์ž(๋ฏธ์ƒ์ž) ํ‘œ๊ธฐ ๋ฐฉ์‹์„ ํ†ต์ผํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  2. ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ: ํƒ๋ณธ ํŽธ์ง‘ ๊ณผ์ •์—์„œ ๋ถ€๊ธฐ๋œ ๊ธฐํ˜ธ๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ ์ˆœ์ˆ˜ ํ…์ŠคํŠธ ์ •๋ณด๋งŒ์„ ์ถ”์ถœํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  3. ํ•„ํ„ฐ๋ง: ๋ถ„์„ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋Š” ํžˆ๋ผ๊ฐ€๋‚˜ ๋ฐ ํ•œ๊ธ€ ํฌํ•จ ๋ฐ์ดํ„ฐ๋Š” ์ œ์™ธํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Dataset Statistics / ๋ฐ์ดํ„ฐ์…‹ ํ†ต๊ณ„

Dataset Number of Sentences Ratio
Training Set 182,904 80%
Validation Set 22,863 10%
Test Set 22,863 10%
Total 228,630 100%

Hyperparameters / ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

  • Optimizer: AdamW
  • Learning Rate: 2e-5
  • Scheduler: Linear Warmup (Ratio: 0.06)
  • Training Epochs: Max 20 (Early stopping: 3 epochs)
  • Tokenizer: Expanded to 31,885 tokens to handle unique characters in rubbings.

Evaluation Results / ํ‰๊ฐ€ ๊ฒฐ๊ณผ

[English] The performance on the test dataset is as follows:

[Korean] ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ์ง€ํ‘œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Metric Value
Top-1 Accuracy 68.22%
Top-5 Accuracy 82.56%
Test Loss 1.5714
Test Perplexity 4.8136

How to Use / ์‚ฌ์šฉ ๋ฐฉ๋ฒ•

[English] Below is an example code using the transformers library to restore masked characters in historical text.

[Korean] transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ๋ฌธํ—Œ ์›๋ฌธ์˜ ๋ˆ„๋ฝ๋œ ๊ธ€์ž๋ฅผ ์ถ”๋ก ํ•˜๋Š” ์˜ˆ์‹œ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.

from transformers import pipeline, AutoTokenizer

# 1. Setup model and tokenizer
model_path = "jhangyejin/epitext-sikuroberta"
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)

# 2. Create pipeline
fill_mask = pipeline("fill-mask", model=model_path, tokenizer=tokenizer)

# 3. Inference
# Example: "ๅˆ็ซ‹็Ÿณๆ–ผๅ…ˆ็”Ÿๅข“ๅ‰ไปฅ่กจไน‹่€ŒไฝฟๅผŸๆฆฎ็ฅ[MASK]ๅ…ถ้™ฐๅ†…ๅค–ไธ–"
text = "ๅˆ็ซ‹็Ÿณๆ–ผๅ…ˆ็”Ÿๅข“ๅ‰ไปฅ่กจไน‹่€ŒไฝฟๅผŸๆฆฎ็ฅ[MASK]ๅ…ถ้™ฐๅ†…ๅค–ไธ–"
results = fill_mask(text)

# 4. Print Top-5 Results
print(f"Input: {text}")
print("-" * 30)
for i, item in enumerate(results):
    print(f"Top {i+1} - Score: {item['score']:.4f}, Token: {item['token_str']}")

Citation

If you find this model useful for your research, please cite it as follows:

BibTeX:

@misc{jhangyejin2026epitext,
  author = {Jhang, Yejin},
  title = {Epitext-SikuRoBERTa: Context-based Restoration for Historical Korean Rubbing Texts},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{[https://huggingface.co/jhangyejin/epitext-sikuroberta](https://huggingface.co/jhangyejin/epitext-sikuroberta)}}
}
Downloads last month
88
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support