jhangyejin/epitext-sikuroberta

Model Description / 모델 설명

This model is specifically designed to restore unidentifiable characters in historical Korean rubbing texts. It was trained on a large-scale dataset collected from four major Korean historical institutions and fine-tuned from SIKU-BERT/sikuroberta. By leveraging bidirectional contextual information, it fills in the gaps caused by physical damage or erosion of the original documents.

본 모델은 한국 탁본 데이터에서 마모나 손상으로 인해 발생한 판독 불능자를 전후 문맥을 통해 복원하기 위해 개발되었습니다. 4개 주요 역사 기관의 대규모 데이터를 바탕으로 학습되었으며, SIKU-BERT/sikuroberta 모델을 기반으로 한 양방향 문맥 파악을 통해 정교한 원문 복원을 지원합니다.

Training Data / 학습 데이터

Data Sources & Preprocessing / 데이터 출처 및 전처리

[English] The data was collected from four major databases:

National Institute of Korean History (국사편찬위원회)
National Heritage Administration (국가유산청)
Kyujanggak Institute for Korean Studies (규장각한국학연구원)
National Research Institute of Cultural Heritage (국립문화유산연구원)

Preprocessing Steps:

Normalization: Unified various notations for unidentifiable characters across different institutions.
Noise Removal: Removed editorial symbols used to describe stone wear or missing parts to extract pure text.
Filtering: Excluded data containing Hiragana or Hangul as they were outside the research scope.

[Korean] 본 연구의 데이터는 총 4개 기관(국사편찬위원회, 국가유산청, 규장각한국학연구원, 국립문화유산연구원)의 데이터베이스에서 수집되었습니다.

정규화: 기관별로 상이한 판독불능자(미상자) 표기 방식을 통일하였습니다.
노이즈 제거: 탁본 편집 과정에서 부기된 기호를 제거하여 순수 텍스트 정보만을 추출하였습니다.
필터링: 분석 범위를 벗어나는 히라가나 및 한글 포함 데이터는 제외하였습니다.

Dataset Statistics / 데이터셋 통계

Dataset	Number of Sentences	Ratio
Training Set	182,904	80%
Validation Set	22,863	10%
Test Set	22,863	10%
Total	228,630	100%

Hyperparameters / 하이퍼파라미터

Optimizer: AdamW
Learning Rate: 2e-5
Scheduler: Linear Warmup (Ratio: 0.06)
Training Epochs: Max 20 (Early stopping: 3 epochs)
Tokenizer: Expanded to 31,885 tokens to handle unique characters in rubbings.

Evaluation Results / 평가 결과

[English] The performance on the test dataset is as follows:

[Korean] 테스트 데이터셋에 대한 모델의 성능 지표는 다음과 같습니다.

Metric	Value
Top-1 Accuracy	68.22%
Top-5 Accuracy	82.56%
Test Loss	1.5714
Test Perplexity	4.8136

How to Use / 사용 방법

[English] Below is an example code using the transformers library to restore masked characters in historical text.

[Korean] transformers 라이브러리를 사용하여 고문헌 원문의 누락된 글자를 추론하는 예시 코드입니다.

from transformers import pipeline, AutoTokenizer

# 1. Setup model and tokenizer
model_path = "jhangyejin/epitext-sikuroberta"
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)

# 2. Create pipeline
fill_mask = pipeline("fill-mask", model=model_path, tokenizer=tokenizer)

# 3. Inference
# Example: "又立石於先生墓前以表之而使弟榮祏[MASK]其陰内外世"
text = "又立石於先生墓前以表之而使弟榮祏[MASK]其陰内外世"
results = fill_mask(text)

# 4. Print Top-5 Results
print(f"Input: {text}")
print("-" * 30)
for i, item in enumerate(results):
    print(f"Top {i+1} - Score: {item['score']:.4f}, Token: {item['token_str']}")

Citation

If you find this model useful for your research, please cite it as follows:

BibTeX:

@misc{jhangyejin2026epitext,
  author = {Jhang, Yejin},
  title = {Epitext-SikuRoBERTa: Context-based Restoration for Historical Korean Rubbing Texts},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{[https://huggingface.co/jhangyejin/epitext-sikuroberta](https://huggingface.co/jhangyejin/epitext-sikuroberta)}}
}

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32