Jeju Satoru

Project Overview

'Jeju Satoru' is a bidirectional Jeju-Standard Korean translation model developed to preserve the Jeju language, which is designated as an 'endangered language' by UNESCO. The model aims to bridge the digital divide for elderly Jeju dialect speakers by improving their digital accessibility.

Model Information

Base Model: KoBART (gogamza/kobart-base-v2)
Model Architecture: Seq2Seq (Encoder-Decoder structure)
Training Data: The model was trained using a large-scale dataset of approximately 930,000 sentence pairs. The dataset was built by leveraging the publicly available Junhoee/Jeju-Standard-Translation dataset, which is primarily based on text from the KakaoBrain JIT (Jeju-Island-Translation) corpus and transcribed data from the AI Hub Jeju dialect speech dataset.

Training Strategy and Parameters

Our model was trained using a two-stage domain adaptation method to handle the complexities of the Jeju dialect.

Domain Adaptation: The model was separately trained on Standard Korean and Jeju dialect sentences to help it deeply understand the grammar and style of each language.
Translation Fine-Tuning: The final stage involved training the model on the bidirectional dataset, with [제주] (Jeju) and [표준] (Standard) tags added to each sentence to explicitly guide the translation direction.

The following key hyperparameters and techniques were applied for performance optimization:

Learning Rate: 2e-5
Epochs: 3
Batch Size: 128
Weight Decay: 0.01
Generation Beams: 5
GPU Memory Efficiency: Mixed-precision training (FP16) was used to reduce training time, along with Gradient Accumulation (Steps: 16).

Performance Evaluation

The model's performance was comprehensively evaluated using both quantitative and qualitative metrics.

Quantitative Evaluation

Direction	SacreBLEU	CHRF	BERTScore
Jeju Dialect → Standard	77.19	83.02	0.97
Standard → Jeju Dialect	64.86	72.68	0.94

Qualitative Evaluation (Summary)

Adequacy: The model accurately captures the meaning of most source sentences.
Fluency: The translated sentences are grammatically correct and natural-sounding.
Tone: While generally good at maintaining the tone, the model has some limitations in perfectly reflecting the nuances and specific colloquial endings of the Jeju dialect.

How to Use

You can easily load and infer with the model using the transformers library's pipeline function.

1. Installation

pip install transformers torch

from transformers import pipeline

# Load the model pipeline
translator = pipeline(
    "translation",
    model="sbaru/jeju-satoru"
)

# Example: Jeju Dialect -> Standard
jeju_sentence = '[제주] 우리 집이 펜안허다.'
result = translator(jeju_sentence, max_length=128)
print(f"Input: {jeju_sentence}")
print(f"Output: {result[0]['translation_text']}")

# Example: Standard -> Jeju Dialect
standard_sentence = '[표준] 우리 집은 편안하다.'
result = translator(standard_sentence, max_length=128)
print(f"Input: {standard_sentence}")
print(f"Output: {result[0]['translation_text']}")

Downloads last month: 3

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for sbaru/jeju-satoru

Base model

gogamza/kobart-base-v2

Finetuned

(23)

this model

sbaru
/

jeju-satoru