Translation
Safetensors
English
Korean
mt5
korean
english
seq2seq
text2text-generation

Midas Banner

Midas — Korean ↔ English Translation

Everything it touches, it translates.

A bidirectional Korean ↔ English translation model built on mT5-small, fine-tuned on ~150k parallel sentence pairs drawn from TED talks, parallel corpora, and quality-filtered Wikipedia data.

Base Model Task Languages BLEU License


Overview

Midas is a single bidirectional translation model that handles both Korean→English and English→Korean in one set of weights. Direction is controlled by a task prefix in the input — no separate models required.

mT5-small was chosen over standard T5-small specifically for its multilingual SentencePiece tokenizer, which covers Korean script natively and avoids the heavy subword fragmentation that limits T5-small's Korean generation quality.

What Midas is good for:

  • Getting the gist of Korean text in English
  • Drafting English-to-Korean translations for human review
  • Lightweight deployment where model size matters
  • Baseline translation in pipelines where speed > perfection

What Midas is not:

  • A production-grade translator — BLEU 21.88 sits in the "understandable with noticeable flaws" range
  • A replacement for DeepL or Google Translate on critical content
  • Suited for highly domain-specific text (legal, medical, technical)

Model Details

Property Value
Base model google/mt5-small
Parameters ~300M
Task Bidirectional Korean ↔ English translation
Training examples 299,000 (150k KO→EN + 150k EN→KO)
Epochs 2
Total steps 8,094
Training time 2h 20m
Final BLEU 21.88
Final train loss 7.6516
Final val loss 1.5326
Hardware NVIDIA A100-SXM4 80GB
Max sequence length 512 tokens (source and target)
Precision bfloat16
License Apache 2.0

Usage

Transformers

from transformers import MT5ForConditionalGeneration, AutoTokenizer

model_name = "EphAsad/Midas-Korean-English"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = MT5ForConditionalGeneration.from_pretrained(model_name)

def translate(text, direction="ko-en"):
    if direction == "ko-en":
        prefix = "translate Korean to English: "
    else:
        prefix = "translate English to Korean: "

    inputs = tokenizer(
        prefix + text,
        return_tensors = "pt",
        max_length     = 512,
        truncation     = True,
    )
    outputs = model.generate(
        **inputs,
        max_new_tokens     = 512,
        num_beams          = 4,
        early_stopping     = True,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Korean → English
print(translate("오늘 날씨가 정말 좋네요.", direction="ko-en"))

# English → Korean
print(translate("The weather is really nice today.", direction="en-ko"))

Pipeline

from transformers import pipeline

ko_en = pipeline(
    "translation",
    model    = "EphAsad/Midas-Korean-English",
    src_lang = "ko",
    tgt_lang = "en",
)

en_ko = pipeline(
    "translation",
    model    = "EphAsad/Midas-Korean-English",
    src_lang = "en",
    tgt_lang = "ko",
)

Input Format

Direction is controlled by the prefix — this is required:

translate Korean to English: {Korean text}
translate English to Korean: {English text}

Omitting the prefix will produce degraded or incorrect output.


Training Data

Three parallel corpus sources were streamed and combined. Each sentence pair was expanded into two training examples — one per direction — giving a balanced 50/50 KO→EN / EN→KO split.

Dataset Pairs Filter Fields
msarmi9/korean-english-multitarget-ted-talks-task 30,000 None korean, english
Moo/korean-parallel-corpora 20,000 None ko, en
lemon-mint/korean_english_parallel_wiki_augmented_v1 100,000 score ≥ 0.87 korean, english

Total pairs: ~150,000
Total training examples: ~299,000 (after bidirectional expansion)
Val split: 1,000 examples (0.3%)

The wiki dataset quality filter (score ≥ 0.87) was applied to retain only high-confidence alignment pairs. Records below this threshold were discarded during streaming.


Training Configuration

base_model        = 'google/mt5-small'
batch_size        = 32
grad_accumulation = 4          # effective batch: 128
learning_rate     = 5e-4
warmup_steps      = 500
num_epochs        = 2
max_source_len    = 512
max_target_len    = 512
bf16              = True

Loss and BLEU curve:

Step Train Loss Val Loss BLEU
1,500 10.6597 2.0668 10.50
3,000 9.1004 1.7672 16.18
4,500 8.2559 1.6380 18.76
6,000 8.0250 1.5700 20.84
7,500 7.6715 1.5370 21.40
8,094 7.6516 1.5326 21.88

BLEU improvement plateaued significantly after step 6,000, with only +1.04 points gained over the final 2,000 steps. The model converged within 2 epochs.


Evaluation

BLEU 21.88 on the held-out validation set (1,000 examples, mixed KO→EN and EN→KO).

For context on what BLEU scores mean in practice:

BLEU Quality
< 10 Unusable
10–19 Rough — gist only
20–29 Understandable — noticeable flaws
30–40 Good — approaches human quality
40+ High quality

Midas sits at the lower end of the "understandable" band. Translations convey meaning but may have grammatical errors, awkward phrasing, or minor semantic drift — particularly on longer or more complex sentences.


Known Limitations

BLEU ceiling for mT5-small. The model converged around 21-22 BLEU. Reaching 25+ would likely require mT5-base (580M) or more training steps. Capacity, not data, is the limiting factor here.

Long sentences. Performance degrades on sentences requiring complex reordering across Korean and English word order differences. Both languages are structurally very different (SOV vs SVO), which remains a challenge at this scale.

Domain coverage. Training data is dominated by encyclopedic (Wikipedia) and spoken (TED) content. Legal, medical, and technical Korean text will produce lower quality outputs.

Sacrebleu tokenisation artefact. mT5's SentencePiece decoder sometimes produces spaces before punctuation (e.g. word . rather than word.). This slightly suppresses the reported BLEU score — actual perceived quality is marginally better than the number suggests.


Citation

@misc{midas_korean_english_2026,
  author       = {Asad, Zain},
  title        = {Midas: A Bidirectional Korean-English Translation Model
                  Based on mT5-Small},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/EphAsad/Midas-Korean-English}},
}

License

Released under the Apache 2.0 License, consistent with the mT5-small base model.


Built independently by EphAsad

Downloads last month
62
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EphAsad/Midas-Korean-English

Base model

google/mt5-small
Finetuned
(746)
this model

Datasets used to train EphAsad/Midas-Korean-English