Midas — Korean ↔ English Translation

Everything it touches, it translates.

A bidirectional Korean ↔ English translation model built on mT5-small, fine-tuned on ~150k parallel sentence pairs drawn from TED talks, parallel corpora, and quality-filtered Wikipedia data.

Overview

Midas is a single bidirectional translation model that handles both Korean→English and English→Korean in one set of weights. Direction is controlled by a task prefix in the input — no separate models required.

mT5-small was chosen over standard T5-small specifically for its multilingual SentencePiece tokenizer, which covers Korean script natively and avoids the heavy subword fragmentation that limits T5-small's Korean generation quality.

What Midas is good for:

Getting the gist of Korean text in English
Drafting English-to-Korean translations for human review
Lightweight deployment where model size matters
Baseline translation in pipelines where speed > perfection

What Midas is not:

A production-grade translator — BLEU 21.88 sits in the "understandable with noticeable flaws" range
A replacement for DeepL or Google Translate on critical content
Suited for highly domain-specific text (legal, medical, technical)

Model Details

Property	Value
Base model	google/mt5-small
Parameters	~300M
Task	Bidirectional Korean ↔ English translation
Training examples	299,000 (150k KO→EN + 150k EN→KO)
Epochs	2
Total steps	8,094
Training time	2h 20m
Final BLEU	21.88
Final train loss	7.6516
Final val loss	1.5326
Hardware	NVIDIA A100-SXM4 80GB
Max sequence length	512 tokens (source and target)
Precision	bfloat16
License	Apache 2.0

Usage

Transformers

from transformers import MT5ForConditionalGeneration, AutoTokenizer

model_name = "EphAsad/Midas-Korean-English"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = MT5ForConditionalGeneration.from_pretrained(model_name)

def translate(text, direction="ko-en"):
    if direction == "ko-en":
        prefix = "translate Korean to English: "
    else:
        prefix = "translate English to Korean: "

    inputs = tokenizer(
        prefix + text,
        return_tensors = "pt",
        max_length     = 512,
        truncation     = True,
    )
    outputs = model.generate(
        **inputs,
        max_new_tokens     = 512,
        num_beams          = 4,
        early_stopping     = True,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Korean → English
print(translate("오늘 날씨가 정말 좋네요.", direction="ko-en"))

# English → Korean
print(translate("The weather is really nice today.", direction="en-ko"))

Pipeline

from transformers import pipeline

ko_en = pipeline(
    "translation",
    model    = "EphAsad/Midas-Korean-English",
    src_lang = "ko",
    tgt_lang = "en",
)

en_ko = pipeline(
    "translation",
    model    = "EphAsad/Midas-Korean-English",
    src_lang = "en",
    tgt_lang = "ko",
)

Input Format

Direction is controlled by the prefix — this is required:

translate Korean to English: {Korean text}
translate English to Korean: {English text}

Omitting the prefix will produce degraded or incorrect output.

Training Data

Three parallel corpus sources were streamed and combined. Each sentence pair was expanded into two training examples — one per direction — giving a balanced 50/50 KO→EN / EN→KO split.

Dataset	Pairs	Filter	Fields
msarmi9/korean-english-multitarget-ted-talks-task	30,000	None	`korean`, `english`
Moo/korean-parallel-corpora	20,000	None	`ko`, `en`
lemon-mint/korean_english_parallel_wiki_augmented_v1	100,000	`score ≥ 0.87`	`korean`, `english`

Total pairs: ~150,000
Total training examples: ~299,000 (after bidirectional expansion)
Val split: 1,000 examples (0.3%)

The wiki dataset quality filter (score ≥ 0.87) was applied to retain only high-confidence alignment pairs. Records below this threshold were discarded during streaming.

Training Configuration

base_model        = 'google/mt5-small'
batch_size        = 32
grad_accumulation = 4          # effective batch: 128
learning_rate     = 5e-4
warmup_steps      = 500
num_epochs        = 2
max_source_len    = 512
max_target_len    = 512
bf16              = True

Loss and BLEU curve:

Step	Train Loss	Val Loss	BLEU
1,500	10.6597	2.0668	10.50
3,000	9.1004	1.7672	16.18
4,500	8.2559	1.6380	18.76
6,000	8.0250	1.5700	20.84
7,500	7.6715	1.5370	21.40
8,094	7.6516	1.5326	21.88

BLEU improvement plateaued significantly after step 6,000, with only +1.04 points gained over the final 2,000 steps. The model converged within 2 epochs.

Evaluation

BLEU 21.88 on the held-out validation set (1,000 examples, mixed KO→EN and EN→KO).

For context on what BLEU scores mean in practice:

BLEU	Quality
< 10	Unusable
10–19	Rough — gist only
20–29	Understandable — noticeable flaws
30–40	Good — approaches human quality
40+	High quality

Midas sits at the lower end of the "understandable" band. Translations convey meaning but may have grammatical errors, awkward phrasing, or minor semantic drift — particularly on longer or more complex sentences.

Known Limitations

BLEU ceiling for mT5-small. The model converged around 21-22 BLEU. Reaching 25+ would likely require mT5-base (580M) or more training steps. Capacity, not data, is the limiting factor here.

Long sentences. Performance degrades on sentences requiring complex reordering across Korean and English word order differences. Both languages are structurally very different (SOV vs SVO), which remains a challenge at this scale.

Domain coverage. Training data is dominated by encyclopedic (Wikipedia) and spoken (TED) content. Legal, medical, and technical Korean text will produce lower quality outputs.

Sacrebleu tokenisation artefact. mT5's SentencePiece decoder sometimes produces spaces before punctuation (e.g. word . rather than word.). This slightly suppresses the reported BLEU score — actual perceived quality is marginally better than the number suggests.

Citation

@misc{midas_korean_english_2026,
  author       = {Asad, Zain},
  title        = {Midas: A Bidirectional Korean-English Translation Model
                  Based on mT5-Small},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/EphAsad/Midas-Korean-English}},
}

License

Released under the Apache 2.0 License, consistent with the mT5-small base model.

Built independently by EphAsad

Downloads last month: 62

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for EphAsad/Midas-Korean-English

Base model

google/mt5-small

Finetuned

(746)

this model

EphAsad
/

Midas-Korean-English