Midas — Korean ↔ English Translation
Everything it touches, it translates.
A bidirectional Korean ↔ English translation model built on mT5-small, fine-tuned on ~150k parallel sentence pairs drawn from TED talks, parallel corpora, and quality-filtered Wikipedia data.
Overview
Midas is a single bidirectional translation model that handles both Korean→English and English→Korean in one set of weights. Direction is controlled by a task prefix in the input — no separate models required.
mT5-small was chosen over standard T5-small specifically for its multilingual SentencePiece tokenizer, which covers Korean script natively and avoids the heavy subword fragmentation that limits T5-small's Korean generation quality.
What Midas is good for:
- Getting the gist of Korean text in English
- Drafting English-to-Korean translations for human review
- Lightweight deployment where model size matters
- Baseline translation in pipelines where speed > perfection
What Midas is not:
- A production-grade translator — BLEU 21.88 sits in the "understandable with noticeable flaws" range
- A replacement for DeepL or Google Translate on critical content
- Suited for highly domain-specific text (legal, medical, technical)
Model Details
| Property | Value |
|---|---|
| Base model | google/mt5-small |
| Parameters | ~300M |
| Task | Bidirectional Korean ↔ English translation |
| Training examples | 299,000 (150k KO→EN + 150k EN→KO) |
| Epochs | 2 |
| Total steps | 8,094 |
| Training time | 2h 20m |
| Final BLEU | 21.88 |
| Final train loss | 7.6516 |
| Final val loss | 1.5326 |
| Hardware | NVIDIA A100-SXM4 80GB |
| Max sequence length | 512 tokens (source and target) |
| Precision | bfloat16 |
| License | Apache 2.0 |
Usage
Transformers
from transformers import MT5ForConditionalGeneration, AutoTokenizer
model_name = "EphAsad/Midas-Korean-English"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)
def translate(text, direction="ko-en"):
if direction == "ko-en":
prefix = "translate Korean to English: "
else:
prefix = "translate English to Korean: "
inputs = tokenizer(
prefix + text,
return_tensors = "pt",
max_length = 512,
truncation = True,
)
outputs = model.generate(
**inputs,
max_new_tokens = 512,
num_beams = 4,
early_stopping = True,
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Korean → English
print(translate("오늘 날씨가 정말 좋네요.", direction="ko-en"))
# English → Korean
print(translate("The weather is really nice today.", direction="en-ko"))
Pipeline
from transformers import pipeline
ko_en = pipeline(
"translation",
model = "EphAsad/Midas-Korean-English",
src_lang = "ko",
tgt_lang = "en",
)
en_ko = pipeline(
"translation",
model = "EphAsad/Midas-Korean-English",
src_lang = "en",
tgt_lang = "ko",
)
Input Format
Direction is controlled by the prefix — this is required:
translate Korean to English: {Korean text}
translate English to Korean: {English text}
Omitting the prefix will produce degraded or incorrect output.
Training Data
Three parallel corpus sources were streamed and combined. Each sentence pair was expanded into two training examples — one per direction — giving a balanced 50/50 KO→EN / EN→KO split.
| Dataset | Pairs | Filter | Fields |
|---|---|---|---|
| msarmi9/korean-english-multitarget-ted-talks-task | 30,000 | None | korean, english |
| Moo/korean-parallel-corpora | 20,000 | None | ko, en |
| lemon-mint/korean_english_parallel_wiki_augmented_v1 | 100,000 | score ≥ 0.87 |
korean, english |
Total pairs: ~150,000
Total training examples: ~299,000 (after bidirectional expansion)
Val split: 1,000 examples (0.3%)
The wiki dataset quality filter (score ≥ 0.87) was applied to retain only high-confidence alignment pairs. Records below this threshold were discarded during streaming.
Training Configuration
base_model = 'google/mt5-small'
batch_size = 32
grad_accumulation = 4 # effective batch: 128
learning_rate = 5e-4
warmup_steps = 500
num_epochs = 2
max_source_len = 512
max_target_len = 512
bf16 = True
Loss and BLEU curve:
| Step | Train Loss | Val Loss | BLEU |
|---|---|---|---|
| 1,500 | 10.6597 | 2.0668 | 10.50 |
| 3,000 | 9.1004 | 1.7672 | 16.18 |
| 4,500 | 8.2559 | 1.6380 | 18.76 |
| 6,000 | 8.0250 | 1.5700 | 20.84 |
| 7,500 | 7.6715 | 1.5370 | 21.40 |
| 8,094 | 7.6516 | 1.5326 | 21.88 |
BLEU improvement plateaued significantly after step 6,000, with only +1.04 points gained over the final 2,000 steps. The model converged within 2 epochs.
Evaluation
BLEU 21.88 on the held-out validation set (1,000 examples, mixed KO→EN and EN→KO).
For context on what BLEU scores mean in practice:
| BLEU | Quality |
|---|---|
| < 10 | Unusable |
| 10–19 | Rough — gist only |
| 20–29 | Understandable — noticeable flaws |
| 30–40 | Good — approaches human quality |
| 40+ | High quality |
Midas sits at the lower end of the "understandable" band. Translations convey meaning but may have grammatical errors, awkward phrasing, or minor semantic drift — particularly on longer or more complex sentences.
Known Limitations
BLEU ceiling for mT5-small. The model converged around 21-22 BLEU. Reaching 25+ would likely require mT5-base (580M) or more training steps. Capacity, not data, is the limiting factor here.
Long sentences. Performance degrades on sentences requiring complex reordering across Korean and English word order differences. Both languages are structurally very different (SOV vs SVO), which remains a challenge at this scale.
Domain coverage. Training data is dominated by encyclopedic (Wikipedia) and spoken (TED) content. Legal, medical, and technical Korean text will produce lower quality outputs.
Sacrebleu tokenisation artefact. mT5's SentencePiece decoder sometimes produces spaces before punctuation (e.g. word . rather than word.). This slightly suppresses the reported BLEU score — actual perceived quality is marginally better than the number suggests.
Citation
@misc{midas_korean_english_2026,
author = {Asad, Zain},
title = {Midas: A Bidirectional Korean-English Translation Model
Based on mT5-Small},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/EphAsad/Midas-Korean-English}},
}
License
Released under the Apache 2.0 License, consistent with the mT5-small base model.
Built independently by EphAsad
- Downloads last month
- 62
Model tree for EphAsad/Midas-Korean-English
Base model
google/mt5-small