Update README.md

f5ed961 verified 12 months ago

2.91 kB

language:
  - lo
  - vi
tags:
  - translation
  - machine-translation
  - m2m100
  - lao
  - vietnamese
  - vlsp2023
license: apache-2.0
datasets:
  - lngmnhlinh/vietnamese-lao
metrics:
  - bleu
model-index:
  - name: Vietnamese-to-Lao M2M100 (200M)
    results:
      - task:
          type: translation
          name: Machine Translation
        dataset:
          name: VLSP2023
          type: lngmnhlinh/vietnamese-lao
          split: test
        metrics:
          - name: BLEU
            type: bleu
            value: ~41

Lao-to-Vietnamese Translation Model (M2M100, 200M parameters)

Model Description

This model is a neural machine translation system for Vietnamese ➜ Lao, built from scratch using the architecture and configuration of Facebook's M2M100. It is specifically trained for low-resource translation tasks between Lao and Vietnamese.

Architecture: M2M100-based (custom config)
Model size: ~200M parameters
Task: Machine Translation
Source Language: Lao (lo)
Target Language: Vietnamese (vi)

Training Data

The model was trained on the VLSP 2023 Vietnamese-Lao parallel dataset, supplemented with synthetic translations generated using:

Google Translate
Gemini (Google's LLM)

Full dataset details: Kaggle - Vietnamese-Lao Parallel Corpus

Evaluation

The model achieved a BLEU score of ~41 on the VLSP2023 test set, calculated using:

sacrebleu
Hugging Face's evaluate library

Usage

BIG NOTE: The training data and the vocab are all lowercased 😃, so please lowercase your text before pass it into the tokenizer

To perform translation using this model with 🤗 Transformers:

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = M2M100ForConditionalGeneration.from_pretrained("luongmanhlinh/M2M100_Vi2Lo_Translation").to(device)
tokenizer = M2M100Tokenizer.from_pretrained("luongmanhlinh/M2M100_Vi2Lo_Translation")

# Example input
src_lang = "vi"
tgt_lang = "lo"
text = "Bạn đang làm gì?"  # Lao sentence

# Prepare input
text = text.lower()
tokenizer.src_lang = src_lang
inputs = tokenizer(text, return_tensors="pt", padding=False).to(device)

# Set decoder language
forced_bos_token_id = tokenizer.lang_code_to_id[tgt_lang]
decoder_start_token_id = tokenizer.lang_code_to_id[tgt_lang]

# Generate translation
outputs = model.generate(
    **inputs,
    forced_bos_token_id=forced_bos_token_id,
    decoder_start_token_id=decoder_start_token_id,
    num_beams=5,
    early_stopping=True,
    max_new_tokens=100,
)

# Decode translation
translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print("Translation:", translation)