Vietnamese-Tày Neural Machine Translation

This is the first public release of a Vietnamese-Tày translation model. Expect imperfections and feel free to report issues or suggest improvements so the model can better serve the Tày and Vietnamese communities.

Model name: Vi-Tày Transformer
Description: First publicly released neural translation model for Vietnamese ↔ Tày
First release date: 12/11/2025

This repository provides a sequence-to-sequence Transformer model for bidirectional translation between Vietnamese (vi) and Tày (tyz), implemented in pure PyTorch with a SentencePiece tokenizer.

The model is intended for:

  • Translating Vietnamese text into Tày
  • Translating Tày text into Vietnamese
  • Supporting research and experimentation on low-resource language translation

Model Overview

  • Architecture: Encoder-Decoder Transformer (Seq2Seq)
  • Framework: PyTorch
  • Tokenizer: SentencePiece (subword tokenization)
    • Model file: spm.model
  • Languages:
    • vi - Vietnamese
    • tyz - Tày (ISO 639-3: tyz)
  • Decoding: Beam search

The code assumes the following key components are defined in model.py:

  • ModelConfig
  • Seq2SeqTransformer
  • Constants: PAD, BOS, EOS, LANG2ID
  • (Optionally) a generate method for sequence decoding

Files in This Repository

Typical files you will see:

  • pytorch_model.bin - Model weights (PyTorch state_dict)
  • config.json - Model configuration (hyperparameters, vocabulary size, etc.)
  • spm.model - SentencePiece tokenizer model
  • model.py - Model architecture and utilities
  • README.md - This file

How to Load the Model

import json
import torch
import sentencepiece as spm

from model import ModelConfig, Seq2SeqTransformer, PAD, BOS, EOS, LANG2ID

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load SentencePiece tokenizer
sp = spm.SentencePieceProcessor(model_file="spm.model")

# Load config
with open("config.json", "r", encoding="utf-8") as f:
    cfg_dict = json.load(f)

cfg = ModelConfig(cfg_dict)  # or ModelConfig(**cfg_dict) depending on implementation

# Create model & load weights
model = Seq2SeqTransformer(cfg)
state_dict = torch.load("pytorch_model.bin", map_location=device)
model.load_state_dict(state_dict)
model.to(device)
model.eval()

Helper: Decoding Token IDs

If you don’t already have a decoding helper, you can use:

def decode_ids(token_ids):
    """Decode a sequence of token IDs to text, removing special tokens."""
    clean_ids = [tid for tid in token_ids if tid not in (PAD, BOS, EOS)]
    return sp.decode(clean_ids)

Translation Usage

Vietnamese → Tày

@torch.inference_mode()
def translate_vi_to_tay(model, text_vi: str, max_len: int = 128, beam_size: int = 4):
    model.eval()
    # Encode Vietnamese sentence
    src_ids = sp.encode(text_vi, out_type=int)
    src_tensor = torch.tensor(src_ids, dtype=torch.long, device=device).unsqueeze(0)  # [1, src_len]

    # Generate Tày translation
    out = model.generate(
        src_tensor,
        src_lang_id=LANG2ID["vi"],
        tgt_lang_id=LANG2ID["tyz"],
        bos_id=BOS,
        eos_id=EOS,
        beam_size=beam_size,
        max_len=max_len,
    )

    # Assuming model.generate returns shape [1, T]
    output_ids = out[0].tolist()
    return decode_ids(output_ids)


example_vi = "À ơi em ngủ, ngủ say đi, đợi tý mẹ về."
print(translate_vi_to_tay(model, example_vi))

Example output:

Ừ hở noọng nòn, nòn đắc nòn pây, đợi tý me mà.

Expected Inputs & Outputs

  • Input type: Raw Unicode strings (UTF-8), either Vietnamese or Tày
  • Preprocessing: SentencePiece subword encoding via spm.model
  • Output type: UTF-8 strings in the target language
  • Length limits:
    • Typically, max_lenis set around 128 tokens (can be adjusted)
    • Very long sentences may be truncated or produce degraded translations

Training Data & Method

Note: This is a high-level description only. Exact dataset details and statistics can be added here if/when they are released.

  • Data: Parallel Vietnamese-Tày sentence pairs collected from various sources (e.g., community texts, educational content, manually constructed pairs).
  • Preprocessing:
    • Normalization and cleaning of text
    • SentencePiece vocabulary training on combined corpora
  • Objective: Standard sequence-to-sequence cross-entropy loss with teacher forcing
  • Optimization: Transformer training with AdamW

Ethical Considerations

  • Tày is a low-resource and minority language. When using this model:

    • Be mindful of how generated content represents the Tày community and culture.
    • Avoid using translations to misrepresent or manipulate speakers of Tày or Vietnamese.
    • Encourage human review and involvement, especially from native speakers.

Contact & Contributions

  • Issues / bugs: Please open an issue on the model's Hugging Face page or repository.

  • Contributions: Contributions (bug fixes, better examples, improved evaluation, or documentation for Tày) are very welcome. You can:

    • Submit pull requests with improved code or README sections
    • Share evaluation results or additional example translations
    • Help refine the tokenizer or add support for more variants
Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support