Vietnamese-Tày Neural Machine Translation

This is the first public release of a Vietnamese-Tày translation model. Expect imperfections and feel free to report issues or suggest improvements so the model can better serve the Tày and Vietnamese communities.

Model name: Vi-Tày Transformer
Description: First publicly released neural translation model for Vietnamese ↔ Tày
First release date: 12/11/2025

This repository provides a sequence-to-sequence Transformer model for bidirectional translation between Vietnamese (vi) and Tày (tyz), implemented in pure PyTorch with a SentencePiece tokenizer.

The model is intended for:

Translating Vietnamese text into Tày
Translating Tày text into Vietnamese
Supporting research and experimentation on low-resource language translation

Model Overview

Architecture: Encoder-Decoder Transformer (Seq2Seq)
Framework: PyTorch
Tokenizer: SentencePiece (subword tokenization)
- Model file: spm.model
Languages:
- vi - Vietnamese
- tyz - Tày (ISO 639-3: tyz)
Decoding: Beam search

The code assumes the following key components are defined in model.py:

ModelConfig
Seq2SeqTransformer
Constants: PAD, BOS, EOS, LANG2ID
(Optionally) a generate method for sequence decoding

Files in This Repository

Typical files you will see:

pytorch_model.bin - Model weights (PyTorch state_dict)
config.json - Model configuration (hyperparameters, vocabulary size, etc.)
spm.model - SentencePiece tokenizer model
model.py - Model architecture and utilities
README.md - This file

How to Load the Model

import json
import torch
import sentencepiece as spm

from model import ModelConfig, Seq2SeqTransformer, PAD, BOS, EOS, LANG2ID

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load SentencePiece tokenizer
sp = spm.SentencePieceProcessor(model_file="spm.model")

# Load config
with open("config.json", "r", encoding="utf-8") as f:
    cfg_dict = json.load(f)

cfg = ModelConfig(cfg_dict)  # or ModelConfig(**cfg_dict) depending on implementation

# Create model & load weights
model = Seq2SeqTransformer(cfg)
state_dict = torch.load("pytorch_model.bin", map_location=device)
model.load_state_dict(state_dict)
model.to(device)
model.eval()

Helper: Decoding Token IDs

If you don’t already have a decoding helper, you can use:

def decode_ids(token_ids):
    """Decode a sequence of token IDs to text, removing special tokens."""
    clean_ids = [tid for tid in token_ids if tid not in (PAD, BOS, EOS)]
    return sp.decode(clean_ids)

Translation Usage

Vietnamese → Tày

@torch.inference_mode()
def translate_vi_to_tay(model, text_vi: str, max_len: int = 128, beam_size: int = 4):
    model.eval()
    # Encode Vietnamese sentence
    src_ids = sp.encode(text_vi, out_type=int)
    src_tensor = torch.tensor(src_ids, dtype=torch.long, device=device).unsqueeze(0)  # [1, src_len]

    # Generate Tày translation
    out = model.generate(
        src_tensor,
        src_lang_id=LANG2ID["vi"],
        tgt_lang_id=LANG2ID["tyz"],
        bos_id=BOS,
        eos_id=EOS,
        beam_size=beam_size,
        max_len=max_len,
    )

    # Assuming model.generate returns shape [1, T]
    output_ids = out[0].tolist()
    return decode_ids(output_ids)


example_vi = "À ơi em ngủ, ngủ say đi, đợi tý mẹ về."
print(translate_vi_to_tay(model, example_vi))

Example output:

Ừ hở noọng nòn, nòn đắc nòn pây, đợi tý me mà.

Expected Inputs & Outputs

Input type: Raw Unicode strings (UTF-8), either Vietnamese or Tày
Preprocessing: SentencePiece subword encoding via spm.model
Output type: UTF-8 strings in the target language
Length limits:
- Typically, max_lenis set around 128 tokens (can be adjusted)
- Very long sentences may be truncated or produce degraded translations

Training Data & Method

Note: This is a high-level description only. Exact dataset details and statistics can be added here if/when they are released.

Data: Parallel Vietnamese-Tày sentence pairs collected from various sources (e.g., community texts, educational content, manually constructed pairs).
Preprocessing:
- Normalization and cleaning of text
- SentencePiece vocabulary training on combined corpora
Objective: Standard sequence-to-sequence cross-entropy loss with teacher forcing
Optimization: Transformer training with AdamW

Ethical Considerations

Tày is a low-resource and minority language. When using this model:
- Be mindful of how generated content represents the Tày community and culture.
- Avoid using translations to misrepresent or manipulate speakers of Tày or Vietnamese.
- Encourage human review and involvement, especially from native speakers.

Contact & Contributions

Issues / bugs: Please open an issue on the model's Hugging Face page or repository.
Contributions: Contributions (bug fixes, better examples, improved evaluation, or documentation for Tày) are very welcome. You can:
- Submit pull requests with improved code or README sections
- Share evaluation results or additional example translations
- Help refine the tokenizer or add support for more variants

Downloads last month: 1