Vietnamese-Tày Neural Machine Translation
This is the first public release of a Vietnamese-Tày translation model. Expect imperfections and feel free to report issues or suggest improvements so the model can better serve the Tày and Vietnamese communities.
Model name: Vi-Tày Transformer
Description: First publicly released neural translation model for Vietnamese ↔ Tày
First release date: 12/11/2025
This repository provides a sequence-to-sequence Transformer model for bidirectional translation between Vietnamese (vi) and Tày (tyz), implemented in pure PyTorch with a SentencePiece tokenizer.
The model is intended for:
- Translating Vietnamese text into Tày
- Translating Tày text into Vietnamese
- Supporting research and experimentation on low-resource language translation
Model Overview
- Architecture: Encoder-Decoder Transformer (Seq2Seq)
- Framework: PyTorch
- Tokenizer: SentencePiece (subword tokenization)
- Model file:
spm.model
- Model file:
- Languages:
vi- Vietnamesetyz- Tày (ISO 639-3:tyz)
- Decoding: Beam search
The code assumes the following key components are defined in model.py:
ModelConfigSeq2SeqTransformer- Constants:
PAD,BOS,EOS,LANG2ID - (Optionally) a
generatemethod for sequence decoding
Files in This Repository
Typical files you will see:
pytorch_model.bin- Model weights (PyTorchstate_dict)config.json- Model configuration (hyperparameters, vocabulary size, etc.)spm.model- SentencePiece tokenizer modelmodel.py- Model architecture and utilitiesREADME.md- This file
How to Load the Model
import json
import torch
import sentencepiece as spm
from model import ModelConfig, Seq2SeqTransformer, PAD, BOS, EOS, LANG2ID
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load SentencePiece tokenizer
sp = spm.SentencePieceProcessor(model_file="spm.model")
# Load config
with open("config.json", "r", encoding="utf-8") as f:
cfg_dict = json.load(f)
cfg = ModelConfig(cfg_dict) # or ModelConfig(**cfg_dict) depending on implementation
# Create model & load weights
model = Seq2SeqTransformer(cfg)
state_dict = torch.load("pytorch_model.bin", map_location=device)
model.load_state_dict(state_dict)
model.to(device)
model.eval()
Helper: Decoding Token IDs
If you don’t already have a decoding helper, you can use:
def decode_ids(token_ids):
"""Decode a sequence of token IDs to text, removing special tokens."""
clean_ids = [tid for tid in token_ids if tid not in (PAD, BOS, EOS)]
return sp.decode(clean_ids)
Translation Usage
Vietnamese → Tày
@torch.inference_mode()
def translate_vi_to_tay(model, text_vi: str, max_len: int = 128, beam_size: int = 4):
model.eval()
# Encode Vietnamese sentence
src_ids = sp.encode(text_vi, out_type=int)
src_tensor = torch.tensor(src_ids, dtype=torch.long, device=device).unsqueeze(0) # [1, src_len]
# Generate Tày translation
out = model.generate(
src_tensor,
src_lang_id=LANG2ID["vi"],
tgt_lang_id=LANG2ID["tyz"],
bos_id=BOS,
eos_id=EOS,
beam_size=beam_size,
max_len=max_len,
)
# Assuming model.generate returns shape [1, T]
output_ids = out[0].tolist()
return decode_ids(output_ids)
example_vi = "À ơi em ngủ, ngủ say đi, đợi tý mẹ về."
print(translate_vi_to_tay(model, example_vi))
Example output:
Ừ hở noọng nòn, nòn đắc nòn pây, đợi tý me mà.
Expected Inputs & Outputs
- Input type: Raw Unicode strings (UTF-8), either Vietnamese or Tày
- Preprocessing: SentencePiece subword encoding via
spm.model - Output type: UTF-8 strings in the target language
- Length limits:
- Typically,
max_lenis set around 128 tokens (can be adjusted) - Very long sentences may be truncated or produce degraded translations
- Typically,
Training Data & Method
Note: This is a high-level description only. Exact dataset details and statistics can be added here if/when they are released.
- Data: Parallel Vietnamese-Tày sentence pairs collected from various sources (e.g., community texts, educational content, manually constructed pairs).
- Preprocessing:
- Normalization and cleaning of text
- SentencePiece vocabulary training on combined corpora
- Objective: Standard sequence-to-sequence cross-entropy loss with teacher forcing
- Optimization: Transformer training with AdamW
Ethical Considerations
Tày is a low-resource and minority language. When using this model:
- Be mindful of how generated content represents the Tày community and culture.
- Avoid using translations to misrepresent or manipulate speakers of Tày or Vietnamese.
- Encourage human review and involvement, especially from native speakers.
Contact & Contributions
Issues / bugs: Please open an issue on the model's Hugging Face page or repository.
Contributions: Contributions (bug fixes, better examples, improved evaluation, or documentation for Tày) are very welcome. You can:
- Submit pull requests with improved code or README sections
- Share evaluation results or additional example translations
- Help refine the tokenizer or add support for more variants
- Downloads last month
- 5