Moxhi MT 30 zh-vi

Fast Chinese to Vietnamese Marian-style machine translation model, trained for web-novel / xianxia content.

Intended Use

Chinese -> Vietnamese web novel / fiction translation (xianxia, modern, cross-domain).
Fast local or server inference where a small model is preferred.
Strong general / cross-domain coverage (sci-fi, historical, modern, mystery) while keeping xianxia / wuxia / classical register sharp.
Experimental release; review output for high-stakes or publication use.

Model Details

Architecture: Marian seq2seq (asymmetric 8 encoder + 2 decoder)
Parameters: ~37M
Tokenizer: SentencePiece joint source/target, 24k
Suggested decoding: num_beams=4, max_length=512

Versions

Tag	Notes
`v4.0.1` (current `main`)	v4.0 + tokenizer hotfix: 231 orphaned fused name-pieces marked UNUSED in `source.spm` — fixes rare proper-noun hallucination (e.g. 柳三 → "Liễu Tam"). Weights unchanged.
`v4.0`	Pin with `revision="v4.0"`.
`v3.0`	Pin with `revision="v3.0"`.
`v2.2`	Pin with `revision="v2.2"`.

Pin a specific version:

from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("DanVP/MoxhiMT-30", revision="v3.0")

Quick Start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "DanVP/MoxhiMT-30"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

Text = " hắn ngẩng đầu nhìn về phía xa xa sơn môn."
inputs = tok(text, return_tensors="pt", truncation=True, max_length=512)
out = model.generate(**inputs, max_length=512, num_beams=4)
print(tok.decode(out[0], skip_special_tokens=True))

Fast CPU Runtime

A CTranslate2 INT8 export is in ct2-int8_float32/ for ~3-5x faster CPU inference.

import ctranslate2
from pathlib import Path
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer

model_id = "DanVP/MoxhiMT-30"
model_path = Path(snapshot_download(model_id, allow_patterns=[
    "config.json", "source.spm", "target.spm", "vocab.json",
    "tokenizer_config.json", "ct2-int8_float32/*",
]))
tokenizer = AutoTokenizer.from_pretrained(model_path)
translator = ctranslate2.Translator(
    str(model_path / "ct2-int8_float32"),
    device="cpu", compute_type="int8_float32",
)

Training Data

Trained from scratch on a curated Chinese-Vietnamese parallel corpus covering xianxia, modern fiction, historical, sci-fi, and cross-domain web-novel content, with a research-grounded layer for idioms and classical-Chinese grammar, then a light preference-tuning (DPO) pass for xianxia/idiom sharpness.

Notes

Prioritizes speed and small footprint.
Known hard cases include rare proper nouns and highly domain-specific OOD terminology.
For production usage, pair with reviewed glossary/guard layers where appropriate.

License

CC-BY-NC-4.0 (research / non-commercial use).

Downloads last month: 45

Safetensors

Model size

36.5M params

Tensor type

F32

Model tree for DanVP/MoxhiMT-30

Quantizations

1 model

DanVP
/

MoxhiMT-30