--- language: - zh - en tags: - translation license: cc-by-4.0 datasets: - quickmt/quickmt-train.zh-en model-index: - name: quickmt-zh-en results: - task: name: Translation zho-eng type: translation args: zho-eng dataset: name: flores101-devtest type: flores_101 args: zho_Hans eng_Latn devtest metrics: - name: BLEU type: bleu value: 28.58 - name: CHRF type: chrf value: 57.46 --- # `quickmt-zh-en` Neural Machine Translation Model # Usage ## Install `quickmt` ```bash git clone https://github.com/quickmt/quickmt.git pip install ./quickmt/ ``` ## Download model ```bash quickmt-model-download quickmt/quickmt-zh-en ./quickmt-zh-en ``` ## Use model Inference with `quickmt`: ```python from quickmt import Translator # Auto-detects GPU, set to "cpu" to force CPU inference t = Translator("./quickmt-zh-en/", device="auto") # Translate - set beam size to 5 for higher quality (but slower speed) t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”"], beam_size=1) # Get alternative translations by sampling # You can pass any cTranslate2 `translate_batch` arguments t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”"], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9) ``` The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use the model files directly if you want. It would be fairly easy to get them to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`. # Model Information * Trained using [`eole`](https://github.com/eole-nlp/eole) - It took about 1 day on a single RTX 4090 on [vast.ai](https://cloud.vast.ai) * Exported for fast inference to []CTranslate2](https://github.com/OpenNMT/CTranslate2) format * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.zh-en/tree/main ## Metrics BLEU and CHRF2 calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the Flores200 `devtest` test set ("zho_Hans"->"eng_Latn"). "Time" is the time to translate the following input with a single CPU core: > 2019冠状病毒病(英語:Coronavirus disease 2019,缩写:COVID-19[17][18]),是一種由嚴重急性呼吸系統綜合症冠狀病毒2型(縮寫:SARS-CoV-2)引發的傳染病,导致了一场持续的疫情,成为人類歷史上致死人數最多的流行病之一。 | Model | bleu | chrf2 | Time (s) | | -------------------------------- | ----- | ----- | ---- | | quickmt/quickmt-zh-en | 28.58 | 57.46 | 0.670 | | Helsinki-NLP/opus-mt-zh-en | 23.35 | 53.60 | 0.838 | | facebook/m2m100_418M | 18.96 | 50.06 | 11.5 | | facebook/nllb-200-distilled-600M | 26.22 | 55.17 | 13.2 | | facebook/nllb-200-distilled-1.3B | 28.54 | 57.34 | 23.6 | | facebook/m2m100_1.2B | 24.68 | 54.68 | 25.7 | | google/madlad400-3b-mt | 28.74 | 58.01 | ??? | `quickmt-zh-en` is the fastest and delivers fairly high quality. Helsinki-NLP/opus-mt-zh-en is one of the most downloaded machine translation models on HuggingFace, and this model is considerably more accurate *and* a bit faster. ## Training Configuration ```yaml ### Vocab src_vocab_size: 20000 tgt_vocab_size: 20000 share_vocab: False data: corpus_1: path_src: hf://quickmt/quickmt-train-zh-en/zh path_tgt: hf://quickmt/quickmt-train-zh-en/en path_sco: hf://quickmt/quickmt-train-zh-en/sco valid: path_src: zh-en/dev.zho path_tgt: zh-en/dev.eng transforms: [sentencepiece, filtertoolong] transforms_configs: sentencepiece: src_subword_model: "zh-en/src.spm.model" tgt_subword_model: "zh-en/tgt.spm.model" filtertoolong: src_seq_length: 512 tgt_seq_length: 512 training: # Run configuration model_path: quickmt-zh-en keep_checkpoint: 4 save_checkpoint_steps: 1000 train_steps: 104000 valid_steps: 1000 # Train on a single GPU world_size: 1 gpu_ranks: [0] # Batching batch_type: "tokens" batch_size: 13312 valid_batch_size: 13312 batch_size_multiple: 8 accum_count: [4] accum_steps: [0] # Optimizer & Compute compute_dtype: "bfloat16" optim: "pagedadamw8bit" learning_rate: 1.0 warmup_steps: 10000 decay_method: "noam" adam_beta2: 0.998 # Data loading bucket_size: 262144 num_workers: 4 prefetch_factor: 100 # Hyperparams dropout_steps: [0] dropout: [0.1] attention_dropout: [0.1] max_grad_norm: 0 label_smoothing: 0.1 average_decay: 0.0001 param_init_method: xavier_uniform normalization: "tokens" model: architecture: "transformer" layer_norm: standard share_embeddings: false share_decoder_embeddings: true add_ffnbias: true mlp_activation_fn: gated-silu add_estimator: false add_qkvbias: false norm_eps: 1e-6 hidden_size: 1024 encoder: layers: 8 decoder: layers: 2 heads: 16 transformer_ff: 4096 embeddings: word_vec_size: 1024 position_encoding_type: "SinusoidalInterleaved" ```