Translation
English
Chinese
Eval Results
radinplaid commited on
Commit
fb98c61
·
verified ·
1 Parent(s): d8d6087

Upload folder using huggingface_hub

Browse files
.ipynb_checkpoints/README-checkpoint.md CHANGED
@@ -1,21 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # `quickmt-zh-en` Neural Machine Translation Model
2
 
3
- # Usage
4
 
5
- ## Install `quickmt`
6
 
7
- ```bash
8
- git clone https://github.com/quickmt/quickmt.git
9
- pip install ./quickmt/
10
- ```
11
 
12
- ## Download model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
  ```bash
 
 
 
15
  quickmt-model-download quickmt/quickmt-zh-en ./quickmt-zh-en
16
  ```
17
 
18
- ## Use model
19
 
20
  ```python
21
  from quickmt import Translator
@@ -23,135 +85,36 @@ from quickmt import Translator
23
  # Auto-detects GPU, set to "cpu" to force CPU inference
24
  t = Translator("./quickmt-zh-en/", device="auto")
25
 
26
- # Translate - set beam size to 5 for higher quality (but slower speed)
27
- t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”"], beam_size=1)
 
 
 
 
 
28
 
 
29
  # Get alternative translations by sampling
30
  # You can pass any cTranslate2 `translate_batch` arguments
31
- t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”"], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
32
  ```
33
 
34
- # Model Information
 
 
35
 
36
- * Trained using [`eole`](https://github.com/eole-nlp/eole)
37
- * Exported for fast inference to []CTranslate2](https://github.com/OpenNMT/CTranslate2) format
38
- * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.zh-en/tree/main
39
 
40
  ## Metrics
41
 
42
- BLEU and CHRF2 calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the Flores200 `devtest` test set ("zho_Hans"->"eng_Latn").
43
-
44
- | Model | bleu | chrf2 |
45
- | ---- | ---- | ---- |
46
- | quickmt/quickmt-zh-en | 28.58 | 57.46 |
47
- | Helsinki-NLP/opus-mt-zh-en | 23.35 | 53.60 |
48
- | facebook/m2m100_418M | 18.96 | 50.06 |
49
- | facebook/m2m100_1.2B | 24.68 | 54.68 |
50
- | facebook/nllb-200-distilled-600M | 26.22 | 55.17 |
51
- | facebook/nllb-200-distilled-1.3B | 28.54 | 57.34 |
52
- | google/madlad400-3b-mt | 28.74 | 58.01 |
53
-
54
- ## Training Configuration
55
-
56
- ```yaml
57
- ## IO
58
- save_data: zh_en/data_spm
59
- overwrite: True
60
- seed: 1234
61
- report_every: 100
62
- valid_metrics: ["BLEU"]
63
- tensorboard: true
64
- tensorboard_log_dir: tensorboard
65
-
66
- ### Vocab
67
- src_vocab: zh-en/src.eole.vocab
68
- tgt_vocab: zh-en/tgt.eole.vocab
69
- src_vocab_size: 20000
70
- tgt_vocab_size: 20000
71
- vocab_size_multiple: 8
72
- share_vocab: False
73
- n_sample: 0
74
-
75
- data:
76
- corpus_1:
77
- path_src: hf://quickmt/quickmt-train-zh-en/zh
78
- path_tgt: hf://quickmt/quickmt-train-zh-en/en
79
- path_sco: hf://quickmt/quickmt-train-zh-en/sco
80
-
81
- valid:
82
- path_src: zh-en/dev.zho
83
- path_tgt: zh-en/dev.eng
84
-
85
- transforms: [sentencepiece, filtertoolong]
86
- transforms_configs:
87
- sentencepiece:
88
- src_subword_model: "zh-en/src.spm.model"
89
- tgt_subword_model: "zh-en/tgt.spm.model"
90
- filtertoolong:
91
- src_seq_length: 512
92
- tgt_seq_length: 512
93
-
94
- training:
95
- # Run configuration
96
- model_path: quickmt-zh-en
97
- keep_checkpoint: 4
98
- save_checkpoint_steps: 1000
99
- train_steps: 200000
100
- valid_steps: 1000
101
-
102
- # Train on a single GPU
103
- world_size: 1
104
- gpu_ranks: [0]
105
-
106
- # Batching
107
- batch_type: "tokens"
108
- batch_size: 13312
109
- valid_batch_size: 13312
110
- batch_size_multiple: 8
111
- accum_count: [4]
112
- accum_steps: [0]
113
-
114
- # Optimizer & Compute
115
- compute_dtype: "bfloat16"
116
- optim: "pagedadamw8bit"
117
- learning_rate: 1.0
118
- warmup_steps: 10000
119
- decay_method: "noam"
120
- adam_beta2: 0.998
121
-
122
- # Data loading
123
- bucket_size: 262144
124
- num_workers: 4
125
- prefetch_factor: 100
126
-
127
- # Hyperparams
128
- dropout_steps: [0]
129
- dropout: [0.1]
130
- attention_dropout: [0.1]
131
- max_grad_norm: 0
132
- label_smoothing: 0.1
133
- average_decay: 0.0001
134
- param_init_method: xavier_uniform
135
- normalization: "tokens"
136
-
137
- model:
138
- architecture: "transformer"
139
- layer_norm: standard
140
- share_embeddings: false
141
- share_decoder_embeddings: true
142
- add_ffnbias: true
143
- mlp_activation_fn: gated-silu
144
- add_estimator: false
145
- add_qkvbias: false
146
- norm_eps: 1e-6
147
- hidden_size: 1024
148
- encoder:
149
- layers: 8
150
- decoder:
151
- layers: 2
152
- heads: 16
153
- transformer_ff: 4096
154
- embeddings:
155
- word_vec_size: 1024
156
- position_encoding_type: "SinusoidalInterleaved"
157
- ```
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ tags:
6
+ - translation
7
+ license: cc-by-4.0
8
+ datasets:
9
+ - quickmt/quickmt-train.zh-en
10
+ - quickmt/madlad400-en-backtranslated-zh
11
+ - quickmt/newscrawl2024-en-backtranslated-zh
12
+ model-index:
13
+ - name: quickmt-zh-en
14
+ results:
15
+ - task:
16
+ name: Translation zho-eng
17
+ type: translation
18
+ args: zho-eng
19
+ dataset:
20
+ name: flores101-devtest
21
+ type: flores_101
22
+ args: zho_Hans eng_Latn devtest
23
+ metrics:
24
+ - name: BLEU
25
+ type: bleu
26
+ value: 29.9
27
+ - name: CHRF
28
+ type: chrf
29
+ value: 58.42
30
+ - name: COMET
31
+ type: comet
32
+ value: 86.59
33
+ ---
34
+
35
+
36
  # `quickmt-zh-en` Neural Machine Translation Model
37
 
38
+ `quickmt-zh-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `zh` into `en`.
39
 
40
+ `quickmt` models are roughly 3 times faster for GPU inference than OpusMT models and roughly [40 times](https://huggingface.co/spaces/quickmt/quickmt-vs-libretranslate) faster than [LibreTranslate](https://huggingface.co/spaces/quickmt/quickmt-vs-libretranslate)/[ArgosTranslate](github.com/argosopentech/argos-translate).
41
 
 
 
 
 
42
 
43
+ ## *UPDATED VERSION!*
44
+
45
+ This model was trained with back-translated data and has improved translation quality!
46
+
47
+ * https://huggingface.co/datasets/quickmt/madlad400-en-backtranslated-zh
48
+ * https://huggingface.co/datasets/quickmt/newscrawl2024-en-backtranslated-zh
49
+
50
+
51
+ ## Try it on our Huggingface Space
52
+
53
+ Give it a try before downloading here: https://huggingface.co/spaces/quickmt/QuickMT-Demo
54
+
55
+
56
+ ## Model Information
57
+
58
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
59
+ * 200M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
60
+ * 32k separate Sentencepiece vocabs
61
+ * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
62
+ * The pytorch model (for use with [`eole`](https://github.com/eole-nlp/eole)) is available in this repository in the `eole-model` folder
63
+
64
+ See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
65
+
66
+
67
+ ## Usage with `quickmt`
68
+
69
+ You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
70
+
71
+ Next, install the `quickmt` python library and download the model:
72
 
73
  ```bash
74
+ git clone https://github.com/quickmt/quickmt.git
75
+ pip install -e ./quickmt/
76
+
77
  quickmt-model-download quickmt/quickmt-zh-en ./quickmt-zh-en
78
  ```
79
 
80
+ Finally use the model in python:
81
 
82
  ```python
83
  from quickmt import Translator
 
85
  # Auto-detects GPU, set to "cpu" to force CPU inference
86
  t = Translator("./quickmt-zh-en/", device="auto")
87
 
88
+ # Translate - set beam size to 1 for faster speed (but lower quality)
89
+ sample_text = '埃胡德·乌尔博士(新斯科舍省哈利法克斯市达尔豪西大学医学教授,加拿大糖尿病协会临床与科学部门教授)提醒,这项研究仍处在早期阶段。'
90
+
91
+ t(sample_text, beam_size=5)
92
+ ```
93
+
94
+ > 'Dr. Ehud Ur (Professor of Medicine, Dalhousie University, Halifax, Nova Scotia, and Professor of Clinical and Scientific Division, Canadian Diabetes Association) cautions that the study is still at an early stage.'
95
 
96
+ ```python
97
  # Get alternative translations by sampling
98
  # You can pass any cTranslate2 `translate_batch` arguments
99
+ t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
100
  ```
101
 
102
+ > 'Dr Elhoud (Professor of Medicine at Dalhousie University, Halifax, Nova Scotia, and professor of clinical and scientific Division of the Canadian Diabetes Association) cautions that the study is still at an early stage.'
103
+
104
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`. A model in safetensors format to be used with `eole` is also provided.
105
 
 
 
 
106
 
107
  ## Metrics
108
 
109
+ `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("zho_Hans"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32.
110
+
111
+
112
+ | | bleu | chrf2 | comet22 | Time (s) |
113
+ |:---------------------------------|-------:|--------:|----------:|-----------:|
114
+ | quickmt/quickmt-zh-en | 29.9 | 58.42 | 86.59 | 1.22 |
115
+ | Helsinki-NLP/opus-mt-zh-en | 22.99 | 53.98 | 84.6 | 3.73 |
116
+ | facebook/nllb-200-distilled-600M | 26.02 | 55.27 | 85.1 | 21.69 |
117
+ | facebook/nllb-200-distilled-1.3B | 28.61 | 57.43 | 86.22 | 37.55 |
118
+ | facebook/m2m100_418M | 19.55 | 50.83 | 82.04 | 18.2 |
119
+ | facebook/m2m100_1.2B | 24.9 | 54.89 | 85.1 | 35.49 |
120
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.ipynb_checkpoints/eole-config-checkpoint.yaml ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## IO
2
+ save_data: data
3
+ overwrite: True
4
+ seed: 1234
5
+ report_every: 100
6
+ valid_metrics: ["BLEU"]
7
+ tensorboard: true
8
+ tensorboard_log_dir: tensorboard
9
+
10
+ ### Vocab
11
+ src_vocab: zh.eole.vocab
12
+ tgt_vocab: en.eole.vocab
13
+ src_vocab_size: 32000
14
+ tgt_vocab_size: 32000
15
+ vocab_size_multiple: 8
16
+ share_vocab: false
17
+ n_sample: 0
18
+
19
+ data:
20
+ corpus_1:
21
+ path_src: hf://quickmt/quickmt-train.is-en/zh
22
+ path_tgt: hf://quickmt/quickmt-train.is-en/en
23
+ path_sco: hf://quickmt/quickmt-train.is-en/sco
24
+ weight: 2
25
+ corpus_2:
26
+ path_src: hf://quickmt/newscrawl2024-en-backtranslated-zh/zh
27
+ path_tgt: hf://quickmt/newscrawl2024-en-backtranslated-zh/en
28
+ path_sco: hf://quickmt/newscrawl2024-en-backtranslated-zh/sco
29
+ weight: 1
30
+ corpus_3:
31
+ path_src: hf://quickmt/madlad400-en-backtranslated-zh/zh
32
+ path_tgt: hf://quickmt/madlad400-en-backtranslated-zh/en
33
+ path_sco: hf://quickmt/madlad400-en-backtranslated-zh/sco
34
+ weight: 2
35
+ valid:
36
+ path_src: valid.zh
37
+ path_tgt: valid.en
38
+
39
+ transforms: [sentencepiece, filtertoolong]
40
+ transforms_configs:
41
+ sentencepiece:
42
+ src_subword_model: "zh.spm.model"
43
+ tgt_subword_model: "en.spm.model"
44
+ filtertoolong:
45
+ src_seq_length: 256
46
+ tgt_seq_length: 256
47
+
48
+ training:
49
+ # Run configuration
50
+ model_path: quickmt-zh-en-eole-model
51
+ keep_checkpoint: 4
52
+ train_steps: 200000
53
+ save_checkpoint_steps: 5000
54
+ valid_steps: 5000
55
+
56
+ # Train on a single GPU
57
+ world_size: 1
58
+ gpu_ranks: [0]
59
+
60
+ # Batching
61
+ batch_type: "tokens"
62
+ batch_size: 6000
63
+ valid_batch_size: 2048
64
+ batch_size_multiple: 8
65
+ accum_count: [20]
66
+ accum_steps: [0]
67
+
68
+ # Optimizer & Compute
69
+ compute_dtype: "fp16"
70
+ optim: "adamw"
71
+ #use_amp: False
72
+ learning_rate: 3.0
73
+ warmup_steps: 5000
74
+ decay_method: "noam"
75
+ adam_beta2: 0.998
76
+
77
+ # Data loading
78
+ bucket_size: 128000
79
+ num_workers: 4
80
+ prefetch_factor: 32
81
+
82
+ # Hyperparams
83
+ dropout_steps: [0]
84
+ dropout: [0.1]
85
+ attention_dropout: [0.1]
86
+ max_grad_norm: 0
87
+ label_smoothing: 0.1
88
+ average_decay: 0.0001
89
+ param_init_method: xavier_uniform
90
+ normalization: "tokens"
91
+
92
+ model:
93
+ architecture: "transformer"
94
+ share_embeddings: false
95
+ share_decoder_embeddings: true
96
+ hidden_size: 1024
97
+ encoder:
98
+ layers: 8
99
+ decoder:
100
+ layers: 2
101
+ heads: 8
102
+ transformer_ff: 4096
103
+ embeddings:
104
+ word_vec_size: 1024
105
+ position_encoding_type: "SinusoidalInterleaved"
106
+
README.md CHANGED
@@ -1,12 +1,14 @@
1
  ---
2
  language:
3
- - zh
4
  - en
 
5
  tags:
6
  - translation
7
  license: cc-by-4.0
8
  datasets:
9
  - quickmt/quickmt-train.zh-en
 
 
10
  model-index:
11
  - name: quickmt-zh-en
12
  results:
@@ -21,10 +23,13 @@ model-index:
21
  metrics:
22
  - name: BLEU
23
  type: bleu
24
- value: 29.36
25
  - name: CHRF
26
  type: chrf
27
- value: 58.10
 
 
 
28
  ---
29
 
30
 
@@ -32,57 +37,84 @@ model-index:
32
 
33
  `quickmt-zh-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `zh` into `en`.
34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ## Model Information
37
 
38
- * Trained using [`eole`](https://github.com/eole-nlp/eole)
39
  * 200M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
40
- * Separate source and target Sentencepiece tokenizers
41
  * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
42
- * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.zh-en/tree/main
43
 
44
- See the `eole` model configuration in this repository for further details.
45
 
46
 
47
  ## Usage with `quickmt`
48
 
49
- First, install `quickmt` and download the model
 
 
50
 
51
  ```bash
52
  git clone https://github.com/quickmt/quickmt.git
53
- pip install ./quickmt/
54
 
55
  quickmt-model-download quickmt/quickmt-zh-en ./quickmt-zh-en
56
  ```
57
 
 
 
58
  ```python
59
  from quickmt import Translator
60
 
61
  # Auto-detects GPU, set to "cpu" to force CPU inference
62
  t = Translator("./quickmt-zh-en/", device="auto")
63
 
64
- # Translate - set beam size to 5 for higher quality (but slower speed)
65
- t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”"], beam_size=1)
 
 
 
66
 
 
 
 
67
  # Get alternative translations by sampling
68
  # You can pass any cTranslate2 `translate_batch` arguments
69
- t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”"], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
70
  ```
71
 
72
- The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
 
 
73
 
74
 
75
  ## Metrics
76
 
77
- BLEU and CHRF2 calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("zho_Hans"->"eng_Latn"). COMET22 with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate (using `ctranslate2`) the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32.
 
78
 
79
- | Model | bleu | chrf2 | comet22 | Time (s) |
80
- | -------------------------------- | ----- | ----- | ---- | ---- |
81
- | quickmt/quickmt-zh-en | 29.36 | 58.10 | 0.8655 | 0.88 |
82
- | Helsinki-NLP/opus-mt-zh-en | 23.35 | 53.60 | 0.8426 | 3.78 |
83
- | facebook/m2m100_418M | 15.99 | 50.13 | 0.7881 | 16.61 |
84
- | facebook/nllb-200-distilled-600M | 26.22 | 55.18 | 0.8507 | 20.89 |
85
- | facebook/m2m100_1.2B | 20.30 | 54.23 | 0.8206 | 33.12 |
86
- | facebook/nllb-200-distilled-1.3B | 28.56 | 57.35 | 0.8620 | 36.64 |
87
 
88
- `quickmt-zh-en` is the fastest *and* highest quality.
 
1
  ---
2
  language:
 
3
  - en
4
+ - zh
5
  tags:
6
  - translation
7
  license: cc-by-4.0
8
  datasets:
9
  - quickmt/quickmt-train.zh-en
10
+ - quickmt/madlad400-en-backtranslated-zh
11
+ - quickmt/newscrawl2024-en-backtranslated-zh
12
  model-index:
13
  - name: quickmt-zh-en
14
  results:
 
23
  metrics:
24
  - name: BLEU
25
  type: bleu
26
+ value: 29.9
27
  - name: CHRF
28
  type: chrf
29
+ value: 58.42
30
+ - name: COMET
31
+ type: comet
32
+ value: 86.59
33
  ---
34
 
35
 
 
37
 
38
  `quickmt-zh-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `zh` into `en`.
39
 
40
+ `quickmt` models are roughly 3 times faster for GPU inference than OpusMT models and roughly [40 times](https://huggingface.co/spaces/quickmt/quickmt-vs-libretranslate) faster than [LibreTranslate](https://huggingface.co/spaces/quickmt/quickmt-vs-libretranslate)/[ArgosTranslate](github.com/argosopentech/argos-translate).
41
+
42
+
43
+ ## *UPDATED VERSION!*
44
+
45
+ This model was trained with back-translated data and has improved translation quality!
46
+
47
+ * https://huggingface.co/datasets/quickmt/madlad400-en-backtranslated-zh
48
+ * https://huggingface.co/datasets/quickmt/newscrawl2024-en-backtranslated-zh
49
+
50
+
51
+ ## Try it on our Huggingface Space
52
+
53
+ Give it a try before downloading here: https://huggingface.co/spaces/quickmt/QuickMT-Demo
54
+
55
 
56
  ## Model Information
57
 
58
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
59
  * 200M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
60
+ * 32k separate Sentencepiece vocabs
61
  * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
62
+ * The pytorch model (for use with [`eole`](https://github.com/eole-nlp/eole)) is available in this repository in the `eole-model` folder
63
 
64
+ See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
65
 
66
 
67
  ## Usage with `quickmt`
68
 
69
+ You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
70
+
71
+ Next, install the `quickmt` python library and download the model:
72
 
73
  ```bash
74
  git clone https://github.com/quickmt/quickmt.git
75
+ pip install -e ./quickmt/
76
 
77
  quickmt-model-download quickmt/quickmt-zh-en ./quickmt-zh-en
78
  ```
79
 
80
+ Finally use the model in python:
81
+
82
  ```python
83
  from quickmt import Translator
84
 
85
  # Auto-detects GPU, set to "cpu" to force CPU inference
86
  t = Translator("./quickmt-zh-en/", device="auto")
87
 
88
+ # Translate - set beam size to 1 for faster speed (but lower quality)
89
+ sample_text = '埃胡德·乌尔博士(新斯科舍省哈利法克斯市达尔豪西大学医学教授,加拿大糖尿病协会临床与科学部门教授)提醒,这项研究仍处在早期阶段。'
90
+
91
+ t(sample_text, beam_size=5)
92
+ ```
93
 
94
+ > 'Dr. Ehud Ur (Professor of Medicine, Dalhousie University, Halifax, Nova Scotia, and Professor of Clinical and Scientific Division, Canadian Diabetes Association) cautions that the study is still at an early stage.'
95
+
96
+ ```python
97
  # Get alternative translations by sampling
98
  # You can pass any cTranslate2 `translate_batch` arguments
99
+ t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
100
  ```
101
 
102
+ > 'Dr Elhoud (Professor of Medicine at Dalhousie University, Halifax, Nova Scotia, and professor of clinical and scientific Division of the Canadian Diabetes Association) cautions that the study is still at an early stage.'
103
+
104
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`. A model in safetensors format to be used with `eole` is also provided.
105
 
106
 
107
  ## Metrics
108
 
109
+ `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("zho_Hans"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32.
110
+
111
 
112
+ | | bleu | chrf2 | comet22 | Time (s) |
113
+ |:---------------------------------|-------:|--------:|----------:|-----------:|
114
+ | quickmt/quickmt-zh-en | 29.9 | 58.42 | 86.59 | 1.22 |
115
+ | Helsinki-NLP/opus-mt-zh-en | 22.99 | 53.98 | 84.6 | 3.73 |
116
+ | facebook/nllb-200-distilled-600M | 26.02 | 55.27 | 85.1 | 21.69 |
117
+ | facebook/nllb-200-distilled-1.3B | 28.61 | 57.43 | 86.22 | 37.55 |
118
+ | facebook/m2m100_418M | 19.55 | 50.83 | 82.04 | 18.2 |
119
+ | facebook/m2m100_1.2B | 24.9 | 54.89 | 85.1 | 35.49 |
120
 
 
eole-config.yaml CHANGED
@@ -1,5 +1,5 @@
1
  ## IO
2
- save_data: zh_en/data_spm
3
  overwrite: True
4
  seed: 1234
5
  report_every: 100
@@ -8,39 +8,50 @@ tensorboard: true
8
  tensorboard_log_dir: tensorboard
9
 
10
  ### Vocab
11
- src_vocab: zh-en/src.eole.vocab
12
- tgt_vocab: zh-en/tgt.eole.vocab
13
  src_vocab_size: 32000
14
  tgt_vocab_size: 32000
15
  vocab_size_multiple: 8
16
- share_vocab: False
17
  n_sample: 0
18
 
19
  data:
20
  corpus_1:
21
- path_src: hf://quickmt/quickmt-train-zh-en/zh
22
- path_tgt: hf://quickmt/quickmt-train-zh-en/en
23
- path_sco: hf://quickmt/quickmt-train-zh-en/sco
 
 
 
 
 
 
 
 
 
 
 
24
  valid:
25
- path_src: zh-en/dev.zho
26
- path_tgt: zh-en/dev.eng
27
 
28
  transforms: [sentencepiece, filtertoolong]
29
  transforms_configs:
30
  sentencepiece:
31
- src_subword_model: "zh-en/src.spm.model"
32
- tgt_subword_model: "zh-en/tgt.spm.model"
33
  filtertoolong:
34
  src_seq_length: 256
35
  tgt_seq_length: 256
36
 
37
  training:
38
  # Run configuration
39
- model_path: zh-en/model
40
  keep_checkpoint: 4
41
- save_checkpoint_steps: 2000
42
- train_steps: 100000
43
- valid_steps: 2000
44
 
45
  # Train on a single GPU
46
  world_size: 1
@@ -48,30 +59,31 @@ training:
48
 
49
  # Batching
50
  batch_type: "tokens"
51
- batch_size: 8192
52
- valid_batch_size: 8192
53
  batch_size_multiple: 8
54
- accum_count: [16]
55
  accum_steps: [0]
56
 
57
  # Optimizer & Compute
58
- compute_dtype: "bf16"
59
- optim: "pagedadamw8bit"
60
- learning_rate: 2.0
61
- warmup_steps: 10000
 
62
  decay_method: "noam"
63
  adam_beta2: 0.998
64
 
65
  # Data loading
66
  bucket_size: 128000
67
  num_workers: 4
68
- prefetch_factor: 100
69
 
70
  # Hyperparams
71
  dropout_steps: [0]
72
  dropout: [0.1]
73
  attention_dropout: [0.1]
74
- max_grad_norm: 2
75
  label_smoothing: 0.1
76
  average_decay: 0.0001
77
  param_init_method: xavier_uniform
@@ -79,21 +91,16 @@ training:
79
 
80
  model:
81
  architecture: "transformer"
82
- layer_norm: standard
83
  share_embeddings: false
84
  share_decoder_embeddings: true
85
- add_ffnbias: true
86
- mlp_activation_fn: gelu
87
- add_estimator: false
88
- add_qkvbias: false
89
- norm_eps: 1e-6
90
  hidden_size: 1024
91
  encoder:
92
  layers: 8
93
  decoder:
94
  layers: 2
95
- heads: 16
96
  transformer_ff: 4096
97
  embeddings:
98
  word_vec_size: 1024
99
  position_encoding_type: "SinusoidalInterleaved"
 
 
1
  ## IO
2
+ save_data: data
3
  overwrite: True
4
  seed: 1234
5
  report_every: 100
 
8
  tensorboard_log_dir: tensorboard
9
 
10
  ### Vocab
11
+ src_vocab: zh.eole.vocab
12
+ tgt_vocab: en.eole.vocab
13
  src_vocab_size: 32000
14
  tgt_vocab_size: 32000
15
  vocab_size_multiple: 8
16
+ share_vocab: false
17
  n_sample: 0
18
 
19
  data:
20
  corpus_1:
21
+ path_src: hf://quickmt/quickmt-train.is-en/zh
22
+ path_tgt: hf://quickmt/quickmt-train.is-en/en
23
+ path_sco: hf://quickmt/quickmt-train.is-en/sco
24
+ weight: 2
25
+ corpus_2:
26
+ path_src: hf://quickmt/newscrawl2024-en-backtranslated-zh/zh
27
+ path_tgt: hf://quickmt/newscrawl2024-en-backtranslated-zh/en
28
+ path_sco: hf://quickmt/newscrawl2024-en-backtranslated-zh/sco
29
+ weight: 1
30
+ corpus_3:
31
+ path_src: hf://quickmt/madlad400-en-backtranslated-zh/zh
32
+ path_tgt: hf://quickmt/madlad400-en-backtranslated-zh/en
33
+ path_sco: hf://quickmt/madlad400-en-backtranslated-zh/sco
34
+ weight: 2
35
  valid:
36
+ path_src: valid.zh
37
+ path_tgt: valid.en
38
 
39
  transforms: [sentencepiece, filtertoolong]
40
  transforms_configs:
41
  sentencepiece:
42
+ src_subword_model: "zh.spm.model"
43
+ tgt_subword_model: "en.spm.model"
44
  filtertoolong:
45
  src_seq_length: 256
46
  tgt_seq_length: 256
47
 
48
  training:
49
  # Run configuration
50
+ model_path: quickmt-zh-en-eole-model
51
  keep_checkpoint: 4
52
+ train_steps: 200000
53
+ save_checkpoint_steps: 5000
54
+ valid_steps: 5000
55
 
56
  # Train on a single GPU
57
  world_size: 1
 
59
 
60
  # Batching
61
  batch_type: "tokens"
62
+ batch_size: 6000
63
+ valid_batch_size: 2048
64
  batch_size_multiple: 8
65
+ accum_count: [20]
66
  accum_steps: [0]
67
 
68
  # Optimizer & Compute
69
+ compute_dtype: "fp16"
70
+ optim: "adamw"
71
+ #use_amp: False
72
+ learning_rate: 3.0
73
+ warmup_steps: 5000
74
  decay_method: "noam"
75
  adam_beta2: 0.998
76
 
77
  # Data loading
78
  bucket_size: 128000
79
  num_workers: 4
80
+ prefetch_factor: 32
81
 
82
  # Hyperparams
83
  dropout_steps: [0]
84
  dropout: [0.1]
85
  attention_dropout: [0.1]
86
+ max_grad_norm: 0
87
  label_smoothing: 0.1
88
  average_decay: 0.0001
89
  param_init_method: xavier_uniform
 
91
 
92
  model:
93
  architecture: "transformer"
 
94
  share_embeddings: false
95
  share_decoder_embeddings: true
 
 
 
 
 
96
  hidden_size: 1024
97
  encoder:
98
  layers: 8
99
  decoder:
100
  layers: 2
101
+ heads: 8
102
  transformer_ff: 4096
103
  embeddings:
104
  word_vec_size: 1024
105
  position_encoding_type: "SinusoidalInterleaved"
106
+
eole-model/config.json ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "n_sample": 0,
3
+ "share_vocab": false,
4
+ "report_every": 100,
5
+ "tgt_vocab_size": 32000,
6
+ "tensorboard_log_dir": "tensorboard",
7
+ "tensorboard_log_dir_dated": "tensorboard/Nov-28_15-33-54",
8
+ "valid_metrics": [
9
+ "BLEU"
10
+ ],
11
+ "src_vocab": "zh.eole.vocab",
12
+ "tensorboard": true,
13
+ "seed": 1234,
14
+ "tgt_vocab": "en.eole.vocab",
15
+ "vocab_size_multiple": 8,
16
+ "transforms": [
17
+ "sentencepiece",
18
+ "filtertoolong"
19
+ ],
20
+ "src_vocab_size": 32000,
21
+ "overwrite": true,
22
+ "save_data": "data",
23
+ "training": {
24
+ "num_workers": 0,
25
+ "label_smoothing": 0.1,
26
+ "accum_count": [
27
+ 20
28
+ ],
29
+ "valid_steps": 5000,
30
+ "gpu_ranks": [
31
+ 0
32
+ ],
33
+ "accum_steps": [
34
+ 0
35
+ ],
36
+ "warmup_steps": 5000,
37
+ "world_size": 1,
38
+ "batch_size_multiple": 8,
39
+ "optim": "adamw",
40
+ "normalization": "tokens",
41
+ "max_grad_norm": 0.0,
42
+ "bucket_size": 128000,
43
+ "dropout": [
44
+ 0.1
45
+ ],
46
+ "adam_beta2": 0.998,
47
+ "model_path": "quickmt-zh-en-eole-model",
48
+ "batch_size": 6000,
49
+ "batch_type": "tokens",
50
+ "compute_dtype": "torch.float16",
51
+ "save_checkpoint_steps": 5000,
52
+ "keep_checkpoint": 4,
53
+ "learning_rate": 3.0,
54
+ "prefetch_factor": 32,
55
+ "dropout_steps": [
56
+ 0
57
+ ],
58
+ "train_steps": 200000,
59
+ "decay_method": "noam",
60
+ "average_decay": 0.0001,
61
+ "valid_batch_size": 2048,
62
+ "param_init_method": "xavier_uniform",
63
+ "attention_dropout": [
64
+ 0.1
65
+ ]
66
+ },
67
+ "transforms_configs": {
68
+ "sentencepiece": {
69
+ "src_subword_model": "${MODEL_PATH}/zh.spm.model",
70
+ "tgt_subword_model": "${MODEL_PATH}/en.spm.model"
71
+ },
72
+ "filtertoolong": {
73
+ "src_seq_length": 256,
74
+ "tgt_seq_length": 256
75
+ }
76
+ },
77
+ "data": {
78
+ "corpus_1": {
79
+ "weight": 2,
80
+ "transforms": [
81
+ "sentencepiece",
82
+ "filtertoolong"
83
+ ],
84
+ "path_align": null,
85
+ "path_src": "train.zh",
86
+ "path_tgt": "train.en"
87
+ },
88
+ "corpus_2": {
89
+ "weight": 1,
90
+ "transforms": [
91
+ "sentencepiece",
92
+ "filtertoolong"
93
+ ],
94
+ "path_align": null,
95
+ "path_src": "/home/mark/mt/data/newscrawl.backtrans.zh",
96
+ "path_tgt": "/home/mark/mt/data/newscrawl.2024.en"
97
+ },
98
+ "corpus_3": {
99
+ "weight": 2,
100
+ "transforms": [
101
+ "sentencepiece",
102
+ "filtertoolong"
103
+ ],
104
+ "path_align": null,
105
+ "path_src": "/home/mark/mt/data/madlad.backtrans.zh",
106
+ "path_tgt": "/home/mark/mt/data/madlad.en"
107
+ },
108
+ "valid": {
109
+ "path_src": "valid.zh",
110
+ "transforms": [
111
+ "sentencepiece",
112
+ "filtertoolong"
113
+ ],
114
+ "path_tgt": "valid.en",
115
+ "path_align": null
116
+ }
117
+ },
118
+ "model": {
119
+ "hidden_size": 1024,
120
+ "position_encoding_type": "SinusoidalInterleaved",
121
+ "share_embeddings": false,
122
+ "architecture": "transformer",
123
+ "heads": 8,
124
+ "share_decoder_embeddings": true,
125
+ "transformer_ff": 4096,
126
+ "decoder": {
127
+ "hidden_size": 1024,
128
+ "layers": 2,
129
+ "position_encoding_type": "SinusoidalInterleaved",
130
+ "tgt_word_vec_size": 1024,
131
+ "n_positions": null,
132
+ "heads": 8,
133
+ "decoder_type": "transformer",
134
+ "transformer_ff": 4096
135
+ },
136
+ "embeddings": {
137
+ "src_word_vec_size": 1024,
138
+ "word_vec_size": 1024,
139
+ "position_encoding_type": "SinusoidalInterleaved",
140
+ "tgt_word_vec_size": 1024
141
+ },
142
+ "encoder": {
143
+ "hidden_size": 1024,
144
+ "encoder_type": "transformer",
145
+ "src_word_vec_size": 1024,
146
+ "layers": 8,
147
+ "position_encoding_type": "SinusoidalInterleaved",
148
+ "n_positions": null,
149
+ "heads": 8,
150
+ "transformer_ff": 4096
151
+ }
152
+ }
153
+ }
eole-model/en.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c23dc1aa66b7b98263da867137a7eb41d5e4573984fb100c0b295f3010381823
3
+ size 792100
eole-model/model.00.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ce0b3dfe2ef4c9f6b93f969f27f5d5cf38432e4a7bcd144577c8583209bb701a
3
+ size 840314816
eole-model/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
eole-model/zh.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:56159dc3a8607805f0e3f2cf4c91f37ee221e91fb824ebb362622d22185872cb
3
+ size 720056
model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:494518d6282fcd01f850ab4ab096e6a5c937aa834290a62e5efd435275828c9d
3
- size 409972810
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:88ef37879afce2d5f0bdf4c53073aab30967f178f0a0fa2eed7c98160270b06a
3
+ size 409915789
source_vocabulary.json CHANGED
The diff for this file is too large to render. See raw diff
 
src.spm.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:23d03d562fc3f8fe57e497dac0ece4827c254675a80c103fc4bb4040638ceb67
3
- size 733978
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:56159dc3a8607805f0e3f2cf4c91f37ee221e91fb824ebb362622d22185872cb
3
+ size 720056
target_vocabulary.json CHANGED
The diff for this file is too large to render. See raw diff
 
tgt.spm.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c373f1d78753313b0dbc337058bf8450e1fdd6fe662a49e0941affce44ec14c5
3
- size 800955
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c23dc1aa66b7b98263da867137a7eb41d5e4573984fb100c0b295f3010381823
3
+ size 792100