Translation
English
Japanese
Eval Results
radinplaid commited on
Commit
4f9134c
·
verified ·
1 Parent(s): 955f9e6

Upload folder using huggingface_hub

Browse files
.ipynb_checkpoints/README-checkpoint.md CHANGED
@@ -7,27 +7,29 @@ tags:
7
  license: cc-by-4.0
8
  datasets:
9
  - quickmt/quickmt-train.ja-en
 
 
10
  model-index:
11
  - name: quickmt-ja-en
12
  results:
13
  - task:
14
- name: Translation jpn-eng
15
  type: translation
16
  args: jpn-eng
17
  dataset:
18
  name: flores101-devtest
19
  type: flores_101
20
- args: jpn_Japn eng_Latn devtest
21
  metrics:
22
- - name: CHRF
23
- type: chrf
24
- value: 57.06
25
  - name: BLEU
26
  type: bleu
27
- value: 27.91
 
 
 
28
  - name: COMET
29
  type: comet
30
- value: 87.29
31
  ---
32
 
33
 
@@ -35,14 +37,26 @@ model-index:
35
 
36
  `quickmt-ja-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `ja` into `en`.
37
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
  ## Model Information
40
 
41
- * Trained using [`eole`](https://github.com/eole-nlp/eole)
42
- * 185M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
43
- * 20k sentencepiece vocabularies
44
  * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
45
- * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.ja-en/tree/main
46
 
47
  See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
48
 
@@ -55,9 +69,9 @@ Next, install the `quickmt` python library and download the model:
55
 
56
  ```bash
57
  git clone https://github.com/quickmt/quickmt.git
58
- pip install ./quickmt/
59
 
60
- quickmt-model-download quickmt/quickmt-ja-en ./quickmt-ja-en
61
  ```
62
 
63
  Finally use the model in python:
@@ -68,33 +82,35 @@ from quickmt import Translator
68
  # Auto-detects GPU, set to "cpu" to force CPU inference
69
  t = Translator("./quickmt-ja-en/", device="auto")
70
 
71
- # Translate - set beam size to 5 for higher quality (but slower speed)
72
  sample_text = 'ノバスコシア州ハリファックスにあるダルハウジー大学医学部教授でカナダ糖尿病協会の臨床・科学部門の責任者を務めるエフード・ウル博士は、この研究はまだ初期段階にあるとして注意を促しました。'
 
73
  t(sample_text, beam_size=5)
 
74
 
75
- > 'Dr. Ehud Ulu, a professor of medicine at Dalhousie University in Halifax, Nova Scotia and head of the clinical and scientific division of the Canadian Diabetes Association, cautioned that the study is still in its early stages.'
76
 
 
77
  # Get alternative translations by sampling
78
  # You can pass any cTranslate2 `translate_batch` arguments
79
  t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
80
-
81
- > 'Dr Ehud Ul, professor of medicine at the University of Dalhousie’s Halifax, Nova Scotia and head of the clinical and scientific division of the Canadian Diabetes Society, noted the study is in its early stages.'
82
  ```
83
 
84
- The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
 
 
85
 
86
 
87
  ## Metrics
88
 
89
- `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("jpn_Jpan"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate (using `ctranslate2`) the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32 (faster speed is possible using a large batch size).
 
90
 
91
  | | bleu | chrf2 | comet22 | Time (s) |
92
  |:---------------------------------|-------:|--------:|----------:|-----------:|
93
- | quickmt/quickmt-ja-en | 27.91 | 57.06 | 87.29 | 1.00 |
94
  | Helsink-NLP/opus-mt-ja-en | 19.22 | 49.15 | 82.64 | 3.54 |
95
  | facebook/nllb-200-distilled-600M | 24.05 | 53.39 | 85.84 | 22.5 |
96
  | facebook/nllb-200-distilled-1.3B | 28.4 | 56.96 | 87.47 | 37.15 |
97
  | facebook/m2m100_418M | 18.82 | 49.55 | 82.59 | 18.27 |
98
- | facebook/m2m100_1.2B | 23.32 | 53.46 | 85.43 | 35.44 |
99
-
100
- `quickmt-ja-en` is the fastest and nearly as high-quality as facebook/nllb-200-distilled-1.3B.
 
7
  license: cc-by-4.0
8
  datasets:
9
  - quickmt/quickmt-train.ja-en
10
+ - quickmt/madlad400-en-backtranslated-ja
11
+ - quickmt/newscrawl2024-en-backtranslated-ja
12
  model-index:
13
  - name: quickmt-ja-en
14
  results:
15
  - task:
16
+ name: Translation jap-eng
17
  type: translation
18
  args: jpn-eng
19
  dataset:
20
  name: flores101-devtest
21
  type: flores_101
22
+ args: jpn_Jpan eng_Latn devtest
23
  metrics:
 
 
 
24
  - name: BLEU
25
  type: bleu
26
+ value: 28.87
27
+ - name: CHRF
28
+ type: chrf
29
+ value: 58.56
30
  - name: COMET
31
  type: comet
32
+ value: 87.83
33
  ---
34
 
35
 
 
37
 
38
  `quickmt-ja-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `ja` into `en`.
39
 
40
+ `quickmt` models are roughly 3 times faster for GPU inference than OpusMT models and roughly [40 times](https://huggingface.co/spaces/quickmt/quickmt-vs-libretranslate) faster than [LibreTranslate](https://huggingface.co/spaces/quickmt/quickmt-vs-libretranslate)/[ArgosTranslate](github.com/argosopentech/argos-translate).
41
+
42
+
43
+ ## *UPDATED!*
44
+
45
+ This model was trained with back-translated data and a slightly different configuration and has improved translation quality!
46
+
47
+
48
+ ## Try it on our Huggingface Space
49
+
50
+ Give it a try before downloading here: https://huggingface.co/spaces/quickmt/QuickMT-Demo
51
+
52
 
53
  ## Model Information
54
 
55
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
56
+ * 200M parameter seq2seq transformer
57
+ * 32k separate Sentencepiece vocabs
58
  * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
59
+ * The pytorch model (for use with [`eole`](https://github.com/eole-nlp/eole)) is available in this repository in the `eole-model` folder
60
 
61
  See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
62
 
 
69
 
70
  ```bash
71
  git clone https://github.com/quickmt/quickmt.git
72
+ pip install -e ./quickmt/
73
 
74
+ quickmt-model-download quickmt/quickmt-zh-en ./quickmt-ja-en
75
  ```
76
 
77
  Finally use the model in python:
 
82
  # Auto-detects GPU, set to "cpu" to force CPU inference
83
  t = Translator("./quickmt-ja-en/", device="auto")
84
 
85
+ # Translate - set beam size to 1 for faster speed (but lower quality)
86
  sample_text = 'ノバスコシア州ハリファックスにあるダルハウジー大学医学部教授でカナダ糖尿病協会の臨床・科学部門の責任者を務めるエフード・ウル博士は、この研究はまだ初期段階にあるとして注意を促しました。'
87
+
88
  t(sample_text, beam_size=5)
89
+ ```
90
 
91
+ > 'Dr Ehud Ulu, professor of medicine at Dalhousie University in Halifax, Nova Scotia and head of the clinical and scientific division of the Canadian Diabetes Association, urged caution as the study is still in its infancy.'
92
 
93
+ ```python
94
  # Get alternative translations by sampling
95
  # You can pass any cTranslate2 `translate_batch` arguments
96
  t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
 
 
97
  ```
98
 
99
+ > 'Dr Ehud Ulu, Professor of Medicine at Dalhousie University in Halifax, Nova Scotia and head of clinical and scientific departments for the Canadian Diabetes Society, urged caution, saying the research was still in its early stages.'
100
+
101
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`. A model in safetensors format to be used with `eole` is also provided.
102
 
103
 
104
  ## Metrics
105
 
106
+ `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores), `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32.
107
+
108
 
109
  | | bleu | chrf2 | comet22 | Time (s) |
110
  |:---------------------------------|-------:|--------:|----------:|-----------:|
111
+ | quickmt/quickmt-ja-en | 28.87 | 58.56 | 87.83 | 1.10 |
112
  | Helsink-NLP/opus-mt-ja-en | 19.22 | 49.15 | 82.64 | 3.54 |
113
  | facebook/nllb-200-distilled-600M | 24.05 | 53.39 | 85.84 | 22.5 |
114
  | facebook/nllb-200-distilled-1.3B | 28.4 | 56.96 | 87.47 | 37.15 |
115
  | facebook/m2m100_418M | 18.82 | 49.55 | 82.59 | 18.27 |
116
+ | facebook/m2m100_1.2B | 23.32 | 53.46 | 85.43 | 35.44 |
 
 
.ipynb_checkpoints/eole-config-checkpoint.yaml ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## IO
2
+ save_data: data
3
+ overwrite: True
4
+ seed: 1234
5
+ report_every: 100
6
+ valid_metrics: ["BLEU"]
7
+ tensorboard: true
8
+ tensorboard_log_dir: tensorboard
9
+
10
+ ### Vocab
11
+ src_vocab: ja.eole.vocab
12
+ tgt_vocab: en.eole.vocab
13
+ src_vocab_size: 32000
14
+ tgt_vocab_size: 32000
15
+ vocab_size_multiple: 8
16
+ share_vocab: false
17
+ n_sample: 0
18
+
19
+ data:
20
+ corpus_1:
21
+ path_src: hf://quickmt/quickmt-train.ja-en/ja
22
+ path_tgt: hf://quickmt/quickmt-train.ja-en/en
23
+ path_sco: hf://quickmt/quickmt-train.ja-en/sco
24
+ weight: 2
25
+ corpus_2:
26
+ path_src: hf://quickmt/newscrawl2024-en-backtranslated-ja/ja
27
+ path_tgt: hf://quickmt/newscrawl2024-en-backtranslated-ja/en
28
+ path_sco: hf://quickmt/newscrawl2024-en-backtranslated-ja/sco
29
+ weight: 1
30
+ corpus_3:
31
+ path_src: hf://quickmt/madlad400-en-backtranslated-ja/ja
32
+ path_tgt: hf://quickmt/madlad400-en-backtranslated-ja/en
33
+ path_sco: hf://quickmt/madlad400-en-backtranslated-ja/sco
34
+ weight: 2
35
+ valid:
36
+ path_src: valid.ja
37
+ path_tgt: valid.en
38
+
39
+
40
+
41
+ transforms: [sentencepiece, filtertoolong]
42
+ transforms_configs:
43
+ sentencepiece:
44
+ src_subword_model: "ja.spm.model"
45
+ tgt_subword_model: "en.spm.model"
46
+ filtertoolong:
47
+ src_seq_length: 256
48
+ tgt_seq_length: 256
49
+
50
+ training:
51
+ # Run configuration
52
+ model_path: quickmt-ja-en-eole-model
53
+ keep_checkpoint: 4
54
+ train_steps: 200000
55
+ save_checkpoint_steps: 5000
56
+ valid_steps: 5000
57
+
58
+ # Train on a single GPU
59
+ world_size: 1
60
+ gpu_ranks: [0]
61
+ batch_type: "tokens"
62
+ batch_size: 15000
63
+ valid_batch_size: 2048
64
+ batch_size_multiple: 8
65
+ accum_count: [8]
66
+ accum_steps: [0]
67
+
68
+ # Optimizer & Compute
69
+ compute_dtype: "fp16"
70
+ optim: "adamw"
71
+ learning_rate: 3.0
72
+ warmup_steps: 5000
73
+ decay_method: "noam"
74
+ adam_beta2: 0.998
75
+
76
+ # Data loading
77
+ bucket_size: 256000
78
+ num_workers: 4
79
+ prefetch_factor: 64
80
+
81
+ # Hyperparams
82
+ dropout_steps: [0]
83
+ dropout: [0.1]
84
+ attention_dropout: [0.1]
85
+ max_grad_norm: 0
86
+ label_smoothing: 0.1
87
+ average_decay: 0.0001
88
+ param_init_method: xavier_uniform
89
+ normalization: "tokens"
90
+
91
+ model:
92
+ architecture: "transformer"
93
+ share_embeddings: false
94
+ share_decoder_embeddings: false
95
+ add_estimator: false
96
+ add_ffnbias: true
97
+ add_qkvbias: false
98
+ layer_norm: standard
99
+ mlp_activation_fn: gelu
100
+ hidden_size: 768
101
+ encoder:
102
+ layers: 12
103
+ decoder:
104
+ layers: 2
105
+ heads: 16
106
+ transformer_ff: 4096
107
+ embeddings:
108
+ word_vec_size: 768
109
+ position_encoding_type: "SinusoidalInterleaved"
110
+
README.md CHANGED
@@ -7,27 +7,29 @@ tags:
7
  license: cc-by-4.0
8
  datasets:
9
  - quickmt/quickmt-train.ja-en
 
 
10
  model-index:
11
  - name: quickmt-ja-en
12
  results:
13
  - task:
14
- name: Translation jpn-eng
15
  type: translation
16
  args: jpn-eng
17
  dataset:
18
  name: flores101-devtest
19
  type: flores_101
20
- args: jpn_Japn eng_Latn devtest
21
  metrics:
22
- - name: CHRF
23
- type: chrf
24
- value: 57.06
25
  - name: BLEU
26
  type: bleu
27
- value: 27.91
 
 
 
28
  - name: COMET
29
  type: comet
30
- value: 87.29
31
  ---
32
 
33
 
@@ -35,14 +37,26 @@ model-index:
35
 
36
  `quickmt-ja-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `ja` into `en`.
37
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
  ## Model Information
40
 
41
- * Trained using [`eole`](https://github.com/eole-nlp/eole)
42
- * 185M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
43
- * 20k sentencepiece vocabularies
44
  * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
45
- * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.ja-en/tree/main
46
 
47
  See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
48
 
@@ -55,9 +69,9 @@ Next, install the `quickmt` python library and download the model:
55
 
56
  ```bash
57
  git clone https://github.com/quickmt/quickmt.git
58
- pip install ./quickmt/
59
 
60
- quickmt-model-download quickmt/quickmt-ja-en ./quickmt-ja-en
61
  ```
62
 
63
  Finally use the model in python:
@@ -68,33 +82,35 @@ from quickmt import Translator
68
  # Auto-detects GPU, set to "cpu" to force CPU inference
69
  t = Translator("./quickmt-ja-en/", device="auto")
70
 
71
- # Translate - set beam size to 5 for higher quality (but slower speed)
72
  sample_text = 'ノバスコシア州ハリファックスにあるダルハウジー大学医学部教授でカナダ糖尿病協会の臨床・科学部門の責任者を務めるエフード・ウル博士は、この研究はまだ初期段階にあるとして注意を促しました。'
 
73
  t(sample_text, beam_size=5)
 
74
 
75
- > 'Dr. Ehud Ulu, a professor of medicine at Dalhousie University in Halifax, Nova Scotia and head of the clinical and scientific division of the Canadian Diabetes Association, cautioned that the study is still in its early stages.'
76
 
 
77
  # Get alternative translations by sampling
78
  # You can pass any cTranslate2 `translate_batch` arguments
79
  t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
80
-
81
- > 'Dr Ehud Ul, professor of medicine at the University of Dalhousie’s Halifax, Nova Scotia and head of the clinical and scientific division of the Canadian Diabetes Society, noted the study is in its early stages.'
82
  ```
83
 
84
- The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
 
 
85
 
86
 
87
  ## Metrics
88
 
89
- `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("jpn_Jpan"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate (using `ctranslate2`) the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32 (faster speed is possible using a large batch size).
 
90
 
91
  | | bleu | chrf2 | comet22 | Time (s) |
92
  |:---------------------------------|-------:|--------:|----------:|-----------:|
93
- | quickmt/quickmt-ja-en | 27.91 | 57.06 | 87.29 | 1.00 |
94
  | Helsink-NLP/opus-mt-ja-en | 19.22 | 49.15 | 82.64 | 3.54 |
95
  | facebook/nllb-200-distilled-600M | 24.05 | 53.39 | 85.84 | 22.5 |
96
  | facebook/nllb-200-distilled-1.3B | 28.4 | 56.96 | 87.47 | 37.15 |
97
  | facebook/m2m100_418M | 18.82 | 49.55 | 82.59 | 18.27 |
98
- | facebook/m2m100_1.2B | 23.32 | 53.46 | 85.43 | 35.44 |
99
-
100
- `quickmt-ja-en` is the fastest and nearly as high-quality as facebook/nllb-200-distilled-1.3B.
 
7
  license: cc-by-4.0
8
  datasets:
9
  - quickmt/quickmt-train.ja-en
10
+ - quickmt/madlad400-en-backtranslated-ja
11
+ - quickmt/newscrawl2024-en-backtranslated-ja
12
  model-index:
13
  - name: quickmt-ja-en
14
  results:
15
  - task:
16
+ name: Translation jap-eng
17
  type: translation
18
  args: jpn-eng
19
  dataset:
20
  name: flores101-devtest
21
  type: flores_101
22
+ args: jpn_Jpan eng_Latn devtest
23
  metrics:
 
 
 
24
  - name: BLEU
25
  type: bleu
26
+ value: 28.87
27
+ - name: CHRF
28
+ type: chrf
29
+ value: 58.56
30
  - name: COMET
31
  type: comet
32
+ value: 87.83
33
  ---
34
 
35
 
 
37
 
38
  `quickmt-ja-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `ja` into `en`.
39
 
40
+ `quickmt` models are roughly 3 times faster for GPU inference than OpusMT models and roughly [40 times](https://huggingface.co/spaces/quickmt/quickmt-vs-libretranslate) faster than [LibreTranslate](https://huggingface.co/spaces/quickmt/quickmt-vs-libretranslate)/[ArgosTranslate](github.com/argosopentech/argos-translate).
41
+
42
+
43
+ ## *UPDATED!*
44
+
45
+ This model was trained with back-translated data and a slightly different configuration and has improved translation quality!
46
+
47
+
48
+ ## Try it on our Huggingface Space
49
+
50
+ Give it a try before downloading here: https://huggingface.co/spaces/quickmt/QuickMT-Demo
51
+
52
 
53
  ## Model Information
54
 
55
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
56
+ * 200M parameter seq2seq transformer
57
+ * 32k separate Sentencepiece vocabs
58
  * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
59
+ * The pytorch model (for use with [`eole`](https://github.com/eole-nlp/eole)) is available in this repository in the `eole-model` folder
60
 
61
  See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
62
 
 
69
 
70
  ```bash
71
  git clone https://github.com/quickmt/quickmt.git
72
+ pip install -e ./quickmt/
73
 
74
+ quickmt-model-download quickmt/quickmt-zh-en ./quickmt-ja-en
75
  ```
76
 
77
  Finally use the model in python:
 
82
  # Auto-detects GPU, set to "cpu" to force CPU inference
83
  t = Translator("./quickmt-ja-en/", device="auto")
84
 
85
+ # Translate - set beam size to 1 for faster speed (but lower quality)
86
  sample_text = 'ノバスコシア州ハリファックスにあるダルハウジー大学医学部教授でカナダ糖尿病協会の臨床・科学部門の責任者を務めるエフード・ウル博士は、この研究はまだ初期段階にあるとして注意を促しました。'
87
+
88
  t(sample_text, beam_size=5)
89
+ ```
90
 
91
+ > 'Dr Ehud Ulu, professor of medicine at Dalhousie University in Halifax, Nova Scotia and head of the clinical and scientific division of the Canadian Diabetes Association, urged caution as the study is still in its infancy.'
92
 
93
+ ```python
94
  # Get alternative translations by sampling
95
  # You can pass any cTranslate2 `translate_batch` arguments
96
  t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
 
 
97
  ```
98
 
99
+ > 'Dr Ehud Ulu, Professor of Medicine at Dalhousie University in Halifax, Nova Scotia and head of clinical and scientific departments for the Canadian Diabetes Society, urged caution, saying the research was still in its early stages.'
100
+
101
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`. A model in safetensors format to be used with `eole` is also provided.
102
 
103
 
104
  ## Metrics
105
 
106
+ `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores), `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32.
107
+
108
 
109
  | | bleu | chrf2 | comet22 | Time (s) |
110
  |:---------------------------------|-------:|--------:|----------:|-----------:|
111
+ | quickmt/quickmt-ja-en | 28.87 | 58.56 | 87.83 | 1.10 |
112
  | Helsink-NLP/opus-mt-ja-en | 19.22 | 49.15 | 82.64 | 3.54 |
113
  | facebook/nllb-200-distilled-600M | 24.05 | 53.39 | 85.84 | 22.5 |
114
  | facebook/nllb-200-distilled-1.3B | 28.4 | 56.96 | 87.47 | 37.15 |
115
  | facebook/m2m100_418M | 18.82 | 49.55 | 82.59 | 18.27 |
116
+ | facebook/m2m100_1.2B | 23.32 | 53.46 | 85.43 | 35.44 |
 
 
eole-config.yaml CHANGED
@@ -10,8 +10,8 @@ tensorboard_log_dir: tensorboard
10
  ### Vocab
11
  src_vocab: ja.eole.vocab
12
  tgt_vocab: en.eole.vocab
13
- src_vocab_size: 20000
14
- tgt_vocab_size: 20000
15
  vocab_size_multiple: 8
16
  share_vocab: false
17
  n_sample: 0
@@ -21,10 +21,23 @@ data:
21
  path_src: hf://quickmt/quickmt-train.ja-en/ja
22
  path_tgt: hf://quickmt/quickmt-train.ja-en/en
23
  path_sco: hf://quickmt/quickmt-train.ja-en/sco
 
 
 
 
 
 
 
 
 
 
 
24
  valid:
25
  path_src: valid.ja
26
  path_tgt: valid.en
27
 
 
 
28
  transforms: [sentencepiece, filtertoolong]
29
  transforms_configs:
30
  sentencepiece:
@@ -36,20 +49,18 @@ transforms_configs:
36
 
37
  training:
38
  # Run configuration
39
- model_path: eole-model
40
  keep_checkpoint: 4
41
- train_steps: 108000
42
- save_checkpoint_steps: 2000
43
- valid_steps: 2000
44
-
45
  # Train on a single GPU
46
  world_size: 1
47
  gpu_ranks: [0]
48
-
49
- # Batching 10240
50
  batch_type: "tokens"
51
- batch_size: 8000
52
- valid_batch_size: 4096
53
  batch_size_multiple: 8
54
  accum_count: [8]
55
  accum_steps: [0]
@@ -57,15 +68,15 @@ training:
57
  # Optimizer & Compute
58
  compute_dtype: "fp16"
59
  optim: "adamw"
60
- learning_rate: 2.0
61
- warmup_steps: 4000
62
  decay_method: "noam"
63
  adam_beta2: 0.998
64
 
65
  # Data loading
66
- bucket_size: 128000
67
  num_workers: 4
68
- prefetch_factor: 32
69
 
70
  # Hyperparams
71
  dropout_steps: [0]
@@ -81,14 +92,19 @@ model:
81
  architecture: "transformer"
82
  share_embeddings: false
83
  share_decoder_embeddings: false
84
- hidden_size: 1024
 
 
 
 
 
85
  encoder:
86
- layers: 8
87
  decoder:
88
  layers: 2
89
- heads: 8
90
  transformer_ff: 4096
91
  embeddings:
92
- word_vec_size: 1024
93
  position_encoding_type: "SinusoidalInterleaved"
94
 
 
10
  ### Vocab
11
  src_vocab: ja.eole.vocab
12
  tgt_vocab: en.eole.vocab
13
+ src_vocab_size: 32000
14
+ tgt_vocab_size: 32000
15
  vocab_size_multiple: 8
16
  share_vocab: false
17
  n_sample: 0
 
21
  path_src: hf://quickmt/quickmt-train.ja-en/ja
22
  path_tgt: hf://quickmt/quickmt-train.ja-en/en
23
  path_sco: hf://quickmt/quickmt-train.ja-en/sco
24
+ weight: 2
25
+ corpus_2:
26
+ path_src: hf://quickmt/newscrawl2024-en-backtranslated-ja/ja
27
+ path_tgt: hf://quickmt/newscrawl2024-en-backtranslated-ja/en
28
+ path_sco: hf://quickmt/newscrawl2024-en-backtranslated-ja/sco
29
+ weight: 1
30
+ corpus_3:
31
+ path_src: hf://quickmt/madlad400-en-backtranslated-ja/ja
32
+ path_tgt: hf://quickmt/madlad400-en-backtranslated-ja/en
33
+ path_sco: hf://quickmt/madlad400-en-backtranslated-ja/sco
34
+ weight: 2
35
  valid:
36
  path_src: valid.ja
37
  path_tgt: valid.en
38
 
39
+
40
+
41
  transforms: [sentencepiece, filtertoolong]
42
  transforms_configs:
43
  sentencepiece:
 
49
 
50
  training:
51
  # Run configuration
52
+ model_path: quickmt-ja-en-eole-model
53
  keep_checkpoint: 4
54
+ train_steps: 200000
55
+ save_checkpoint_steps: 5000
56
+ valid_steps: 5000
57
+
58
  # Train on a single GPU
59
  world_size: 1
60
  gpu_ranks: [0]
 
 
61
  batch_type: "tokens"
62
+ batch_size: 15000
63
+ valid_batch_size: 2048
64
  batch_size_multiple: 8
65
  accum_count: [8]
66
  accum_steps: [0]
 
68
  # Optimizer & Compute
69
  compute_dtype: "fp16"
70
  optim: "adamw"
71
+ learning_rate: 3.0
72
+ warmup_steps: 5000
73
  decay_method: "noam"
74
  adam_beta2: 0.998
75
 
76
  # Data loading
77
+ bucket_size: 256000
78
  num_workers: 4
79
+ prefetch_factor: 64
80
 
81
  # Hyperparams
82
  dropout_steps: [0]
 
92
  architecture: "transformer"
93
  share_embeddings: false
94
  share_decoder_embeddings: false
95
+ add_estimator: false
96
+ add_ffnbias: true
97
+ add_qkvbias: false
98
+ layer_norm: standard
99
+ mlp_activation_fn: gelu
100
+ hidden_size: 768
101
  encoder:
102
+ layers: 12
103
  decoder:
104
  layers: 2
105
+ heads: 16
106
  transformer_ff: 4096
107
  embeddings:
108
+ word_vec_size: 768
109
  position_encoding_type: "SinusoidalInterleaved"
110
 
eole-model/config.json CHANGED
@@ -1,108 +1,121 @@
1
  {
2
- "n_sample": 0,
3
- "overwrite": true,
4
  "save_data": "data",
5
- "tgt_vocab_size": 20000,
6
- "seed": 1234,
7
- "share_vocab": false,
8
  "tensorboard_log_dir": "tensorboard",
9
  "report_every": 100,
10
- "tensorboard": true,
11
- "tgt_vocab": "en.eole.vocab",
12
  "vocab_size_multiple": 8,
13
- "src_vocab": "ja.eole.vocab",
14
- "src_vocab_size": 20000,
15
- "tensorboard_log_dir_dated": "tensorboard/Apr-15_05-55-59",
16
  "transforms": [
17
  "sentencepiece",
18
  "filtertoolong"
19
  ],
20
- "valid_metrics": [
21
- "BLEU"
22
- ],
 
 
 
 
 
23
  "training": {
24
- "save_checkpoint_steps": 2000,
25
- "max_grad_norm": 0.0,
26
- "dropout": [
 
 
27
  0.1
28
  ],
29
- "dropout_steps": [
30
  0
31
  ],
32
- "label_smoothing": 0.1,
 
33
  "keep_checkpoint": 4,
34
- "prefetch_factor": 32,
35
- "valid_steps": 2000,
36
- "optim": "adamw",
37
  "gpu_ranks": [
38
  0
39
  ],
40
- "bucket_size": 128000,
41
- "batch_size": 8000,
42
  "average_decay": 0.0001,
43
- "compute_dtype": "torch.float16",
44
- "param_init_method": "xavier_uniform",
45
- "decay_method": "noam",
46
- "num_workers": 0,
47
- "warmup_steps": 4000,
48
- "attention_dropout": [
49
  0.1
50
  ],
51
- "model_path": "quickmt-ja-en-eole-model",
52
- "accum_count": [
53
- 8
54
  ],
55
- "batch_size_multiple": 8,
 
 
 
56
  "train_steps": 200000,
57
- "accum_steps": [
58
- 0
59
  ],
60
- "adam_beta2": 0.998,
61
- "learning_rate": 2.0,
62
  "normalization": "tokens",
63
- "valid_batch_size": 4096,
64
- "world_size": 1,
65
- "batch_type": "tokens"
 
 
 
66
  },
67
  "model": {
 
 
 
 
 
 
68
  "transformer_ff": 4096,
 
69
  "share_decoder_embeddings": false,
70
- "architecture": "transformer",
71
  "position_encoding_type": "SinusoidalInterleaved",
72
- "share_embeddings": false,
73
- "heads": 8,
74
- "hidden_size": 1024,
75
  "encoder": {
76
- "layers": 8,
77
- "transformer_ff": 4096,
78
- "position_encoding_type": "SinusoidalInterleaved",
79
- "src_word_vec_size": 1024,
80
  "n_positions": null,
 
81
  "encoder_type": "transformer",
82
- "heads": 8,
83
- "hidden_size": 1024
 
 
 
 
 
 
84
  },
85
  "embeddings": {
86
- "tgt_word_vec_size": 1024,
87
- "position_encoding_type": "SinusoidalInterleaved",
88
- "word_vec_size": 1024,
89
- "src_word_vec_size": 1024
90
  },
91
  "decoder": {
 
 
 
 
 
92
  "layers": 2,
 
93
  "transformer_ff": 4096,
94
  "position_encoding_type": "SinusoidalInterleaved",
95
- "n_positions": null,
96
- "decoder_type": "transformer",
97
- "tgt_word_vec_size": 1024,
98
- "heads": 8,
99
- "hidden_size": 1024
100
  }
101
  },
102
  "transforms_configs": {
103
  "sentencepiece": {
104
- "src_subword_model": "${MODEL_PATH}/ja.spm.model",
105
- "tgt_subword_model": "${MODEL_PATH}/en.spm.model"
106
  },
107
  "filtertoolong": {
108
  "tgt_seq_length": 256,
@@ -111,22 +124,46 @@
111
  },
112
  "data": {
113
  "corpus_1": {
114
- "path_tgt": "en.txt",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
  "path_align": null,
 
116
  "transforms": [
117
  "sentencepiece",
118
  "filtertoolong"
119
  ],
120
- "path_src": "ja.txt"
 
121
  },
122
  "valid": {
123
- "path_tgt": "valid.en",
124
  "path_align": null,
 
125
  "transforms": [
126
  "sentencepiece",
127
  "filtertoolong"
128
  ],
129
- "path_src": "valid.ja"
130
  }
131
  }
132
  }
 
1
  {
2
+ "src_vocab": "ja.eole.vocab",
 
3
  "save_data": "data",
4
+ "valid_metrics": [
5
+ "BLEU"
6
+ ],
7
  "tensorboard_log_dir": "tensorboard",
8
  "report_every": 100,
 
 
9
  "vocab_size_multiple": 8,
10
+ "seed": 1234,
 
 
11
  "transforms": [
12
  "sentencepiece",
13
  "filtertoolong"
14
  ],
15
+ "overwrite": true,
16
+ "src_vocab_size": 32000,
17
+ "tgt_vocab": "en.eole.vocab",
18
+ "share_vocab": false,
19
+ "tensorboard": true,
20
+ "n_sample": 0,
21
+ "tgt_vocab_size": 32000,
22
+ "tensorboard_log_dir_dated": "tensorboard/Dec-24_20-29-50",
23
  "training": {
24
+ "batch_type": "tokens",
25
+ "param_init_method": "xavier_uniform",
26
+ "batch_size_multiple": 8,
27
+ "learning_rate": 3.0,
28
+ "attention_dropout": [
29
  0.1
30
  ],
31
+ "accum_steps": [
32
  0
33
  ],
34
+ "batch_size": 15000,
35
+ "model_path": "quickmt-ja-en-eole-model",
36
  "keep_checkpoint": 4,
37
+ "adam_beta2": 0.998,
 
 
38
  "gpu_ranks": [
39
  0
40
  ],
 
 
41
  "average_decay": 0.0001,
42
+ "warmup_steps": 5000,
43
+ "valid_steps": 5000,
44
+ "dropout": [
 
 
 
45
  0.1
46
  ],
47
+ "dropout_steps": [
48
+ 0
 
49
  ],
50
+ "prefetch_factor": 64,
51
+ "max_grad_norm": 0.0,
52
+ "world_size": 1,
53
+ "compute_dtype": "torch.float16",
54
  "train_steps": 200000,
55
+ "accum_count": [
56
+ 8
57
  ],
58
+ "num_workers": 0,
 
59
  "normalization": "tokens",
60
+ "decay_method": "noam",
61
+ "optim": "adamw",
62
+ "valid_batch_size": 2048,
63
+ "bucket_size": 256000,
64
+ "label_smoothing": 0.1,
65
+ "save_checkpoint_steps": 5000
66
  },
67
  "model": {
68
+ "layer_norm": "standard",
69
+ "hidden_size": 768,
70
+ "add_estimator": false,
71
+ "mlp_activation_fn": "gelu",
72
+ "share_embeddings": false,
73
+ "heads": 16,
74
  "transformer_ff": 4096,
75
+ "add_qkvbias": false,
76
  "share_decoder_embeddings": false,
 
77
  "position_encoding_type": "SinusoidalInterleaved",
78
+ "add_ffnbias": true,
79
+ "architecture": "transformer",
 
80
  "encoder": {
81
+ "src_word_vec_size": 768,
 
 
 
82
  "n_positions": null,
83
+ "layer_norm": "standard",
84
  "encoder_type": "transformer",
85
+ "hidden_size": 768,
86
+ "mlp_activation_fn": "gelu",
87
+ "heads": 16,
88
+ "transformer_ff": 4096,
89
+ "position_encoding_type": "SinusoidalInterleaved",
90
+ "add_ffnbias": true,
91
+ "add_qkvbias": false,
92
+ "layers": 12
93
  },
94
  "embeddings": {
95
+ "word_vec_size": 768,
96
+ "src_word_vec_size": 768,
97
+ "tgt_word_vec_size": 768,
98
+ "position_encoding_type": "SinusoidalInterleaved"
99
  },
100
  "decoder": {
101
+ "n_positions": null,
102
+ "layer_norm": "standard",
103
+ "hidden_size": 768,
104
+ "mlp_activation_fn": "gelu",
105
+ "tgt_word_vec_size": 768,
106
  "layers": 2,
107
+ "heads": 16,
108
  "transformer_ff": 4096,
109
  "position_encoding_type": "SinusoidalInterleaved",
110
+ "add_ffnbias": true,
111
+ "add_qkvbias": false,
112
+ "decoder_type": "transformer"
 
 
113
  }
114
  },
115
  "transforms_configs": {
116
  "sentencepiece": {
117
+ "tgt_subword_model": "${MODEL_PATH}/en.spm.model",
118
+ "src_subword_model": "${MODEL_PATH}/ja.spm.model"
119
  },
120
  "filtertoolong": {
121
  "tgt_seq_length": 256,
 
124
  },
125
  "data": {
126
  "corpus_1": {
127
+ "path_tgt": "hf://quickmt/quickmt-train.ja-en/en",
128
+ "path_align": null,
129
+ "path_src": "hf://quickmt/quickmt-train.ja-en/ja",
130
+ "transforms": [
131
+ "sentencepiece",
132
+ "filtertoolong"
133
+ ],
134
+ "weight": 2,
135
+ "path_sco": "hf://quickmt/quickmt-train.ja-en/sco"
136
+ },
137
+ "corpus_2": {
138
+ "path_tgt": "hf://quickmt/newscrawl2024-en-backtranslated-ja/en",
139
+ "path_align": null,
140
+ "path_src": "hf://quickmt/newscrawl2024-en-backtranslated-ja/ja",
141
+ "transforms": [
142
+ "sentencepiece",
143
+ "filtertoolong"
144
+ ],
145
+ "weight": 1,
146
+ "path_sco": "hf://quickmt/newscrawl2024-en-backtranslated-ja/sco"
147
+ },
148
+ "corpus_3": {
149
+ "path_tgt": "hf://quickmt/madlad400-en-backtranslated-ja/en",
150
  "path_align": null,
151
+ "path_src": "hf://quickmt/madlad400-en-backtranslated-ja/ja",
152
  "transforms": [
153
  "sentencepiece",
154
  "filtertoolong"
155
  ],
156
+ "weight": 2,
157
+ "path_sco": "hf://quickmt/madlad400-en-backtranslated-ja/sco"
158
  },
159
  "valid": {
 
160
  "path_align": null,
161
+ "path_src": "valid.ja",
162
  "transforms": [
163
  "sentencepiece",
164
  "filtertoolong"
165
  ],
166
+ "path_tgt": "valid.en"
167
  }
168
  }
169
  }
eole-model/en.spm.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:01dc857df9df5987bec61812d1fbc2e78ad530346c09fe3ffaf27184358ab8fd
3
- size 583983
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d2e58f48c0d030d6b4e6b183ebca7068704ef9cce640f090abfea4ac9ea11da0
3
+ size 797948
eole-model/ja.spm.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cbd449d7e92d850f6d36efeafb40cea3a8468f55db0ca751ee4ece0ceb70d19b
3
- size 583133
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:438e5f7fd498d76b057ca8b6759470b82eb3855b81a67718face917407b31ff5
3
+ size 816774
eole-model/model.00.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f226e709170b27c12510b5b186f305fcc9ddb3dbf4ae3ea1066f8ff4632f7de1
3
- size 823882912
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:850944f80d706c273021d0dc80a82d05536672532122dccccaeb63065d74ace3
3
+ size 829569112
eole-model/vocab.json CHANGED
The diff for this file is too large to render. See raw diff
 
model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4f9906f44709dab21e2bd168cb19e4379326d4b8e92513dc9b4e5ff4af6cf323
3
- size 401699775
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d11276f68986d951edc1e5b4b634e00f1f9c493eb14519598be975630965eb47
3
+ size 407101843
source_vocabulary.json CHANGED
The diff for this file is too large to render. See raw diff
 
src.spm.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cbd449d7e92d850f6d36efeafb40cea3a8468f55db0ca751ee4ece0ceb70d19b
3
- size 583133
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:438e5f7fd498d76b057ca8b6759470b82eb3855b81a67718face917407b31ff5
3
+ size 816774
target_vocabulary.json CHANGED
The diff for this file is too large to render. See raw diff
 
tgt.spm.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:01dc857df9df5987bec61812d1fbc2e78ad530346c09fe3ffaf27184358ab8fd
3
- size 583983
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d2e58f48c0d030d6b4e6b183ebca7068704ef9cce640f090abfea4ac9ea11da0
3
+ size 797948