radinplaid commited on
Commit
cb934b0
·
verified ·
1 Parent(s): d00d112

Upload folder using huggingface_hub

Browse files
.ipynb_checkpoints/README-checkpoint.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - fr
4
+ - en
5
+ tags:
6
+ - translation
7
+ license: cc-by-4.0
8
+ datasets:
9
+ - quickmt/quickmt-train.fr-en
10
+ - quickmt/madlad400-en-backtranslated-fr
11
+ - quickmt/newscrawl2024-en-backtranslated-fr
12
+ - quickmt/canadian_hansard
13
+ model-index:
14
+ - name: quickmt-fr-en
15
+ results:
16
+ - task:
17
+ name: Translation fra-eng
18
+ type: translation
19
+ args: fra-eng
20
+ dataset:
21
+ name: flores101-devtest
22
+ type: flores_101
23
+ args: fra_Latn eng_Latn devtest
24
+ metrics:
25
+ - name: BLEU
26
+ type: bleu
27
+ value: 46.84
28
+ - name: CHRF
29
+ type: chrf
30
+ value: 69.87
31
+ - name: COMET
32
+ type: comet
33
+ value: 89.4
34
+ ---
35
+
36
+
37
+ # `quickmt-fr-en` Neural Machine Translation Model
38
+
39
+ `quickmt-fr-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `fr` into `en`.
40
+
41
+ `quickmt` models are roughly 3 times faster for GPU inference than OpusMT models and roughly [40 times](https://huggingface.co/spaces/quickmt/quickmt-vs-libretranslate) faster than [LibreTranslate](https://huggingface.co/spaces/quickmt/quickmt-vs-libretranslate)/[ArgosTranslate](github.com/argosopentech/argos-translate).
42
+
43
+
44
+ ## *UPDATED VERSION!*
45
+
46
+ This model was augmented with back-translated data and has improved translation quality!
47
+
48
+
49
+ ## Try it on our Huggingface Space
50
+
51
+ Give it a try before downloading here: https://huggingface.co/spaces/quickmt/QuickMT-Demo
52
+
53
+
54
+ ## Model Information
55
+
56
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
57
+ * 200M parameter seq2seq transformer
58
+ * 32k separate Sentencepiece vocabs
59
+ * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
60
+ * The pytorch model (for use with [`eole`](https://github.com/eole-nlp/eole)) is available in this repository in the `eole-model` folder
61
+
62
+ See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
63
+
64
+
65
+ ## Usage with `quickmt`
66
+
67
+ You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
68
+
69
+ Next, install the `quickmt` python library and download the model:
70
+
71
+ ```bash
72
+ git clone https://github.com/quickmt/quickmt.git
73
+ pip install -e ./quickmt/
74
+
75
+ quickmt-model-download quickmt/quickmt-fr-en ./quickmt-fr-en
76
+ ```
77
+
78
+ Finally use the model in python:
79
+
80
+ ```python
81
+ from quickmt import Translator
82
+
83
+ # Auto-detects GPU, set to "cpu" to force CPU inference
84
+ mt = Translator("./quickmt-fr-en/", device="auto")
85
+
86
+ # Translate - set beam size to 1 for faster speed (but lower quality)
87
+ sample_text = "Le Dr Ehud Ur, professeur de médecine à l'Université Dalhousie de Halifax (Nouvelle-Écosse) et président de la division clinique et scientifique de l'Association canadienne du diabète, a averti que la recherche en était encore à ses débuts."
88
+
89
+ mt(sample_text, beam_size=5)
90
+ ```
91
+
92
+ > 'Dr. Ehud Ur, professor of medicine at Dalhousie University in Halifax, Nova Scotia and chair of the clinical and scientific division of the Canadian Diabetes Association, warned that research was still in its infancy.'
93
+
94
+ ```python
95
+ # Get alternative translations by sampling
96
+ # You can pass any cTranslate2 `translate_batch` arguments
97
+ mt([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
98
+ ```
99
+
100
+ > 'Dr. Ehud Ur, Professor of Medicine at Dalhousie University in Halifax, Nova Scotia, and Chair of the Clinical & Scientific Division of the Canadian Diabetes Association, warned the research was still in its infancy.'
101
+
102
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`. A model in safetensors format to be used with `eole` is also provided.
103
+
104
+
105
+ ## Metrics
106
+
107
+ `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32.
108
+
109
+
110
+ | | bleu | chrf2 | comet22 | Time (s) |
111
+ |:--------------------------------------|-------:|--------:|----------:|-----------:|
112
+ | quickmt/quickmt-fr-en | 46.84 | 69.87 | 89.4 | 1.08 |
113
+ | Helsinki-NLP/opus-mt-fr-en | 41.71 | 66.84 | 88.31 | 3.49 |
114
+ | facebook/nllb-200-distilled-600M | 44.05 | 67.81 | 88.48 | 21.52 |
115
+ | facebook/nllb-200-distilled-1.3B | 46.24 | 69.32 | 89.24 | 37.25 |
116
+ | facebook/m2m100_418M | 36.48 | 63.3 | 85.87 | 18.28 |
117
+ | facebook/m2m100_1.2B | 41.69 | 66.51 | 88 | 35.1 |
118
+ | tencent/HY-MT1.5-1.8B | 28.68 | 61.62 | 88.66 | 9 |
119
+ | tencent/Hunyuan-MT-7B-fp8 | 35.66 | 64.94 | 89.72 | 22 |
120
+ | CohereLabs/aya-expanse-8b (bnb quant) | 45.03 | 69.02 | 90.29 | 73.97 |
121
+
README.md CHANGED
@@ -1,12 +1,15 @@
1
  ---
2
  language:
3
- - en
4
  - fr
 
5
  tags:
6
  - translation
7
  license: cc-by-4.0
8
  datasets:
9
  - quickmt/quickmt-train.fr-en
 
 
 
10
  model-index:
11
  - name: quickmt-fr-en
12
  results:
@@ -19,15 +22,15 @@ model-index:
19
  type: flores_101
20
  args: fra_Latn eng_Latn devtest
21
  metrics:
22
- - name: CHRF
23
- type: chrf
24
- value: 66.77
25
  - name: BLEU
26
  type: bleu
27
- value: 42.17
 
 
 
28
  - name: COMET
29
  type: comet
30
- value: 58.10
31
  ---
32
 
33
 
@@ -35,16 +38,28 @@ model-index:
35
 
36
  `quickmt-fr-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `fr` into `en`.
37
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
  ## Model Information
40
 
41
- * Trained using [`eole`](https://github.com/eole-nlp/eole)
42
- * 185M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
43
- * 50k joint Sentencepiece vocabulary
44
  * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
45
- * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.fr-en/tree/main
46
 
47
- See the `eole` model configuration in this repository for further details.
48
 
49
 
50
  ## Usage with `quickmt`
@@ -55,7 +70,7 @@ Next, install the `quickmt` python library and download the model:
55
 
56
  ```bash
57
  git clone https://github.com/quickmt/quickmt.git
58
- pip install ./quickmt/
59
 
60
  quickmt-model-download quickmt/quickmt-fr-en ./quickmt-fr-en
61
  ```
@@ -66,31 +81,41 @@ Finally use the model in python:
66
  from quickmt import Translator
67
 
68
  # Auto-detects GPU, set to "cpu" to force CPU inference
69
- t = Translator("./quickmt-fr-en/", device="auto")
70
 
71
- # Translate - set beam size to 5 for higher quality (but slower speed)
72
- sample_text = "Résigny est une commune française située dans le département de l'Aisne, en région Hauts-de-France. "
73
- t(sample_text, beam_size=1)
 
 
74
 
 
 
 
75
  # Get alternative translations by sampling
76
  # You can pass any cTranslate2 `translate_batch` arguments
77
- t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
78
  ```
79
 
80
- The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
 
 
81
 
82
 
83
  ## Metrics
84
 
85
- `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("fra_Latn"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate (using `ctranslate2`) the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32 (faster speed is possible using a large batch size).
 
86
 
87
- | Model | chrf2 | bleu | comet22 | Time (s) |
88
- | -------------------------------- | ----- | ------- | ------- | -------- |
89
- | quickmt/quickmt-fr-en | 68.22 | 44.28 | 88.86 | 1.1 |
90
- | Helsinki-NLP/opus-mt-fr-en | 66.85 | 41.71 | 88.31 | 3.6 |
91
- | facebook/m2m100_418M | 64.39 | 36.49 | 85.87 | 18.0 |
92
- | facebook/m2m100_1.2B | 66.51 | 41.69 | 88.00 | 34.6 |
93
- | facebook/nllb-200-distilled-600M | 67.82 | 44.04 | 88.47 | 21.7 |
94
- | facebook/nllb-200-distilled-1.3B | 69.30 | 46.22 | 89.24 | 37.1 |
 
 
 
95
 
96
- `quickmt-fr-en` is the fastest and is higher quality than `opus-mt-fr-en`, `m2m100_418m`, `m2m100_1.2B` and `nllb-200-distilled-600M`.
 
1
  ---
2
  language:
 
3
  - fr
4
+ - en
5
  tags:
6
  - translation
7
  license: cc-by-4.0
8
  datasets:
9
  - quickmt/quickmt-train.fr-en
10
+ - quickmt/madlad400-en-backtranslated-fr
11
+ - quickmt/newscrawl2024-en-backtranslated-fr
12
+ - quickmt/canadian_hansard
13
  model-index:
14
  - name: quickmt-fr-en
15
  results:
 
22
  type: flores_101
23
  args: fra_Latn eng_Latn devtest
24
  metrics:
 
 
 
25
  - name: BLEU
26
  type: bleu
27
+ value: 46.84
28
+ - name: CHRF
29
+ type: chrf
30
+ value: 69.87
31
  - name: COMET
32
  type: comet
33
+ value: 89.4
34
  ---
35
 
36
 
 
38
 
39
  `quickmt-fr-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `fr` into `en`.
40
 
41
+ `quickmt` models are roughly 3 times faster for GPU inference than OpusMT models and roughly [40 times](https://huggingface.co/spaces/quickmt/quickmt-vs-libretranslate) faster than [LibreTranslate](https://huggingface.co/spaces/quickmt/quickmt-vs-libretranslate)/[ArgosTranslate](github.com/argosopentech/argos-translate).
42
+
43
+
44
+ ## *UPDATED VERSION!*
45
+
46
+ This model was augmented with back-translated data and has improved translation quality!
47
+
48
+
49
+ ## Try it on our Huggingface Space
50
+
51
+ Give it a try before downloading here: https://huggingface.co/spaces/quickmt/QuickMT-Demo
52
+
53
 
54
  ## Model Information
55
 
56
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
57
+ * 200M parameter seq2seq transformer
58
+ * 32k separate Sentencepiece vocabs
59
  * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
60
+ * The pytorch model (for use with [`eole`](https://github.com/eole-nlp/eole)) is available in this repository in the `eole-model` folder
61
 
62
+ See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
63
 
64
 
65
  ## Usage with `quickmt`
 
70
 
71
  ```bash
72
  git clone https://github.com/quickmt/quickmt.git
73
+ pip install -e ./quickmt/
74
 
75
  quickmt-model-download quickmt/quickmt-fr-en ./quickmt-fr-en
76
  ```
 
81
  from quickmt import Translator
82
 
83
  # Auto-detects GPU, set to "cpu" to force CPU inference
84
+ mt = Translator("./quickmt-fr-en/", device="auto")
85
 
86
+ # Translate - set beam size to 1 for faster speed (but lower quality)
87
+ sample_text = "Le Dr Ehud Ur, professeur de médecine à l'Université Dalhousie de Halifax (Nouvelle-Écosse) et président de la division clinique et scientifique de l'Association canadienne du diabète, a averti que la recherche en était encore à ses débuts."
88
+
89
+ mt(sample_text, beam_size=5)
90
+ ```
91
 
92
+ > 'Dr. Ehud Ur, professor of medicine at Dalhousie University in Halifax, Nova Scotia and chair of the clinical and scientific division of the Canadian Diabetes Association, warned that research was still in its infancy.'
93
+
94
+ ```python
95
  # Get alternative translations by sampling
96
  # You can pass any cTranslate2 `translate_batch` arguments
97
+ mt([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
98
  ```
99
 
100
+ > 'Dr. Ehud Ur, Professor of Medicine at Dalhousie University in Halifax, Nova Scotia, and Chair of the Clinical & Scientific Division of the Canadian Diabetes Association, warned the research was still in its infancy.'
101
+
102
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`. A model in safetensors format to be used with `eole` is also provided.
103
 
104
 
105
  ## Metrics
106
 
107
+ `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32.
108
+
109
 
110
+ | | bleu | chrf2 | comet22 | Time (s) |
111
+ |:--------------------------------------|-------:|--------:|----------:|-----------:|
112
+ | quickmt/quickmt-fr-en | 46.84 | 69.87 | 89.4 | 1.08 |
113
+ | Helsinki-NLP/opus-mt-fr-en | 41.71 | 66.84 | 88.31 | 3.49 |
114
+ | facebook/nllb-200-distilled-600M | 44.05 | 67.81 | 88.48 | 21.52 |
115
+ | facebook/nllb-200-distilled-1.3B | 46.24 | 69.32 | 89.24 | 37.25 |
116
+ | facebook/m2m100_418M | 36.48 | 63.3 | 85.87 | 18.28 |
117
+ | facebook/m2m100_1.2B | 41.69 | 66.51 | 88 | 35.1 |
118
+ | tencent/HY-MT1.5-1.8B | 28.68 | 61.62 | 88.66 | 9 |
119
+ | tencent/Hunyuan-MT-7B-fp8 | 35.66 | 64.94 | 89.72 | 22 |
120
+ | CohereLabs/aya-expanse-8b (bnb quant) | 45.03 | 69.02 | 90.29 | 73.97 |
121
 
 
eole-config.yaml CHANGED
@@ -1,5 +1,5 @@
1
  ## IO
2
- save_data: fr-en/data_spm
3
  overwrite: True
4
  seed: 1234
5
  report_every: 100
@@ -8,71 +8,84 @@ tensorboard: true
8
  tensorboard_log_dir: tensorboard
9
 
10
  ### Vocab
11
- src_vocab: fr-en/joint.eole.vocab
12
- tgt_vocab: fr-en/joint.eole.vocab
13
- src_vocab_size: 50000
14
- tgt_vocab_size: 50000
15
  vocab_size_multiple: 8
16
- share_vocab: True
17
  n_sample: 0
18
 
19
  data:
20
  corpus_1:
21
- path_src: hf://quickmt/quickmt-train.fr-en/fr
22
- path_tgt: hf://quickmt/quickmt-train.fr-en/en
23
- path_sco: hf://quickmt/quickmt-train.fr-en/sco
 
 
 
 
 
 
 
 
 
 
 
 
24
  valid:
25
- path_src: fr-en/dev.src
26
- path_tgt: fr-en/dev.tgt
27
 
28
  transforms: [sentencepiece, filtertoolong]
29
  transforms_configs:
30
  sentencepiece:
31
- src_subword_model: "fr-en/joint.spm.model"
32
- tgt_subword_model: "fr-en/joint.spm.model"
33
  filtertoolong:
34
  src_seq_length: 256
35
  tgt_seq_length: 256
36
 
37
  training:
38
  # Run configuration
39
- model_path: fr-en/model
40
  keep_checkpoint: 4
41
- save_checkpoint_steps: 2000
42
- train_steps: 100000
43
- valid_steps: 2000
44
 
45
  # Train on a single GPU
46
  world_size: 1
47
  gpu_ranks: [0]
48
 
49
- # Batching
 
50
  batch_type: "tokens"
51
- batch_size: 8192
52
- valid_batch_size: 8192
53
  batch_size_multiple: 8
54
- accum_count: [16]
55
  accum_steps: [0]
56
 
57
  # Optimizer & Compute
58
- compute_dtype: "bf16"
59
- optim: "pagedadamw8bit"
60
- #optim: "adamw"
61
- learning_rate: 2.0
62
- warmup_steps: 10000
63
  decay_method: "noam"
64
  adam_beta2: 0.998
65
 
66
  # Data loading
67
- bucket_size: 128000
68
  num_workers: 4
69
- prefetch_factor: 100
70
 
71
  # Hyperparams
72
  dropout_steps: [0]
73
  dropout: [0.1]
74
  attention_dropout: [0.1]
75
- max_grad_norm: 2
76
  label_smoothing: 0.1
77
  average_decay: 0.0001
78
  param_init_method: xavier_uniform
@@ -80,22 +93,22 @@ training:
80
 
81
  model:
82
  architecture: "transformer"
83
- layer_norm: standard
84
- share_embeddings: true
85
- share_decoder_embeddings: true
86
- add_ffnbias: true
87
- mlp_activation_fn: gelu
88
  add_estimator: false
 
89
  add_qkvbias: false
90
- norm_eps: 1e-6
91
- hidden_size: 1024
 
92
  encoder:
93
- layers: 8
94
  decoder:
95
  layers: 2
96
- heads: 8
97
  transformer_ff: 4096
98
  embeddings:
99
- word_vec_size: 1024
100
  position_encoding_type: "SinusoidalInterleaved"
101
 
 
 
1
  ## IO
2
+ save_data: data
3
  overwrite: True
4
  seed: 1234
5
  report_every: 100
 
8
  tensorboard_log_dir: tensorboard
9
 
10
  ### Vocab
11
+ src_vocab: fren/fr.eole.vocab
12
+ tgt_vocab: fren/en.eole.vocab
13
+ src_vocab_size: 32000
14
+ tgt_vocab_size: 32000
15
  vocab_size_multiple: 8
16
+ share_vocab: false
17
  n_sample: 0
18
 
19
  data:
20
  corpus_1:
21
+ path_src: fren/train.cleaned.filtered.fr
22
+ path_tgt: fren/train.cleaned.filtered.en
23
+ weight: 200
24
+ corpus_2:
25
+ path_src: ../data/newscrawl.backtrans.cleaned.filtered.fr
26
+ path_tgt: ../data/newscrawl.backtrans.cleaned.filtered.en
27
+ weight: 35
28
+ corpus_3:
29
+ path_src: ../data/madlad.backtrans.cleaned.filtered.fr
30
+ path_tgt: ../data/madlad.backtrans.cleaned.filtered.en
31
+ weight: 68
32
+ corpus_4:
33
+ path_src: ../data/hansard.fr
34
+ path_tgt: ../data/hansard.en
35
+ weight: 5
36
  valid:
37
+ path_src: fren/dev.fr
38
+ path_tgt: fren/dev.en
39
 
40
  transforms: [sentencepiece, filtertoolong]
41
  transforms_configs:
42
  sentencepiece:
43
+ src_subword_model: "fren/fr.spm.model"
44
+ tgt_subword_model: "fren/en.spm.model"
45
  filtertoolong:
46
  src_seq_length: 256
47
  tgt_seq_length: 256
48
 
49
  training:
50
  # Run configuration
51
+ model_path: quickmt-fr-en-eole-model
52
  keep_checkpoint: 4
53
+ train_steps: 200000
54
+ save_checkpoint_steps: 5000
55
+ valid_steps: 5000
56
 
57
  # Train on a single GPU
58
  world_size: 1
59
  gpu_ranks: [0]
60
 
61
+ # Batching 120,000 tokens
62
+ # For RTX 5090, 15000 batch size, accum_count 8
63
  batch_type: "tokens"
64
+ batch_size: 6000
65
+ valid_batch_size: 2048
66
  batch_size_multiple: 8
67
+ accum_count: [20]
68
  accum_steps: [0]
69
 
70
  # Optimizer & Compute
71
+ compute_dtype: "fp16"
72
+ optim: "adamw"
73
+ #use_amp: True
74
+ learning_rate: 3.0
75
+ warmup_steps: 5000
76
  decay_method: "noam"
77
  adam_beta2: 0.998
78
 
79
  # Data loading
80
+ bucket_size: 256000
81
  num_workers: 4
82
+ prefetch_factor: 128
83
 
84
  # Hyperparams
85
  dropout_steps: [0]
86
  dropout: [0.1]
87
  attention_dropout: [0.1]
88
+ max_grad_norm: 0
89
  label_smoothing: 0.1
90
  average_decay: 0.0001
91
  param_init_method: xavier_uniform
 
93
 
94
  model:
95
  architecture: "transformer"
96
+ share_embeddings: false
97
+ share_decoder_embeddings: false
 
 
 
98
  add_estimator: false
99
+ add_ffnbias: true
100
  add_qkvbias: false
101
+ layer_norm: standard
102
+ mlp_activation_fn: gelu
103
+ hidden_size: 768
104
  encoder:
105
+ layers: 12
106
  decoder:
107
  layers: 2
108
+ heads: 16
109
  transformer_ff: 4096
110
  embeddings:
111
+ word_vec_size: 768
112
  position_encoding_type: "SinusoidalInterleaved"
113
 
114
+
eole-model/config.json CHANGED
@@ -1,126 +1,73 @@
1
  {
 
 
 
 
2
  "seed": 1234,
3
- "transforms": [
4
- "sentencepiece",
5
- "filtertoolong"
6
- ],
7
- "report_every": 100,
8
- "save_data": "fr-en/data_spm",
9
- "src_vocab_size": 50000,
10
- "share_vocab": true,
11
- "overwrite": true,
12
- "tgt_vocab": "fr-en/joint.eole.vocab",
13
  "valid_metrics": [
14
  "BLEU"
15
  ],
16
- "tensorboard_log_dir_dated": "tensorboard/Feb-17_09-24-56",
17
- "src_vocab": "fr-en/joint.eole.vocab",
 
 
18
  "tensorboard_log_dir": "tensorboard",
19
- "tensorboard": true,
 
 
 
 
20
  "n_sample": 0,
21
- "tgt_vocab_size": 50000,
22
  "vocab_size_multiple": 8,
 
23
  "training": {
24
- "adam_beta2": 0.998,
25
- "dropout_steps": [
26
- 0
 
 
 
 
27
  ],
 
28
  "param_init_method": "xavier_uniform",
29
- "accum_steps": [
30
- 0
31
- ],
32
- "batch_size": 8192,
33
  "batch_size_multiple": 8,
34
  "gpu_ranks": [
35
  0
36
  ],
37
- "model_path": "fr-en/model3",
38
- "learning_rate": 2.0,
39
- "bucket_size": 128000,
40
- "train_steps": 100000,
41
- "label_smoothing": 0.1,
42
- "num_workers": 0,
43
- "world_size": 1,
44
- "compute_dtype": "torch.bfloat16",
45
- "save_checkpoint_steps": 2000,
46
- "dropout": [
47
- 0.1
48
  ],
49
- "decay_method": "noam",
50
- "keep_checkpoint": 4,
51
- "optim": "pagedadamw8bit",
52
- "normalization": "tokens",
53
- "valid_batch_size": 8192,
54
- "batch_type": "tokens",
55
- "warmup_steps": 10000,
56
  "average_decay": 0.0001,
57
- "prefetch_factor": 100,
58
- "valid_steps": 2000,
59
- "accum_count": [
60
- 16
 
 
61
  ],
62
- "attention_dropout": [
 
 
 
 
 
 
 
 
 
63
  0.1
64
  ],
65
- "max_grad_norm": 2.0
66
- },
67
- "model": {
68
- "share_decoder_embeddings": true,
69
- "hidden_size": 1024,
70
- "mlp_activation_fn": "gelu",
71
- "add_estimator": false,
72
- "add_ffnbias": true,
73
- "share_embeddings": true,
74
- "norm_eps": 1e-06,
75
- "transformer_ff": 4096,
76
- "position_encoding_type": "SinusoidalInterleaved",
77
- "layer_norm": "standard",
78
- "architecture": "transformer",
79
- "add_qkvbias": false,
80
- "heads": 8,
81
- "encoder": {
82
- "layer_norm": "standard",
83
- "rope_config": null,
84
- "encoder_type": "transformer",
85
- "hidden_size": 1024,
86
- "add_qkvbias": false,
87
- "layers": 8,
88
- "src_word_vec_size": 1024,
89
- "add_ffnbias": true,
90
- "n_positions": null,
91
- "norm_eps": 1e-06,
92
- "mlp_activation_fn": "gelu",
93
- "heads": 8,
94
- "transformer_ff": 4096,
95
- "position_encoding_type": "SinusoidalInterleaved"
96
- },
97
- "embeddings": {
98
- "word_vec_size": 1024,
99
- "position_encoding_type": "SinusoidalInterleaved",
100
- "src_word_vec_size": 1024,
101
- "tgt_word_vec_size": 1024
102
- },
103
- "decoder": {
104
- "layer_norm": "standard",
105
- "decoder_type": "transformer",
106
- "rope_config": null,
107
- "tgt_word_vec_size": 1024,
108
- "hidden_size": 1024,
109
- "add_qkvbias": false,
110
- "layers": 2,
111
- "add_ffnbias": true,
112
- "n_positions": null,
113
- "norm_eps": 1e-06,
114
- "mlp_activation_fn": "gelu",
115
- "heads": 8,
116
- "transformer_ff": 4096,
117
- "position_encoding_type": "SinusoidalInterleaved"
118
- }
119
  },
120
  "transforms_configs": {
121
  "sentencepiece": {
122
- "tgt_subword_model": "${MODEL_PATH}/joint.spm.model",
123
- "src_subword_model": "${MODEL_PATH}/joint.spm.model"
124
  },
125
  "filtertoolong": {
126
  "src_seq_length": 256,
@@ -129,22 +76,101 @@
129
  },
130
  "data": {
131
  "corpus_1": {
 
 
 
132
  "transforms": [
133
  "sentencepiece",
134
  "filtertoolong"
135
  ],
 
 
 
 
 
136
  "path_align": null,
137
- "path_src": "fr-en/train.cleaned.src",
138
- "path_tgt": "fr-en/train.cleaned.tgt"
 
 
 
139
  },
140
- "valid": {
 
 
 
 
 
 
 
 
 
 
 
 
 
141
  "transforms": [
142
  "sentencepiece",
143
  "filtertoolong"
144
  ],
 
 
 
 
 
145
  "path_align": null,
146
- "path_src": "fr-en/dev.src",
147
- "path_tgt": "fr-en/dev.tgt"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
  }
149
  }
150
  }
 
1
  {
2
+ "tensorboard": true,
3
+ "tensorboard_log_dir_dated": "tensorboard/Jan-11_15-01-39",
4
+ "src_vocab_size": 32000,
5
+ "src_vocab": "fren/fr.eole.vocab",
6
  "seed": 1234,
 
 
 
 
 
 
 
 
 
 
7
  "valid_metrics": [
8
  "BLEU"
9
  ],
10
+ "overwrite": true,
11
+ "share_vocab": false,
12
+ "save_data": "data",
13
+ "report_every": 100,
14
  "tensorboard_log_dir": "tensorboard",
15
+ "transforms": [
16
+ "sentencepiece",
17
+ "filtertoolong"
18
+ ],
19
+ "tgt_vocab": "fren/en.eole.vocab",
20
  "n_sample": 0,
 
21
  "vocab_size_multiple": 8,
22
+ "tgt_vocab_size": 32000,
23
  "training": {
24
+ "prefetch_factor": 128,
25
+ "optim": "adamw",
26
+ "keep_checkpoint": 4,
27
+ "world_size": 1,
28
+ "decay_method": "noam",
29
+ "attention_dropout": [
30
+ 0.1
31
  ],
32
+ "max_grad_norm": 0.0,
33
  "param_init_method": "xavier_uniform",
34
+ "normalization": "tokens",
 
 
 
35
  "batch_size_multiple": 8,
36
  "gpu_ranks": [
37
  0
38
  ],
39
+ "accum_count": [
40
+ 20
 
 
 
 
 
 
 
 
 
41
  ],
 
 
 
 
 
 
 
42
  "average_decay": 0.0001,
43
+ "batch_size": 6000,
44
+ "compute_dtype": "torch.float16",
45
+ "adam_beta2": 0.998,
46
+ "valid_steps": 5000,
47
+ "dropout_steps": [
48
+ 0
49
  ],
50
+ "train_steps": 200000,
51
+ "warmup_steps": 5000,
52
+ "learning_rate": 3.0,
53
+ "num_workers": 0,
54
+ "save_checkpoint_steps": 5000,
55
+ "accum_steps": [
56
+ 0
57
+ ],
58
+ "batch_type": "tokens",
59
+ "dropout": [
60
  0.1
61
  ],
62
+ "bucket_size": 256000,
63
+ "label_smoothing": 0.1,
64
+ "model_path": "quickmt-fr-en-eole-model",
65
+ "valid_batch_size": 2048
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
  },
67
  "transforms_configs": {
68
  "sentencepiece": {
69
+ "src_subword_model": "${MODEL_PATH}/fr.spm.model",
70
+ "tgt_subword_model": "${MODEL_PATH}/en.spm.model"
71
  },
72
  "filtertoolong": {
73
  "src_seq_length": 256,
 
76
  },
77
  "data": {
78
  "corpus_1": {
79
+ "path_src": "fren/train.cleaned.filtered.fr",
80
+ "path_tgt": "fren/train.cleaned.filtered.en",
81
+ "path_align": null,
82
  "transforms": [
83
  "sentencepiece",
84
  "filtertoolong"
85
  ],
86
+ "weight": 200
87
+ },
88
+ "corpus_2": {
89
+ "path_src": "../data/newscrawl.backtrans.cleaned.filtered.fr",
90
+ "path_tgt": "../data/newscrawl.backtrans.cleaned.filtered.en",
91
  "path_align": null,
92
+ "transforms": [
93
+ "sentencepiece",
94
+ "filtertoolong"
95
+ ],
96
+ "weight": 35
97
  },
98
+ "corpus_3": {
99
+ "path_src": "../data/madlad.backtrans.cleaned.filtered.fr",
100
+ "path_tgt": "../data/madlad.backtrans.cleaned.filtered.en",
101
+ "path_align": null,
102
+ "transforms": [
103
+ "sentencepiece",
104
+ "filtertoolong"
105
+ ],
106
+ "weight": 68
107
+ },
108
+ "corpus_4": {
109
+ "path_src": "../data/hansard.fr",
110
+ "path_tgt": "../data/hansard.en",
111
+ "path_align": null,
112
  "transforms": [
113
  "sentencepiece",
114
  "filtertoolong"
115
  ],
116
+ "weight": 5
117
+ },
118
+ "valid": {
119
+ "path_src": "fren/dev.fr",
120
+ "path_tgt": "fren/dev.en",
121
  "path_align": null,
122
+ "transforms": [
123
+ "sentencepiece",
124
+ "filtertoolong"
125
+ ]
126
+ }
127
+ },
128
+ "model": {
129
+ "position_encoding_type": "SinusoidalInterleaved",
130
+ "share_decoder_embeddings": false,
131
+ "add_qkvbias": false,
132
+ "architecture": "transformer",
133
+ "add_estimator": false,
134
+ "hidden_size": 768,
135
+ "share_embeddings": false,
136
+ "layer_norm": "standard",
137
+ "add_ffnbias": true,
138
+ "mlp_activation_fn": "gelu",
139
+ "heads": 16,
140
+ "transformer_ff": 4096,
141
+ "decoder": {
142
+ "transformer_ff": 4096,
143
+ "position_encoding_type": "SinusoidalInterleaved",
144
+ "add_qkvbias": false,
145
+ "tgt_word_vec_size": 768,
146
+ "n_positions": null,
147
+ "decoder_type": "transformer",
148
+ "hidden_size": 768,
149
+ "layer_norm": "standard",
150
+ "add_ffnbias": true,
151
+ "mlp_activation_fn": "gelu",
152
+ "heads": 16,
153
+ "layers": 2
154
+ },
155
+ "encoder": {
156
+ "encoder_type": "transformer",
157
+ "transformer_ff": 4096,
158
+ "position_encoding_type": "SinusoidalInterleaved",
159
+ "src_word_vec_size": 768,
160
+ "add_qkvbias": false,
161
+ "n_positions": null,
162
+ "hidden_size": 768,
163
+ "layer_norm": "standard",
164
+ "add_ffnbias": true,
165
+ "mlp_activation_fn": "gelu",
166
+ "heads": 16,
167
+ "layers": 12
168
+ },
169
+ "embeddings": {
170
+ "tgt_word_vec_size": 768,
171
+ "word_vec_size": 768,
172
+ "position_encoding_type": "SinusoidalInterleaved",
173
+ "src_word_vec_size": 768
174
  }
175
  }
176
  }
eole-model/en.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ce51817f4aabdac074cccee54167581e681e9cbade82b563d70d64f7be958e4d
3
+ size 799063
eole-model/fr.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1e41eb3cab1f4e5caf55370cd23362ace5839cec791411da61d159365b6a6451
3
+ size 816271
eole-model/model.00.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7a28ad097a4ed4a2bd3f4b6da5731c5f4d2cf664cc28a54d62d2b150b1f68e0c
3
- size 762769904
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:540388bd6955f125e9e6f79e8dd51aecb54dca6c982a8df30e1cad79bb2a1b1b
3
+ size 829569112
eole-model/vocab.json CHANGED
The diff for this file is too large to render. See raw diff
 
model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7a5c0a467de1c354122644c49dc5fad47b4c38b1eeb6ddfa841f3d6e3b2a699b
3
- size 381336824
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c0657044846861a905f959bb4950a155e0a41c2bf6187a2c47fa86e2a79d3582
3
+ size 407101843
source_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
src.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1e41eb3cab1f4e5caf55370cd23362ace5839cec791411da61d159365b6a6451
3
+ size 816271
target_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
tgt.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ce51817f4aabdac074cccee54167581e681e9cbade82b563d70d64f7be958e4d
3
+ size 799063