Translation
Arabic
English
Eval Results
radinplaid commited on
Commit
a81dfb3
·
verified ·
1 Parent(s): f969727

Upload folder using huggingface_hub

Browse files
.ipynb_checkpoints/EADME-checkpoint.md ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ar
4
+ - en
5
+ tags:
6
+ - translation
7
+ license: cc-by-4.0
8
+ datasets:
9
+ - quickmt/quickmt-train.ar-en
10
+ - quickmt/madlad400-en-backtranslated-ar
11
+ - quickmt/newscrawl2024-en-backtranslated-ar
12
+ model-index:
13
+ - name: quickmt-ar-en
14
+ results:
15
+ - task:
16
+ name: Translation arb-eng
17
+ type: translation
18
+ args: arb-eng
19
+ dataset:
20
+ name: flores101-devtest
21
+ type: flores_101
22
+ args: arb_Arab eng_Latn devtest
23
+ metrics:
24
+ - name: BLEU
25
+ type: bleu
26
+ value: 44.11
27
+ - name: CHRF
28
+ type: chrf
29
+ value: 67.96
30
+ - name: COMET
31
+ type: comet
32
+ value: 87.64
33
+ ---
34
+
35
+
36
+ # `quickmt-ar-en` Neural Machine Translation Model
37
+
38
+ `quickmt-ar-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `ar` into `en`.
39
+
40
+ `quickmt` models are roughly 3 times faster for GPU inference than OpusMT models and roughly [40 times](https://huggingface.co/spaces/quickmt/quickmt-vs-libretranslate) faster than [LibreTranslate](https://huggingface.co/spaces/quickmt/quickmt-vs-libretranslate)/[ArgosTranslate](github.com/argosopentech/argos-translate).
41
+
42
+
43
+ ## *UPDATED VERSION!*
44
+
45
+ This model was trained with back-translated data and has improved translation quality!
46
+
47
+
48
+ ## Try it on our Huggingface Space
49
+
50
+ Give it a try before downloading here: https://huggingface.co/spaces/quickmt/QuickMT-Demo
51
+
52
+
53
+ ## Model Information
54
+
55
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
56
+ * 200M parameter seq2seq transformer
57
+ * 32k separate Sentencepiece vocabs
58
+ * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
59
+ * The pytorch model (for use with [`eole`](https://github.com/eole-nlp/eole)) is available in this repository in the `eole-model` folder
60
+
61
+ See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
62
+
63
+
64
+ ## Usage with `quickmt`
65
+
66
+ You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
67
+
68
+ Next, install the `quickmt` python library and download the model:
69
+
70
+ ```bash
71
+ git clone https://github.com/quickmt/quickmt.git
72
+ pip install -e ./quickmt/
73
+
74
+ quickmt-model-download quickmt/quickmt-ar-en ./quickmt-ar-en
75
+ ```
76
+
77
+ Finally use the model in python:
78
+
79
+ ```python
80
+ from quickmt import Translator
81
+
82
+ # Auto-detects GPU, set to "cpu" to force CPU inference
83
+ mt = Translator("./quickmt-ar-en/", device="auto")
84
+
85
+ # Translate - set beam size to 1 for faster speed (but lower quality)
86
+ sample_text = 'نبه الدكتور إيهود أور -أستاذ الطب في جامعة دالهوزي في هاليفاكس، نوفا سكوتيا ورئيس الشعبة الطبية والعلمية في الجمعية الكندية للسكري- إلى أن البحث لا يزال في أيامه الأولى.'
87
+
88
+ mt(sample_text, beam_size=5)
89
+ ```
90
+
91
+ > 'Dr. Ehud Orr, a professor of medicine at Dalhousie University in Halifax, Nova Scotia and head of the medical and scientific division of the Canadian Diabetes Association, warned that the research is still in its early days.'
92
+
93
+ ```python
94
+ # Get alternative translations by sampling
95
+ # You can pass any cTranslate2 `translate_batch` arguments
96
+ mt([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
97
+ ```
98
+
99
+ > 'Dr. Ehr, professor of medicine at Dalhousie University in Halifax, Nova Scotia and chief of the medical and scientific division at the Canadian Diabetes Association, warned that the research is still in its early days.'
100
+
101
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`. A model in safetensors format to be used with `eole` is also provided.
102
+
103
+
104
+ ## Metrics
105
+
106
+ `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32.
107
+
108
+
109
+ | | bleu | chrf2 | comet22 | Time (s) |
110
+ |:---------------------------------|-------:|--------:|----------:|-----------:|
111
+ | quickmt/quickmt-ar-en | 44.11 | 67.96 | 87.64 | 1.11 |
112
+ | Helsinki-NLP/opus-mt-ar-en | 34.22 | 61.26 | 84.5 | 3.67 |
113
+ | facebook/nllb-200-distilled-600M | 39.13 | 64.14 | 86.22 | 21.76 |
114
+ | facebook/nllb-200-distilled-1.3B | 42.29 | 66.55 | 87.55 | 37.7 |
115
+ | facebook/m2m100_418M | 29.41 | 57.68 | 82.21 | 18.53 |
116
+ | facebook/m2m100_1.2B | 29.77 | 56.7 | 80.77 | 36.23 |
README.md CHANGED
@@ -1,33 +1,35 @@
1
  ---
2
  language:
3
- - en
4
  - ar
 
5
  tags:
6
  - translation
7
  license: cc-by-4.0
8
  datasets:
9
  - quickmt/quickmt-train.ar-en
 
 
10
  model-index:
11
  - name: quickmt-ar-en
12
  results:
13
  - task:
14
- name: Translation ara-eng
15
  type: translation
16
- args: ara-eng
17
  dataset:
18
  name: flores101-devtest
19
  type: flores_101
20
- args: ara_Arab eng_Latn devtest
21
  metrics:
22
- - name: CHRF
23
- type: chrf
24
- value: 66.98
25
  - name: BLEU
26
  type: bleu
27
- value: 42.79
 
 
 
28
  - name: COMET
29
  type: comet
30
- value: 87.4
31
  ---
32
 
33
 
@@ -35,14 +37,26 @@ model-index:
35
 
36
  `quickmt-ar-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `ar` into `en`.
37
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
  ## Model Information
40
 
41
- * Trained using [`eole`](https://github.com/eole-nlp/eole)
42
- * 185M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
43
- * 20k sentencepiece vocabularies
44
  * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
45
- * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.ar-en/tree/main
46
 
47
  See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
48
 
@@ -55,7 +69,7 @@ Next, install the `quickmt` python library and download the model:
55
 
56
  ```bash
57
  git clone https://github.com/quickmt/quickmt.git
58
- pip install ./quickmt/
59
 
60
  quickmt-model-download quickmt/quickmt-ar-en ./quickmt-ar-en
61
  ```
@@ -66,35 +80,37 @@ Finally use the model in python:
66
  from quickmt import Translator
67
 
68
  # Auto-detects GPU, set to "cpu" to force CPU inference
69
- t = Translator("./quickmt-ar-en/", device="auto")
70
 
71
- # Translate - set beam size to 5 for higher quality (but slower speed)
72
  sample_text = 'نبه الدكتور إيهود أور -أستاذ الطب في جامعة دالهوزي في هاليفاكس، نوفا سكوتيا ورئيس الشعبة الطبية والعلمية في الجمعية الكندية للسكري- إلى أن البحث لا يزال في أيامه الأولى.'
73
- t(sample_text, beam_size=5)
74
 
75
- > 'Dr. Ehud Orr, professor of medicine at Dalhousie University in Halifax, Nova Scotia and head of the medical and scientific division of the Canadian Diabetes Association, warned that the research is still in its early days.'
 
 
 
76
 
 
77
  # Get alternative translations by sampling
78
  # You can pass any cTranslate2 `translate_batch` arguments
79
- t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
80
-
81
- > 'Professor of Medicine at Dalhousie University in Halifax, Nova Scotia and chairman of the Medical and Scientific Division at the Canadian Diabetes Society, cautioned that the research was still in its early days.'
82
  ```
83
 
84
- The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
 
 
85
 
86
 
87
  ## Metrics
88
 
89
- `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("ara_Arab"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate (using `ctranslate2`) the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32 (faster speed is possible using a large batch size).
 
90
 
91
  | | bleu | chrf2 | comet22 | Time (s) |
92
  |:---------------------------------|-------:|--------:|----------:|-----------:|
93
- | quickmt/quickmt-ar-en | 42.79 | 66.98 | 87.4 | 0.88 |
94
- | Helsink-NLP/opus-mt-ar-en | 34.22 | 61.26 | 84.5 | 3.78 |
95
- | facebook/nllb-200-distilled-600M | 39.13 | 64.14 | 86.22 | 21.58 |
96
  | facebook/nllb-200-distilled-1.3B | 42.29 | 66.55 | 87.55 | 37.7 |
97
- | facebook/m2m100_418M | 29.41 | 57.68 | 82.21 | 18.5 |
98
  | facebook/m2m100_1.2B | 29.77 | 56.7 | 80.77 | 36.23 |
99
-
100
- `quickmt-ar-en` is the fastest and highest quality.
 
1
  ---
2
  language:
 
3
  - ar
4
+ - en
5
  tags:
6
  - translation
7
  license: cc-by-4.0
8
  datasets:
9
  - quickmt/quickmt-train.ar-en
10
+ - quickmt/madlad400-en-backtranslated-ar
11
+ - quickmt/newscrawl2024-en-backtranslated-ar
12
  model-index:
13
  - name: quickmt-ar-en
14
  results:
15
  - task:
16
+ name: Translation arb-eng
17
  type: translation
18
+ args: arb-eng
19
  dataset:
20
  name: flores101-devtest
21
  type: flores_101
22
+ args: arb_Arab eng_Latn devtest
23
  metrics:
 
 
 
24
  - name: BLEU
25
  type: bleu
26
+ value: 44.11
27
+ - name: CHRF
28
+ type: chrf
29
+ value: 67.96
30
  - name: COMET
31
  type: comet
32
+ value: 87.64
33
  ---
34
 
35
 
 
37
 
38
  `quickmt-ar-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `ar` into `en`.
39
 
40
+ `quickmt` models are roughly 3 times faster for GPU inference than OpusMT models and roughly [40 times](https://huggingface.co/spaces/quickmt/quickmt-vs-libretranslate) faster than [LibreTranslate](https://huggingface.co/spaces/quickmt/quickmt-vs-libretranslate)/[ArgosTranslate](github.com/argosopentech/argos-translate).
41
+
42
+
43
+ ## *UPDATED VERSION!*
44
+
45
+ This model was trained with back-translated data and has improved translation quality!
46
+
47
+
48
+ ## Try it on our Huggingface Space
49
+
50
+ Give it a try before downloading here: https://huggingface.co/spaces/quickmt/QuickMT-Demo
51
+
52
 
53
  ## Model Information
54
 
55
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
56
+ * 200M parameter seq2seq transformer
57
+ * 32k separate Sentencepiece vocabs
58
  * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
59
+ * The pytorch model (for use with [`eole`](https://github.com/eole-nlp/eole)) is available in this repository in the `eole-model` folder
60
 
61
  See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
62
 
 
69
 
70
  ```bash
71
  git clone https://github.com/quickmt/quickmt.git
72
+ pip install -e ./quickmt/
73
 
74
  quickmt-model-download quickmt/quickmt-ar-en ./quickmt-ar-en
75
  ```
 
80
  from quickmt import Translator
81
 
82
  # Auto-detects GPU, set to "cpu" to force CPU inference
83
+ mt = Translator("./quickmt-ar-en/", device="auto")
84
 
85
+ # Translate - set beam size to 1 for faster speed (but lower quality)
86
  sample_text = 'نبه الدكتور إيهود أور -أستاذ الطب في جامعة دالهوزي في هاليفاكس، نوفا سكوتيا ورئيس الشعبة الطبية والعلمية في الجمعية الكندية للسكري- إلى أن البحث لا يزال في أيامه الأولى.'
 
87
 
88
+ mt(sample_text, beam_size=5)
89
+ ```
90
+
91
+ > 'Dr. Ehud Orr, a professor of medicine at Dalhousie University in Halifax, Nova Scotia and head of the medical and scientific division of the Canadian Diabetes Association, warned that the research is still in its early days.'
92
 
93
+ ```python
94
  # Get alternative translations by sampling
95
  # You can pass any cTranslate2 `translate_batch` arguments
96
+ mt([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
 
 
97
  ```
98
 
99
+ > 'Dr. Ehr, professor of medicine at Dalhousie University in Halifax, Nova Scotia and chief of the medical and scientific division at the Canadian Diabetes Association, warned that the research is still in its early days.'
100
+
101
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`. A model in safetensors format to be used with `eole` is also provided.
102
 
103
 
104
  ## Metrics
105
 
106
+ `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32.
107
+
108
 
109
  | | bleu | chrf2 | comet22 | Time (s) |
110
  |:---------------------------------|-------:|--------:|----------:|-----------:|
111
+ | quickmt/quickmt-ar-en | 44.11 | 67.96 | 87.64 | 1.11 |
112
+ | Helsinki-NLP/opus-mt-ar-en | 34.22 | 61.26 | 84.5 | 3.67 |
113
+ | facebook/nllb-200-distilled-600M | 39.13 | 64.14 | 86.22 | 21.76 |
114
  | facebook/nllb-200-distilled-1.3B | 42.29 | 66.55 | 87.55 | 37.7 |
115
+ | facebook/m2m100_418M | 29.41 | 57.68 | 82.21 | 18.53 |
116
  | facebook/m2m100_1.2B | 29.77 | 56.7 | 80.77 | 36.23 |
 
 
eole-config.yaml CHANGED
@@ -1,5 +1,5 @@
1
  ## IO
2
- save_data: ar-en/data
3
  overwrite: True
4
  seed: 1234
5
  report_every: 100
@@ -10,72 +10,100 @@ tensorboard_log_dir: tensorboard
10
  ### Vocab
11
  src_vocab: ar.eole.vocab
12
  tgt_vocab: en.eole.vocab
13
- src_vocab_size: 20000
14
- tgt_vocab_size: 20000
15
  vocab_size_multiple: 8
16
  share_vocab: false
17
  n_sample: 0
18
 
19
  data:
20
  corpus_1:
21
- path_src: hf://quickmt/quickmt-train.ar-en/ar
22
- path_tgt: hf://quickmt/quickmt-train.ar-en/en
23
- path_sco: hf://quickmt/quickmt-train.ar-en/sco
 
 
 
 
 
 
 
 
24
  valid:
25
- path_src: flores-dev.ar
26
- path_tgt: flores-dev.en
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  transforms: [sentencepiece, filtertoolong]
29
  transforms_configs:
30
  sentencepiece:
31
  src_subword_model: "ar.spm.model"
32
  tgt_subword_model: "en.spm.model"
33
- src_subword_alpha: 0.5
34
- src_subword_nbest: -1
35
  filtertoolong:
36
  src_seq_length: 256
37
  tgt_seq_length: 256
38
 
39
  training:
40
  # Run configuration
41
- model_path: model
42
- train_from: model
43
  keep_checkpoint: 4
44
- save_checkpoint_steps: 1000
45
  train_steps: 200000
46
- valid_steps: 1000
 
47
 
48
  # Train on a single GPU
49
  world_size: 1
50
  gpu_ranks: [0]
51
 
52
- # Batching
 
53
  batch_type: "tokens"
54
- batch_size: 8192
55
- valid_batch_size: 8192
56
  batch_size_multiple: 8
57
- accum_count: [16]
58
  accum_steps: [0]
59
 
60
  # Optimizer & Compute
61
  compute_dtype: "fp16"
62
- #use_amp: true
63
- optim: "pagedadamw8bit"
64
- learning_rate: 2.0
65
  warmup_steps: 5000
66
  decay_method: "noam"
67
  adam_beta2: 0.998
68
 
69
  # Data loading
70
- bucket_size: 128000
71
  num_workers: 4
72
- prefetch_factor: 32
73
 
74
  # Hyperparams
75
  dropout_steps: [0]
76
  dropout: [0.1]
77
- attention_dropout: [0]
78
- max_grad_norm: 2
79
  label_smoothing: 0.1
80
  average_decay: 0.0001
81
  param_init_method: xavier_uniform
@@ -83,22 +111,21 @@ training:
83
 
84
  model:
85
  architecture: "transformer"
86
- layer_norm: standard
87
  share_embeddings: false
88
- share_decoder_embeddings: true
89
- add_ffnbias: true
90
- mlp_activation_fn: gelu
91
  add_estimator: false
 
92
  add_qkvbias: false
93
- norm_eps: 1e-6
94
- hidden_size: 1024
 
95
  encoder:
96
- layers: 8
97
  decoder:
98
  layers: 2
99
- heads: 8
100
  transformer_ff: 4096
101
  embeddings:
102
- word_vec_size: 1024
103
  position_encoding_type: "SinusoidalInterleaved"
104
 
 
1
  ## IO
2
+ save_data: data
3
  overwrite: True
4
  seed: 1234
5
  report_every: 100
 
10
  ### Vocab
11
  src_vocab: ar.eole.vocab
12
  tgt_vocab: en.eole.vocab
13
+ src_vocab_size: 32000
14
+ tgt_vocab_size: 32000
15
  vocab_size_multiple: 8
16
  share_vocab: false
17
  n_sample: 0
18
 
19
  data:
20
  corpus_1:
21
+ path_src: train.ar
22
+ path_tgt: train.en
23
+ weight: 2
24
+ corpus_2:
25
+ path_src: newscrawl.backtrans.ar
26
+ path_tgt: newscrawl.2024.en
27
+ weight: 1
28
+ corpus_3:
29
+ path_src: madlad.backtrans.ar
30
+ path_tgt: madlad.en
31
+ weight: 2
32
  valid:
33
+ path_src: valid.ar
34
+ path_tgt: valid.en
35
+
36
+ # data:
37
+ # corpus_1:
38
+ # path_src: hf://quickmt/quickmt-train.ar-en/ar
39
+ # path_tgt: hf://quickmt/quickmt-train.ar-en/en
40
+ # path_sco: hf://quickmt/quickmt-train.ar-en/sco
41
+ # weight: 2
42
+ # corpus_2:
43
+ # path_src: hf://quickmt/newscrawl2024-en-backtranslated-ar/ar
44
+ # path_tgt: hf://quickmt/newscrawl2024-en-backtranslated-ar/en
45
+ # path_sco: hf://quickmt/newscrawl2024-en-backtranslated-ar/sco
46
+ # weight: 1
47
+ # corpus_3:
48
+ # path_src: hf://quickmt/madlad400-en-backtranslated-ar/ar
49
+ # path_tgt: hf://quickmt/madlad400-en-backtranslated-ar/en
50
+ # path_sco: hf://quickmt/madlad400-en-backtranslated-ar/sco
51
+ # weight: 2
52
+ # valid:
53
+ # path_src: valid.ar
54
+ # path_tgt: valid.en
55
+
56
+
57
 
58
  transforms: [sentencepiece, filtertoolong]
59
  transforms_configs:
60
  sentencepiece:
61
  src_subword_model: "ar.spm.model"
62
  tgt_subword_model: "en.spm.model"
 
 
63
  filtertoolong:
64
  src_seq_length: 256
65
  tgt_seq_length: 256
66
 
67
  training:
68
  # Run configuration
69
+ model_path: quickmt-ar-en-eole-model
 
70
  keep_checkpoint: 4
 
71
  train_steps: 200000
72
+ save_checkpoint_steps: 5000
73
+ valid_steps: 5000
74
 
75
  # Train on a single GPU
76
  world_size: 1
77
  gpu_ranks: [0]
78
 
79
+ # Batching 120,000 tokens
80
+ # For RTX 5090, 15000 batch size, accum_count 8
81
  batch_type: "tokens"
82
+ batch_size: 15000
83
+ valid_batch_size: 2048
84
  batch_size_multiple: 8
85
+ accum_count: [8]
86
  accum_steps: [0]
87
 
88
  # Optimizer & Compute
89
  compute_dtype: "fp16"
90
+ optim: "adamw"
91
+ use_amp: True
92
+ learning_rate: 3.0
93
  warmup_steps: 5000
94
  decay_method: "noam"
95
  adam_beta2: 0.998
96
 
97
  # Data loading
98
+ bucket_size: 256000
99
  num_workers: 4
100
+ prefetch_factor: 128
101
 
102
  # Hyperparams
103
  dropout_steps: [0]
104
  dropout: [0.1]
105
+ attention_dropout: [0.1]
106
+ max_grad_norm: 0
107
  label_smoothing: 0.1
108
  average_decay: 0.0001
109
  param_init_method: xavier_uniform
 
111
 
112
  model:
113
  architecture: "transformer"
 
114
  share_embeddings: false
115
+ share_decoder_embeddings: false
 
 
116
  add_estimator: false
117
+ add_ffnbias: true
118
  add_qkvbias: false
119
+ layer_norm: standard
120
+ mlp_activation_fn: gelu
121
+ hidden_size: 768
122
  encoder:
123
+ layers: 12
124
  decoder:
125
  layers: 2
126
+ heads: 16
127
  transformer_ff: 4096
128
  embeddings:
129
+ word_vec_size: 768
130
  position_encoding_type: "SinusoidalInterleaved"
131
 
eole-model/ar.spm.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c7b8bf117088a901628696f60ddd15e047057ccbdcf9b0996ca19e2779a234c7
3
- size 642429
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1a510ccfeb5cf06584327b9c2482ed0a4111d06348a56e28d165165b6dabe520
3
+ size 901383
eole-model/config.json CHANGED
@@ -1,152 +1,167 @@
1
  {
 
 
 
 
2
  "src_vocab": "ar.eole.vocab",
3
- "tensorboard_log_dir_dated": "tensorboard/Mar-21_20-28-42",
4
- "tensorboard": true,
 
 
 
5
  "valid_metrics": [
6
  "BLEU"
7
  ],
8
- "save_data": "ar-en/data",
 
 
9
  "vocab_size_multiple": 8,
 
10
  "share_vocab": false,
 
11
  "tensorboard_log_dir": "tensorboard",
12
- "tgt_vocab_size": 20000,
13
- "src_vocab_size": 20000,
14
- "report_every": 100,
15
- "transforms": [
16
- "sentencepiece",
17
- "filtertoolong"
18
- ],
19
- "seed": 1234,
20
- "n_sample": 0,
21
- "tgt_vocab": "en.eole.vocab",
22
- "overwrite": true,
23
  "training": {
24
- "batch_size": 8192,
25
- "batch_type": "tokens",
26
- "valid_steps": 1000,
27
- "decay_method": "noam",
28
- "optim": "pagedadamw8bit",
29
- "bucket_size": 128000,
30
- "keep_checkpoint": 4,
31
- "model_path": "model",
32
  "label_smoothing": 0.1,
33
- "batch_size_multiple": 8,
34
- "param_init_method": "xavier_uniform",
35
  "gpu_ranks": [
36
  0
37
  ],
38
- "warmup_steps": 5000,
39
- "prefetch_factor": 32,
40
- "dropout": [
41
- 0.1
42
- ],
43
- "accum_steps": [
44
- 0
45
- ],
 
 
46
  "world_size": 1,
47
  "attention_dropout": [
48
- 0.0
49
  ],
50
- "average_decay": 0.0001,
51
- "max_grad_norm": 2.0,
52
- "learning_rate": 2.0,
53
- "normalization": "tokens",
54
- "valid_batch_size": 8192,
55
  "dropout_steps": [
56
  0
57
  ],
58
- "num_workers": 0,
59
- "accum_count": [
60
- 16
61
- ],
62
  "compute_dtype": "torch.float16",
63
- "adam_beta2": 0.998,
64
- "save_checkpoint_steps": 1000,
65
- "train_steps": 200000,
66
- "train_from": "model"
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  },
68
- "transforms_configs": {
69
- "filtertoolong": {
70
- "tgt_seq_length": 256,
71
- "src_seq_length": 256
 
 
 
 
 
 
72
  },
73
- "sentencepiece": {
74
- "tgt_subword_model": "${MODEL_PATH}/en.spm.model",
75
- "src_subword_alpha": 0.5,
76
- "src_subword_nbest": -1,
77
- "src_subword_model": "${MODEL_PATH}/ar.spm.model"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  }
79
  },
80
  "model": {
81
- "heads": 8,
82
- "add_qkvbias": false,
83
- "layer_norm": "standard",
84
  "share_embeddings": false,
85
- "share_decoder_embeddings": true,
86
- "position_encoding_type": "SinusoidalInterleaved",
87
- "architecture": "transformer",
88
- "hidden_size": 1024,
89
- "add_ffnbias": true,
90
  "add_estimator": false,
91
- "transformer_ff": 4096,
92
- "norm_eps": 1e-06,
93
  "mlp_activation_fn": "gelu",
 
 
 
 
 
 
 
 
94
  "embeddings": {
95
- "src_word_vec_size": 1024,
96
- "tgt_word_vec_size": 1024,
97
- "word_vec_size": 1024,
98
- "position_encoding_type": "SinusoidalInterleaved"
99
  },
100
- "encoder": {
101
- "heads": 8,
 
 
102
  "transformer_ff": 4096,
103
- "hidden_size": 1024,
104
  "add_qkvbias": false,
105
  "layer_norm": "standard",
 
 
 
106
  "n_positions": null,
107
- "add_ffnbias": true,
 
 
 
108
  "encoder_type": "transformer",
109
  "mlp_activation_fn": "gelu",
110
- "layers": 8,
111
- "src_word_vec_size": 1024,
112
- "norm_eps": 1e-06,
113
- "position_encoding_type": "SinusoidalInterleaved"
114
- },
115
- "decoder": {
116
- "heads": 8,
117
- "tgt_word_vec_size": 1024,
118
  "transformer_ff": 4096,
119
- "hidden_size": 1024,
120
  "add_qkvbias": false,
121
  "layer_norm": "standard",
 
 
 
122
  "n_positions": null,
123
- "add_ffnbias": true,
124
- "decoder_type": "transformer",
125
- "mlp_activation_fn": "gelu",
126
- "layers": 2,
127
- "norm_eps": 1e-06,
128
  "position_encoding_type": "SinusoidalInterleaved"
129
  }
130
  },
131
- "data": {
132
- "corpus_1": {
133
- "path_src": "hf://quickmt/quickmt-train.ar-en/ar",
134
- "path_align": null,
135
- "transforms": [
136
- "sentencepiece",
137
- "filtertoolong"
138
- ],
139
- "path_tgt": "hf://quickmt/quickmt-train.ar-en/en",
140
- "path_sco": "hf://quickmt/quickmt-train.ar-en/sco"
141
  },
142
- "valid": {
143
- "path_src": "flores-dev.ar",
144
- "path_tgt": "flores-dev.en",
145
- "path_align": null,
146
- "transforms": [
147
- "sentencepiece",
148
- "filtertoolong"
149
- ]
150
  }
151
  }
152
  }
 
1
  {
2
+ "seed": 1234,
3
+ "save_data": "data",
4
+ "report_every": 100,
5
+ "src_vocab_size": 32000,
6
  "src_vocab": "ar.eole.vocab",
7
+ "overwrite": true,
8
+ "transforms": [
9
+ "sentencepiece",
10
+ "filtertoolong"
11
+ ],
12
  "valid_metrics": [
13
  "BLEU"
14
  ],
15
+ "tensorboard": true,
16
+ "tensorboard_log_dir_dated": "tensorboard/Dec-29_03-28-03",
17
+ "tgt_vocab": "en.eole.vocab",
18
  "vocab_size_multiple": 8,
19
+ "n_sample": 0,
20
  "share_vocab": false,
21
+ "tgt_vocab_size": 32000,
22
  "tensorboard_log_dir": "tensorboard",
 
 
 
 
 
 
 
 
 
 
 
23
  "training": {
24
+ "accum_count": [
25
+ 8
26
+ ],
 
 
 
 
 
27
  "label_smoothing": 0.1,
 
 
28
  "gpu_ranks": [
29
  0
30
  ],
31
+ "normalization": "tokens",
32
+ "average_decay": 0.0001,
33
+ "train_steps": 200000,
34
+ "prefetch_factor": 128,
35
+ "use_amp": true,
36
+ "param_init_method": "xavier_uniform",
37
+ "batch_size_multiple": 8,
38
+ "learning_rate": 3.0,
39
+ "adam_beta2": 0.998,
40
+ "batch_size": 15000,
41
  "world_size": 1,
42
  "attention_dropout": [
43
+ 0.1
44
  ],
 
 
 
 
 
45
  "dropout_steps": [
46
  0
47
  ],
48
+ "valid_steps": 5000,
 
 
 
49
  "compute_dtype": "torch.float16",
50
+ "model_path": "quickmt-ar-en-eole-model",
51
+ "batch_type": "tokens",
52
+ "decay_method": "noam",
53
+ "dropout": [
54
+ 0.1
55
+ ],
56
+ "warmup_steps": 5000,
57
+ "max_grad_norm": 0.0,
58
+ "num_workers": 0,
59
+ "save_checkpoint_steps": 5000,
60
+ "bucket_size": 256000,
61
+ "keep_checkpoint": 4,
62
+ "optim": "adamw",
63
+ "valid_batch_size": 2048,
64
+ "accum_steps": [
65
+ 0
66
+ ]
67
  },
68
+ "data": {
69
+ "corpus_1": {
70
+ "transforms": [
71
+ "sentencepiece",
72
+ "filtertoolong"
73
+ ],
74
+ "path_align": null,
75
+ "path_src": "train.ar",
76
+ "path_tgt": "train.en",
77
+ "weight": 2
78
  },
79
+ "corpus_2": {
80
+ "transforms": [
81
+ "sentencepiece",
82
+ "filtertoolong"
83
+ ],
84
+ "path_align": null,
85
+ "path_src": "newscrawl.backtrans.ar",
86
+ "path_tgt": "newscrawl.2024.en",
87
+ "weight": 1
88
+ },
89
+ "corpus_3": {
90
+ "transforms": [
91
+ "sentencepiece",
92
+ "filtertoolong"
93
+ ],
94
+ "path_align": null,
95
+ "path_src": "madlad.backtrans.ar",
96
+ "path_tgt": "madlad.en",
97
+ "weight": 2
98
+ },
99
+ "valid": {
100
+ "transforms": [
101
+ "sentencepiece",
102
+ "filtertoolong"
103
+ ],
104
+ "path_align": null,
105
+ "path_src": "valid.ar",
106
+ "path_tgt": "valid.en"
107
  }
108
  },
109
  "model": {
 
 
 
110
  "share_embeddings": false,
111
+ "heads": 16,
 
 
 
 
112
  "add_estimator": false,
 
 
113
  "mlp_activation_fn": "gelu",
114
+ "transformer_ff": 4096,
115
+ "add_ffnbias": true,
116
+ "architecture": "transformer",
117
+ "layer_norm": "standard",
118
+ "add_qkvbias": false,
119
+ "hidden_size": 768,
120
+ "share_decoder_embeddings": false,
121
+ "position_encoding_type": "SinusoidalInterleaved",
122
  "embeddings": {
123
+ "word_vec_size": 768,
124
+ "src_word_vec_size": 768,
125
+ "position_encoding_type": "SinusoidalInterleaved",
126
+ "tgt_word_vec_size": 768
127
  },
128
+ "decoder": {
129
+ "heads": 16,
130
+ "tgt_word_vec_size": 768,
131
+ "mlp_activation_fn": "gelu",
132
  "transformer_ff": 4096,
133
+ "add_ffnbias": true,
134
  "add_qkvbias": false,
135
  "layer_norm": "standard",
136
+ "hidden_size": 768,
137
+ "layers": 2,
138
+ "position_encoding_type": "SinusoidalInterleaved",
139
  "n_positions": null,
140
+ "decoder_type": "transformer"
141
+ },
142
+ "encoder": {
143
+ "heads": 16,
144
  "encoder_type": "transformer",
145
  "mlp_activation_fn": "gelu",
 
 
 
 
 
 
 
 
146
  "transformer_ff": 4096,
147
+ "add_ffnbias": true,
148
  "add_qkvbias": false,
149
  "layer_norm": "standard",
150
+ "src_word_vec_size": 768,
151
+ "hidden_size": 768,
152
+ "layers": 12,
153
  "n_positions": null,
 
 
 
 
 
154
  "position_encoding_type": "SinusoidalInterleaved"
155
  }
156
  },
157
+ "transforms_configs": {
158
+ "filtertoolong": {
159
+ "tgt_seq_length": 256,
160
+ "src_seq_length": 256
 
 
 
 
 
 
161
  },
162
+ "sentencepiece": {
163
+ "tgt_subword_model": "${MODEL_PATH}/en.spm.model",
164
+ "src_subword_model": "${MODEL_PATH}/ar.spm.model"
 
 
 
 
 
165
  }
166
  }
167
  }
eole-model/en.spm.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:acf054421b5ed52bd6aee14564fb8595b9a6b492df669e971d43fba9edb33c16
3
- size 589684
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8c793ee4a6d01a8b3fcd24d246a1580febf053747e44eefe99606974957cfdaa
3
+ size 806180
eole-model/model.00.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f5529927ea7de80e3e4d2a39880dfdbdbd96801c1d46060e1649af9a3266b383
3
- size 742170000
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:07c8ebd41b7dbef57b13fb30d2124d1345dc574d08331b10e406353dffeebc65
3
+ size 829569112
eole-model/vocab.json CHANGED
The diff for this file is too large to render. See raw diff
 
model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4a4210397c03c0ebddb84a0beaa39aa51dcaa5dce07dc04f63ec5e02bbc93911
3
- size 360843109
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2c4e6f5d86de0d9d6c23c54768d7ab6bd20fb7a36099c987368d6a9f45c32da1
3
+ size 407101843
source_vocabulary.json CHANGED
The diff for this file is too large to render. See raw diff
 
src.spm.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c7b8bf117088a901628696f60ddd15e047057ccbdcf9b0996ca19e2779a234c7
3
- size 642429
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1a510ccfeb5cf06584327b9c2482ed0a4111d06348a56e28d165165b6dabe520
3
+ size 901383
target_vocabulary.json CHANGED
The diff for this file is too large to render. See raw diff
 
tgt.spm.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:acf054421b5ed52bd6aee14564fb8595b9a6b492df669e971d43fba9edb33c16
3
- size 589684
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8c793ee4a6d01a8b3fcd24d246a1580febf053747e44eefe99606974957cfdaa
3
+ size 806180