radinplaid commited on
Commit
af1488c
·
verified ·
1 Parent(s): 3c5faa8

Upload folder using huggingface_hub

Browse files
.ipynb_checkpoints/README-checkpoint.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - id
5
+ tags:
6
+ - translation
7
+ license: cc-by-4.0
8
+ datasets:
9
+ - quickmt/quickmt-train.id-en
10
+ model-index:
11
+ - name: quickmt-id-en
12
+ results:
13
+ - task:
14
+ name: Translation ben-eng
15
+ type: translation
16
+ args: ben-eng
17
+ dataset:
18
+ name: flores101-devtest
19
+ type: flores_101
20
+ args: ind_Latn eng_Latn devtest
21
+ metrics:
22
+ - name: BLEU
23
+ type: bleu
24
+ value: 44.5
25
+ - name: CHRF
26
+ type: chrf
27
+ value: 68.78
28
+ - name: COMET
29
+ type: comet
30
+ value: 89.35
31
+ ---
32
+
33
+
34
+ # `quickmt-id-en` Neural Machine Translation Model
35
+
36
+ `quickmt-id-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `id` into `en`.
37
+
38
+
39
+ ## Model Information
40
+
41
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
42
+ * 185M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
43
+ * 50k joint Sentencepiece vocabulary
44
+ * Exbented for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
45
+ * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.id-en/tree/main
46
+
47
+ See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
48
+
49
+ ## Usage with `quickmt`
50
+
51
+ You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
52
+
53
+ Next, install the `quickmt` python library and download the model:
54
+
55
+ ```bash
56
+ git clone https://github.com/quickmt/quickmt.git
57
+ pip install ./quickmt/
58
+
59
+ quickmt-model-download quickmt/quickmt-id-en ./quickmt-id-en
60
+ ```
61
+
62
+ Finally use the model in python:
63
+
64
+ ```python
65
+ from quickmt imbent Translator
66
+
67
+ # Auto-detects GPU, set to "cpu" to force CPU inference
68
+ t = Translator("./quickmt-id-en/", device="auto")
69
+
70
+ # Translate - set beam size to 1 for faster speed (but lower quality)
71
+ sample_text = 'Dr. Ehud Ur, profesor kedokteran di Dalhousie University di Halifax, Nova Scotia dan ketua divisi klinis dan ilmiah di Perhimpunan Diabetes Kanada memperingatkan bahwa penelitiannya masih berada di tahap awal.'
72
+
73
+ t(sample_text, beam_size=5)
74
+ ```
75
+
76
+ > 'Dr. Ehud Ur, professor of medicine at Dalhousie University in Halifax, Nova Scotia and chair of the clinical and scientific division at the Canadian Diabetes Society warned that his research is still in its early stages.'
77
+
78
+ ```python
79
+ # Get alternative translations by sampling
80
+ # You can pass any cTranslate2 `translate_batch` arguments
81
+ t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
82
+ ```
83
+
84
+ > 'Dr. Ehud Ur, a professor of medicine at Dalhousie University in Halifax, Nova Scotia and chair of the clinical and scientific division in Canadian Diabetes Association said, “his research is at an infancy.”'
85
+
86
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
87
+
88
+ ## Metrics
89
+
90
+ `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("ind_Latn"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32 (faster speed is possible using a larger batch size).
91
+
92
+ | | bleu | chrf2 | comet22 | Time (s) |
93
+ |:---------------------------------|-------:|--------:|----------:|-----------:|
94
+ | quickmt/quickmt-id-en | 44.5 | 68.78 | 89.35 | 1.19 |
95
+ | Helsinki-NLP/opus-mt-id-en | 34.62 | 62.07 | 86.31 | 3.35 |
96
+ | facebook/nllb-200-distilled-600M | 42.26 | 66.89 | 88.67 | 21.13 |
97
+ | facebook/nllb-200-distilled-1.3B | 45.25 | 68.92 | 89.51 | 36.01 |
98
+ | facebook/m2m100_418M | 33.14 | 60.91 | 84.85 | 17.37 |
99
+ | facebook/m2m100_1.2B | 39.1 | 65.07 | 87.55 | 33.41 |
100
+
README.md CHANGED
@@ -1,3 +1,100 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - id
5
+ tags:
6
+ - translation
7
+ license: cc-by-4.0
8
+ datasets:
9
+ - quickmt/quickmt-train.id-en
10
+ model-index:
11
+ - name: quickmt-id-en
12
+ results:
13
+ - task:
14
+ name: Translation ben-eng
15
+ type: translation
16
+ args: ben-eng
17
+ dataset:
18
+ name: flores101-devtest
19
+ type: flores_101
20
+ args: ind_Latn eng_Latn devtest
21
+ metrics:
22
+ - name: BLEU
23
+ type: bleu
24
+ value: 44.5
25
+ - name: CHRF
26
+ type: chrf
27
+ value: 68.78
28
+ - name: COMET
29
+ type: comet
30
+ value: 89.35
31
+ ---
32
+
33
+
34
+ # `quickmt-id-en` Neural Machine Translation Model
35
+
36
+ `quickmt-id-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `id` into `en`.
37
+
38
+
39
+ ## Model Information
40
+
41
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
42
+ * 185M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
43
+ * 50k joint Sentencepiece vocabulary
44
+ * Exbented for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
45
+ * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.id-en/tree/main
46
+
47
+ See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
48
+
49
+ ## Usage with `quickmt`
50
+
51
+ You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
52
+
53
+ Next, install the `quickmt` python library and download the model:
54
+
55
+ ```bash
56
+ git clone https://github.com/quickmt/quickmt.git
57
+ pip install ./quickmt/
58
+
59
+ quickmt-model-download quickmt/quickmt-id-en ./quickmt-id-en
60
+ ```
61
+
62
+ Finally use the model in python:
63
+
64
+ ```python
65
+ from quickmt imbent Translator
66
+
67
+ # Auto-detects GPU, set to "cpu" to force CPU inference
68
+ t = Translator("./quickmt-id-en/", device="auto")
69
+
70
+ # Translate - set beam size to 1 for faster speed (but lower quality)
71
+ sample_text = 'Dr. Ehud Ur, profesor kedokteran di Dalhousie University di Halifax, Nova Scotia dan ketua divisi klinis dan ilmiah di Perhimpunan Diabetes Kanada memperingatkan bahwa penelitiannya masih berada di tahap awal.'
72
+
73
+ t(sample_text, beam_size=5)
74
+ ```
75
+
76
+ > 'Dr. Ehud Ur, professor of medicine at Dalhousie University in Halifax, Nova Scotia and chair of the clinical and scientific division at the Canadian Diabetes Society warned that his research is still in its early stages.'
77
+
78
+ ```python
79
+ # Get alternative translations by sampling
80
+ # You can pass any cTranslate2 `translate_batch` arguments
81
+ t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
82
+ ```
83
+
84
+ > 'Dr. Ehud Ur, a professor of medicine at Dalhousie University in Halifax, Nova Scotia and chair of the clinical and scientific division in Canadian Diabetes Association said, “his research is at an infancy.”'
85
+
86
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
87
+
88
+ ## Metrics
89
+
90
+ `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("ind_Latn"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32 (faster speed is possible using a larger batch size).
91
+
92
+ | | bleu | chrf2 | comet22 | Time (s) |
93
+ |:---------------------------------|-------:|--------:|----------:|-----------:|
94
+ | quickmt/quickmt-id-en | 44.5 | 68.78 | 89.35 | 1.19 |
95
+ | Helsinki-NLP/opus-mt-id-en | 34.62 | 62.07 | 86.31 | 3.35 |
96
+ | facebook/nllb-200-distilled-600M | 42.26 | 66.89 | 88.67 | 21.13 |
97
+ | facebook/nllb-200-distilled-1.3B | 45.25 | 68.92 | 89.51 | 36.01 |
98
+ | facebook/m2m100_418M | 33.14 | 60.91 | 84.85 | 17.37 |
99
+ | facebook/m2m100_1.2B | 39.1 | 65.07 | 87.55 | 33.41 |
100
+
config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_source_bos": false,
3
+ "add_source_eos": false,
4
+ "bos_token": "<s>",
5
+ "decoder_start_token": "<s>",
6
+ "eos_token": "</s>",
7
+ "layer_norm_epsilon": 1e-06,
8
+ "multi_query_attention": false,
9
+ "unk_token": "<unk>"
10
+ }
eole-config.yaml ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## IO
2
+ save_data: data
3
+ overwrite: True
4
+ seed: 1234
5
+ report_every: 100
6
+ valid_metrics: ["BLEU"]
7
+ tensorboard: true
8
+ tensorboard_log_dir: tensorboard
9
+
10
+ ### Vocab
11
+ src_vocab: id.eole.vocab
12
+ tgt_vocab: en.eole.vocab
13
+ src_vocab_size: 20000
14
+ tgt_vocab_size: 20000
15
+ vocab_size_multiple: 8
16
+ share_vocab: false
17
+ n_sample: 0
18
+
19
+ data:
20
+ corpus_1:
21
+ # path_src: hf://quickmt/quickmt-train.id-en/id
22
+ # path_tgt: hf://quickmt/quickmt-train.id-en/en
23
+ # path_sco: hf://quickmt/quickmt-train.id-en/sco
24
+ path_src: train.id
25
+ path_tgt: train.en
26
+ valid:
27
+ path_src: dev.id
28
+ path_tgt: dev.en
29
+
30
+ transforms: [sentencepiece, filtertoolong]
31
+ transforms_configs:
32
+ sentencepiece:
33
+ src_subword_model: "id.spm.model"
34
+ tgt_subword_model: "en.spm.model"
35
+ filtertoolong:
36
+ src_seq_length: 256
37
+ tgt_seq_length: 256
38
+
39
+ training:
40
+ # Run configuration
41
+ model_path: quickmt-id-en-eole-model
42
+ #train_from: model
43
+ keep_checkpoint: 4
44
+ train_steps: 100000
45
+ save_checkpoint_steps: 5000
46
+ valid_steps: 5000
47
+
48
+ # Train on a single GPU
49
+ world_size: 1
50
+ gpu_ranks: [0]
51
+
52
+ # Batching 10240
53
+ batch_type: "tokens"
54
+ batch_size: 8000
55
+ valid_batch_size: 4096
56
+ batch_size_multiple: 8
57
+ accum_count: [10]
58
+ accum_steps: [0]
59
+
60
+ # Optimizer & Compute
61
+ compute_dtype: "fp16"
62
+ optim: "adamw"
63
+ #use_amp: False
64
+ learning_rate: 2.0
65
+ warmup_steps: 4000
66
+ decay_method: "noam"
67
+ adam_beta2: 0.998
68
+
69
+ # Data loading
70
+ bucket_size: 128000
71
+ num_workers: 4
72
+ prefetch_factor: 32
73
+
74
+ # Hyperparams
75
+ dropout_steps: [0]
76
+ dropout: [0.1]
77
+ attention_dropout: [0.1]
78
+ max_grad_norm: 0
79
+ label_smoothing: 0.1
80
+ average_decay: 0.0001
81
+ param_init_method: xavier_uniform
82
+ normalization: "tokens"
83
+
84
+ model:
85
+ architecture: "transformer"
86
+ share_embeddings: false
87
+ share_decoder_embeddings: false
88
+ hidden_size: 1024
89
+ encoder:
90
+ layers: 8
91
+ decoder:
92
+ layers: 2
93
+ heads: 8
94
+ transformer_ff: 4096
95
+ embeddings:
96
+ word_vec_size: 1024
97
+ position_encoding_type: "SinusoidalInterleaved"
98
+
eole-model/config.json ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "n_sample": 0,
3
+ "tensorboard": true,
4
+ "tgt_vocab_size": 20000,
5
+ "overwrite": true,
6
+ "src_vocab": "id.eole.vocab",
7
+ "tgt_vocab": "en.eole.vocab",
8
+ "report_every": 100,
9
+ "transforms": [
10
+ "sentencepiece",
11
+ "filtertoolong"
12
+ ],
13
+ "seed": 1234,
14
+ "vocab_size_multiple": 8,
15
+ "valid_metrics": [
16
+ "BLEU"
17
+ ],
18
+ "tensorboard_log_dir": "tensorboard",
19
+ "save_data": "data",
20
+ "share_vocab": false,
21
+ "tensorboard_log_dir_dated": "tensorboard/May-26_20-00-30",
22
+ "src_vocab_size": 20000,
23
+ "training": {
24
+ "bucket_size": 128000,
25
+ "dropout_steps": [
26
+ 0
27
+ ],
28
+ "label_smoothing": 0.1,
29
+ "save_checkpoint_steps": 5000,
30
+ "num_workers": 0,
31
+ "learning_rate": 2.0,
32
+ "valid_steps": 5000,
33
+ "warmup_steps": 4000,
34
+ "param_init_method": "xavier_uniform",
35
+ "compute_dtype": "torch.float16",
36
+ "optim": "adamw",
37
+ "adam_beta2": 0.998,
38
+ "gpu_ranks": [
39
+ 0
40
+ ],
41
+ "world_size": 1,
42
+ "batch_size": 8000,
43
+ "keep_checkpoint": 4,
44
+ "decay_method": "noam",
45
+ "average_decay": 0.0001,
46
+ "batch_type": "tokens",
47
+ "batch_size_multiple": 8,
48
+ "normalization": "tokens",
49
+ "train_steps": 100000,
50
+ "max_grad_norm": 0.0,
51
+ "prefetch_factor": 32,
52
+ "dropout": [
53
+ 0.1
54
+ ],
55
+ "attention_dropout": [
56
+ 0.1
57
+ ],
58
+ "accum_count": [
59
+ 10
60
+ ],
61
+ "valid_batch_size": 4096,
62
+ "model_path": "quickmt-id-en-eole-model",
63
+ "accum_steps": [
64
+ 0
65
+ ]
66
+ },
67
+ "model": {
68
+ "architecture": "transformer",
69
+ "position_encoding_type": "SinusoidalInterleaved",
70
+ "heads": 8,
71
+ "share_embeddings": false,
72
+ "share_decoder_embeddings": false,
73
+ "hidden_size": 1024,
74
+ "transformer_ff": 4096,
75
+ "embeddings": {
76
+ "tgt_word_vec_size": 1024,
77
+ "src_word_vec_size": 1024,
78
+ "word_vec_size": 1024,
79
+ "position_encoding_type": "SinusoidalInterleaved"
80
+ },
81
+ "encoder": {
82
+ "layers": 8,
83
+ "encoder_type": "transformer",
84
+ "position_encoding_type": "SinusoidalInterleaved",
85
+ "heads": 8,
86
+ "n_positions": null,
87
+ "src_word_vec_size": 1024,
88
+ "hidden_size": 1024,
89
+ "transformer_ff": 4096
90
+ },
91
+ "decoder": {
92
+ "decoder_type": "transformer",
93
+ "layers": 2,
94
+ "tgt_word_vec_size": 1024,
95
+ "position_encoding_type": "SinusoidalInterleaved",
96
+ "heads": 8,
97
+ "n_positions": null,
98
+ "hidden_size": 1024,
99
+ "transformer_ff": 4096
100
+ }
101
+ },
102
+ "transforms_configs": {
103
+ "sentencepiece": {
104
+ "tgt_subword_model": "${MODEL_PATH}/en.spm.model",
105
+ "src_subword_model": "${MODEL_PATH}/id.spm.model"
106
+ },
107
+ "filtertoolong": {
108
+ "tgt_seq_length": 256,
109
+ "src_seq_length": 256
110
+ }
111
+ },
112
+ "data": {
113
+ "corpus_1": {
114
+ "path_tgt": "train.en",
115
+ "transforms": [
116
+ "sentencepiece",
117
+ "filtertoolong"
118
+ ],
119
+ "path_src": "train.id",
120
+ "path_align": null
121
+ },
122
+ "valid": {
123
+ "path_tgt": "dev.en",
124
+ "transforms": [
125
+ "sentencepiece",
126
+ "filtertoolong"
127
+ ],
128
+ "path_src": "dev.id",
129
+ "path_align": null
130
+ }
131
+ }
132
+ }
eole-model/en.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eededcf122b0fa0e5447be31d4e1c5ef02084b15bf8a8b2962acb1088230481b
3
+ size 588071
eole-model/id.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4e366992fe4c1aeeb771b3adf8fd79de82545828441545780bf5dd55b3708b61
3
+ size 587418
eole-model/model.00.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3c5050b0390246bc29f77e2a3c27815fc5d14e364b76ac9605fa12de38244def
3
+ size 823882912
eole-model/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0bc630f2e7835a1d8c5915c33a20811d3e3029869094c98ed53eda06ee895972
3
+ size 401699775
source_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
src.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4e366992fe4c1aeeb771b3adf8fd79de82545828441545780bf5dd55b3708b61
3
+ size 587418
target_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
tgt.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eededcf122b0fa0e5447be31d4e1c5ef02084b15bf8a8b2962acb1088230481b
3
+ size 588071
training-id.md ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ## Training `quickmt` Models
3
+
4
+ ### Environment setup
5
+
6
+ ```bash
7
+ # Install system dependencies
8
+ sudo apt install libhunspell-dev parallel
9
+
10
+ ## Install eole
11
+ git clone https://github.com/eole-nlp/eole.git
12
+ pip install -e ./eole
13
+
14
+ ## Install ctranslate2
15
+ git clone --recursive https://github.com/OpenNMT/CTranslate2.git
16
+ cd CTranslate2
17
+ mkdir build && cd build
18
+ cmake -DOPENMP_RUNTIME=COMP -DWITH_MKL=OFF ..
19
+ make -j8
20
+ sudo make install
21
+ sudo ldconfig
22
+ pip install -e ./python/
23
+
24
+ # Install kenlm
25
+ pip install --config-settings="--build-option=--max_order=7" https://github.com/kpu/kenlm/archive/master.zip
26
+
27
+ # Install quickmt
28
+ python -m pip install -e ./
29
+ ```
30
+
31
+ ### Download Data
32
+
33
+ ```bash
34
+ mv $HOME/.mtdata /path/to/large/disk
35
+ ln -s /path/to/large/disk $HOME/.mtdata
36
+
37
+ # Create experiment data/experiment folder
38
+ mkdir id-en
39
+
40
+ # List corpora
41
+ mtdata list -l ind-eng | cut -f1 > corpora.txt
42
+
43
+ # Download corpora
44
+ # Select some, then fetch
45
+ mtdata get -l ind-eng --merge --out id-en --no-fail -j 1 --test Flores-flores200_devtest-1-eng-ind Microsoft-ntrex-128-eng-ind \
46
+ --dev Neulab-tedtalks_test-1-eng-ind Flores-flores200_dev-1-eng-ind \
47
+ --train Statmt-news_commentary-14-eng-ind Statmt-news_commentary-15-eng-ind Statmt-news_commentary-16-eng-ind Statmt-news_commentary-17-eng-ind Statmt-news_commentary-18-eng-ind Statmt-news_commentary-18.1-eng-ind Statmt-ccaligned-1-eng-ind_ID Facebook-wikimatrix-1-eng-ind Neulab-tedtalks_train-1-eng-ind Neulab-tedtalks_dev-1-eng-ind ELRC-wikipedia_health-1-eng-ind ELRC-hrw_dataset_v1-1-eng-ind OPUS-ccaligned-v1-eng-ind OPUS-ccmatrix-v1-eng-ind OPUS-elrc_3049_wikipedia_health-v1-eng-ind OPUS-elrc_wikipedia_health-v1-eng-ind OPUS-elrc_2922-v1-eng-ind OPUS-gnome-v1-eng-ind OPUS-globalvoices-v2015-eng-ind OPUS-globalvoices-v2017q3-eng-ind OPUS-globalvoices-v2018q4-eng-ind OPUS-kde4-v2-eng-ind OPUS-multiccaligned-v1-eng-ind OPUS-nllb-v1-eng-ind OPUS-neulab_tedtalks-v1-eng-ind OPUS-news_commentary-v14-eng-ind OPUS-news_commentary-v16-eng-ind OPUS-opensubtitles-v2016-eng-ind OPUS-opensubtitles-v2018-eng-ind OPUS-opensubtitles-v2024-eng-ind OPUS-paracrawl_bonus-v9-eng-ind OPUS-qed-v2.0a-eng-ind OPUS-ted2020-v1-eng-ind OPUS-tanzil-v1-eng-ind OPUS-tatoeba-v2-eng-ind OPUS-tatoeba-v20190709-eng-ind OPUS-tatoeba-v20200531-eng-ind OPUS-tatoeba-v20201109-eng-ind OPUS-tatoeba-v20210310-eng-ind OPUS-tatoeba-v20210722-eng-ind OPUS-tatoeba-v20220303-eng-ind OPUS-tatoeba-v20230412-eng-ind OPUS-ubuntu-v14.10-eng-ind OPUS-wikimatrix-v1-eng-ind OPUS-xlent-v1-eng-ind OPUS-xlent-v1.1-eng-ind OPUS-xlent-v1.2-eng-ind OPUS-bible_uedin-v1-eng-ind OPUS-tico_19-v20201028-eng-ind OPUS-tldr_pages-v20230829-eng-ind OPUS-wikimedia-v20210402-eng-ind OPUS-wikimedia-v20230407-eng-ind Google-wmt24pp-1-eng-ind_ID
48
+
49
+
50
+ # Move files to standardized src/tgt names
51
+ cd id-en
52
+ mv dev.ind dev.id
53
+ mv dev.eng dev.en
54
+ mv train.ind train.src
55
+ mv train.eng train.tgt
56
+
57
+ paste -d '\t' train.src train.tgt \
58
+ | sort | uniq \
59
+ | parallel --block 70M -j 6 --pipe -k -l 200000 quickmt-clean --src_lang id --tgt_lang en --ft_model_path ../lid.176.bin --length_ratio 3 --src_min_langid_score 0.5 --tgt_min_langid_score 0.5 \
60
+ | awk 'BEGIN{srand()}{print rand(), $0}' | sort -n -k 1 | awk 'sub(/\S* /,"\t")' \
61
+ | awk -v FS="\t" '{ print $2 > "train.cleaned.src" ; print $3 > "train.cleaned.tgt" }'
62
+ ```
63
+
64
+ ### Upload Data to Huggingface
65
+
66
+ You will have to have authenticated to huggingface and you will need to write to a location for which you have permissions (replace `quickmt/quickmt-train.ri-en` with `your_username/your_dataset_name`)
67
+
68
+ ```
69
+ huggingface-cli login
70
+ quickmt-corpus-upload quickmt/quickmt-train.id-en --src_in train.cleaned.src --tgt_in train.cleaned.tgt --src_lang id --tgt_lang en
71
+ ```
72
+
73
+ ### Train Tokenizers
74
+
75
+ ```bash
76
+ # Train target tokenizer
77
+ spm_train --input_sentence_size 10000000 --shuffle_input_sentence false \
78
+ --input=train.cleaned.tgt --num_threads 4 --model_prefix=en.spm \
79
+ --vocab_size=20000 --character_coverage=0.9999 --model_type=unigram \
80
+ --byte_fallback --train_extremely_large_corpus true
81
+
82
+ # Train source tokenizer
83
+ spm_train --input_sentence_size 10000000 --shuffle_input_sentence false \
84
+ --input=train.cleaned.src --num_threads 4 --model_prefix=id.spm \
85
+ --vocab_size=20000 --character_coverage=0.9999 --model_type=unigram \
86
+ --byte_fallback --train_extremely_large_corpus true
87
+
88
+ # Train joint tokenizer
89
+ # spm_train --input_sentence_size 10000000 --shuffle_input_sentence true \
90
+ # --input=tok.txt --num_threads 6 --model_prefix=joint.spm \
91
+ # --vocab_size=50000 --character_coverage=0.9999 --model_type=unigram
92
+
93
+
94
+ # Convert spm vocab to eole vocab
95
+ cat en.spm.vocab | eole tools spm_to_vocab > en.eole.vocab
96
+ cat id.spm.vocab | eole tools spm_to_vocab > id.eole.vocab
97
+ #cat fr-en/joint.spm.vocab | eole tools spm_to_vocab > fr-en/joint.eole.vocab
98
+
99
+ mv train.cleaned.src train.id
100
+ mv train.cleaned.tgt train.en
101
+ ```
102
+
103
+ ### Train Model
104
+
105
+ ```bash
106
+ eole train --config eole-config-iden.yaml
107
+ eole train --config eole-config-enid.yaml
108
+
109
+
110
+ ```
111
+
112
+
113
+ ### Inference with eole
114
+
115
+ ```bash
116
+ eole predict -model_path ./so-en/model/ -src input.txt -output output.txt --batch_size 16 --gpu_ranks 0
117
+ ```
118
+
119
+
120
+ ### Convert to ctranslate2
121
+
122
+ ```python
123
+ python -m ctranslate2.converters.eole_ct2 --model_path quickmt-id-en-eole-model/ --output_dir ct2-iden --force
124
+
125
+ # Copy over src and tgt tokenizers
126
+ cp en.spm.model ct2-iden/tgt.spm.model
127
+ cp id.spm.model ct2-iden/src.spm.model
128
+
129
+ # Copy over the config too
130
+ cp eole-config-iden.yaml ct2-iden/eole-config.yaml
131
+ ```
132
+
133
+ ### Evaluate
134
+
135
+ Evaluate on the `flores-devtest` dataset
136
+
137
+ ```bash
138
+ quickmt-eval --model_path ct2-iden --tgt_lang eng_Latn --src_lang ind_Latn --output_file quickmt.iden.mt --device cpu
139
+
140
+ ```
141
+
142
+
143
+ * Statmt-commoncrawl_wmt13-1-rus-eng
144
+ * Statmt-news_commentary_wmt18-13-rus-eng
145
+ * Statmt-news_commentary-14-eng-rus
146
+ * Statmt-news_commentary-15-eng-rus
147
+ * Statmt-news_commentary-16-eng-rus
148
+ * Statmt-news_commentary-17-eng-rus
149
+ * Statmt-news_commentary-18-eng-rus
150
+ * Statmt-news_commentary-18.1-eng-rus
151
+ * Statmt-newstest_ruen-2014-rus-eng
152
+ * Statmt-newstest_enru-2015-eng-rus
153
+ * Statmt-newstest_ruen-2015-rus-eng
154
+ * Statmt-newstest_ruen-2016-rus-eng
155
+ * Statmt-newstest_enru-2016-eng-rus
156
+ * Statmt-newstest_ruen-2017-rus-eng
157
+ * Statmt-newstest_enru-2017-eng-rus
158
+ * Statmt-newstest_ruen-2018-rus-eng
159
+ * Statmt-newstest_enru-2018-eng-rus
160
+ * Statmt-newstest_ruen-2019-rus-eng
161
+ * Statmt-newstest_enru-2019-eng-rus
162
+ * Statmt-newstest-2012-eng-rus
163
+ * Statmt-newstest-2013-eng-rus
164
+ * Statmt-newstest_enru-2020-eng-rus
165
+ * Statmt-newstest_ruen-2020-rus-eng
166
+ * Statmt-newstestb_ruen-2020-rus-eng
167
+ * Statmt-newstest_enru-2021-eng-rus
168
+ * Statmt-newstest_ruen-2021-rus-eng
169
+ * Statmt-backtrans_ruen-wmt20-rus-eng
170
+ * Statmt-yandex-wmt22-eng-rus
171
+ * Tilde-airbaltic-1-eng-rus
172
+ * Tilde-czechtourism-1-eng-rus
173
+ * Tilde-worldbank-1-eng-rus
174
+ * Neulab-tedtalks_train-1-eng-rus
175
+ * Neulab-tedtalks_test-1-eng-rus
176
+ * Neulab-tedtalks_dev-1-eng-rus
177
+ * ELRC-wikipedia_health-1-eng-rus
178
+ * ELRC-swps_university_social_sciences_humanities-1-eng-rus
179
+ * ELRC-scipar-1-eng-rus
180
+ * ELRC-web_acquired_data_related_to_scientific_research-1-eng-rus
181
+ * ELRC-hrw_dataset_v1-1-eng-rus
182
+ * OPUS-books-v1-eng-rus
183
+ * OPUS-ccaligned-v1-eng-rus
184
+ * OPUS-ccmatrix-v1-eng-rus
185
+ * OPUS-elrc_3075_wikipedia_health-v1-eng-rus
186
+ * OPUS-elrc_3855_swps_university_soci-v1-eng-rus
187
+ * OPUS-elrc_5067_scipar-v1-eng-rus
188
+ * OPUS-elrc_5183_scipar_ukraine-v1-eng-rus
189
+ * OPUS-elrc_wikipedia_health-v1-eng-rus
190
+ * OPUS-elrc_2922-v1-eng-rus
191
+ * OPUS-eubookshop-v2-eng-rus
192
+ * OPUS-gnome-v1-eng-rus
193
+ * OPUS-globalvoices-v2015-eng-rus
194
+ * OPUS-globalvoices-v2017q3-eng-rus
195
+ * OPUS-globalvoices-v2018q4-eng-rus
196
+ * OPUS-kde4-v2-eng-rus
197
+ * OPUS-kdedoc-v1-eng_GB-rus
198
+ * OPUS-linguatools_wikititles-v2014-eng-rus
199
+ * OPUS-mdn_web_docs-v20230925-eng-rus
200
+ * OPUS-multiun-v1-eng-rus
201
+ * OPUS-neulab_tedtalks-v1-eng-rus
202
+ * OPUS-news_commentary-v11-eng-rus
203
+ * OPUS-news_commentary-v14-eng-rus
204
+ * OPUS-news_commentary-v16-eng-rus
205
+ * OPUS-news_commentary-v9.0-eng-rus
206
+ * OPUS-news_commentary-v9.1-eng-rus
207
+ * OPUS-openoffice-v3-eng_GB-rus
208
+ * OPUS-opensubtitles-v2024-eng-rus
209
+ * OPUS-php-v1-eng-rus
210
+ * OPUS-paracrawl-v9-eng-rus
211
+ * OPUS-qed-v2.0a-eng-rus
212
+ * OPUS-ted2013-v1.1-eng-rus
213
+ * OPUS-ted2020-v1-eng-rus
214
+ * OPUS-tanzil-v1-eng-rus
215
+ * OPUS-tatoeba-v2-eng-rus
216
+ * OPUS-tatoeba-v20190709-eng-rus
217
+ * OPUS-tatoeba-v20200531-eng-rus
218
+ * OPUS-tatoeba-v20201109-eng-rus
219
+ * OPUS-tatoeba-v20210310-eng-rus
220
+ * OPUS-tatoeba-v20210722-eng-rus
221
+ * OPUS-tatoeba-v20220303-eng-rus
222
+ * OPUS-tatoeba-v20230412-eng-rus
223
+ * OPUS-tildemodel-v2018-eng-rus
224
+ * OPUS-unpc-v1.0-eng-rus
225
+ * OPUS-ubuntu-v14.10-eng-rus
226
+ * OPUS-wmt_news-v2014-eng-rus
227
+ * OPUS-wmt_news-v2019-eng-rus
228
+ * OPUS-wikimatrix-v1-eng-rus
229
+ * OPUS-wikipedia-v1.0-eng-rus
230
+ * OPUS-ada83-v1-eng-rus
231
+ * OPUS-bible_uedin-v1-eng-rus
232
+ * OPUS-infopankki-v1-eng-rus
233
+ * OPUS-tico_19-v20201028-eng-rus
234
+ * OPUS-tldr_pages-v20230829-eng-rus
235
+ * OPUS-wikimedia-v20230407-eng-rus
236
+ * Google-wmt24pp-1-eng-rus_RU