Upload folder using huggingface_hub

Browse files

Files changed (15) hide show

.ipynb_checkpoints/README-checkpoint.md +100 -0
README.md +100 -3
config.json +10 -0
eole-config.yaml +98 -0
eole-model/config.json +132 -0
eole-model/en.spm.model +3 -0
eole-model/id.spm.model +3 -0
eole-model/model.00.safetensors +3 -0
eole-model/vocab.json +0 -0
model.bin +3 -0
source_vocabulary.json +0 -0
src.spm.model +3 -0
target_vocabulary.json +0 -0
tgt.spm.model +3 -0
training-id.md +236 -0

.ipynb_checkpoints/README-checkpoint.md ADDED Viewed

	@@ -0,0 +1,100 @@

+---
+language:
+- en
+- id
+tags:
+- translation
+license: cc-by-4.0
+datasets:
+- quickmt/quickmt-train.id-en
+model-index:
+- name: quickmt-id-en
+  results:
+  - task:
+      name: Translation ben-eng
+      type: translation
+      args: ben-eng
+    dataset:
+      name: flores101-devtest
+      type: flores_101
+      args: ind_Latn eng_Latn devtest
+    metrics:
+    - name: BLEU
+      type: bleu
+      value: 44.5
+    - name: CHRF
+      type: chrf
+      value: 68.78
+    - name: COMET
+      type: comet
+      value: 89.35
+---
+# `quickmt-id-en` Neural Machine Translation Model
+`quickmt-id-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `id` into `en`.
+## Model Information
+* Trained using [`eole`](https://github.com/eole-nlp/eole)
+* 185M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
+* 50k joint Sentencepiece vocabulary
+* Exbented for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
+* Training data: https://huggingface.co/datasets/quickmt/quickmt-train.id-en/tree/main
+See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
+## Usage with `quickmt`
+You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
+Next, install the `quickmt` python library and download the model:
+```bash
+git clone https://github.com/quickmt/quickmt.git
+pip install ./quickmt/
+quickmt-model-download quickmt/quickmt-id-en ./quickmt-id-en
+```
+Finally use the model in python:
+```python
+from quickmt imbent Translator
+# Auto-detects GPU, set to "cpu" to force CPU inference
+t = Translator("./quickmt-id-en/", device="auto")
+# Translate - set beam size to 1 for faster speed (but lower quality)
+sample_text = 'Dr. Ehud Ur, profesor kedokteran di Dalhousie University di Halifax, Nova Scotia dan ketua divisi klinis dan ilmiah di Perhimpunan Diabetes Kanada memperingatkan bahwa penelitiannya masih berada di tahap awal.'
+t(sample_text, beam_size=5)
+```
+> 'Dr. Ehud Ur, professor of medicine at Dalhousie University in Halifax, Nova Scotia and chair of the clinical and scientific division at the Canadian Diabetes Society warned that his research is still in its early stages.'
+```python
+# Get alternative translations by sampling
+# You can pass any cTranslate2 `translate_batch` arguments
+t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
+```
+> 'Dr. Ehud Ur, a professor of medicine at Dalhousie University in Halifax, Nova Scotia and chair of the clinical and scientific division in Canadian Diabetes Association said, “his research is at an infancy.”'
+The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible  to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
+## Metrics
+`bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("ind_Latn"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32 (faster speed is possible using a larger batch size).
+|                                  |   bleu |   chrf2 |   comet22 |   Time (s) |
+|:---------------------------------|-------:|--------:|----------:|-----------:|
+| quickmt/quickmt-id-en            |  44.5  |   68.78 |     89.35 |       1.19 |
+| Helsinki-NLP/opus-mt-id-en       |  34.62 |   62.07 |     86.31 |       3.35 |
+| facebook/nllb-200-distilled-600M |  42.26 |   66.89 |     88.67 |      21.13 |
+| facebook/nllb-200-distilled-1.3B |  45.25 |   68.92 |     89.51 |      36.01 |
+| facebook/m2m100_418M             |  33.14 |   60.91 |     84.85 |      17.37 |
+| facebook/m2m100_1.2B             |  39.1  |   65.07 |     87.55 |      33.41 |

README.md CHANGED Viewed

@@ -1,3 +1,100 @@
----
-license: cc-by-4.0
----

+---
+language:
+- en
+- id
+tags:
+- translation
+license: cc-by-4.0
+datasets:
+- quickmt/quickmt-train.id-en
+model-index:
+- name: quickmt-id-en
+  results:
+  - task:
+      name: Translation ben-eng
+      type: translation
+      args: ben-eng
+    dataset:
+      name: flores101-devtest
+      type: flores_101
+      args: ind_Latn eng_Latn devtest
+    metrics:
+    - name: BLEU
+      type: bleu
+      value: 44.5
+    - name: CHRF
+      type: chrf
+      value: 68.78
+    - name: COMET
+      type: comet
+      value: 89.35
+---
+# `quickmt-id-en` Neural Machine Translation Model
+`quickmt-id-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `id` into `en`.
+## Model Information
+* Trained using [`eole`](https://github.com/eole-nlp/eole)
+* 185M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
+* 50k joint Sentencepiece vocabulary
+* Exbented for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
+* Training data: https://huggingface.co/datasets/quickmt/quickmt-train.id-en/tree/main
+See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
+## Usage with `quickmt`
+You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
+Next, install the `quickmt` python library and download the model:
+```bash
+git clone https://github.com/quickmt/quickmt.git
+pip install ./quickmt/
+quickmt-model-download quickmt/quickmt-id-en ./quickmt-id-en
+```
+Finally use the model in python:
+```python
+from quickmt imbent Translator
+# Auto-detects GPU, set to "cpu" to force CPU inference
+t = Translator("./quickmt-id-en/", device="auto")
+# Translate - set beam size to 1 for faster speed (but lower quality)
+sample_text = 'Dr. Ehud Ur, profesor kedokteran di Dalhousie University di Halifax, Nova Scotia dan ketua divisi klinis dan ilmiah di Perhimpunan Diabetes Kanada memperingatkan bahwa penelitiannya masih berada di tahap awal.'
+t(sample_text, beam_size=5)
+```
+> 'Dr. Ehud Ur, professor of medicine at Dalhousie University in Halifax, Nova Scotia and chair of the clinical and scientific division at the Canadian Diabetes Society warned that his research is still in its early stages.'
+```python
+# Get alternative translations by sampling
+# You can pass any cTranslate2 `translate_batch` arguments
+t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
+```
+> 'Dr. Ehud Ur, a professor of medicine at Dalhousie University in Halifax, Nova Scotia and chair of the clinical and scientific division in Canadian Diabetes Association said, “his research is at an infancy.”'
+The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible  to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
+## Metrics
+`bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("ind_Latn"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32 (faster speed is possible using a larger batch size).
+|                                  |   bleu |   chrf2 |   comet22 |   Time (s) |
+|:---------------------------------|-------:|--------:|----------:|-----------:|
+| quickmt/quickmt-id-en            |  44.5  |   68.78 |     89.35 |       1.19 |
+| Helsinki-NLP/opus-mt-id-en       |  34.62 |   62.07 |     86.31 |       3.35 |
+| facebook/nllb-200-distilled-600M |  42.26 |   66.89 |     88.67 |      21.13 |
+| facebook/nllb-200-distilled-1.3B |  45.25 |   68.92 |     89.51 |      36.01 |
+| facebook/m2m100_418M             |  33.14 |   60.91 |     84.85 |      17.37 |
+| facebook/m2m100_1.2B             |  39.1  |   65.07 |     87.55 |      33.41 |

config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "add_source_bos": false,
+  "add_source_eos": false,
+  "bos_token": "<s>",
+  "decoder_start_token": "<s>",
+  "eos_token": "</s>",
+  "layer_norm_epsilon": 1e-06,
+  "multi_query_attention": false,
+  "unk_token": "<unk>"
+}

eole-config.yaml ADDED Viewed

	@@ -0,0 +1,98 @@

+## IO
+save_data: data
+overwrite: True
+seed: 1234
+report_every: 100
+valid_metrics: ["BLEU"]
+tensorboard: true
+tensorboard_log_dir: tensorboard
+### Vocab
+src_vocab: id.eole.vocab
+tgt_vocab: en.eole.vocab
+src_vocab_size: 20000
+tgt_vocab_size: 20000
+vocab_size_multiple: 8
+share_vocab: false
+n_sample: 0
+data:
+    corpus_1:
+        # path_src: hf://quickmt/quickmt-train.id-en/id
+        # path_tgt: hf://quickmt/quickmt-train.id-en/en
+        # path_sco: hf://quickmt/quickmt-train.id-en/sco
+        path_src: train.id
+        path_tgt: train.en
+    valid:
+        path_src: dev.id
+        path_tgt: dev.en
+transforms: [sentencepiece, filtertoolong]
+transforms_configs:
+  sentencepiece:
+    src_subword_model: "id.spm.model"
+    tgt_subword_model: "en.spm.model"
+  filtertoolong:
+    src_seq_length: 256
+    tgt_seq_length: 256
+training:
+    # Run configuration
+    model_path: quickmt-id-en-eole-model
+    #train_from: model
+    keep_checkpoint: 4
+    train_steps: 100000
+    save_checkpoint_steps: 5000
+    valid_steps: 5000
+    # Train on a single GPU
+    world_size: 1
+    gpu_ranks: [0]
+    # Batching 10240
+    batch_type: "tokens"
+    batch_size: 8000
+    valid_batch_size: 4096
+    batch_size_multiple: 8
+    accum_count: [10]
+    accum_steps: [0]
+    # Optimizer & Compute
+    compute_dtype: "fp16"
+    optim: "adamw"
+    #use_amp: False
+    learning_rate: 2.0
+    warmup_steps: 4000
+    decay_method: "noam"
+    adam_beta2: 0.998
+    # Data loading
+    bucket_size: 128000
+    num_workers: 4
+    prefetch_factor: 32
+    # Hyperparams
+    dropout_steps: [0]
+    dropout: [0.1]
+    attention_dropout: [0.1]
+    max_grad_norm: 0
+    label_smoothing: 0.1
+    average_decay: 0.0001
+    param_init_method: xavier_uniform
+    normalization: "tokens"
+model:
+    architecture: "transformer"
+    share_embeddings: false
+    share_decoder_embeddings: false
+    hidden_size: 1024
+    encoder:
+        layers: 8
+    decoder:
+        layers: 2
+    heads: 8
+    transformer_ff: 4096
+    embeddings:
+        word_vec_size: 1024
+        position_encoding_type: "SinusoidalInterleaved"

eole-model/config.json ADDED Viewed

	@@ -0,0 +1,132 @@

+{
+  "n_sample": 0,
+  "tensorboard": true,
+  "tgt_vocab_size": 20000,
+  "overwrite": true,
+  "src_vocab": "id.eole.vocab",
+  "tgt_vocab": "en.eole.vocab",
+  "report_every": 100,
+  "transforms": [
+    "sentencepiece",
+    "filtertoolong"
+  ],
+  "seed": 1234,
+  "vocab_size_multiple": 8,
+  "valid_metrics": [
+    "BLEU"
+  ],
+  "tensorboard_log_dir": "tensorboard",
+  "save_data": "data",
+  "share_vocab": false,
+  "tensorboard_log_dir_dated": "tensorboard/May-26_20-00-30",
+  "src_vocab_size": 20000,
+  "training": {
+    "bucket_size": 128000,
+    "dropout_steps": [
+      0
+    ],
+    "label_smoothing": 0.1,
+    "save_checkpoint_steps": 5000,
+    "num_workers": 0,
+    "learning_rate": 2.0,
+    "valid_steps": 5000,
+    "warmup_steps": 4000,
+    "param_init_method": "xavier_uniform",
+    "compute_dtype": "torch.float16",
+    "optim": "adamw",
+    "adam_beta2": 0.998,
+    "gpu_ranks": [
+      0
+    ],
+    "world_size": 1,
+    "batch_size": 8000,
+    "keep_checkpoint": 4,
+    "decay_method": "noam",
+    "average_decay": 0.0001,
+    "batch_type": "tokens",
+    "batch_size_multiple": 8,
+    "normalization": "tokens",
+    "train_steps": 100000,
+    "max_grad_norm": 0.0,
+    "prefetch_factor": 32,
+    "dropout": [
+      0.1
+    ],
+    "attention_dropout": [
+      0.1
+    ],
+    "accum_count": [
+      10
+    ],
+    "valid_batch_size": 4096,
+    "model_path": "quickmt-id-en-eole-model",
+    "accum_steps": [
+      0
+    ]
+  },
+  "model": {
+    "architecture": "transformer",
+    "position_encoding_type": "SinusoidalInterleaved",
+    "heads": 8,
+    "share_embeddings": false,
+    "share_decoder_embeddings": false,
+    "hidden_size": 1024,
+    "transformer_ff": 4096,
+    "embeddings": {
+      "tgt_word_vec_size": 1024,
+      "src_word_vec_size": 1024,
+      "word_vec_size": 1024,
+      "position_encoding_type": "SinusoidalInterleaved"
+    },
+    "encoder": {
+      "layers": 8,
+      "encoder_type": "transformer",
+      "position_encoding_type": "SinusoidalInterleaved",
+      "heads": 8,
+      "n_positions": null,
+      "src_word_vec_size": 1024,
+      "hidden_size": 1024,
+      "transformer_ff": 4096
+    },
+    "decoder": {
+      "decoder_type": "transformer",
+      "layers": 2,
+      "tgt_word_vec_size": 1024,
+      "position_encoding_type": "SinusoidalInterleaved",
+      "heads": 8,
+      "n_positions": null,
+      "hidden_size": 1024,
+      "transformer_ff": 4096
+    }
+  },
+  "transforms_configs": {
+    "sentencepiece": {
+      "tgt_subword_model": "${MODEL_PATH}/en.spm.model",
+      "src_subword_model": "${MODEL_PATH}/id.spm.model"
+    },
+    "filtertoolong": {
+      "tgt_seq_length": 256,
+      "src_seq_length": 256
+    }
+  },
+  "data": {
+    "corpus_1": {
+      "path_tgt": "train.en",
+      "transforms": [
+        "sentencepiece",
+        "filtertoolong"
+      ],
+      "path_src": "train.id",
+      "path_align": null
+    },
+    "valid": {
+      "path_tgt": "dev.en",
+      "transforms": [
+        "sentencepiece",
+        "filtertoolong"
+      ],
+      "path_src": "dev.id",
+      "path_align": null
+    }
+  }
+}

eole-model/en.spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eededcf122b0fa0e5447be31d4e1c5ef02084b15bf8a8b2962acb1088230481b
+size 588071

eole-model/id.spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4e366992fe4c1aeeb771b3adf8fd79de82545828441545780bf5dd55b3708b61
+size 587418

eole-model/model.00.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3c5050b0390246bc29f77e2a3c27815fc5d14e364b76ac9605fa12de38244def
+size 823882912

eole-model/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0bc630f2e7835a1d8c5915c33a20811d3e3029869094c98ed53eda06ee895972
+size 401699775

source_vocabulary.json ADDED Viewed

The diff for this file is too large to render. See raw diff

src.spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4e366992fe4c1aeeb771b3adf8fd79de82545828441545780bf5dd55b3708b61
+size 587418

target_vocabulary.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tgt.spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eededcf122b0fa0e5447be31d4e1c5ef02084b15bf8a8b2962acb1088230481b
+size 588071

training-id.md ADDED Viewed

	@@ -0,0 +1,236 @@

+## Training `quickmt` Models
+### Environment setup
+```bash
+# Install system dependencies
+sudo apt install  libhunspell-dev parallel
+## Install eole
+git clone https://github.com/eole-nlp/eole.git
+pip install -e ./eole
+## Install ctranslate2
+git clone --recursive https://github.com/OpenNMT/CTranslate2.git
+cd CTranslate2
+mkdir build && cd build
+cmake -DOPENMP_RUNTIME=COMP -DWITH_MKL=OFF ..
+make -j8
+sudo make install
+sudo ldconfig
+pip install -e ./python/
+# Install kenlm
+pip install --config-settings="--build-option=--max_order=7" https://github.com/kpu/kenlm/archive/master.zip
+# Install quickmt
+python -m pip install -e ./
+```
+### Download Data
+```bash
+mv $HOME/.mtdata /path/to/large/disk
+ln -s /path/to/large/disk $HOME/.mtdata
+# Create experiment data/experiment folder
+mkdir id-en
+# List corpora
+mtdata list -l ind-eng | cut -f1 > corpora.txt
+# Download corpora
+# Select some, then fetch
+mtdata get -l ind-eng --merge --out id-en  --no-fail -j 1 --test Flores-flores200_devtest-1-eng-ind Microsoft-ntrex-128-eng-ind \
+--dev Neulab-tedtalks_test-1-eng-ind Flores-flores200_dev-1-eng-ind \
+--train Statmt-news_commentary-14-eng-ind Statmt-news_commentary-15-eng-ind Statmt-news_commentary-16-eng-ind Statmt-news_commentary-17-eng-ind Statmt-news_commentary-18-eng-ind Statmt-news_commentary-18.1-eng-ind Statmt-ccaligned-1-eng-ind_ID Facebook-wikimatrix-1-eng-ind Neulab-tedtalks_train-1-eng-ind Neulab-tedtalks_dev-1-eng-ind ELRC-wikipedia_health-1-eng-ind ELRC-hrw_dataset_v1-1-eng-ind OPUS-ccaligned-v1-eng-ind OPUS-ccmatrix-v1-eng-ind OPUS-elrc_3049_wikipedia_health-v1-eng-ind OPUS-elrc_wikipedia_health-v1-eng-ind OPUS-elrc_2922-v1-eng-ind OPUS-gnome-v1-eng-ind OPUS-globalvoices-v2015-eng-ind OPUS-globalvoices-v2017q3-eng-ind OPUS-globalvoices-v2018q4-eng-ind OPUS-kde4-v2-eng-ind OPUS-multiccaligned-v1-eng-ind OPUS-nllb-v1-eng-ind OPUS-neulab_tedtalks-v1-eng-ind OPUS-news_commentary-v14-eng-ind OPUS-news_commentary-v16-eng-ind OPUS-opensubtitles-v2016-eng-ind OPUS-opensubtitles-v2018-eng-ind OPUS-opensubtitles-v2024-eng-ind OPUS-paracrawl_bonus-v9-eng-ind OPUS-qed-v2.0a-eng-ind OPUS-ted2020-v1-eng-ind OPUS-tanzil-v1-eng-ind OPUS-tatoeba-v2-eng-ind OPUS-tatoeba-v20190709-eng-ind OPUS-tatoeba-v20200531-eng-ind OPUS-tatoeba-v20201109-eng-ind OPUS-tatoeba-v20210310-eng-ind OPUS-tatoeba-v20210722-eng-ind OPUS-tatoeba-v20220303-eng-ind OPUS-tatoeba-v20230412-eng-ind OPUS-ubuntu-v14.10-eng-ind OPUS-wikimatrix-v1-eng-ind OPUS-xlent-v1-eng-ind OPUS-xlent-v1.1-eng-ind OPUS-xlent-v1.2-eng-ind OPUS-bible_uedin-v1-eng-ind OPUS-tico_19-v20201028-eng-ind OPUS-tldr_pages-v20230829-eng-ind OPUS-wikimedia-v20210402-eng-ind OPUS-wikimedia-v20230407-eng-ind Google-wmt24pp-1-eng-ind_ID
+# Move files to standardized src/tgt names
+cd id-en
+mv dev.ind dev.id
+mv dev.eng dev.en
+mv train.ind train.src
+mv train.eng train.tgt
+paste -d '\t' train.src train.tgt \
+    | sort | uniq  \
+    | parallel --block 70M -j 6 --pipe -k -l 200000 quickmt-clean --src_lang id --tgt_lang en --ft_model_path ../lid.176.bin --length_ratio 3 --src_min_langid_score 0.5 --tgt_min_langid_score 0.5 \
+    | awk 'BEGIN{srand()}{print rand(), $0}' | sort -n -k 1 | awk 'sub(/\S* /,"\t")' \
+    | awk -v FS="\t" '{ print $2 > "train.cleaned.src" ; print $3 > "train.cleaned.tgt" }'
+```
+### Upload Data to Huggingface
+You will have to have authenticated to huggingface and you will need to write to a location for which you have permissions (replace `quickmt/quickmt-train.ri-en` with `your_username/your_dataset_name`)
+```
+huggingface-cli login
+quickmt-corpus-upload quickmt/quickmt-train.id-en --src_in train.cleaned.src --tgt_in train.cleaned.tgt --src_lang id --tgt_lang en
+```
+### Train Tokenizers
+```bash
+# Train target tokenizer
+spm_train --input_sentence_size 10000000 --shuffle_input_sentence false \
+    --input=train.cleaned.tgt --num_threads 4 --model_prefix=en.spm \
+    --vocab_size=20000 --character_coverage=0.9999 --model_type=unigram \
+    --byte_fallback --train_extremely_large_corpus true
+# Train source tokenizer
+spm_train --input_sentence_size 10000000 --shuffle_input_sentence false \
+    --input=train.cleaned.src --num_threads 4 --model_prefix=id.spm \
+    --vocab_size=20000 --character_coverage=0.9999 --model_type=unigram \
+    --byte_fallback --train_extremely_large_corpus true
+# Train joint tokenizer
+# spm_train --input_sentence_size 10000000 --shuffle_input_sentence true \
+#     --input=tok.txt --num_threads 6 --model_prefix=joint.spm \
+#     --vocab_size=50000 --character_coverage=0.9999 --model_type=unigram
+# Convert spm vocab to eole vocab
+cat en.spm.vocab | eole tools spm_to_vocab > en.eole.vocab
+cat id.spm.vocab | eole tools spm_to_vocab > id.eole.vocab
+#cat fr-en/joint.spm.vocab | eole tools spm_to_vocab > fr-en/joint.eole.vocab
+mv train.cleaned.src train.id
+mv train.cleaned.tgt train.en
+```
+### Train Model
+```bash
+eole train --config eole-config-iden.yaml
+eole train --config eole-config-enid.yaml
+```
+### Inference with eole
+```bash
+eole predict -model_path ./so-en/model/ -src input.txt  -output output.txt  --batch_size 16 --gpu_ranks 0
+```
+### Convert to ctranslate2
+```python
+python -m ctranslate2.converters.eole_ct2 --model_path quickmt-id-en-eole-model/ --output_dir ct2-iden --force
+# Copy over src and tgt tokenizers
+cp en.spm.model ct2-iden/tgt.spm.model
+cp id.spm.model ct2-iden/src.spm.model
+# Copy over the config too
+cp eole-config-iden.yaml ct2-iden/eole-config.yaml
+```
+### Evaluate
+Evaluate on the `flores-devtest` dataset
+```bash
+quickmt-eval --model_path ct2-iden --tgt_lang eng_Latn --src_lang ind_Latn --output_file quickmt.iden.mt --device cpu
+```
+* Statmt-commoncrawl_wmt13-1-rus-eng
+* Statmt-news_commentary_wmt18-13-rus-eng
+* Statmt-news_commentary-14-eng-rus
+* Statmt-news_commentary-15-eng-rus
+* Statmt-news_commentary-16-eng-rus
+* Statmt-news_commentary-17-eng-rus
+* Statmt-news_commentary-18-eng-rus
+* Statmt-news_commentary-18.1-eng-rus
+* Statmt-newstest_ruen-2014-rus-eng
+* Statmt-newstest_enru-2015-eng-rus
+* Statmt-newstest_ruen-2015-rus-eng
+* Statmt-newstest_ruen-2016-rus-eng
+* Statmt-newstest_enru-2016-eng-rus
+* Statmt-newstest_ruen-2017-rus-eng
+* Statmt-newstest_enru-2017-eng-rus
+* Statmt-newstest_ruen-2018-rus-eng
+* Statmt-newstest_enru-2018-eng-rus
+* Statmt-newstest_ruen-2019-rus-eng
+* Statmt-newstest_enru-2019-eng-rus
+* Statmt-newstest-2012-eng-rus
+* Statmt-newstest-2013-eng-rus
+* Statmt-newstest_enru-2020-eng-rus
+* Statmt-newstest_ruen-2020-rus-eng
+* Statmt-newstestb_ruen-2020-rus-eng
+* Statmt-newstest_enru-2021-eng-rus
+* Statmt-newstest_ruen-2021-rus-eng
+* Statmt-backtrans_ruen-wmt20-rus-eng
+* Statmt-yandex-wmt22-eng-rus
+* Tilde-airbaltic-1-eng-rus
+* Tilde-czechtourism-1-eng-rus
+* Tilde-worldbank-1-eng-rus
+* Neulab-tedtalks_train-1-eng-rus
+* Neulab-tedtalks_test-1-eng-rus
+* Neulab-tedtalks_dev-1-eng-rus
+* ELRC-wikipedia_health-1-eng-rus
+* ELRC-swps_university_social_sciences_humanities-1-eng-rus
+* ELRC-scipar-1-eng-rus
+* ELRC-web_acquired_data_related_to_scientific_research-1-eng-rus
+* ELRC-hrw_dataset_v1-1-eng-rus
+* OPUS-books-v1-eng-rus
+* OPUS-ccaligned-v1-eng-rus
+* OPUS-ccmatrix-v1-eng-rus
+* OPUS-elrc_3075_wikipedia_health-v1-eng-rus
+* OPUS-elrc_3855_swps_university_soci-v1-eng-rus
+* OPUS-elrc_5067_scipar-v1-eng-rus
+* OPUS-elrc_5183_scipar_ukraine-v1-eng-rus
+* OPUS-elrc_wikipedia_health-v1-eng-rus
+* OPUS-elrc_2922-v1-eng-rus
+* OPUS-eubookshop-v2-eng-rus
+* OPUS-gnome-v1-eng-rus
+* OPUS-globalvoices-v2015-eng-rus
+* OPUS-globalvoices-v2017q3-eng-rus
+* OPUS-globalvoices-v2018q4-eng-rus
+* OPUS-kde4-v2-eng-rus
+* OPUS-kdedoc-v1-eng_GB-rus
+* OPUS-linguatools_wikititles-v2014-eng-rus
+* OPUS-mdn_web_docs-v20230925-eng-rus
+* OPUS-multiun-v1-eng-rus
+* OPUS-neulab_tedtalks-v1-eng-rus
+* OPUS-news_commentary-v11-eng-rus
+* OPUS-news_commentary-v14-eng-rus
+* OPUS-news_commentary-v16-eng-rus
+* OPUS-news_commentary-v9.0-eng-rus
+* OPUS-news_commentary-v9.1-eng-rus
+* OPUS-openoffice-v3-eng_GB-rus
+* OPUS-opensubtitles-v2024-eng-rus
+* OPUS-php-v1-eng-rus
+* OPUS-paracrawl-v9-eng-rus
+* OPUS-qed-v2.0a-eng-rus
+* OPUS-ted2013-v1.1-eng-rus
+* OPUS-ted2020-v1-eng-rus
+* OPUS-tanzil-v1-eng-rus
+* OPUS-tatoeba-v2-eng-rus
+* OPUS-tatoeba-v20190709-eng-rus
+* OPUS-tatoeba-v20200531-eng-rus
+* OPUS-tatoeba-v20201109-eng-rus
+* OPUS-tatoeba-v20210310-eng-rus
+* OPUS-tatoeba-v20210722-eng-rus
+* OPUS-tatoeba-v20220303-eng-rus
+* OPUS-tatoeba-v20230412-eng-rus
+* OPUS-tildemodel-v2018-eng-rus
+* OPUS-unpc-v1.0-eng-rus
+* OPUS-ubuntu-v14.10-eng-rus
+* OPUS-wmt_news-v2014-eng-rus
+* OPUS-wmt_news-v2019-eng-rus
+* OPUS-wikimatrix-v1-eng-rus
+* OPUS-wikipedia-v1.0-eng-rus
+* OPUS-ada83-v1-eng-rus
+* OPUS-bible_uedin-v1-eng-rus
+* OPUS-infopankki-v1-eng-rus
+* OPUS-tico_19-v20201028-eng-rus
+* OPUS-tldr_pages-v20230829-eng-rus
+* OPUS-wikimedia-v20230407-eng-rus
+* Google-wmt24pp-1-eng-rus_RU