Translation
Safetensors
Ukrainian
English
Eval Results (legacy)
radinplaid commited on
Commit
cdb4ffb
·
verified ·
1 Parent(s): 241b3f8

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - uk
4
+ - en
5
+ tags:
6
+ - translation
7
+ license: cc-by-4.0
8
+ datasets:
9
+ - quickmt/quickmt-train.uk-en
10
+ - quickmt/finetranslations-sample-uk-en
11
+ - HuggingFaceFW/finetranslations
12
+ model-index:
13
+ - name: quickmt-uk-en
14
+ results:
15
+ - task:
16
+ name: Translation ukr-eng
17
+ type: translation
18
+ args: ukr-eng
19
+ dataset:
20
+ name: flores101-devtest
21
+ type: flores_101
22
+ args: ukr_Cyrl eng_Latn devtest
23
+ metrics:
24
+ - name: BLEU
25
+ type: bleu
26
+ value: 39.88
27
+ - name: CHRF
28
+ type: chrf
29
+ value: 65.65
30
+
31
+ ---
32
+
33
+
34
+ # `quickmt-uk-en` Neural Machine Translation Model
35
+
36
+ `quickmt-uk-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `uk` into `en`.
37
+
38
+ `quickmt` models are roughly 3 times faster for GPU inference than OpusMT models and roughly [40 times](https://huggingface.co/spaces/quickmt/quickmt-vs-libretranslate) faster than [LibreTranslate](https://huggingface.co/spaces/quickmt/quickmt-vs-libretranslate)/[ArgosTranslate](github.com/argosopentech/argos-translate).
39
+
40
+
41
+ ## Try it on our Huggingface Space
42
+
43
+ Give it a try before downloading here: https://huggingface.co/spaces/quickmt/QuickMT-Demo
44
+
45
+
46
+ ## Model Information
47
+
48
+ * Trained using [`quickmt-train`](github.com/quickmt/quickmt-train)
49
+ * 200M parameter seq2seq transformer
50
+ * 32k separate Sentencepiece vocabs
51
+ * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
52
+ * The pytorch model (for fine-tuning or pytorch inference) is available in this repository in the `pytorch_model` folder
53
+ * Config file here: `pytorch_model/config.yaml`
54
+
55
+
56
+ ## Usage with `quickmt`
57
+
58
+ You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
59
+
60
+ Next, install the `quickmt` python library and download the model:
61
+
62
+ ```bash
63
+ git clone https://github.com/quickmt/quickmt.git
64
+ pip install -e ./quickmt/
65
+ ```
66
+
67
+ Finally use the model in python:
68
+
69
+ ```python
70
+ from quickmt import Translator
71
+
72
+ # Auto-detects GPU, set to "cpu" to force CPU inference
73
+ mt = Translator("quickmt/quickmt-uk-en", device="auto")
74
+
75
+ # Translate - set beam size to 1 for faster speed (but lower quality)
76
+ sample_text = 'Д-р Ехуд Ур, професор медицини в Університеті Делхаузі у Галіфаксі, Нова Шотландія, і голова клінічного та наукового відділу Канадської Асоціації Діабету, попередив, що дослідження лише починаються.'
77
+
78
+ mt(sample_text, beam_size=5)
79
+ ```
80
+
81
+ > 'Dr. Ehud Ur, a professor of medicine at Delhouse University in Halifax, Nova Scotia, and head of the clinical and scientific division of the Canadian Diabetes Association, warned that the research is just beginning.'
82
+
83
+ ```python
84
+ # Get alternative translations by sampling
85
+ # You can pass any cTranslate2 `translate_batch` arguments
86
+ mt([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
87
+ ```
88
+
89
+ > 'Dr Ehud Uhr, an associate professor of medicine at Delhousy University in Halifax, Nova Scotia and the head of the clinical and scientific department of the Canadian Diabetes Association, has warned that the research has just started.'
90
+
91
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`. A model in safetensors format to be used with `eole` is also provided.
92
+
93
+
94
+ ## Metrics
95
+
96
+ `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32.
97
+
98
+
99
+ | Model | time | bleu | chrf |
100
+ |------------------------------------|-------|-------|-------|
101
+ | quickmt/quickmt_uk_en | 1.12 | 39.88 | 65.65 |
102
+ | Helsinki-NLP/opus-mt-tc-big-zle-en | 3.03 | 39.07 | 64.98 |
103
+ | facebook/nllb-200-distilled-1.3B | 17.25 | 33.59 | 63.50 |
104
+ | CohereLabs/tiny-aya-global | 23.36 | 37.60 | 63.73 |
105
+ | google/gemma-4-E2B-it | 47.71 | 37.22 | 63.75 |
config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_source_bos": true,
3
+ "add_source_eos": false,
4
+ "bos_token": "<s>",
5
+ "decoder_start_token": "<s>",
6
+ "eos_token": "</s>",
7
+ "layer_norm_epsilon": null,
8
+ "multi_query_attention": false,
9
+ "unk_token": "<unk>"
10
+ }
model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d63c061e3a0eed8b61527ded6a9e5f8a74364983c454ee4652cc995a1eb612be
3
+ size 202004997
pytorch_model/averaged_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5d59db4c8760263746c13387981f148c466b4712b0fbed7f31ef209db179c274
3
+ size 799828424
pytorch_model/config.yaml ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ d_model: 768
3
+ enc_layers: 12
4
+ dec_layers: 2
5
+ n_heads: 16
6
+ ffn_dim: 4096
7
+ max_len: 256
8
+ vocab_size_src: 32000
9
+ vocab_size_tgt: 32000
10
+ dropout: 0.05
11
+ mlp_type: "standard" # standard or gated
12
+ activation: "gelu" # gelu or silu
13
+ norm_type: "layernorm" # layernorm or rmsnorm
14
+ ff_bias: true
15
+ tie_decoder_embeddings: false
16
+ layernorm_eps: 1.0e-5
17
+
18
+ data:
19
+ src_lang: "uk"
20
+ tgt_lang: "en"
21
+ src_dev_path: "dev.ukr"
22
+ tgt_dev_path: "dev.eng"
23
+ max_tokens_per_batch: 6000
24
+ src_spm_nbest_size: -1
25
+ src_spm_alpha: 0.5
26
+ tgt_spm_nbest_size: 1
27
+ tgt_spm_alpha: 1.0
28
+ corpora:
29
+ - src_file: "train.cleaned.filtered.ukr"
30
+ tgt_file: "train.cleaned.filtered.eng"
31
+ weight: 1
32
+ start_step: 1000
33
+ - src_file: "finetranslations.ukr_Cyrl-eng_Latn.ukr_Cyrl"
34
+ tgt_file: "finetranslations.ukr_Cyrl-eng_Latn.eng_Latn"
35
+ weight: 1
36
+ start_step: 0
37
+ stop_step: 80000
38
+
39
+ train:
40
+ experiment_name: "uken-base"
41
+ lr: 2.5e-3
42
+ grad_clip: 0.5
43
+ accum_steps: 20
44
+ max_checkpoints: 10
45
+ precision: "bfloat16"
46
+ warmup_steps: 5000
47
+ max_steps: 108000
48
+ eval_steps: 1000
49
+
50
+ export:
51
+ k: 5
pytorch_model/tokenizer_src.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f4ae41ace20c5b2b4287ac9d402cf93f6f857901b10ae1b20f22544073ce1c85
3
+ size 1014232
pytorch_model/tokenizer_src.vocab ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model/tokenizer_tgt.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:832da24581d0516be8980cba35eab07fb575e703da700a919aac92d9c5c31311
3
+ size 804561
pytorch_model/tokenizer_tgt.vocab ADDED
The diff for this file is too large to render. See raw diff
 
source_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
src.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f4ae41ace20c5b2b4287ac9d402cf93f6f857901b10ae1b20f22544073ce1c85
3
+ size 1014232
target_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
tgt.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:832da24581d0516be8980cba35eab07fb575e703da700a919aac92d9c5c31311
3
+ size 804561