mgrbyte
/

mt-general-cy-en

text2text-generation

Eval Results (legacy)

Model card Files Files and versions

Initial version of model card

#1

by mgrbyte - opened May 24, 2023

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

Files changed (1) hide show

README.md +57 -0

README.md CHANGED Viewed

@@ -1,3 +1,60 @@
 ---
 license: apache-2.0
 ---

 ---
+language:
+- cy
+- en
 license: apache-2.0
+pipeline_tag: translation
+tags:
+- translation
+- marian
+metrics:
+  - bleu
+widget:
+ - text: Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siaradwr Cymraeg erbyn y flwyddyn 2020."
+model-index:
+- name: mt-dspec-legislation-en-cy
+  results:
+  - task:
+      name: Translation
+      type: translation
+    metrics:
+      - type: bleu
+        value: 54
 ---
+# mt-dspec-legislation-en-cy
+A language translation model for translating between English and Welsh, specialised to the specific domain of Legislation.
+This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/),
+the datasets prepared were generated from the following sources:
+ - [UK Government Legislation data](https://www.legislation.gov.uk)
+ - [OPUS-cy-en](https://opus.nlpl.eu/)
+ - [Cofnod Y Cynulliad](https://record.assembly.wales/)
+ - [Cofion Techiaith Cymru](https://cofion.techiaith.cymru)
+The data was split into train, validation and test sets; the test comprising of a random slice of 20% of the total dataset. Segments were selected randomly form
+of text and TMX from the datasets described above.
+The datasets were cleaned, without any pre-tokenisation, utilising a SentencePiece vocabulary model, and then fed into a 10 separate Marian NMT training processes, the data having been split into
+split into 10 training and validation sets.
+## Evaluation
+The BLEU evaluation score was produced using the python libraries [SacreBLEU](https://github.com/mjpost/sacrebleu).
+## Usage
+Ensure you have the prerequisite python libraries installed:
+```bsdh
+pip install transformers sentencepiece
+```
+```python
+import trnasformers
+model_id = "mgrbyte/mt-general-cy-en"
+tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
+model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id)
+translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer)
+translated = translate(
+   "Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siaradwr Cymraeg erbyn y flwyddyn 2020."
+)
+print(translated["translation_text"])
+```