| | --- |
| | language: |
| | - cy |
| | - en |
| | license: apache-2.0 |
| | pipeline_tag: translation |
| | tags: |
| | - translation |
| | - marian |
| | metrics: |
| | - bleu |
| | widget: |
| | - text: Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg erbyn y flwyddyn 2020." |
| | model-index: |
| | - name: mt-general-cy-en |
| | results: |
| | - task: |
| | name: Translation |
| | type: translation |
| | metrics: |
| | - type: bleu |
| | value: 54 |
| | --- |
| | # mt-general-cy-en |
| | A general language translation model for translating between Welsh and English. |
| |
|
| | This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/), |
| | the datasets prepared were generated from the following sources: |
| | - [UK Government Legislation data](https://www.legislation.gov.uk) |
| | - [OPUS-cy-en](https://opus.nlpl.eu/) |
| | - [Cofnod Y Cynulliad](https://record.assembly.wales/) |
| | - [Cofion Techiaith Cymru](https://cofion.techiaith.cymru) |
| |
|
| | The data was split into train, validation and test sets; the test comprising of a random slice of 20% of the total dataset. Segments were selected randomly form |
| | of text and TMX from the datasets described above. |
| | The datasets were cleaned, without any pre-tokenisation, utilising a SentencePiece vocabulary model, and then fed into a 10 separate Marian NMT training processes, the data having been split into |
| | split into 10 training and validation sets. |
| |
|
| | ## Evaluation |
| |
|
| | The BLEU evaluation score was produced using the python library [SacreBLEU](https://github.com/mjpost/sacrebleu). |
| | ## Usage |
| |
|
| | Ensure you have the prerequisite python libraries installed: |
| |
|
| | ```bsdh |
| | pip install transformers sentencepiece |
| | ``` |
| |
|
| | ```python |
| | import trnasformers |
| | model_id = "mgrbyte/mt-general-cy-en" |
| | tokenizer = transformers.AutoTokenizer.from_pretrained(model_id) |
| | model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id) |
| | translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer) |
| | translated = translate( |
| | "Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg erbyn y flwyddyn 2020." |
| | ) |
| | print(translated["translation_text"]) |
| | ``` |
| |
|