mgrbyte
/

mt-general-cy-en

text2text-generation

Eval Results (legacy)

Model card Files Files and versions

mt-general-cy-en / README.md

mgrbyte's picture

Changed wording.

53f2c82 almost 3 years ago

|

2.04 kB

	---
	language:
	- cy
	- en
	license: apache-2.0
	pipeline_tag: translation
	tags:
	- translation
	- marian
	metrics:
	- bleu
	widget:
	- text: Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg erbyn y flwyddyn 2020."
	model-index:
	- name: mt-general-cy-en
	results:
	- task:
	name: Translation
	type: translation
	metrics:
	- type: bleu
	value: 54
	---
	# mt-general-cy-en
	A general language translation model for translating between Welsh and English.

	This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/),
	the datasets prepared were generated from the following sources:
	- [UK Government Legislation data](https://www.legislation.gov.uk)
	- [OPUS-cy-en](https://opus.nlpl.eu/)
	- [Cofnod Y Cynulliad](https://record.assembly.wales/)
	- [Cofion Techiaith Cymru](https://cofion.techiaith.cymru)

	The data was split into train, validation and test sets; the test comprising of a random slice of 20% of the total dataset. Segments were selected randomly form
	of text and TMX from the datasets described above.
	The datasets were cleaned, without any pre-tokenisation, utilising a SentencePiece vocabulary model, and then fed into a 10 separate Marian NMT training processes, the data having been split into
	split into 10 training and validation sets.

	## Evaluation

	The BLEU evaluation score was produced using the python library [SacreBLEU](https://github.com/mjpost/sacrebleu).
	## Usage

	Ensure you have the prerequisite python libraries installed:

	```bsdh
	pip install transformers sentencepiece
	```

	```python
	import trnasformers
	model_id = "mgrbyte/mt-general-cy-en"
	tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
	model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id)
	translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer)
	translated = translate(
	"Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg erbyn y flwyddyn 2020."
	)
	print(translated["translation_text"])
	```