Mgolo
/

eng_hau

text2text-generation

machine-translation

Eval Results (legacy)

Model card Files Files and versions

eng_hau / README.md

Mgolo's picture

Update README.md

208c120 verified 5 months ago

|

history blame contribute delete

3.18 kB

	---
	language:
	- en
	- ha
	license: mit
	tags:
	- translation
	- machine-translation
	- low-resource
	- english
	- hausa
	datasets:
	- custom
	metrics:
	- bleu
	library_name: transformers
	pipeline_tag: translation
	model-index:
	- name: localenlp-en-hau
	results:
	- task:
	name: Translation
	type: translation
	dataset:
	name: English-Hausa Custom Dataset
	type: custom
	size: 15k
	metrics:
	- name: BLEU
	type: bleu
	value: 39
	---
	# localenlp-en-hau

	Fine-tuned MarianMT model for English-to-Hausa translation.

	# Model Card for `LOCALENLP/eng-hau`

	This is a machine translation model for English → Hausa, developed by the LOCALENLP organization.
	It is based on the pretrained `Helsinki-NLP/opus-mt-en-mul` MarianMT model and fine-tuned on a custom parallel corpus of ~15k sentence pairs.

	---

	## Model Details

	### Model Description
	- Developed by: Mgolo
	- Funded by [optional]: N/A
	- Shared by: Mgolo
	- Model type: Seq2Seq Transformer (MarianMT)
	- Languages: English → Hausa
	- License: MIT
	- Finetuned from model: [Helsinki-NLP/opus-mt-en-mul](https://huggingface.co/Helsinki-NLP/opus-mt-en-mul)

	### Model Sources
	- Repository: https://huggingface.co/LOCALENLP/eng_hau
	- Demo [optional]: [To be integrated in Gradio / Web app](https://huggingface.co/spaces/LocaleNLP/english_hausa)

	---

	## Uses

	### Direct Use
	- Translate English text into Hausa for research, education, and communication.
	- Useful for low-resource NLP tasks, digital content creation, and cultural preservation.

	### Downstream Use
	- Can be integrated into translation apps, chatbots, and education platforms.
	- Serves as a base for further fine-tuning on domain-specific Wolof corpora.

	### Out-of-Scope Use
	- Note Suitable for legal and medical translations (e.g., contracts, prescriptions, medical records).
	- Mistranslations may occur, like any automated system.
	- Review recommended as the model can sometimes mistranslate.

	---

	## Bias, Risks, and Limitations
	- Training data is from a custom collection of parallel sentences (~15k pairs).
	- Some informal or culturally nuanced expressions may not be accurately translated.
	- Wolof spelling and grammar variation (Latin script) may lead to inconsistencies.
	- Model may underperform on domain-specific or long, complex texts.

	### Recommendations
	- Use human post-editing for high-stakes use cases.
	- Evaluate performance on your target domain before deployment.

	---

	## How to Get Started with the Model

	```python
	from transformers import MarianTokenizer, AutoModelForSeq2SeqLM

	model_name = "LOCALENLP/eng_hau"
	tokenizer = MarianTokenizer.from_pretrained(model_name)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

	text = "Good evening, how was your day?"
	inputs = tokenizer(">>hau<< " + text, return_tensors="pt", padding=True, truncation=True)
	outputs = model.generate(**inputs, max_length=512, num_beams=4)
	translation = tokenizer.decode(outputs[0], skip_special_tokens=True)

	print("English:", text)
	print("Hausa:", translation)