eng_hau / README.md
Mgolo's picture
Update README.md
208c120 verified
---
language:
- en
- ha
license: mit
tags:
- translation
- machine-translation
- low-resource
- english
- hausa
datasets:
- custom
metrics:
- bleu
library_name: transformers
pipeline_tag: translation
model-index:
- name: localenlp-en-hau
results:
- task:
name: Translation
type: translation
dataset:
name: English-Hausa Custom Dataset
type: custom
size: 15k
metrics:
- name: BLEU
type: bleu
value: 39
---
# localenlp-en-hau
Fine-tuned MarianMT model for English-to-Hausa translation.
# Model Card for `LOCALENLP/eng-hau`
This is a machine translation model for **English → Hausa**, developed by the **LOCALENLP** organization.
It is based on the pretrained `Helsinki-NLP/opus-mt-en-mul` MarianMT model and fine-tuned on a custom parallel corpus of ~15k sentence pairs.
---
## Model Details
### Model Description
- **Developed by:** Mgolo
- **Funded by [optional]:** N/A
- **Shared by:** Mgolo
- **Model type:** Seq2Seq Transformer (MarianMT)
- **Languages:** English → Hausa
- **License:** MIT
- **Finetuned from model:** [Helsinki-NLP/opus-mt-en-mul](https://huggingface.co/Helsinki-NLP/opus-mt-en-mul)
### Model Sources
- **Repository:** https://huggingface.co/LOCALENLP/eng_hau
- **Demo [optional]:** [To be integrated in Gradio / Web app](https://huggingface.co/spaces/LocaleNLP/english_hausa)
---
## Uses
### Direct Use
- Translate English text into Hausa for research, education, and communication.
- Useful for low-resource NLP tasks, digital content creation, and cultural preservation.
### Downstream Use
- Can be integrated into translation apps, chatbots, and education platforms.
- Serves as a base for further fine-tuning on domain-specific Wolof corpora.
### Out-of-Scope Use
- Note Suitable for legal and medical translations (e.g., contracts, prescriptions, medical records).
- Mistranslations may occur, like any automated system.
- Review recommended as the model can sometimes mistranslate.
---
## Bias, Risks, and Limitations
- Training data is from a custom collection of parallel sentences (~15k pairs).
- Some informal or culturally nuanced expressions may not be accurately translated.
- Wolof spelling and grammar variation (Latin script) may lead to inconsistencies.
- Model may underperform on domain-specific or long, complex texts.
### Recommendations
- Use human post-editing for high-stakes use cases.
- Evaluate performance on your target domain before deployment.
---
## How to Get Started with the Model
```python
from transformers import MarianTokenizer, AutoModelForSeq2SeqLM
model_name = "LOCALENLP/eng_hau"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
text = "Good evening, how was your day?"
inputs = tokenizer(">>hau<< " + text, return_tensors="pt", padding=True, truncation=True)
outputs = model.generate(**inputs, max_length=512, num_beams=4)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("English:", text)
print("Hausa:", translation)