swahili-model-v2 / README.md

Update README.md

d754750 verified 23 days ago

6.07 kB

	---
	language:
	- en
	- sw
	tags:
	- translation
	- en-sw
	- helsinki-nlp
	- fine-tuned
	license: apache-2.0
	metrics:
	- bleu
	base_model: Helsinki-NLP/opus-mt-en-sw
	model-index:
	- name: swahili-model-v2
	results:
	- task:
	type: translation
	name: Translation English-to-Swahili
	dataset:
	name: Swahili Parallel Corpus
	type: text
	metrics:
	- name: Bleu
	type: bleu
	value: 41.6314
	---

	# Swahili-English Neural Machine Translation Model (V2)

	## Model Description

	swahili-model-v2 is a Neural Machine Translation (NMT) model designed to translate text from English to Swahili.

	This model is a fine-tuned version of the Helsinki-NLP/opus-mt-en-sw Transformer, adapted specifically for high-accuracy translation tasks using a curated parallel corpus.

	By leveraging Transfer Learning on a substantial dataset of 281,000 sentence pairs, this model achieves a BLEU score of 41.63 on the validation set.

	It demonstrates professional-grade grammatical fluency and robust vocabulary alignment, significantly outperforming [baseline models trained from scratch V1](https://huggingface.co/codeshujaaa/swahili-model-V1).

	## Dataset Characteristics

	The model was trained on a specific subset of the Swahili-English Parallel Corpus.
	Prior to training, a comprehensive Exploratory Data Analysis was conducted to ensure data quality and alignment.

	### Sentence Length Distribution
	The dataset follows a long-tailed distribution typical of natural language corpora.

	Most sentences are between 5 and 20 words long, which is optimal training.

	Figure 3: English Sentence Lengths

	![image](https://cdn-uploads.huggingface.co/production/uploads/6950f0ce20339b949d1af441/dZF5mGvVd9kGDKIFZ5jdr.png)

	Figure 4: Swahili Sentence Lengths

	![image](https://cdn-uploads.huggingface.co/production/uploads/6950f0ce20339b949d1af441/MzcVh3mCRU1E9VAVSC-HM.png)

	### Source-Target Alignment
	There is a strong linear correlation between English and Swahili sentence lengths.

	This indicates a high-quality parallel corpus with few alignment errors.

	Figure 5: Length Correlation

	> The regression line shows a consistent mapping ratio between source and target lengths.

	![image](https://cdn-uploads.huggingface.co/production/uploads/6950f0ce20339b949d1af441/jhWpYgVHM1t_Mvf8KZOt2.png)

	## Training Details

	### Dataset Configuration
	* Source: Swahili-English Parallel Corpus.
	* Size: 281,000 sentence pairs.
	* Split: 90% Training, 10% Validation.
	* Preprocessing: Tokenization using the Helsinki-NLP SentencePiece tokenizer with dynamic padding.

	## Performance and Evaluation

	The model was evaluated using the BLEU (Bilingual Evaluation Understudy) metric, which measures the similarity between the machine-generated translation and professional human reference translations.

	### Evaluation Results
	* BLEU Score: 41.63
	* Validation Loss: 0.8659

	These metrics indicate high translation quality, with the model successfully capturing complex sentence structures rather than performing simple word-for-word substitution.

	### Training Results Table
	The following table summarizes the model's performance metrics across all 5 training epochs.

	\| Epoch \| Training Loss \| Validation Loss \| BLEU Score \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| 1.0 \| 1.0863 \| 1.0334 \| 28.48 \|
	\| 2.0 \| 0.9630 \| 0.9337 \| 32.06 \|
	\| 3.0 \| 0.8826 \| 0.8913 \| 37.38 \|
	\| 4.0 \| 0.8266 \| 0.8708 \| 40.07 \|
	\| 5.0 \| 0.8036 \| 0.8659 \| 41.63 \|

	### Training Metrics
	The training process demonstrated stable convergence with no signs of overfitting. The graphs below illustrate the progression of the BLEU score and Training Loss over 5 epochs.

	Figure 1: BLEU Score Progression
	> The model achieved a rapid increase in translation quality, stabilizing above 40 BLEU by the final epoch.

	![image](https://cdn-uploads.huggingface.co/production/uploads/6950f0ce20339b949d1af441/AYWqSoChtREEtYK6lgp6m.png)

	Figure 2: Loss Convergence
	> Validation loss consistently decreased, confirming that the model effectively generalized to unseen data.

	![image](https://cdn-uploads.huggingface.co/production/uploads/6950f0ce20339b949d1af441/DhHy0dsLqbZg0pOEGiLfy.png)


	## Intended Uses and Limitations

	### Intended Uses
	This model is suitable for general-purpose translation tasks, including:
	* Educational Tools: Assisting learners in understanding English-Swahili sentence structures.
	* Content Localization: Translating web content, documentation, or simple narratives into Swahili.
	* Communication Aids: Facilitating basic written communication across language barriers.
	* NLP Research: Serving as a baseline for low-resource language modeling experiments.

	### Limitations
	* Domain Specificity: The model may struggle with highly technical, medical, or legal jargon that was not present in the training corpus.
	* Context Length: As a sentence-level translator, it may lose context when translating very long paragraphs as a single block.
	* Dialect Variations: Swahili has multiple dialects; this model aligns primarily with standard Swahili (Kiswahili Sanifu) and may not accurately capture regional slang or informal variations (Sheng).
	## Usage

	You can use this model directly with the Hugging Face `transformers` library.

	### Python Example
	```python
	from transformers import pipeline

	# Load the translation pipeline
	translator = pipeline("translation", model="codeshujaaa/swahili-model-v2")

	# Define input text
	text = "I am learning to speak Swahili today."


	# Generate translation
	translation = translator(text)
	print(translation[0]['translation_text'])

	```

	Citation

	If you use this model in your work, please cite the original architecture authors and this repository
	```
	@misc{Denis Mwangi,
	author = {Denis Mwangi},
	title = {Fine-Tuned Swahili-English Neural Machine Translation Model},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{[https://huggingface.co/codeshujaaa/swahili-model-v2](https://huggingface.co/codeshujaaa/swahili-model-v2)}}
	}
	```