swahili-model-v2 / README.md
codeshujaaa's picture
Update README.md
d754750 verified
|
raw
history blame
6.07 kB
---
language:
- en
- sw
tags:
- translation
- en-sw
- helsinki-nlp
- fine-tuned
license: apache-2.0
metrics:
- bleu
base_model: Helsinki-NLP/opus-mt-en-sw
model-index:
- name: swahili-model-v2
results:
- task:
type: translation
name: Translation English-to-Swahili
dataset:
name: Swahili Parallel Corpus
type: text
metrics:
- name: Bleu
type: bleu
value: 41.6314
---
# Swahili-English Neural Machine Translation Model (V2)
## Model Description
**swahili-model-v2** is a Neural Machine Translation (NMT) model designed to translate text from English to Swahili.
This model is a fine-tuned version of the **Helsinki-NLP/opus-mt-en-sw** Transformer, adapted specifically for high-accuracy translation tasks using a curated parallel corpus.
By leveraging Transfer Learning on a substantial dataset of **281,000 sentence pairs**, this model achieves a **BLEU score of 41.63** on the validation set.
It demonstrates professional-grade grammatical fluency and robust vocabulary alignment, significantly outperforming [baseline models trained from scratch V1](https://huggingface.co/codeshujaaa/swahili-model-V1).
## Dataset Characteristics
The model was trained on a specific subset of the Swahili-English Parallel Corpus.
Prior to training, a comprehensive Exploratory Data Analysis was conducted to ensure data quality and alignment.
### Sentence Length Distribution
The dataset follows a long-tailed distribution typical of natural language corpora.
Most sentences are between 5 and 20 words long, which is optimal training.
**Figure 3: English Sentence Lengths**
![image](https://cdn-uploads.huggingface.co/production/uploads/6950f0ce20339b949d1af441/dZF5mGvVd9kGDKIFZ5jdr.png)
**Figure 4: Swahili Sentence Lengths**
![image](https://cdn-uploads.huggingface.co/production/uploads/6950f0ce20339b949d1af441/MzcVh3mCRU1E9VAVSC-HM.png)
### Source-Target Alignment
There is a strong linear correlation between English and Swahili sentence lengths.
This indicates a high-quality parallel corpus with few alignment errors.
**Figure 5: Length Correlation**
> *The regression line shows a consistent mapping ratio between source and target lengths.*
![image](https://cdn-uploads.huggingface.co/production/uploads/6950f0ce20339b949d1af441/jhWpYgVHM1t_Mvf8KZOt2.png)
## Training Details
### Dataset Configuration
* **Source:** Swahili-English Parallel Corpus.
* **Size:** 281,000 sentence pairs.
* **Split:** 90% Training, 10% Validation.
* **Preprocessing:** Tokenization using the Helsinki-NLP SentencePiece tokenizer with dynamic padding.
## Performance and Evaluation
The model was evaluated using the **BLEU (Bilingual Evaluation Understudy)** metric, which measures the similarity between the machine-generated translation and professional human reference translations.
### Evaluation Results
* **BLEU Score:** 41.63
* **Validation Loss:** 0.8659
These metrics indicate high translation quality, with the model successfully capturing complex sentence structures rather than performing simple word-for-word substitution.
### Training Results Table
The following table summarizes the model's performance metrics across all 5 training epochs.
| Epoch | Training Loss | Validation Loss | BLEU Score |
| :--- | :--- | :--- | :--- |
| 1.0 | 1.0863 | 1.0334 | 28.48 |
| 2.0 | 0.9630 | 0.9337 | 32.06 |
| 3.0 | 0.8826 | 0.8913 | 37.38 |
| 4.0 | 0.8266 | 0.8708 | 40.07 |
| **5.0** | **0.8036** | **0.8659** | **41.63** |
### Training Metrics
The training process demonstrated stable convergence with no signs of overfitting. The graphs below illustrate the progression of the BLEU score and Training Loss over 5 epochs.
**Figure 1: BLEU Score Progression**
> *The model achieved a rapid increase in translation quality, stabilizing above 40 BLEU by the final epoch.*
![image](https://cdn-uploads.huggingface.co/production/uploads/6950f0ce20339b949d1af441/AYWqSoChtREEtYK6lgp6m.png)
**Figure 2: Loss Convergence**
> *Validation loss consistently decreased, confirming that the model effectively generalized to unseen data.*
![image](https://cdn-uploads.huggingface.co/production/uploads/6950f0ce20339b949d1af441/DhHy0dsLqbZg0pOEGiLfy.png)
## Intended Uses and Limitations
### Intended Uses
This model is suitable for general-purpose translation tasks, including:
* **Educational Tools:** Assisting learners in understanding English-Swahili sentence structures.
* **Content Localization:** Translating web content, documentation, or simple narratives into Swahili.
* **Communication Aids:** Facilitating basic written communication across language barriers.
* **NLP Research:** Serving as a baseline for low-resource language modeling experiments.
### Limitations
* **Domain Specificity:** The model may struggle with highly technical, medical, or legal jargon that was not present in the training corpus.
* **Context Length:** As a sentence-level translator, it may lose context when translating very long paragraphs as a single block.
* **Dialect Variations:** Swahili has multiple dialects; this model aligns primarily with standard Swahili (Kiswahili Sanifu) and may not accurately capture regional slang or informal variations (Sheng).
## Usage
You can use this model directly with the Hugging Face `transformers` library.
### Python Example
```python
from transformers import pipeline
# Load the translation pipeline
translator = pipeline("translation", model="codeshujaaa/swahili-model-v2")
# Define input text
text = "I am learning to speak Swahili today."
# Generate translation
translation = translator(text)
print(translation[0]['translation_text'])
```
Citation
If you use this model in your work, please cite the original architecture authors and this repository
```
@misc{Denis Mwangi,
author = {Denis Mwangi},
title = {Fine-Tuned Swahili-English Neural Machine Translation Model},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{[https://huggingface.co/codeshujaaa/swahili-model-v2](https://huggingface.co/codeshujaaa/swahili-model-v2)}}
}
```