Yoruba-English Code-Switching Language Identification (LID) - Mini

This model is a highly efficient, fine-tuned version of Davlan/afro-xlmr-mini designed for token-level Language Identification (LID) in Yoruba-English code-switched text.

Research Highlights

High Accuracy, Low Footprint: Achieved an Overall F1-score of 99.05%, matching the performance of "Large" models (550M parameters) while using a significantly smaller architecture (approx. 17M parameters).
Efficiency: Optimized for deployment in resource-constrained environments or high-throughput real-time applications.
African Language Focus: Built upon the AfroXLM-R-Mini base, leveraging pre-training specifically tailored for African linguistic structures.

Performance Evaluation (Test Set)

The following results were obtained on the held-out test set (~80k tokens):

Language	Precision	Recall	F1-Score	Support
Overall	0.990	0.991	0.991	80,085
English	0.994	0.995	0.994	63,016
Yoruba	0.976	0.978	0.977	17,069

Evaluation Metrics

Overall Accuracy: 99.56%
Evaluation Loss: 0.0947

Model Comparison (Ablation Study)

In our research, we compared this "Mini" architecture against a "Large" baseline to evaluate the trade-off between size and accuracy:

Model	Parameters	Overall F1	Speed (Samples/sec)
AfroXLM-R Large	550M	99.07%	~300
AfroXLM-R Mini	17M	99.05%	1712

Training Procedure

Training Narrative

The model was trained for 5 epochs on an A100 GPU. We utilized a large global batch size (256) and mixed-precision training (BF16) to ensure stable and fast convergence. Unlike larger models that may overfit rapidly on LID tasks, the Mini architecture showed a healthy learning curve with validation loss steadily decreasing throughout the process.

Hyperparameters

Learning Rate: 3e-05
Global Batch Size: 256
Optimizer: AdamW (Fused)
LR Scheduler: Cosine Decay
Warmup Ratio: 0.1

Training Logs

Training Loss	Epoch	Step	Validation Loss
No log	1.0	313	0.2852
0.4325	2.0	626	0.1622
0.4325	3.0	939	0.1153
0.1482	4.0	1252	0.1000
0.1039	5.0	1565	0.0977

Usage

from transformers import pipeline

# Load the model directly from the Hub
lid_model = pipeline("token-classification", model="Professor/yoruba-en-ner-model-small")

text = "Ẹ jẹ́ kí á lọ si cinema to watch the latest movie."
results = lid_model(text)

for entity in results:
    print(f"Token: {entity['word']}, Language: {entity['entity']}")

Intended Uses & Limitations

This model is intended for researchers and developers working on bilingual text processing for Nigerian English and Yoruba. While highly accurate, users should note that performance may vary on text with non-standard orthography or code-switching involving third languages (e.g., Nigerian Pidgin).

Citation

If you use this model in your research, please cite the original AfroXLM-R paper and this specific fine-tuned release.

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Professor/yoruba-en-ner-model-small

Base model

Davlan/afro-xlmr-mini

Finetuned

(1)

this model

Evaluation results

Overall F1 on Yoruba-English Code-Switched Dataset
self-reported

0.991
Overall Precision on Yoruba-English Code-Switched Dataset
self-reported

0.990
Overall Recall on Yoruba-English Code-Switched Dataset
self-reported

0.991