Yoruba-English Code-Switching Language Identification (LID) - Mini

This model is a highly efficient, fine-tuned version of Davlan/afro-xlmr-mini designed for token-level Language Identification (LID) in Yoruba-English code-switched text.

Research Highlights

  • High Accuracy, Low Footprint: Achieved an Overall F1-score of 99.05%, matching the performance of "Large" models (550M parameters) while using a significantly smaller architecture (approx. 17M parameters).
  • Efficiency: Optimized for deployment in resource-constrained environments or high-throughput real-time applications.
  • African Language Focus: Built upon the AfroXLM-R-Mini base, leveraging pre-training specifically tailored for African linguistic structures.

Performance Evaluation (Test Set)

The following results were obtained on the held-out test set (~80k tokens):

Language Precision Recall F1-Score Support
Overall 0.990 0.991 0.991 80,085
English 0.994 0.995 0.994 63,016
Yoruba 0.976 0.978 0.977 17,069

Evaluation Metrics

  • Overall Accuracy: 99.56%
  • Evaluation Loss: 0.0947

Model Comparison (Ablation Study)

In our research, we compared this "Mini" architecture against a "Large" baseline to evaluate the trade-off between size and accuracy:

Model Parameters Overall F1 Speed (Samples/sec)
AfroXLM-R Large 550M 99.07% ~300
AfroXLM-R Mini 17M 99.05% 1712

Training Procedure

Training Narrative

The model was trained for 5 epochs on an A100 GPU. We utilized a large global batch size (256) and mixed-precision training (BF16) to ensure stable and fast convergence. Unlike larger models that may overfit rapidly on LID tasks, the Mini architecture showed a healthy learning curve with validation loss steadily decreasing throughout the process.

Hyperparameters

  • Learning Rate: 3e-05
  • Global Batch Size: 256
  • Optimizer: AdamW (Fused)
  • LR Scheduler: Cosine Decay
  • Warmup Ratio: 0.1

Training Logs

Training Loss Epoch Step Validation Loss
No log 1.0 313 0.2852
0.4325 2.0 626 0.1622
0.4325 3.0 939 0.1153
0.1482 4.0 1252 0.1000
0.1039 5.0 1565 0.0977

Usage

from transformers import pipeline

# Load the model directly from the Hub
lid_model = pipeline("token-classification", model="Professor/yoruba-en-ner-model-small")

text = "Ẹ jẹ́ kí á lọ si cinema to watch the latest movie."
results = lid_model(text)

for entity in results:
    print(f"Token: {entity['word']}, Language: {entity['entity']}")

Intended Uses & Limitations

This model is intended for researchers and developers working on bilingual text processing for Nigerian English and Yoruba. While highly accurate, users should note that performance may vary on text with non-standard orthography or code-switching involving third languages (e.g., Nigerian Pidgin).

Citation

If you use this model in your research, please cite the original AfroXLM-R paper and this specific fine-tuned release.

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Professor/yoruba-en-ner-model-small

Finetuned
(1)
this model

Evaluation results

  • Overall F1 on Yoruba-English Code-Switched Dataset
    self-reported
    0.991
  • Overall Precision on Yoruba-English Code-Switched Dataset
    self-reported
    0.990
  • Overall Recall on Yoruba-English Code-Switched Dataset
    self-reported
    0.991