LID
Collection
Fast, lightweight language detection built for low-resource languages.
Two models, one goal—know what language it is. • 2 items • Updated
How to use olaverse/lid-neural-5 with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-classification", model="olaverse/lid-neural-5") # Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("olaverse/lid-neural-5")
model = AutoModelForSequenceClassification.from_pretrained("olaverse/lid-neural-5")LID-Neural-5 is a state-of-the-art transformer sequence classifier fine-tuned for high-accuracy language identification (LID) across 5 major languages spoken in Nigeria: Yoruba (yor), Igbo (ibo), Hausa (hau), Nigerian Pidgin (pcm), and English (eng).
It is fine-tuned on top of castorini/afriberta_large, a multilingual XLM-RoBERTa transformer pre-trained specifically on African languages, ensuring exceptional subword tokenization and deep semantic contextualization.
| Property | Value |
|---|---|
| Base Model | castorini/afriberta_large (XLM-RoBERTa, 125M parameters) |
| Model Type | Transformer Sequence Classification |
| Model Size | 484.03 MB |
| Supported Languages | Yoruba, Hausa, Igbo, Nigerian Pidgin, English |
| Testing Accuracy | 98.96% (Macro validation) |
| Average Latency | 13.30 ms per sentence (CPU/GPU) |
| Dependencies | PyTorch, Transformers |
| Language | Precision | Recall | F1-Score |
|---|---|---|---|
Yoruba (yor) |
99.60% | 99.60% | 99.60% |
Hausa (hau) |
99.60% | 99.20% | 99.40% |
Igbo (ibo) |
98.79% | 98.20% | 98.50% |
Nigerian Pidgin (pcm) |
99.20% | 98.80% | 99.00% |
English (eng) |
97.63% | 99.00% | 98.31% |
The model is integrated directly into the olaverse Python library.
pip install olaverse[deeplearning]
# installs: torch, transformers
olaverse library (recommended)
from olaverse import LIDNeural5
# Automatically downloads and loads the model on demand
detector = LIDNeural5()
detector.load()
# 1. Predict dominant language
lang = detector.predict("Kedu ka ị mere today?")
print(f"Predicted language: {lang}") # → 'ibo'
# 2. Get probability distributions
probs = detector.predict_proba("How far, wetin dey happen?")
print(probs)
# → {'eng': 0.002, 'hau': 0.001, 'ibo': 0.003, 'pcm': 0.991, 'yor': 0.003}
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("olaverse/lid-neural-5")
model = AutoModelForSequenceClassification.from_pretrained("olaverse/lid-neural-5")
lid = pipeline("text-classification", model=model, tokenizer=tokenizer)
result = lid("Bawo ni, se daadaa ni?")
print(result) # → [{'label': 'yor', 'score': 0.9987}]