LID-Lite-5
LID-Lite-5 is a zero-dependency, ultra-lightweight language identification (LID) classifier designed for 5 major languages spoken in Nigeria: Yoruba (yor), Igbo (ibo), Hausa (hau), Nigerian Pidgin (pcm), and English (eng).
It uses a customized word-level TF-IDF (unigram and bigram) feature representation coupled with a Logistic Regression classifier. The model weights are serialized into a single, compact JSON file (~1.1 MB).
Model Details
- Model Type: TF-IDF + Logistic Regression
- File Size: 1.10 MB (
lid-lite-5.json) - Vocabulary Size: 5,000 top n-gram terms
- Supported Languages: Yoruba (
yor), Hausa (hau), Igbo (ibo), Nigerian Pidgin (pcm), and English (eng) - Testing Accuracy: 98.12% (Macro validation)
- Average Latency: 0.0139 ms per sentence (extremely fast, runs on any CPU)
- Dependencies: Pure Python and NumPy (no deep learning packages required)
Performance Comparison
| Model | Accuracy (%) | Avg Latency (ms / sentence) | File Size (MB) |
|---|---|---|---|
| LIDLite5 | 98.12% | 0.0139 ms | 1.10 MB |
| LIDNeural5 (Afriberta) | 98.96% | 13.2967 ms | 484.03 MB |
Usage
The model is integrated directly into the olaverse Python library.
Installation
pip install olaverse
Inference
from olaverse import LIDLite5
# Automatically downloads and loads the model on demand
detector = LIDLite5()
# 1. Predict dominant language
lang = detector.predict("Bawo ni, se daadaa ni?")
print(f"Predicted language: {lang}") # -> 'yor'
# 2. Get probability distributions
probs = detector.predict_proba("How far, wetin dey happen?")
print(probs)
# -> {'eng': 0.001, 'hau': 0.000, 'ibo': 0.000, 'pcm': 0.998, 'yor': 0.001}
Raw JSON Model Structure
If you want to use the raw weights in other languages (JavaScript, Go, Rust, C++), you can parse the hosted lid-lite-5.json directly. The JSON structure is:
{
"classes": ["eng", "hau", "ibo", "pcm", "yor"],
"intercept": [0.123, -0.456, ...],
"features": {
"word": {
"weights": [0.01, -0.05, 0.23, ...],
"idf": 3.45
},
...
}
}
For custom implementations, apply sublinear term frequency scaling (1 + ln(count)), L2 normalize the resulting vector, calculate the dot product with the classes' weights, add intercepts, and apply softmax to calculate probabilities.