LID-Lite-5

LID-Lite-5 is a zero-dependency, ultra-lightweight language identification (LID) classifier designed for 5 major languages spoken in Nigeria: Yoruba (yor), Igbo (ibo), Hausa (hau), Nigerian Pidgin (pcm), and English (eng).

It uses a customized word-level TF-IDF (unigram and bigram) feature representation coupled with a Logistic Regression classifier. The model weights are serialized into a single, compact JSON file (~1.1 MB).

Model Details

Model Type: TF-IDF + Logistic Regression
File Size: 1.10 MB (lid-lite-5.json)
Vocabulary Size: 5,000 top n-gram terms
Supported Languages: Yoruba (yor), Hausa (hau), Igbo (ibo), Nigerian Pidgin (pcm), and English (eng)
Testing Accuracy: 98.12% (Macro validation)
Average Latency: 0.0139 ms per sentence (extremely fast, runs on any CPU)
Dependencies: Pure Python and NumPy (no deep learning packages required)

Performance Comparison

Model	Accuracy (%)	Avg Latency (ms / sentence)	File Size (MB)
LIDLite5	98.12%	0.0139 ms	1.10 MB
LIDNeural5 (Afriberta)	98.96%	13.2967 ms	484.03 MB

Usage

The model is integrated directly into the olaverse Python library.

Installation

pip install olaverse

Inference

from olaverse import LIDLite5

# Automatically downloads and loads the model on demand
detector = LIDLite5()

# 1. Predict dominant language
lang = detector.predict("Bawo ni, se daadaa ni?")
print(f"Predicted language: {lang}") # -> 'yor'

# 2. Get probability distributions
probs = detector.predict_proba("How far, wetin dey happen?")
print(probs)
# -> {'eng': 0.001, 'hau': 0.000, 'ibo': 0.000, 'pcm': 0.998, 'yor': 0.001}

Raw JSON Model Structure

If you want to use the raw weights in other languages (JavaScript, Go, Rust, C++), you can parse the hosted lid-lite-5.json directly. The JSON structure is:

{
  "classes": ["eng", "hau", "ibo", "pcm", "yor"],
  "intercept": [0.123, -0.456, ...],
  "features": {
    "word": {
      "weights": [0.01, -0.05, 0.23, ...],
      "idf": 3.45
    },
    ...
  }
}

For custom implementations, apply sublinear term frequency scaling (1 + ln(count)), L2 normalize the resulting vector, calculate the dot product with the classes' weights, add intercepts, and apply softmax to calculate probabilities.

Collection including olaverse/lid-lite-5

LID

Collection

7 items • Updated 3 days ago

olaverse
/

lid-lite-5