LID-Lite-5

LID-Lite-5 is a zero-dependency, ultra-lightweight language identification (LID) classifier designed for 5 major languages spoken in Nigeria: Yoruba (yor), Igbo (ibo), Hausa (hau), Nigerian Pidgin (pcm), and English (eng).

It uses a customized word-level TF-IDF (unigram and bigram) feature representation coupled with a Logistic Regression classifier. The model weights are serialized into a single, compact JSON file (~1.1 MB).

Model Details

  • Model Type: TF-IDF + Logistic Regression
  • File Size: 1.10 MB (lid-lite-5.json)
  • Vocabulary Size: 5,000 top n-gram terms
  • Supported Languages: Yoruba (yor), Hausa (hau), Igbo (ibo), Nigerian Pidgin (pcm), and English (eng)
  • Testing Accuracy: 98.12% (Macro validation)
  • Average Latency: 0.0139 ms per sentence (extremely fast, runs on any CPU)
  • Dependencies: Pure Python and NumPy (no deep learning packages required)

Performance Comparison

Model Accuracy (%) Avg Latency (ms / sentence) File Size (MB)
LIDLite5 98.12% 0.0139 ms 1.10 MB
LIDNeural5 (Afriberta) 98.96% 13.2967 ms 484.03 MB

Usage

The model is integrated directly into the olaverse Python library.

Installation

pip install olaverse

Inference

from olaverse import LIDLite5

# Automatically downloads and loads the model on demand
detector = LIDLite5()

# 1. Predict dominant language
lang = detector.predict("Bawo ni, se daadaa ni?")
print(f"Predicted language: {lang}") # -> 'yor'

# 2. Get probability distributions
probs = detector.predict_proba("How far, wetin dey happen?")
print(probs)
# -> {'eng': 0.001, 'hau': 0.000, 'ibo': 0.000, 'pcm': 0.998, 'yor': 0.001}

Raw JSON Model Structure

If you want to use the raw weights in other languages (JavaScript, Go, Rust, C++), you can parse the hosted lid-lite-5.json directly. The JSON structure is:

{
  "classes": ["eng", "hau", "ibo", "pcm", "yor"],
  "intercept": [0.123, -0.456, ...],
  "features": {
    "word": {
      "weights": [0.01, -0.05, 0.23, ...],
      "idf": 3.45
    },
    ...
  }
}

For custom implementations, apply sublinear term frequency scaling (1 + ln(count)), L2 normalize the resulting vector, calculate the dot product with the classes' weights, add intercepts, and apply softmax to calculate probabilities.

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including olaverse/lid-lite-5