mist models

naija-embed-base

Cross-lingual sentence embeddings for Nigerian languages (Hausa, Yoruba, Igbo). Contrastively fine-tuned from olaverse/mist-encoder-base-ng on general-domain synthetic parallel pairs: clean English sentences (FineWeb, ODC-By) machine-translated into ha/yo/ig with the MIT-licensed HelpMumHQ/AI-translator-eng-to-9ja, forming English↔Nigerian and Nigerian↔Nigerian pairs that share an English source. Mean pooling, cosine similarity.

from sentence_transformers import SentenceTransformer
m = SentenceTransformer("olaverse/naija-embed-base")
emb = m.encode(["sentence one", "sentence two"])

Best for

Cross-lingual retrieval (e.g. Hausa query → Yoruba document), within-language semantic search, clustering, RAG, and deduplication over Nigerian-language text.

Evaluation

Within-language usefulness — frozen embeddings + logistic regression on MasakhaNEWS topics (test accuracy / macro-F1):

Lang Acc Macro-F1
Hausa 0.818 0.803
Yoruba 0.798 0.796
Igbo 0.808 0.772

Cross-lingual retrieval — acc@1 on FLORES+ (real human-translated dev, n=997, no shared source). This is the trustworthy cross-lingual benchmark:

Pair acc@1
Hausa → Yoruba 0.670
Igbo → Yoruba 0.581

Limitations

  • Synthetic training data. Pairs are machine-translated and carry MT noise; cross-lingual alignment is genuine (see FLORES) but below what a large model trained on real parallel data would reach. Igbo alignment is slightly looser than Hausa, reflecting translator quality.
  • No Nigerian Pidgin (pcm). The translator only outputs ha/yo/ig, so Pidgin was not part of cross-lingual training.

License & provenance

Apache-2.0 weights. Training data derived from ODC-By English (FineWeb) via an MIT-licensed translation model.

Downloads last month
69
Safetensors
Model size
30.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for olaverse/naija-embed-base

Finetuned
(2)
this model

Space using olaverse/naija-embed-base 1

Collections including olaverse/naija-embed-base