Instructions to use olaverse/naija-embed-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use olaverse/naija-embed-base with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("olaverse/naija-embed-base") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
naija-embed-base
Cross-lingual sentence embeddings for Nigerian languages (Hausa, Yoruba, Igbo). Contrastively
fine-tuned from olaverse/mist-encoder-base-ng on general-domain synthetic parallel pairs:
clean English sentences (FineWeb, ODC-By) machine-translated into ha/yo/ig with the MIT-licensed
HelpMumHQ/AI-translator-eng-to-9ja, forming English↔Nigerian and Nigerian↔Nigerian pairs that
share an English source. Mean pooling, cosine similarity.
from sentence_transformers import SentenceTransformer
m = SentenceTransformer("olaverse/naija-embed-base")
emb = m.encode(["sentence one", "sentence two"])
Best for
Cross-lingual retrieval (e.g. Hausa query → Yoruba document), within-language semantic search, clustering, RAG, and deduplication over Nigerian-language text.
Evaluation
Within-language usefulness — frozen embeddings + logistic regression on MasakhaNEWS topics (test accuracy / macro-F1):
| Lang | Acc | Macro-F1 |
|---|---|---|
| Hausa | 0.818 | 0.803 |
| Yoruba | 0.798 | 0.796 |
| Igbo | 0.808 | 0.772 |
Cross-lingual retrieval — acc@1 on FLORES+ (real human-translated dev, n=997, no shared source). This is the trustworthy cross-lingual benchmark:
| Pair | acc@1 |
|---|---|
| Hausa → Yoruba | 0.670 |
| Igbo → Yoruba | 0.581 |
Limitations
- Synthetic training data. Pairs are machine-translated and carry MT noise; cross-lingual alignment is genuine (see FLORES) but below what a large model trained on real parallel data would reach. Igbo alignment is slightly looser than Hausa, reflecting translator quality.
- No Nigerian Pidgin (pcm). The translator only outputs ha/yo/ig, so Pidgin was not part of cross-lingual training.
License & provenance
Apache-2.0 weights. Training data derived from ODC-By English (FineWeb) via an MIT-licensed translation model.
- Downloads last month
- 69
Model tree for olaverse/naija-embed-base
Base model
olaverse/mist-encoder-base-ng