Eraynet-nirig

Eraynet-nirig is a baseline semantic retrieval model for Somali lexical search. It supports Somali, English, and Italian queries over a structured Somali dictionary dataset.

Overview

This model uses sentence embeddings to perform semantic search across lexical entries containing:

  • abbreviation
  • Somali term
  • Italian gloss
  • English gloss
  • domain metadata

It is designed as a baseline retrieval system for Somali language technology, terminology search, and dictionary lookup.

Model Details

  • Model: paraphrase-multilingual-MiniLM-L12-v2 (fine-tuned/used for embedding)
  • Embedding Dimension: 384
  • Training Data: 73 structured Somali lexical entries
  • Languages: Somali, English, Italian

Features

  • Exact and semantic lexical retrieval
  • Multilingual query support: Somali, English, Italian
  • Similarity scoring (cosine similarity)
  • Confidence labels: high (≥0.7), medium (≥0.5), low (<0.5)
  • Top-k results (default: 5)
  • Domain-aware search

Files

  • build_embeddings.py: builds vector embeddings from the lexical dataset
  • search.py: runs semantic search with confidence scoring
  • ai_model/embeddings.npy: stored embeddings (NumPy format)
  • ai_model/search_data.csv: structured lexical entries

Example Usage

from search import search

# Search for a term
results = search("medicine")
print(results)

Example output:

rank    somali   english   italian   domain    similarity_score confidence_label
1       Daawo    medicine  medicina  medicine  0.8542            high
2       ...      ...       ...       ...       ...              ...

Example Queries

  • medicine → Daawo
  • politics → siyaasad
  • botany → Botani

Installation

pip install -r requirements.txt

Requirements

  • sentence-transformers
  • pandas
  • numpy
  • scikit-learn
  • fastapi (optional, for API)
  • uvicorn (optional, for API)

Intended Use

This model is suitable for:

  • Somali dictionary search
  • terminology lookup
  • NLP preprocessing support
  • lexical search in multilingual Somali applications

Limitations

  • Small baseline dataset (73 entries)
  • Not a generative model
  • Not a translation model
  • Similarity scores are embedding-based, not calibrated probabilities
  • Confidence labels are based on similarity thresholds, not statistical certainty

Future Work

  • expand dataset size with cleaned dictionary entries
  • add part-of-speech tagging
  • add richer domain annotations
  • support example sentence retrieval
  • train a Somali-English translation model from parallel sentence pairs

Version

  • v0.1: Current baseline - 73 entries, multilingual semantic retrieval

Citation

If you use this model, please cite:

Eraynet-nirig: Somali Multilingual Lexical Retrieval Baseline (2026)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support