Eraynet-nirig
Eraynet-nirig is a baseline semantic retrieval model for Somali lexical search. It supports Somali, English, and Italian queries over a structured Somali dictionary dataset.
Overview
This model uses sentence embeddings to perform semantic search across lexical entries containing:
- abbreviation
- Somali term
- Italian gloss
- English gloss
- domain metadata
It is designed as a baseline retrieval system for Somali language technology, terminology search, and dictionary lookup.
Model Details
- Model: paraphrase-multilingual-MiniLM-L12-v2 (fine-tuned/used for embedding)
- Embedding Dimension: 384
- Training Data: 73 structured Somali lexical entries
- Languages: Somali, English, Italian
Features
- Exact and semantic lexical retrieval
- Multilingual query support: Somali, English, Italian
- Similarity scoring (cosine similarity)
- Confidence labels: high (≥0.7), medium (≥0.5), low (<0.5)
- Top-k results (default: 5)
- Domain-aware search
Files
build_embeddings.py: builds vector embeddings from the lexical datasetsearch.py: runs semantic search with confidence scoringai_model/embeddings.npy: stored embeddings (NumPy format)ai_model/search_data.csv: structured lexical entries
Example Usage
from search import search
# Search for a term
results = search("medicine")
print(results)
Example output:
rank somali english italian domain similarity_score confidence_label
1 Daawo medicine medicina medicine 0.8542 high
2 ... ... ... ... ... ...
Example Queries
medicine→Daawopolitics→siyaasadbotany→Botani
Installation
pip install -r requirements.txt
Requirements
- sentence-transformers
- pandas
- numpy
- scikit-learn
- fastapi (optional, for API)
- uvicorn (optional, for API)
Intended Use
This model is suitable for:
- Somali dictionary search
- terminology lookup
- NLP preprocessing support
- lexical search in multilingual Somali applications
Limitations
- Small baseline dataset (73 entries)
- Not a generative model
- Not a translation model
- Similarity scores are embedding-based, not calibrated probabilities
- Confidence labels are based on similarity thresholds, not statistical certainty
Future Work
- expand dataset size with cleaned dictionary entries
- add part-of-speech tagging
- add richer domain annotations
- support example sentence retrieval
- train a Somali-English translation model from parallel sentence pairs
Version
- v0.1: Current baseline - 73 entries, multilingual semantic retrieval
Citation
If you use this model, please cite:
Eraynet-nirig: Somali Multilingual Lexical Retrieval Baseline (2026)
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support