--- language: - so - en - it license: cc-by-4.0 library_name: sentence-transformers tags: - semantic-search - lexical-retrieval - somali - multilingual - dictionary - terminology model_index: - name: Eraynet-nirig results: - task: type: text-retrieval metrics: - type: retrieval_map value: N/A (baseline model) --- # Eraynet-nirig Eraynet-nirig is a baseline semantic retrieval model for Somali lexical search. It supports Somali, English, and Italian queries over a structured Somali dictionary dataset. ## Overview This model uses sentence embeddings to perform semantic search across lexical entries containing: - abbreviation - Somali term - Italian gloss - English gloss - domain metadata It is designed as a baseline retrieval system for Somali language technology, terminology search, and dictionary lookup. ## Model Details - **Model**: paraphrase-multilingual-MiniLM-L12-v2 (fine-tuned/used for embedding) - **Embedding Dimension**: 384 - **Training Data**: 73 structured Somali lexical entries - **Languages**: Somali, English, Italian ## Features - Exact and semantic lexical retrieval - Multilingual query support: Somali, English, Italian - Similarity scoring (cosine similarity) - Confidence labels: high (≥0.7), medium (≥0.5), low (<0.5) - Top-k results (default: 5) - Domain-aware search ## Files - `build_embeddings.py`: builds vector embeddings from the lexical dataset - `search.py`: runs semantic search with confidence scoring - `ai_model/embeddings.npy`: stored embeddings (NumPy format) - `ai_model/search_data.csv`: structured lexical entries ## Example Usage ```python from search import search # Search for a term results = search("medicine") print(results) ``` Example output: ``` rank somali english italian domain similarity_score confidence_label 1 Daawo medicine medicina medicine 0.8542 high 2 ... ... ... ... ... ... ``` ## Example Queries - `medicine` → `Daawo` - `politics` → `siyaasad` - `botany` → `Botani` ## Installation ```bash pip install -r requirements.txt ``` ## Requirements - sentence-transformers - pandas - numpy - scikit-learn - fastapi (optional, for API) - uvicorn (optional, for API) ## Intended Use This model is suitable for: - Somali dictionary search - terminology lookup - NLP preprocessing support - lexical search in multilingual Somali applications ## Limitations - Small baseline dataset (73 entries) - Not a generative model - Not a translation model - Similarity scores are embedding-based, not calibrated probabilities - Confidence labels are based on similarity thresholds, not statistical certainty ## Future Work - expand dataset size with cleaned dictionary entries - add part-of-speech tagging - add richer domain annotations - support example sentence retrieval - train a Somali-English translation model from parallel sentence pairs ## Version - **v0.1**: Current baseline - 73 entries, multilingual semantic retrieval ## Citation If you use this model, please cite: ``` Eraynet-nirig: Somali Multilingual Lexical Retrieval Baseline (2026) ```