ErayNet-nirig / README.md
haajidheere's picture
Add README.md
6f770a6 verified
metadata
language:
  - so
  - en
  - it
license: cc-by-4.0
library_name: sentence-transformers
tags:
  - semantic-search
  - lexical-retrieval
  - somali
  - multilingual
  - dictionary
  - terminology
model_index:
  - name: Eraynet-nirig
    results:
      - task:
          type: text-retrieval
        metrics:
          - type: retrieval_map
            value: N/A (baseline model)

Eraynet-nirig

Eraynet-nirig is a baseline semantic retrieval model for Somali lexical search. It supports Somali, English, and Italian queries over a structured Somali dictionary dataset.

Overview

This model uses sentence embeddings to perform semantic search across lexical entries containing:

  • abbreviation
  • Somali term
  • Italian gloss
  • English gloss
  • domain metadata

It is designed as a baseline retrieval system for Somali language technology, terminology search, and dictionary lookup.

Model Details

  • Model: paraphrase-multilingual-MiniLM-L12-v2 (fine-tuned/used for embedding)
  • Embedding Dimension: 384
  • Training Data: 73 structured Somali lexical entries
  • Languages: Somali, English, Italian

Features

  • Exact and semantic lexical retrieval
  • Multilingual query support: Somali, English, Italian
  • Similarity scoring (cosine similarity)
  • Confidence labels: high (≥0.7), medium (≥0.5), low (<0.5)
  • Top-k results (default: 5)
  • Domain-aware search

Files

  • build_embeddings.py: builds vector embeddings from the lexical dataset
  • search.py: runs semantic search with confidence scoring
  • ai_model/embeddings.npy: stored embeddings (NumPy format)
  • ai_model/search_data.csv: structured lexical entries

Example Usage

from search import search

# Search for a term
results = search("medicine")
print(results)

Example output:

rank    somali   english   italian   domain    similarity_score confidence_label
1       Daawo    medicine  medicina  medicine  0.8542            high
2       ...      ...       ...       ...       ...              ...

Example Queries

  • medicineDaawo
  • politicssiyaasad
  • botanyBotani

Installation

pip install -r requirements.txt

Requirements

  • sentence-transformers
  • pandas
  • numpy
  • scikit-learn
  • fastapi (optional, for API)
  • uvicorn (optional, for API)

Intended Use

This model is suitable for:

  • Somali dictionary search
  • terminology lookup
  • NLP preprocessing support
  • lexical search in multilingual Somali applications

Limitations

  • Small baseline dataset (73 entries)
  • Not a generative model
  • Not a translation model
  • Similarity scores are embedding-based, not calibrated probabilities
  • Confidence labels are based on similarity thresholds, not statistical certainty

Future Work

  • expand dataset size with cleaned dictionary entries
  • add part-of-speech tagging
  • add richer domain annotations
  • support example sentence retrieval
  • train a Somali-English translation model from parallel sentence pairs

Version

  • v0.1: Current baseline - 73 entries, multilingual semantic retrieval

Citation

If you use this model, please cite:

Eraynet-nirig: Somali Multilingual Lexical Retrieval Baseline (2026)