--- language: - tig license: cc-by-sa-4.0 base_model: - facebook/SONAR --- # Tigre Low-Resource Language Resource Collection (Tigre-Data 1.0) ## Overview This repository introduces the first comprehensive public collection of resources for the **Tigre** language — an under-resourced South Semitic language within the Afro-Asiatic family. The release aggregates multiple modalities (text + speech) and provides baseline models for NLP tasks including language modeling, automatic speech recognition (ASR), and machine translation. The goal of **Tigre-Data 1.0** is to accelerate research in low-resource NLP, morphologically rich language modeling, speech technologies, and cross-lingual transfer. --- # tigre-sonar-encoder A **Tigre–English semantic similarity and quality-checking encoder**, fine-tuned from the SONAR universal embedding model. ## Key Capabilities - Generates 1024-dimensional embeddings for Tigre and English text - Computes cosine similarity for translation validation and filtering - Supports retrieval, clustering, and cross-lingual semantic tasks --- ## Model Description **Input Language:** Tigre (`tig`, script: Ethiopic — `tig_Ethi`) **Base Model:** `facebook/nllb-200-distilled-1.3B` **Model Type:** Encoder-only (text embedding model) **Purpose:** Align Tigre embeddings with the universal SONAR cross-lingual space --- ## Training Method: Knowledge Distillation The model was trained with a teacher–student distillation pipeline: ### 1. Model & Tokenizer Preparation - Initialized from the NLLB-200 distilled encoder - Extended tokenizer with Tigre-specific vocabulary - New token embeddings initialized by averaging sub-token embeddings ### 2. Teacher Embedding Generation - SONAR embedding model used as the Teacher - English translations of Tigre sentences encoded into 1024-dimensional vectors ### 3. Distillation Fine-Tuning - Minimized **Mean Squared Error (MSE)** loss between Student (Tigre encoder) and Teacher embeddings - Forced the Tigre model to align with the universal cross-lingual space --- ## Training Details - **Dataset:** `train_tig_parallel_text.parquet` - **Contents:** Tigre sentences paired with gold-standard SONAR embeddings - **Objective:** MSE loss between model output and SONAR target vectors - **Tokenizer:** Extended NLLB tokenizer with Tigre-specific vocabulary --- ## Evaluation Results | Metric | Result | Description | | ------------------------------ | --------- | ------------------------------------------------------------ | | **Accuracy (Source → Target)** | **0.88** | Retrieval accuracy when querying with Tigre text | | **Accuracy (Target → Source)** | **0.78** | Retrieval accuracy when querying with English text | | **BLEU** | **30.74** | (BLEU relates to a separate MT evaluation, not this encoder) | --- ## Usage Example (Python) ```bash pip install transformers torch ```
```python
from transformers import AutoTokenizer, M2M100ForConditionalGeneration
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "BeitTigreAI/tigre-sonar-encoder"
seq2seq = M2M100ForConditionalGeneration.from_pretrained(
model_id,
subfolder="model"
)
encoder = seq2seq.get_encoder().to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="model")
@torch.inference_mode()
def embed(texts, lang):
tokenizer.src_lang = lang
batch = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
out = encoder(**batch, return_dict=True)
mask = batch["attention_mask"].unsqueeze(-1).float()
pooled = (out.last_hidden_state * mask).sum(dim=1) / mask.sum(dim=1).clamp_min(1.0)
return torch.nn.functional.normalize(pooled, p=2, dim=1)
def score_pair(tig, eng):
t = embed([tig], "tig_Ethi")
e = embed([eng], "eng_Latn")
sim = float((t*e).sum())
return round(sim*100, 1)
print(score_pair("እት እድንየ እግል ትርኤ ተሐዜዮ ተቅዪር ግበእ", "Be the change that you wish to see in the world"))
print(score_pair("ክል ዶል ኢገብእ መስል እስከ ይከለስ", "It always seems impossible until it's done"))
---
## License
**CC BY-SA 4.0**