|
|
--- |
|
|
language: |
|
|
- tig |
|
|
license: cc-by-sa-4.0 |
|
|
base_model: |
|
|
- facebook/SONAR |
|
|
--- |
|
|
|
|
|
|
|
|
# Tigre Low-Resource Language Resource Collection (Tigre-Data 1.0) |
|
|
|
|
|
## Overview |
|
|
|
|
|
This repository introduces the first comprehensive public collection of resources for the **Tigre** language — an under-resourced South Semitic language within the Afro-Asiatic family. The release aggregates multiple modalities (text + speech) and provides baseline models for NLP tasks including language modeling, automatic speech recognition (ASR), and machine translation. |
|
|
|
|
|
The goal of **Tigre-Data 1.0** is to accelerate research in low-resource NLP, morphologically rich language modeling, speech technologies, and cross-lingual transfer. |
|
|
|
|
|
--- |
|
|
|
|
|
# tigre-sonar-encoder |
|
|
|
|
|
A **Tigre–English semantic similarity and quality-checking encoder**, fine-tuned from the SONAR universal embedding model. |
|
|
|
|
|
## Key Capabilities |
|
|
|
|
|
- Generates 1024-dimensional embeddings for Tigre and English text |
|
|
- Computes cosine similarity for translation validation and filtering |
|
|
- Supports retrieval, clustering, and cross-lingual semantic tasks |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**Input Language:** Tigre (`tig`, script: Ethiopic — `tig_Ethi`) |
|
|
**Base Model:** `facebook/nllb-200-distilled-1.3B` |
|
|
**Model Type:** Encoder-only (text embedding model) |
|
|
**Purpose:** Align Tigre embeddings with the universal SONAR cross-lingual space |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Method: Knowledge Distillation |
|
|
|
|
|
The model was trained with a teacher–student distillation pipeline: |
|
|
|
|
|
### 1. Model & Tokenizer Preparation |
|
|
|
|
|
- Initialized from the NLLB-200 distilled encoder |
|
|
- Extended tokenizer with Tigre-specific vocabulary |
|
|
- New token embeddings initialized by averaging sub-token embeddings |
|
|
|
|
|
### 2. Teacher Embedding Generation |
|
|
|
|
|
- SONAR embedding model used as the Teacher |
|
|
- English translations of Tigre sentences encoded into 1024-dimensional vectors |
|
|
|
|
|
### 3. Distillation Fine-Tuning |
|
|
|
|
|
- Minimized **Mean Squared Error (MSE)** loss between Student (Tigre encoder) and Teacher embeddings |
|
|
- Forced the Tigre model to align with the universal cross-lingual space |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Dataset:** `train_tig_parallel_text.parquet` |
|
|
- **Contents:** Tigre sentences paired with gold-standard SONAR embeddings |
|
|
- **Objective:** MSE loss between model output and SONAR target vectors |
|
|
- **Tokenizer:** Extended NLLB tokenizer with Tigre-specific vocabulary |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
| Metric | Result | Description | |
|
|
| ------------------------------ | --------- | ------------------------------------------------------------ | |
|
|
| **Accuracy (Source → Target)** | **0.88** | Retrieval accuracy when querying with Tigre text | |
|
|
| **Accuracy (Target → Source)** | **0.78** | Retrieval accuracy when querying with English text | |
|
|
| **BLEU** | **30.74** | (BLEU relates to a separate MT evaluation, not this encoder) | |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage Example (Python) |
|
|
|
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
<pre> |
|
|
```python |
|
|
|
|
|
from transformers import AutoTokenizer, M2M100ForConditionalGeneration |
|
|
import torch |
|
|
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
|
|
model_id = "BeitTigreAI/tigre-sonar-encoder" |
|
|
seq2seq = M2M100ForConditionalGeneration.from_pretrained( |
|
|
model_id, |
|
|
subfolder="model" |
|
|
) |
|
|
encoder = seq2seq.get_encoder().to(device).eval() |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="model") |
|
|
|
|
|
@torch.inference_mode() |
|
|
def embed(texts, lang): |
|
|
tokenizer.src_lang = lang |
|
|
batch = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device) |
|
|
out = encoder(**batch, return_dict=True) |
|
|
mask = batch["attention_mask"].unsqueeze(-1).float() |
|
|
pooled = (out.last_hidden_state * mask).sum(dim=1) / mask.sum(dim=1).clamp_min(1.0) |
|
|
return torch.nn.functional.normalize(pooled, p=2, dim=1) |
|
|
|
|
|
def score_pair(tig, eng): |
|
|
t = embed([tig], "tig_Ethi") |
|
|
e = embed([eng], "eng_Latn") |
|
|
sim = float((t*e).sum()) |
|
|
return round(sim*100, 1) |
|
|
|
|
|
print(score_pair("እት እድንየ እግል ትርኤ ተሐዜዮ ተቅዪር ግበእ", "Be the change that you wish to see in the world")) |
|
|
print(score_pair("ክል ዶል ኢገብእ መስል እስከ ይከለስ", "It always seems impossible until it's done")) |
|
|
|
|
|
--- |
|
|
|
|
|
## License |
|
|
|
|
|
**CC BY-SA 4.0** |
|
|
|
|
|
|