File size: 4,351 Bytes
2549fe1
 
 
 
 
 
 
 
 
f98328d
2549fe1
f98328d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2549fe1
f98328d
2549fe1
f98328d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2549fe1
 
 
 
 
 
 
 
 
9022c8c
 
 
 
2549fe1
9022c8c
2549fe1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f98328d
9022c8c
f98328d
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
language:
- tig
license: cc-by-sa-4.0
base_model:
- facebook/SONAR
---


# Tigre Low-Resource Language Resource Collection (Tigre-Data 1.0)

## Overview

This repository introduces the first comprehensive public collection of resources for the **Tigre** language — an under-resourced South Semitic language within the Afro-Asiatic family. The release aggregates multiple modalities (text + speech) and provides baseline models for NLP tasks including language modeling, automatic speech recognition (ASR), and machine translation.

The goal of **Tigre-Data 1.0** is to accelerate research in low-resource NLP, morphologically rich language modeling, speech technologies, and cross-lingual transfer.

---

# tigre-sonar-encoder

A **Tigre–English semantic similarity and quality-checking encoder**, fine-tuned from the SONAR universal embedding model.

## Key Capabilities

- Generates 1024-dimensional embeddings for Tigre and English text
- Computes cosine similarity for translation validation and filtering
- Supports retrieval, clustering, and cross-lingual semantic tasks

---

## Model Description

**Input Language:** Tigre (`tig`, script: Ethiopic — `tig_Ethi`)  
**Base Model:** `facebook/nllb-200-distilled-1.3B`  
**Model Type:** Encoder-only (text embedding model)  
**Purpose:** Align Tigre embeddings with the universal SONAR cross-lingual space

---

## Training Method: Knowledge Distillation

The model was trained with a teacher–student distillation pipeline:

### 1. Model & Tokenizer Preparation

- Initialized from the NLLB-200 distilled encoder
- Extended tokenizer with Tigre-specific vocabulary
- New token embeddings initialized by averaging sub-token embeddings

### 2. Teacher Embedding Generation

- SONAR embedding model used as the Teacher
- English translations of Tigre sentences encoded into 1024-dimensional vectors

### 3. Distillation Fine-Tuning

- Minimized **Mean Squared Error (MSE)** loss between Student (Tigre encoder) and Teacher embeddings
- Forced the Tigre model to align with the universal cross-lingual space

---

## Training Details

- **Dataset:** `train_tig_parallel_text.parquet`
- **Contents:** Tigre sentences paired with gold-standard SONAR embeddings
- **Objective:** MSE loss between model output and SONAR target vectors
- **Tokenizer:** Extended NLLB tokenizer with Tigre-specific vocabulary

---

## Evaluation Results

| Metric                         | Result    | Description                                                  |
| ------------------------------ | --------- | ------------------------------------------------------------ |
| **Accuracy (Source → Target)** | **0.88**  | Retrieval accuracy when querying with Tigre text             |
| **Accuracy (Target → Source)** | **0.78**  | Retrieval accuracy when querying with English text           |
| **BLEU**                       | **30.74** | (BLEU relates to a separate MT evaluation, not this encoder) |

---

## Usage Example (Python)

```bash
pip install transformers torch
```
<pre>
```python

from transformers import AutoTokenizer, M2M100ForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model_id = "BeitTigreAI/tigre-sonar-encoder"
seq2seq = M2M100ForConditionalGeneration.from_pretrained(
    model_id,
    subfolder="model"  
)
encoder = seq2seq.get_encoder().to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="model")

@torch.inference_mode()
def embed(texts, lang):
    tokenizer.src_lang = lang
    batch = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
    out = encoder(**batch, return_dict=True)
    mask = batch["attention_mask"].unsqueeze(-1).float()
    pooled = (out.last_hidden_state * mask).sum(dim=1) / mask.sum(dim=1).clamp_min(1.0)
    return torch.nn.functional.normalize(pooled, p=2, dim=1)

def score_pair(tig, eng):
    t = embed([tig], "tig_Ethi")
    e = embed([eng], "eng_Latn")
    sim = float((t*e).sum())
    return round(sim*100, 1)

print(score_pair("እት እድንየ እግል ትርኤ ተሐዜዮ ተቅዪር ግበእ", "Be the change that you wish to see in the world"))
print(score_pair("ክል ዶል ኢገብእ መስል እስከ ይከለስ", "It always seems impossible until it's done"))
   
---

## License

**CC BY-SA 4.0**