Update README.md
Browse files
README.md
CHANGED
|
@@ -6,15 +6,84 @@ base_model:
|
|
| 6 |
- facebook/SONAR
|
| 7 |
---
|
| 8 |
|
| 9 |
-
This model provides a Tigre–English quality checker built on a fine-tuned SONAR encoder.
|
| 10 |
-
It produces embeddings for both Tigre and English text and scores their similarity with cosine distance.
|
| 11 |
-
The result is a fast, lightweight tool for filtering parallel data, validating translations, and supporting Tigre–English NLP workflows.
|
| 12 |
|
|
|
|
| 13 |
|
| 14 |
-
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
|
|
|
| 17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
<pre>
|
| 19 |
```python
|
| 20 |
|
|
@@ -45,4 +114,10 @@ def score_pair(tig, eng):
|
|
| 45 |
return round(sim*100, 1)
|
| 46 |
|
| 47 |
print(score_pair("እት እድንየ እግል ትርኤ ተሐዜዮ ተቅዪር ግበእ", "Be the change that you wish to see in the world"))
|
| 48 |
-
print(score_pair("ክል ዶል ኢገብእ መስል እስከ ይከለስ", "It always seems impossible until it's done"))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
- facebook/SONAR
|
| 7 |
---
|
| 8 |
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
+
# Tigre Low-Resource Language Resource Collection (Tigre-Data 1.0)
|
| 11 |
|
| 12 |
+
## Overview
|
| 13 |
+
|
| 14 |
+
This repository introduces the first comprehensive public collection of resources for the **Tigre** language — an under-resourced South Semitic language within the Afro-Asiatic family. The release aggregates multiple modalities (text + speech) and provides baseline models for NLP tasks including language modeling, automatic speech recognition (ASR), and machine translation.
|
| 15 |
+
|
| 16 |
+
The goal of **Tigre-Data 1.0** is to accelerate research in low-resource NLP, morphologically rich language modeling, speech technologies, and cross-lingual transfer.
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
# tigre-sonar-encoder
|
| 21 |
+
|
| 22 |
+
A **Tigre–English semantic similarity and quality-checking encoder**, fine-tuned from the SONAR universal embedding model.
|
| 23 |
+
|
| 24 |
+
## Key Capabilities
|
| 25 |
+
|
| 26 |
+
- Generates 1024-dimensional embeddings for Tigre and English text
|
| 27 |
+
- Computes cosine similarity for translation validation and filtering
|
| 28 |
+
- Supports retrieval, clustering, and cross-lingual semantic tasks
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## Model Description
|
| 33 |
+
|
| 34 |
+
**Input Language:** Tigre (`tig`, script: Ethiopic — `tig_Ethi`)
|
| 35 |
+
**Base Model:** `facebook/nllb-200-distilled-1.3B`
|
| 36 |
+
**Model Type:** Encoder-only (text embedding model)
|
| 37 |
+
**Purpose:** Align Tigre embeddings with the universal SONAR cross-lingual space
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
## Training Method: Knowledge Distillation
|
| 42 |
+
|
| 43 |
+
The model was trained with a teacher–student distillation pipeline:
|
| 44 |
+
|
| 45 |
+
### 1. Model & Tokenizer Preparation
|
| 46 |
+
|
| 47 |
+
- Initialized from the NLLB-200 distilled encoder
|
| 48 |
+
- Extended tokenizer with Tigre-specific vocabulary
|
| 49 |
+
- New token embeddings initialized by averaging sub-token embeddings
|
| 50 |
|
| 51 |
+
### 2. Teacher Embedding Generation
|
| 52 |
|
| 53 |
+
- SONAR embedding model used as the Teacher
|
| 54 |
+
- English translations of Tigre sentences encoded into 1024-dimensional vectors
|
| 55 |
+
|
| 56 |
+
### 3. Distillation Fine-Tuning
|
| 57 |
+
|
| 58 |
+
- Minimized **Mean Squared Error (MSE)** loss between Student (Tigre encoder) and Teacher embeddings
|
| 59 |
+
- Forced the Tigre model to align with the universal cross-lingual space
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
## Training Details
|
| 64 |
+
|
| 65 |
+
- **Dataset:** `train_tig_parallel_text.parquet`
|
| 66 |
+
- **Contents:** Tigre sentences paired with gold-standard SONAR embeddings
|
| 67 |
+
- **Objective:** MSE loss between model output and SONAR target vectors
|
| 68 |
+
- **Tokenizer:** Extended NLLB tokenizer with Tigre-specific vocabulary
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## Evaluation Results
|
| 73 |
+
|
| 74 |
+
| Metric | Result | Description |
|
| 75 |
+
| ------------------------------ | --------- | ------------------------------------------------------------ |
|
| 76 |
+
| **Accuracy (Source → Target)** | **0.88** | Retrieval accuracy when querying with Tigre text |
|
| 77 |
+
| **Accuracy (Target → Source)** | **0.78** | Retrieval accuracy when querying with English text |
|
| 78 |
+
| **BLEU** | **30.74** | (BLEU relates to a separate MT evaluation, not this encoder) |
|
| 79 |
+
|
| 80 |
+
---
|
| 81 |
+
|
| 82 |
+
## Usage Example (Python)
|
| 83 |
+
|
| 84 |
+
```bash
|
| 85 |
+
pip install transformers torch
|
| 86 |
+
```
|
| 87 |
<pre>
|
| 88 |
```python
|
| 89 |
|
|
|
|
| 114 |
return round(sim*100, 1)
|
| 115 |
|
| 116 |
print(score_pair("እት እድንየ እግል ትርኤ ተሐዜዮ ተቅዪር ግበእ", "Be the change that you wish to see in the world"))
|
| 117 |
+
print(score_pair("ክል ዶል ኢገብእ መስል እስከ ይከለስ", "It always seems impossible until it's done"))
|
| 118 |
+
---
|
| 119 |
+
|
| 120 |
+
## License
|
| 121 |
+
|
| 122 |
+
**CC BY-SA 4.0**
|
| 123 |
+
|