BeitTigreAI
/

tigre-sonar-encoder

Safetensors

Tigre

Model card Files Files and versions

xet

Community

beshiribrahim commited on Nov 24, 2025

Commit

f98328d

verified ·

1 Parent(s): de4fb05

Update README.md

Browse files

Files changed (1) hide show

README.md +81 -6

README.md CHANGED Viewed

@@ -6,15 +6,84 @@ base_model:
 - facebook/SONAR
 ---
-This model provides a Tigre–English quality checker built on a fine-tuned SONAR encoder.
-It produces embeddings for both Tigre and English text and scores their similarity with cosine distance.
-The result is a fast, lightweight tool for filtering parallel data, validating translations, and supporting Tigre–English NLP workflows.
-```bash
-pip install transformers torch
 <pre>
 ```python
@@ -45,4 +114,10 @@ def score_pair(tig, eng):
     return round(sim*100, 1)
 print(score_pair("እት እድንየ እግል ትርኤ ተሐዜዮ ተቅዪር ግበእ", "Be the change that you wish to see in the world"))
-print(score_pair("ክል ዶል ኢገብእ መስል እስከ ይከለስ", "It always seems impossible until it's done"))

 - facebook/SONAR
 ---
+# Tigre Low-Resource Language Resource Collection (Tigre-Data 1.0)
+## Overview
+This repository introduces the first comprehensive public collection of resources for the **Tigre** language — an under-resourced South Semitic language within the Afro-Asiatic family. The release aggregates multiple modalities (text + speech) and provides baseline models for NLP tasks including language modeling, automatic speech recognition (ASR), and machine translation.
+The goal of **Tigre-Data 1.0** is to accelerate research in low-resource NLP, morphologically rich language modeling, speech technologies, and cross-lingual transfer.
+---
+# tigre-sonar-encoder
+A **Tigre–English semantic similarity and quality-checking encoder**, fine-tuned from the SONAR universal embedding model.
+## Key Capabilities
+- Generates 1024-dimensional embeddings for Tigre and English text
+- Computes cosine similarity for translation validation and filtering
+- Supports retrieval, clustering, and cross-lingual semantic tasks
+---
+## Model Description
+**Input Language:** Tigre (`tig`, script: Ethiopic — `tig_Ethi`)
+**Base Model:** `facebook/nllb-200-distilled-1.3B`
+**Model Type:** Encoder-only (text embedding model)
+**Purpose:** Align Tigre embeddings with the universal SONAR cross-lingual space
+---
+## Training Method: Knowledge Distillation
+The model was trained with a teacher–student distillation pipeline:
+### 1. Model & Tokenizer Preparation
+- Initialized from the NLLB-200 distilled encoder
+- Extended tokenizer with Tigre-specific vocabulary
+- New token embeddings initialized by averaging sub-token embeddings
+### 2. Teacher Embedding Generation
+- SONAR embedding model used as the Teacher
+- English translations of Tigre sentences encoded into 1024-dimensional vectors
+### 3. Distillation Fine-Tuning
+- Minimized **Mean Squared Error (MSE)** loss between Student (Tigre encoder) and Teacher embeddings
+- Forced the Tigre model to align with the universal cross-lingual space
+---
+## Training Details
+- **Dataset:** `train_tig_parallel_text.parquet`
+- **Contents:** Tigre sentences paired with gold-standard SONAR embeddings
+- **Objective:** MSE loss between model output and SONAR target vectors
+- **Tokenizer:** Extended NLLB tokenizer with Tigre-specific vocabulary
+---
+## Evaluation Results
+| Metric                         | Result    | Description                                                  |
+| ------------------------------ | --------- | ------------------------------------------------------------ |
+| **Accuracy (Source → Target)** | **0.88**  | Retrieval accuracy when querying with Tigre text             |
+| **Accuracy (Target → Source)** | **0.78**  | Retrieval accuracy when querying with English text           |
+| **BLEU**                       | **30.74** | (BLEU relates to a separate MT evaluation, not this encoder) |
+---
+## Usage Example (Python)
+```bash
+pip install transformers torch
+```
 <pre>
 ```python
     return round(sim*100, 1)
 print(score_pair("እት እድንየ እግል ትርኤ ተሐዜዮ ተቅዪር ግበእ", "Be the change that you wish to see in the world"))
+print(score_pair("ክል ዶል ኢገብእ መስል እስከ ይከለስ", "It always seems impossible until it's done"))
+---
+## License
+**CC BY-SA 4.0**