beshiribrahim commited on
Commit
f98328d
·
verified ·
1 Parent(s): de4fb05

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -6
README.md CHANGED
@@ -6,15 +6,84 @@ base_model:
6
  - facebook/SONAR
7
  ---
8
 
9
- This model provides a Tigre–English quality checker built on a fine-tuned SONAR encoder.
10
- It produces embeddings for both Tigre and English text and scores their similarity with cosine distance.
11
- The result is a fast, lightweight tool for filtering parallel data, validating translations, and supporting Tigre–English NLP workflows.
12
 
 
13
 
14
- ```bash
15
- pip install transformers torch
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
 
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  <pre>
19
  ```python
20
 
@@ -45,4 +114,10 @@ def score_pair(tig, eng):
45
  return round(sim*100, 1)
46
 
47
  print(score_pair("እት እድንየ እግል ትርኤ ተሐዜዮ ተቅዪር ግበእ", "Be the change that you wish to see in the world"))
48
- print(score_pair("ክል ዶል ኢገብእ መስል እስከ ይከለስ", "It always seems impossible until it's done"))
 
 
 
 
 
 
 
6
  - facebook/SONAR
7
  ---
8
 
 
 
 
9
 
10
+ # Tigre Low-Resource Language Resource Collection (Tigre-Data 1.0)
11
 
12
+ ## Overview
13
+
14
+ This repository introduces the first comprehensive public collection of resources for the **Tigre** language — an under-resourced South Semitic language within the Afro-Asiatic family. The release aggregates multiple modalities (text + speech) and provides baseline models for NLP tasks including language modeling, automatic speech recognition (ASR), and machine translation.
15
+
16
+ The goal of **Tigre-Data 1.0** is to accelerate research in low-resource NLP, morphologically rich language modeling, speech technologies, and cross-lingual transfer.
17
+
18
+ ---
19
+
20
+ # tigre-sonar-encoder
21
+
22
+ A **Tigre–English semantic similarity and quality-checking encoder**, fine-tuned from the SONAR universal embedding model.
23
+
24
+ ## Key Capabilities
25
+
26
+ - Generates 1024-dimensional embeddings for Tigre and English text
27
+ - Computes cosine similarity for translation validation and filtering
28
+ - Supports retrieval, clustering, and cross-lingual semantic tasks
29
+
30
+ ---
31
+
32
+ ## Model Description
33
+
34
+ **Input Language:** Tigre (`tig`, script: Ethiopic — `tig_Ethi`)
35
+ **Base Model:** `facebook/nllb-200-distilled-1.3B`
36
+ **Model Type:** Encoder-only (text embedding model)
37
+ **Purpose:** Align Tigre embeddings with the universal SONAR cross-lingual space
38
+
39
+ ---
40
+
41
+ ## Training Method: Knowledge Distillation
42
+
43
+ The model was trained with a teacher–student distillation pipeline:
44
+
45
+ ### 1. Model & Tokenizer Preparation
46
+
47
+ - Initialized from the NLLB-200 distilled encoder
48
+ - Extended tokenizer with Tigre-specific vocabulary
49
+ - New token embeddings initialized by averaging sub-token embeddings
50
 
51
+ ### 2. Teacher Embedding Generation
52
 
53
+ - SONAR embedding model used as the Teacher
54
+ - English translations of Tigre sentences encoded into 1024-dimensional vectors
55
+
56
+ ### 3. Distillation Fine-Tuning
57
+
58
+ - Minimized **Mean Squared Error (MSE)** loss between Student (Tigre encoder) and Teacher embeddings
59
+ - Forced the Tigre model to align with the universal cross-lingual space
60
+
61
+ ---
62
+
63
+ ## Training Details
64
+
65
+ - **Dataset:** `train_tig_parallel_text.parquet`
66
+ - **Contents:** Tigre sentences paired with gold-standard SONAR embeddings
67
+ - **Objective:** MSE loss between model output and SONAR target vectors
68
+ - **Tokenizer:** Extended NLLB tokenizer with Tigre-specific vocabulary
69
+
70
+ ---
71
+
72
+ ## Evaluation Results
73
+
74
+ | Metric | Result | Description |
75
+ | ------------------------------ | --------- | ------------------------------------------------------------ |
76
+ | **Accuracy (Source → Target)** | **0.88** | Retrieval accuracy when querying with Tigre text |
77
+ | **Accuracy (Target → Source)** | **0.78** | Retrieval accuracy when querying with English text |
78
+ | **BLEU** | **30.74** | (BLEU relates to a separate MT evaluation, not this encoder) |
79
+
80
+ ---
81
+
82
+ ## Usage Example (Python)
83
+
84
+ ```bash
85
+ pip install transformers torch
86
+ ```
87
  <pre>
88
  ```python
89
 
 
114
  return round(sim*100, 1)
115
 
116
  print(score_pair("እት እድንየ እግል ትርኤ ተሐዜዮ ተቅዪር ግበእ", "Be the change that you wish to see in the world"))
117
+ print(score_pair("ክል ዶል ኢገብእ መስል እስከ ይከለስ", "It always seems impossible until it's done"))
118
+ ---
119
+
120
+ ## License
121
+
122
+ **CC BY-SA 4.0**
123
+