Update README.md

9022c8c verified 12 days ago

4.35 kB

	---
	language:
	- tig
	license: cc-by-sa-4.0
	base_model:
	- facebook/SONAR
	---


	# Tigre Low-Resource Language Resource Collection (Tigre-Data 1.0)

	## Overview

	This repository introduces the first comprehensive public collection of resources for the Tigre language — an under-resourced South Semitic language within the Afro-Asiatic family. The release aggregates multiple modalities (text + speech) and provides baseline models for NLP tasks including language modeling, automatic speech recognition (ASR), and machine translation.

	The goal of Tigre-Data 1.0 is to accelerate research in low-resource NLP, morphologically rich language modeling, speech technologies, and cross-lingual transfer.

	---

	# tigre-sonar-encoder

	A Tigre–English semantic similarity and quality-checking encoder, fine-tuned from the SONAR universal embedding model.

	## Key Capabilities

	- Generates 1024-dimensional embeddings for Tigre and English text
	- Computes cosine similarity for translation validation and filtering
	- Supports retrieval, clustering, and cross-lingual semantic tasks

	---

	## Model Description

	Input Language: Tigre (`tig`, script: Ethiopic — `tig_Ethi`)
	Base Model: `facebook/nllb-200-distilled-1.3B`
	Model Type: Encoder-only (text embedding model)
	Purpose: Align Tigre embeddings with the universal SONAR cross-lingual space

	---

	## Training Method: Knowledge Distillation

	The model was trained with a teacher–student distillation pipeline:

	### 1. Model & Tokenizer Preparation

	- Initialized from the NLLB-200 distilled encoder
	- Extended tokenizer with Tigre-specific vocabulary
	- New token embeddings initialized by averaging sub-token embeddings

	### 2. Teacher Embedding Generation

	- SONAR embedding model used as the Teacher
	- English translations of Tigre sentences encoded into 1024-dimensional vectors

	### 3. Distillation Fine-Tuning

	- Minimized Mean Squared Error (MSE) loss between Student (Tigre encoder) and Teacher embeddings
	- Forced the Tigre model to align with the universal cross-lingual space

	---

	## Training Details

	- Dataset: `train_tig_parallel_text.parquet`
	- Contents: Tigre sentences paired with gold-standard SONAR embeddings
	- Objective: MSE loss between model output and SONAR target vectors
	- Tokenizer: Extended NLLB tokenizer with Tigre-specific vocabulary

	---

	## Evaluation Results

	\| Metric \| Result \| Description \|
	\| ------------------------------ \| --------- \| ------------------------------------------------------------ \|
	\| Accuracy (Source → Target) \| 0.88 \| Retrieval accuracy when querying with Tigre text \|
	\| Accuracy (Target → Source) \| 0.78 \| Retrieval accuracy when querying with English text \|
	\| BLEU \| 30.74 \| (BLEU relates to a separate MT evaluation, not this encoder) \|

	---

	## Usage Example (Python)

	```bash
	pip install transformers torch
	```
	<pre>
	```python

	from transformers import AutoTokenizer, M2M100ForConditionalGeneration
	import torch

	device = "cuda" if torch.cuda.is_available() else "cpu"

	model_id = "BeitTigreAI/tigre-sonar-encoder"
	seq2seq = M2M100ForConditionalGeneration.from_pretrained(
	model_id,
	subfolder="model"
	)
	encoder = seq2seq.get_encoder().to(device).eval()
	tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="model")

	@torch.inference_mode()
	def embed(texts, lang):
	tokenizer.src_lang = lang
	batch = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
	out = encoder(**batch, return_dict=True)
	mask = batch["attention_mask"].unsqueeze(-1).float()
	pooled = (out.last_hidden_state * mask).sum(dim=1) / mask.sum(dim=1).clamp_min(1.0)
	return torch.nn.functional.normalize(pooled, p=2, dim=1)

	def score_pair(tig, eng):
	t = embed([tig], "tig_Ethi")
	e = embed([eng], "eng_Latn")
	sim = float((t*e).sum())
	return round(sim*100, 1)

	print(score_pair("እት እድንየ እግል ትርኤ ተሐዜዮ ተቅዪር ግበእ", "Be the change that you wish to see in the world"))
	print(score_pair("ክል ዶል ኢገብእ መስል እስከ ይከለስ", "It always seems impossible until it's done"))

	---

	## License

	CC BY-SA 4.0