tigre-sonar-encoder / README.md
beshiribrahim's picture
Upload README.md
2549fe1 verified
|
raw
history blame
1.73 kB
metadata
language:
  - tig
license: cc-by-sa-4.0
base_model:
  - facebook/SONAR

This model provides a Tigre–English quality checker built on a fine-tuned SONAR encoder. It produces embeddings for both Tigre and English text and scores their similarity with cosine distance. The result is a fast, lightweight tool for filtering parallel data, validating translations, and supporting Tigre–English NLP workflows.

pip install transformers torch


<pre>
```python

from transformers import AutoTokenizer, M2M100ForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# load your Tigre-trained encoder
model_id = "BeitTigreAI/tigre-sonar-encoder"
seq2seq = M2M100ForConditionalGeneration.from_pretrained(model_id)
encoder = seq2seq.get_encoder().to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id)

@torch.inference_mode()
def embed(texts, lang):
    tokenizer.src_lang = lang
    batch = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
    out = encoder(**batch, return_dict=True)
    mask = batch["attention_mask"].unsqueeze(-1).float()
    pooled = (out.last_hidden_state * mask).sum(dim=1) / mask.sum(dim=1).clamp_min(1.0)
    return torch.nn.functional.normalize(pooled, p=2, dim=1)

def score_pair(tig, eng):
    t = embed([tig], "tig_Ethi")
    e = embed([eng], "eng_Latn")
    sim = float((t*e).sum())
    return round(sim*100, 1)

print(score_pair("እት እድንየ እግል ትርኤ ተሐዜዮ ተቅዪር ግበእ", "Be the change that you wish to see in the world"))
print(score_pair("ክል ዶል ኢገብእ መስል እስከ ይከለስ", "It always seems impossible until it's done"))