YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Model Card for Tajik-Persian multilingual-e5 + Contrastive Learning for Lexical Induction

This model is a fine-tuned version of intfloat/multilingual-e5-base on the TajPersParallelLexicalCorpus dataset. It produces aligned embeddings for Tajik (Cyrillic) and Persian (Arabic script) words, enabling cross-lingual word retrieval and semantic similarity tasks.

Model Details

Model Description

  • Developed by: Mullosharaf K. Arabov (TajikNLPWorld)
  • Funded by: [More Information Needed]
  • Shared by: TajikNLPWorld
  • Model type: Sentence Transformer (contrastive learning)
  • Language(s) (NLP): Tajik (tg), Persian (fa)
  • License: Apache 2.0
  • Finetuned from model: intfloat/multilingual-e5-base

Model Sources

Uses

Direct Use

The model can be used directly to obtain cross-lingual embeddings for Tajik and Persian words. It is optimised for single‑word inputs. Example use cases:

  • Finding translations of a Tajik word from a Persian candidate list.
  • Computing semantic similarity between words in the two languages.
  • Building bilingual lexical resources or improving machine translation pre-processing.

Downstream Use

The model can be integrated into larger systems that require cross-lingual alignment, such as:

  • Bilingual lexicon induction
  • Cross-lingual information retrieval
  • Unsupervised or semi-supervised machine translation (as a pre-processing step for word alignment)

Out-of-Scope Use

  • The model is not designed for multi‑word phrases or full sentences; performance on such inputs may degrade.
  • It does not handle out-of-vocabulary words beyond its subword tokenization – for rare words, subword‑based FastText models may be more appropriate.
  • It should not be used for languages other than Tajik and Persian.

Bias, Risks, and Limitations

  • The model was trained on a parallel lexical corpus which may contain biases present in the source data (e.g., under‑representation of certain domains or dialects).
  • Performance varies by part‑of‑speech; conjunctions and proper nouns show lower retrieval accuracy.
  • As a neural model, it may exhibit unpredictable behaviour on adversarial or nonsensical inputs.

Recommendations

Users should evaluate the model on their specific task and consider combining it with other resources (e.g., rule‑based checks) for critical applications. When using the model, be aware of the limitations described above.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

model_name = "TajikNLPWorld/tajik-persian-e5-contrastive"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return F.normalize(embeddings, p=2, dim=1)

# Example similarity
tajik_word = "модар"
persian_word = "مادر"
emb_tg = get_embedding(tajik_word)
emb_fa = get_embedding(persian_word)
similarity = torch.cosine_similarity(emb_tg, emb_fa)
print(f"Similarity: {similarity.item():.4f}")  # ~0.78

# Find translations from a list
persian_words = ["مادر", "پدر", "برادر", "خواهر", "دختر", "پسر"]
persian_embeddings = torch.cat([get_embedding(w) for w in persian_words])
query_emb = get_embedding("модар")
similarities = torch.mm(query_emb, persian_embeddings.T).squeeze(0)
top_indices = similarities.argsort(descending=True)[:5]
for i in top_indices:
    print(f"{persian_words[i]}: {similarities[i].item():.4f}")

Training Details

Training Data

  • Dataset: TajPersParallelLexicalCorpus
  • Size: 33,222 parallel word pairs (Tajik–Persian)
  • Split: Training set (the remainder after holding out 4,153 pairs for evaluation)

Training Procedure

The model was fine‑tuned using contrastive learning (NT‑Xent loss) to align embeddings of translation pairs.

Preprocessing

  • Inputs were tokenized with the multilingual‑e5 tokenizer (max length 128).
  • No additional filtering was applied.

Training Hyperparameters

  • Batch size: 16
  • Learning rate: 2e-5
  • Optimizer: AdamW
  • Epochs: 3
  • Loss: NT‑Xent with temperature 0.02
  • Max sequence length: 128 tokens

Speeds, Sizes, Times

  • Training was performed on a single GPU (exact hardware unknown). Each epoch took approximately 1 hour.

Evaluation

Testing Data, Factors & Metrics

Testing Data

A held‑out test set of 4,153 Tajik–Persian word pairs from the same corpus, not seen during training.

Factors

  • Part‑of‑speech: The test set includes nouns, adjectives, verbs, adverbs, proper nouns, interjections, conjunctions, and numerals (tagged for Tajik).
  • Domain: General vocabulary.

Metrics

  • Precision@k (P@1, P@5, P@10): proportion of queries where the correct translation is in the top‑k retrieved candidates.
  • Mean Reciprocal Rank (MRR): average of reciprocal ranks of the correct translation.

Results

Overall Performance

Metric Value
P@1 0.559
P@5 0.787
P@10 0.841
MRR 0.661

Performance by Part of Speech (P@1)

POS (Tajik) English P@1
исм Noun 0.545
сифат Adjective 0.598
феъл Verb 0.489
зарф Adverb 0.539
исми хос Proper Noun 0.429
нидо Interjection 0.452
пайвандак Conjunction 0.375
шумора Numeral 0.562

Comparison with Other Models

Model P@1 P@5 P@10 MRR
LaBSE (fine‑tuned) 0.684 0.878 0.913 0.771
multilingual‑e5 + contrastive (this model) 0.559 0.787 0.841 0.661
XLM‑RoBERTa + LoRA 0.000 0.001 0.001 0.001
FastText+VecMap 0.000 0.000 0.000 0.000

This model is the second‑best performer among those evaluated, with particularly strong results on adjectives and numerals.

Environmental Impact

  • Hardware Type: Single GPU (NVIDIA unspecified)
  • Hours used: ~3 hours
  • Cloud Provider: Not applicable (local machine)
  • Compute Region: Not applicable
  • Carbon Emitted: Not estimated

Technical Specifications

Model Architecture and Objective

  • Base architecture: intfloat/multilingual-e5-base (Transformer‑based, 12 layers, 768 hidden size)
  • Training objective: Contrastive (NT‑Xent) loss to maximise similarity between translation pairs and minimise similarity between non‑pairs.

Compute Infrastructure

Hardware

  • GPU with at least 8 GB VRAM (e.g., NVIDIA Tesla T4 or similar)

Software

  • Python 3.9
  • PyTorch 1.13
  • Transformers 4.25
  • Datasets 2.7

Citation

@misc{tajik_persian_e5_2026,
    title = {Tajik-Persian multilingual-e5 + Contrastive Learning for Lexical Induction},
    author = {Arabov, Mullosharaf Kurbonovich},
    year = {2026},
    publisher = {Hugging Face},
    url = {https://huggingface.co/TajikNLPWorld/tajik-persian-e5-contrastive}
}

More Information

[More Information Needed]

Model Card Authors

Mullosharaf K. Arabov (TajikNLPWorld)

Model Card Contact

Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support