YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Model Card for Tajik-Persian multilingual-e5 + Contrastive Learning for Lexical Induction
This model is a fine-tuned version of intfloat/multilingual-e5-base on the TajPersParallelLexicalCorpus dataset. It produces aligned embeddings for Tajik (Cyrillic) and Persian (Arabic script) words, enabling cross-lingual word retrieval and semantic similarity tasks.
Model Details
Model Description
- Developed by: Mullosharaf K. Arabov (TajikNLPWorld)
- Funded by: [More Information Needed]
- Shared by: TajikNLPWorld
- Model type: Sentence Transformer (contrastive learning)
- Language(s) (NLP): Tajik (tg), Persian (fa)
- License: Apache 2.0
- Finetuned from model:
intfloat/multilingual-e5-base
Model Sources
- Repository: https://huggingface.co/TajikNLPWorld/tajik-persian-e5-contrastive
- Paper: [More Information Needed]
- Demo: [More Information Needed]
Uses
Direct Use
The model can be used directly to obtain cross-lingual embeddings for Tajik and Persian words. It is optimised for single‑word inputs. Example use cases:
- Finding translations of a Tajik word from a Persian candidate list.
- Computing semantic similarity between words in the two languages.
- Building bilingual lexical resources or improving machine translation pre-processing.
Downstream Use
The model can be integrated into larger systems that require cross-lingual alignment, such as:
- Bilingual lexicon induction
- Cross-lingual information retrieval
- Unsupervised or semi-supervised machine translation (as a pre-processing step for word alignment)
Out-of-Scope Use
- The model is not designed for multi‑word phrases or full sentences; performance on such inputs may degrade.
- It does not handle out-of-vocabulary words beyond its subword tokenization – for rare words, subword‑based FastText models may be more appropriate.
- It should not be used for languages other than Tajik and Persian.
Bias, Risks, and Limitations
- The model was trained on a parallel lexical corpus which may contain biases present in the source data (e.g., under‑representation of certain domains or dialects).
- Performance varies by part‑of‑speech; conjunctions and proper nouns show lower retrieval accuracy.
- As a neural model, it may exhibit unpredictable behaviour on adversarial or nonsensical inputs.
Recommendations
Users should evaluate the model on their specific task and consider combining it with other resources (e.g., rule‑based checks) for critical applications. When using the model, be aware of the limitations described above.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
model_name = "TajikNLPWorld/tajik-persian-e5-contrastive"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()
def get_embedding(text):
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)
return F.normalize(embeddings, p=2, dim=1)
# Example similarity
tajik_word = "модар"
persian_word = "مادر"
emb_tg = get_embedding(tajik_word)
emb_fa = get_embedding(persian_word)
similarity = torch.cosine_similarity(emb_tg, emb_fa)
print(f"Similarity: {similarity.item():.4f}") # ~0.78
# Find translations from a list
persian_words = ["مادر", "پدر", "برادر", "خواهر", "دختر", "پسر"]
persian_embeddings = torch.cat([get_embedding(w) for w in persian_words])
query_emb = get_embedding("модар")
similarities = torch.mm(query_emb, persian_embeddings.T).squeeze(0)
top_indices = similarities.argsort(descending=True)[:5]
for i in top_indices:
print(f"{persian_words[i]}: {similarities[i].item():.4f}")
Training Details
Training Data
- Dataset: TajPersParallelLexicalCorpus
- Size: 33,222 parallel word pairs (Tajik–Persian)
- Split: Training set (the remainder after holding out 4,153 pairs for evaluation)
Training Procedure
The model was fine‑tuned using contrastive learning (NT‑Xent loss) to align embeddings of translation pairs.
Preprocessing
- Inputs were tokenized with the multilingual‑e5 tokenizer (max length 128).
- No additional filtering was applied.
Training Hyperparameters
- Batch size: 16
- Learning rate: 2e-5
- Optimizer: AdamW
- Epochs: 3
- Loss: NT‑Xent with temperature 0.02
- Max sequence length: 128 tokens
Speeds, Sizes, Times
- Training was performed on a single GPU (exact hardware unknown). Each epoch took approximately 1 hour.
Evaluation
Testing Data, Factors & Metrics
Testing Data
A held‑out test set of 4,153 Tajik–Persian word pairs from the same corpus, not seen during training.
Factors
- Part‑of‑speech: The test set includes nouns, adjectives, verbs, adverbs, proper nouns, interjections, conjunctions, and numerals (tagged for Tajik).
- Domain: General vocabulary.
Metrics
- Precision@k (P@1, P@5, P@10): proportion of queries where the correct translation is in the top‑k retrieved candidates.
- Mean Reciprocal Rank (MRR): average of reciprocal ranks of the correct translation.
Results
Overall Performance
| Metric | Value |
|---|---|
| P@1 | 0.559 |
| P@5 | 0.787 |
| P@10 | 0.841 |
| MRR | 0.661 |
Performance by Part of Speech (P@1)
| POS (Tajik) | English | P@1 |
|---|---|---|
| исм | Noun | 0.545 |
| сифат | Adjective | 0.598 |
| феъл | Verb | 0.489 |
| зарф | Adverb | 0.539 |
| исми хос | Proper Noun | 0.429 |
| нидо | Interjection | 0.452 |
| пайвандак | Conjunction | 0.375 |
| шумора | Numeral | 0.562 |
Comparison with Other Models
| Model | P@1 | P@5 | P@10 | MRR |
|---|---|---|---|---|
| LaBSE (fine‑tuned) | 0.684 | 0.878 | 0.913 | 0.771 |
| multilingual‑e5 + contrastive (this model) | 0.559 | 0.787 | 0.841 | 0.661 |
| XLM‑RoBERTa + LoRA | 0.000 | 0.001 | 0.001 | 0.001 |
| FastText+VecMap | 0.000 | 0.000 | 0.000 | 0.000 |
This model is the second‑best performer among those evaluated, with particularly strong results on adjectives and numerals.
Environmental Impact
- Hardware Type: Single GPU (NVIDIA unspecified)
- Hours used: ~3 hours
- Cloud Provider: Not applicable (local machine)
- Compute Region: Not applicable
- Carbon Emitted: Not estimated
Technical Specifications
Model Architecture and Objective
- Base architecture:
intfloat/multilingual-e5-base(Transformer‑based, 12 layers, 768 hidden size) - Training objective: Contrastive (NT‑Xent) loss to maximise similarity between translation pairs and minimise similarity between non‑pairs.
Compute Infrastructure
Hardware
- GPU with at least 8 GB VRAM (e.g., NVIDIA Tesla T4 or similar)
Software
- Python 3.9
- PyTorch 1.13
- Transformers 4.25
- Datasets 2.7
Citation
@misc{tajik_persian_e5_2026,
title = {Tajik-Persian multilingual-e5 + Contrastive Learning for Lexical Induction},
author = {Arabov, Mullosharaf Kurbonovich},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/TajikNLPWorld/tajik-persian-e5-contrastive}
}
More Information
[More Information Needed]
Model Card Authors
Mullosharaf K. Arabov (TajikNLPWorld)
Model Card Contact
- Hugging Face: TajikNLPWorld
- Email: cool.araby@gmail.com
- Downloads last month
- -