YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Model Card for Tajik-Persian multilingual-e5 + Contrastive Learning for Lexical Induction

This model is a fine-tuned version of intfloat/multilingual-e5-base on the TajPersParallelLexicalCorpus dataset. It produces aligned embeddings for Tajik (Cyrillic) and Persian (Arabic script) words, enabling cross-lingual word retrieval and semantic similarity tasks.

Model Details

Model Description

Developed by: Mullosharaf K. Arabov (TajikNLPWorld)
Funded by: [More Information Needed]
Shared by: TajikNLPWorld
Model type: Sentence Transformer (contrastive learning)
Language(s) (NLP): Tajik (tg), Persian (fa)
License: Apache 2.0
Finetuned from model: intfloat/multilingual-e5-base

Model Sources

Repository: https://huggingface.co/TajikNLPWorld/tajik-persian-e5-contrastive
Paper: [More Information Needed]
Demo: [More Information Needed]

Uses

Direct Use

The model can be used directly to obtain cross-lingual embeddings for Tajik and Persian words. It is optimised for single‑word inputs. Example use cases:

Finding translations of a Tajik word from a Persian candidate list.
Computing semantic similarity between words in the two languages.
Building bilingual lexical resources or improving machine translation pre-processing.

Downstream Use

The model can be integrated into larger systems that require cross-lingual alignment, such as:

Bilingual lexicon induction
Cross-lingual information retrieval
Unsupervised or semi-supervised machine translation (as a pre-processing step for word alignment)

Out-of-Scope Use

The model is not designed for multi‑word phrases or full sentences; performance on such inputs may degrade.
It does not handle out-of-vocabulary words beyond its subword tokenization – for rare words, subword‑based FastText models may be more appropriate.
It should not be used for languages other than Tajik and Persian.

Bias, Risks, and Limitations

The model was trained on a parallel lexical corpus which may contain biases present in the source data (e.g., under‑representation of certain domains or dialects).
Performance varies by part‑of‑speech; conjunctions and proper nouns show lower retrieval accuracy.
As a neural model, it may exhibit unpredictable behaviour on adversarial or nonsensical inputs.

Recommendations

Users should evaluate the model on their specific task and consider combining it with other resources (e.g., rule‑based checks) for critical applications. When using the model, be aware of the limitations described above.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

model_name = "TajikNLPWorld/tajik-persian-e5-contrastive"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return F.normalize(embeddings, p=2, dim=1)

# Example similarity
tajik_word = "модар"
persian_word = "مادر"
emb_tg = get_embedding(tajik_word)
emb_fa = get_embedding(persian_word)
similarity = torch.cosine_similarity(emb_tg, emb_fa)
print(f"Similarity: {similarity.item():.4f}")  # ~0.78

# Find translations from a list
persian_words = ["مادر", "پدر", "برادر", "خواهر", "دختر", "پسر"]
persian_embeddings = torch.cat([get_embedding(w) for w in persian_words])
query_emb = get_embedding("модар")
similarities = torch.mm(query_emb, persian_embeddings.T).squeeze(0)
top_indices = similarities.argsort(descending=True)[:5]
for i in top_indices:
    print(f"{persian_words[i]}: {similarities[i].item():.4f}")

Training Details

Training Data

Dataset: TajPersParallelLexicalCorpus
Size: 33,222 parallel word pairs (Tajik–Persian)
Split: Training set (the remainder after holding out 4,153 pairs for evaluation)

Training Procedure

The model was fine‑tuned using contrastive learning (NT‑Xent loss) to align embeddings of translation pairs.

Preprocessing

Inputs were tokenized with the multilingual‑e5 tokenizer (max length 128).
No additional filtering was applied.

Training Hyperparameters

Batch size: 16
Learning rate: 2e-5
Optimizer: AdamW
Epochs: 3
Loss: NT‑Xent with temperature 0.02
Max sequence length: 128 tokens

Speeds, Sizes, Times

Training was performed on a single GPU (exact hardware unknown). Each epoch took approximately 1 hour.

Evaluation

Testing Data, Factors & Metrics

Testing Data

A held‑out test set of 4,153 Tajik–Persian word pairs from the same corpus, not seen during training.

Factors

Part‑of‑speech: The test set includes nouns, adjectives, verbs, adverbs, proper nouns, interjections, conjunctions, and numerals (tagged for Tajik).
Domain: General vocabulary.

Metrics

Precision@k (P@1, P@5, P@10): proportion of queries where the correct translation is in the top‑k retrieved candidates.
Mean Reciprocal Rank (MRR): average of reciprocal ranks of the correct translation.

Results

Overall Performance

Metric	Value
P@1	0.559
P@5	0.787
P@10	0.841
MRR	0.661

Performance by Part of Speech (P@1)

POS (Tajik)	English	P@1
исм	Noun	0.545
сифат	Adjective	0.598
феъл	Verb	0.489
зарф	Adverb	0.539
исми хос	Proper Noun	0.429
нидо	Interjection	0.452
пайвандак	Conjunction	0.375
шумора	Numeral	0.562

Comparison with Other Models

Model	P@1	P@5	P@10	MRR
LaBSE (fine‑tuned)	0.684	0.878	0.913	0.771
multilingual‑e5 + contrastive (this model)	0.559	0.787	0.841	0.661
XLM‑RoBERTa + LoRA	0.000	0.001	0.001	0.001
FastText+VecMap	0.000	0.000	0.000	0.000

This model is the second‑best performer among those evaluated, with particularly strong results on adjectives and numerals.

Environmental Impact

Hardware Type: Single GPU (NVIDIA unspecified)
Hours used: ~3 hours
Cloud Provider: Not applicable (local machine)
Compute Region: Not applicable
Carbon Emitted: Not estimated

Technical Specifications

Model Architecture and Objective

Base architecture: intfloat/multilingual-e5-base (Transformer‑based, 12 layers, 768 hidden size)
Training objective: Contrastive (NT‑Xent) loss to maximise similarity between translation pairs and minimise similarity between non‑pairs.

Compute Infrastructure

Hardware

GPU with at least 8 GB VRAM (e.g., NVIDIA Tesla T4 or similar)

Software

Python 3.9
PyTorch 1.13
Transformers 4.25
Datasets 2.7

Citation

@misc{tajik_persian_e5_2026,
    title = {Tajik-Persian multilingual-e5 + Contrastive Learning for Lexical Induction},
    author = {Arabov, Mullosharaf Kurbonovich},
    year = {2026},
    publisher = {Hugging Face},
    url = {https://huggingface.co/TajikNLPWorld/tajik-persian-e5-contrastive}
}

More Information

[More Information Needed]

Model Card Authors

Mullosharaf K. Arabov (TajikNLPWorld)

Model Card Contact

Hugging Face: TajikNLPWorld
Email: cool.araby@gmail.com

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support