EgyTriplet Fine-Tuned Model πͺπ¬
This model is a fine-tuned version of the multilingual-e5-large for semantic embedding tasks in Egyptian Arabic and Modern Standard Arabic (MSA).
It was trained using the EgyTriplets - 2M dataset, a large-scale triplet dataset built using an automated pipeline involving translation, quality scoring, and hard-negative mining.
π‘ Whatβs Special?
- Fine-tuned using Triplet Loss
- Supports dialectal and standard Arabic embedding
- Trained on 2 million anchor-positive-negative triplets
- Boosts retrieval, semantic search, and paraphrase detection tasks
π§ͺ Training Details
| Setting | Value |
|---|---|
| Base Model | multilingual-e5-large / multilingual-e5-large-instruct |
| Dataset | EgyTriplets - 2M |
| Loss Function | TripletMarginLoss (margin = 0.3) |
| Batch Size | 16 |
| Epochs | 3 |
| Learning Rate | 2e-5 |
| Hardware | NVIDIA A100 40GB |
π Performance (Triplet Loss)
| Model | Final Triplet Loss |
|---|---|
| egytriplet-e5-large | 0.107 |
| egytriplet-e5-large-instruct | 0.103 |
Lower loss = better semantic alignment.
π§ Intended Use
This model is best suited for:
- Semantic similarity
- Information retrieval
- Search ranking
- Arabic paraphrase detection
- Dialect-to-MSA alignment tasks
π£οΈ Languages
- Egyptian Arabic (ar-eg)
- Modern Standard Arabic (msa)
π License
CC BY 4.0 β free to use, adapt, and share with attribution.
π¬ Citation
If you use this model in your work, please cite:
@misc{egytriplets2024,
author = {Mohammad Essam},
title = {EgyTriplets: Generating 2 Million Egyptian Arabic Triplets via Transformer-Based Translation and Retrieval},
year = {2024},
url = {https://huggingface.co/datasets/metga97/egytriplets-2m}
}
- Downloads last month
- 15
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for metga97/egytriplet-e5-large
Base model
intfloat/multilingual-e5-large