metga97/egytriplets-2m
Viewer β’ Updated β’ 733k β’ 35 β’ 3
How to use metga97/egytriplet-e5-large-instruct with Transformers:
# Load model directly
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("metga97/egytriplet-e5-large-instruct")
model = AutoModel.from_pretrained("metga97/egytriplet-e5-large-instruct")This model is a fine-tuned version of the multilingual-e5-large for semantic embedding tasks in Egyptian Arabic and Modern Standard Arabic (MSA).
It was trained using the EgyTriplets - 2M dataset, a large-scale triplet dataset built using an automated pipeline involving translation, quality scoring, and hard-negative mining.
| Setting | Value |
|---|---|
| Base Model | multilingual-e5-large / multilingual-e5-large-instruct |
| Dataset | EgyTriplets - 2M |
| Loss Function | TripletMarginLoss (margin = 0.3) |
| Batch Size | 16 |
| Epochs | 3 |
| Learning Rate | 2e-5 |
| Hardware | NVIDIA A100 40GB |
| Model | Final Triplet Loss |
|---|---|
| egytriplet-e5-large | 0.107 |
| egytriplet-e5-large-instruct | 0.103 |
Lower loss = better semantic alignment.
This model is best suited for:
CC BY 4.0 β free to use, adapt, and share with attribution.
If you use this model in your work, please cite:
@misc{egytriplets2024,
author = {Mohammad Essam},
title = {EgyTriplets: Generating 2 Million Egyptian Arabic Triplets via Transformer-Based Translation and Retrieval},
year = {2024},
url = {https://huggingface.co/datasets/metga97/egytriplets-2m}
}
Base model
intfloat/multilingual-e5-large-instruct