Buckets:

|
download
raw
87.4 kB

Title: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks

URL Source: https://arxiv.org/html/2411.01192

Published Time: Wed, 12 Feb 2025 01:59:52 GMT

Markdown Content: Gagan Bhatia ξ El Moatez Billah Nagoudi ξ,γ Abdellah El Mekki ω Fakhraddin Alwajih ξ

Muhammad Abdul-Mageed ξ,ω,λ

ξ The University of British Columbia,ω MBZUAI, γ PSU, λ Invertible AI

{gagan30@student.,muhammad.mageed@}ubc.ca

Abstract

We introduce Swan, a family of embedding models centered on Arabic, designed for both small-scale and large-scale applications. Swan comprises two variants: Swan-Small, built on ARBERTv2, and Swan-Large, based on ArMistral, a pretrained Arabic large language model. To evaluate our models, we propose a comprehensive benchmark suite, dubbed ArabicMTEB, that assesses cross-lingual, multi-dialectal, multi-domain, and multi-cultural Arabic text embedding performance. ArabicMTEB covers eight diverse tasks sourced from 94 datasets. Swan-Large achieves state-of-the-art results, outperforming Multilingual-E5-large in most Arabic tasks, while Swan-Small consistently surpasses Multilingual-E5-base. Our extensive evaluations show that Swan models are both dialectally and culturally aware, achieving strong performance across diverse Arabic domains while maintaining significant cost efficiency. This work significantly advances the field of Arabic language modelling and provides valuable resources for future research and applications in Arabic NLP. Our models and benchmark are available at our GitHub page: https://github.com/UBC-NLP/swan.

Image 1: [Uncaptioned image]Swan and ArabicMTEB:

Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks

Gagan Bhatia ξ El Moatez Billah Nagoudi ξ,γ Abdellah El Mekki ω Fakhraddin Alwajih ξ Muhammad Abdul-Mageed ξ,ω,λ ξ The University of British Columbia,ω MBZUAI, γ PSU, λ Invertible AI{gagan30@student.,muhammad.mageed@}ubc.ca

1 Introduction

NLP has seen rapid advancements in recent years, driven by groundbreaking developments in deep learning and the emergence of sophisticated distributed text representations such as word and sentence embeddings Devlin et al. (2018); Reimers and Gurevych (2019). These embeddings, which transform text into dense vectors, enable effective semantic understanding and are pivotal for enhancing performance across many downstream applications, including text classification, semantic search, and machine translation. Moreover, text embeddings have become fundamental to the success of large language models (LLMs)Touvron et al. (2023); Jiang et al. (2023); Gemma-Team et al. (2024), which are increasingly integrated into a variety of real-world systems and tools. One of the most promising applications of these embeddings is in the realm of Retrieval-Augmented Generation (RAG)Shao et al. (2023); rag (2023), where LLMs are augmented with information retrieval capabilities. In RAG-based systems, lightweight embedding models retrieve relevant information from large corpora, fed as context to models like ChatGPT OpenAI (2023) or GPT-4 OpenAI et al. (2024). This synergy between embeddings and LLMs has demonstrated significant improvements in both general-purpose tasks such as question answering Lin et al. (2023); rag (2023) and domain-specific applications Bhatia et al. (2024); Shi et al. (2023); Lin et al. (2023).

Image 2: Refer to caption

Figure 1: Overview of our ArabicMTEB benchmark tasks, covering clustering, retrieval, reranking, classification, semantic similarity, pair classification, cross-lingual retrieval, and bitext mining.

Despite these advances, the predominant focus of current embedding models has been on English and Chinese, which limits their applicability to other languages. This is particularly true for Arabic, a collection of languages, language varieties, and diverse dialects with rich morphology Abdul-Mageed et al. (2023a, 2024a), making it challenging to develop effective language representations Nagoudi et al. (2022); Huang et al. (2024). Existing multilingual models often fail to capture these complexities, leading to a suboptimal performance on Arabic NLP tasks Abdul-Mageed et al. (2020a); Elmadany et al. (2022). Addressing this limitation requires the development of Arabic-specific embedding models that are sensitive to the linguistic and cultural nuances of Arabic.

In this work, we introduce Swan, a family of dialect-aware, Arabic-centric, cross-lingual, and cross-cultural embedding models designed to bridge this gap and push the boundaries of Arabic NLP. Our contributions are as follows:(1)We introduce Swan, a cutting-edge family of Arabic embedding models. This includes two variants: Swan-Small, based on ARBERTv2 Elmadany et al. (2022), and Swan-Large, built upon ArMistral, a further pretrained Arabic language model. (2)We present ArabicMTEB, a comprehensive evaluation benchmark for Arabic text. ArabicMTEB is designed to assess cross-lingual, multi-dialectal, multi-domain, and multi-cultural performance, spanning eight tasks and 94 94 94 94 datasets. Figure1 provides an overview of ArabicMTEB. (3)Our larger model, Swan-Large, showcases state-of-the-art text embedding capabilities, surpassing Multilingual-E5-large Wang et al. (2024b) on most Arabic tasks. Similarly, our smaller, Swan-Small, consistently outperforms Multilingual-E5-base Wang et al. (2024b) on most Arabic tasks. (4)Through rigorous benchmarking, we demonstrate that Swan models are not only dialectally and culturally aware, but also excel across diverse Arabic domains while maintaining a significantly lower monetary cost.

The rest of the paper is organized as follows: In Section2, we review related work with a particular emphasis on Arabic text embedding models, their applications and challenges. We present our approach to model training of Swan models in Section3. Section4 outlines how we built our benchmark dataset,ArabicMTEB. Section5 is about our experiments and model analysis. We conclude in Section6.

2 Related Work

Table 1: Comparison of various text embedding benchmarks proposed in the literature across the different covered task clusters. RTR: retrieval, STS: semantic textual similarity, PairCLF: pair classification, CLF: classification, CLR: clustering, RRK: reranking, BTM: bitext mining, CRTR: cross-lingual retrieval.

In recent years, there have been remarkable advancements in text embedding models, with a shift towards developing universal embeddings for diverse tasks and domains. Despite this, specialized models and benchmarks for languages like Arabic remain underexplored.

Multilingual Text Embedding Models. With the need for language-agnostic embeddings growing, multilingual models such as LASER Artetxe and Schwenk (2019) and LaBSE Feng et al. (2022) were developed using BiLSTM and Transformer encoders, respectively. Building on this, the Multilingual-E5 Wang et al. (2024c) series extends the E5 architecture to support diverse languages using multilingual text pairs and synthetic data. GRIT Muennighoff et al. (2024) further unifies generative and embedding tasks within a single model. Newer models such as ColBERT-XM Louis et al. (2024) and Gecko Lee et al. (2024) refine multilingual embeddings through modular and distilled architectures.

Arabic-Specific Models. Despite progress in Arabic NLP, existing models are not optimized for Arabic text embedding and retrieval. Efforts like ARBERT Abdul-Mageed et al. (2021a) and AraMus Alghamdi et al. (2023) have focused on encoding and generation but are not tailored for sentence-level embeddings. While language-agnostic models such as LASER and Multilingual-E5 include Arabic in their training data, they may not fully capture its linguistic intricacies and diversity. To address this,Nacar and Koubaa (2024) introduced models and training datasets to improve semantic similarity performance for Arabic.

Text Embedding Benchmarks. Most text embedding evaluations rely on a narrow set of datasets, limiting their generalisation ability. To address this, the Massive Text Embedding Benchmark (MTEB) Muennighoff et al. (2023) introduced eight task categories with 58 datasets and 112 languages. However, it remains predominantly focused on English. Similar benchmarks have been developed for other languages, such as C-MTEB Xiao et al. (2023) for Chinese. For Arabic, evaluations have primarily centred on Semantic Text Similarity (STS) tasks Nacar and Koubaa (2024). However, excelling in STS does not guarantee optimal performance in tasks like clustering or reranking Muennighoff et al. (2023). Existing Arabic benchmarks like ORCA Elmadany et al. (2023) and ALUE Seelawi et al. (2021) focus on natural language understanding (NLU), while Dolphin Nagoudi et al. (2023a) targets natural language generation (NLG). This work is the first comprehensive benchmark for evaluating Arabic text embeddings across multiple tasks.

3 Swan

3.1 Training Data

To train Swan, we develop the most extensive training corpus for Arabic embedding models, leveraging a unique assembly of datasets to ensure comprehensive linguistic coverage and diversity. Our training data covers paragraph-based and sentence-based datasets curated from multiple sources. Table 2 shows an overview of our training datasets.

Figure 2: Methodology to generate our synthetic data.

MSA Datasets. We focus on two sources: (i) Human-generated data: Composed from ORCA Elmadany et al. (2023) and mMARCO Bonifacio et al. (2021). ORCA is a compilation of labelled datasets with tasks such as semantic text similarity (STS), sentence classification, text classification, natural language inference (NLI), and question answering. We use all the training sets from ORCA, encompassing 60 60 60 60 different datasets. mMARCO-ar is the translated version of MARCO, which is a human-generated dataset Bajaj et al. (2018). Both of these datasets are cleaned up and de-duplicated using Polydedupe Bhatia (2023),1 1 1https://github.com/gagan3012/PolyDeDupe which is further described in Appendix C. (ii) Synthetically-generated data: To augment our MSA training data for retrieval tasks, we use Command R+ Cohere For AI (2024) to generate high-quality synthetic data.2 2 2 We performed various in-house evaluations comparing multiple models. Command R+ was chosen as it is open-source and efficient in generating Arabic varieties (standard and dialectal). The generation methodology is inspired by Wang et al. (2024a), and we employ the procedure shown in Figure 2 to generate our synthetic dataset. We generate 100⁢k 100 𝑘 100k 100 italic_k in general MSA data and 5⁢k 5 𝑘 5k 5 italic_k in instances for specific domains such as finance, news, medicine, and legal for a total of 120⁢k 120 𝑘 120k 120 italic_k MSA instances.

Family Language Type Dataset Level Size Monoling Ar Human ORCA-MSA Sent 378K ORCA-DIA 122K MMARCO-ar 8.1M Synthetic Synth-MSA Parag 100K Synth-DIA 15K Synth-DOM 20K Crossling Ar to 15 lg Human MMARCO Sent 3M Ar to 6 lg XOR-TyDi 20.5K Multiling 11 lg Human Mr-Tydi Sent 49K 16 lg Miracl 343K Total 12.5M

Table 2: The diverse datasets employed for training our Arabic embedding models. In the synthetic dataset, we have three datasets: the MSA dataset, the dialectal dataset (Egyptian and Moroccan), and domain-based focusing on medical, financial, legal and news domains.

Dialectal Arabic Datasets. Similar to the MSA datasets, we focus on two sources: (i) Human-generated data: We use publicly available dialectal Arabic data, which primarily covers Gulf, Egyptian, Moroccan, and Levantine varieties of Arabic Elmadany et al. (2022); Nagoudi et al. (2023b); Alwajih et al. (2024); Abdul-Mageed et al. (2020b, 2021c, 2022, 2023b, 2018, c); Keleg et al. (2023); Keleg and Magdy (2023); Zaidan and Callison-Burch (2014); Bouamor et al. (2018). The total number of samples is 122⁢K 122 𝐾 122K 122 italic_K. (ii) Synthetically-generated data: As most human-generated dialectal data comes from noisy environments such as social media, it often results in short texts of low quality. Thus, we use Command-R+ to generate paragraph-based synthetic data for Egyptian and Moroccan dialects to improve the performance of our models on dialectal Arabic. We generated 15⁢k 15 𝑘 15k 15 italic_k dialectal instances using the same methodology as our synthetic MSA datasets described above.

Cross-Lingual & Multilingual Datasets. To adapt our model for cross-lingual and multilingual scenarios, we incorporate the mMARCO dataset, which provides translations of the MS MARCO dataset into 15 languages Bonifacio et al. (2021). To ensure that documents correspond accurately to their queries in different languages, we utilize specific IDs. We create 100⁢k 100 𝑘 100k 100 italic_k samples for each cross-lingual pair and shuffle the IDs to prevent repetition, thus guaranteeing that unique data samples are used for each language. We utilize the MIRACL Zhang et al. (2022), XOR-TyDI Asai et al. (2021), and Mr.TyDi Zhang et al. (2021) datasets as our crosslingual and multilingual resources.

3.2 Training Strategy

For Swan, we consider two models: Swan-Small and Swan-Large. The choice of training two models with different sizes is driven by the need to balance performance and computational efficiency. Swan-Small is designed to cater to scenarios where lower computational resources are available or when a lightweight model is preferred for deployment on edge devices. In contrast, Swan-Large is intended for settings where achieving SoTA performance is paramount, leveraging a larger parameter size to better capture the nuances of Arabic.

Data Preprocessing. We incorporate human-generated and synthetic datasets into our training pipeline to ensure robust performance across various dialects and cultural contexts. We first train on MSA datasets, followedby fine-tuning on dialectal datasets. This two-step approach ensures that both MSA and dialectal varieties are well represented, promoting better generalization across the full spectrum of Arabic varieties. Our dataset is constructed with a query format, including positive and negative samples.

Swan-Small. Built upon ARBERTv2 Abdul-Mageed et al. (2021b), a powerful BERT-based model for the Arabic language. Here, our model is trained using the InfoNCE loss van den Oord et al. (2019), which maximizes the similarity between related sentences while minimizing the similarity between unrelated sentences. The model is trained for five epochs on the entire dataset with a learning rate of 5⁢e−6 5 superscript 𝑒 6 5e^{-6}5 italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and a batch size of 128, incorporating 15 hard negatives. Swan-Small has 164 164 164 164 M parameters and a dimension size of 768 768 768 768.

Swan-Large.Swan-Large is based on ArMistral-7B, an in-house further pretrained version of Mistral-7B Jiang et al. (2023)3 3 3 Further details about ArMistral can be found in AppendixA.. To train Swan-Large, we use LoRA Hu et al. (2021) for parameter efficient training and InfoNCE loss for optimization. We train the model for three epochs on the entire dataset with a learning rate of 5⁢e−6 5 superscript 𝑒 6 5e^{-6}5 italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and a batch size of 128 128 128 128, incorporating seven hard negatives. Swan-Large has 7.2⁢B 7.2 𝐵 7.2B 7.2 italic_B parameters and a dimension size of 4,096 4 096 4,096 4 , 096.

3.3 Training Methodology

Given a relevant query-document pair (q+,d+)superscript 𝑞 superscript 𝑑(q^{+},d^{+})( italic_q start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ), we modify the query by appending an instructional template to it. This process transforms the original query q+superscript 𝑞 q^{+}italic_q start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT into a new form q inst+subscript superscript 𝑞 inst q^{+}_{\text{inst}}italic_q start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT inst end_POSTSUBSCRIPT as defined below:

q inst+=Instruction: {task_instruction} Query:⁢{q+}subscript superscript 𝑞 inst Instruction: {task_instruction} Query:superscript 𝑞 q^{+}_{\text{inst}}=\text{Instruction: {task_instruction} Query:}{q^{+}}italic_q start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT inst end_POSTSUBSCRIPT = Instruction: {task_instruction} Query: { italic_q start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT }

Here, “_{task_instruction}” refers to a one-sentence description of the embedding task taken from Table 12, which outlines the instructions for different tasks. Using a pretrained LLM, we append an [EOS] token at the end of both the modified query and the document. These are then fed into the LLM to extract embeddings 𝐡 q inst+subscript 𝐡 subscript superscript 𝑞 inst\mathbf{h}{q^{+}{\text{inst}}}bold_h start_POSTSUBSCRIPT italic_q start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT inst end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝐡 d+subscript 𝐡 superscript 𝑑\mathbf{h}{d^{+}}bold_h start_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT from the vector at the last [EOS] layer. Again, training of the embedding model is conducted using the InfoNCE loss function, which is widely recognized for its effectiveness in learning high-quality embeddings. The objective is minimized using the following formulation:

min⁡(−log⁡ϕ⁢(q inst+,d+)ϕ⁢(q inst+,d+)+∑n i∈ℕ ϕ⁢(q inst+,n i))italic-ϕ subscript superscript 𝑞 inst superscript 𝑑 italic-ϕ subscript superscript 𝑞 inst superscript 𝑑 subscript subscript 𝑛 𝑖 ℕ italic-ϕ subscript superscript 𝑞 inst subscript 𝑛 𝑖\min\left(-\log\frac{\phi(q^{+}{\text{inst}},d^{+})}{\phi(q^{+}{\text{inst}}% ,d^{+})+\sum_{n_{i}\in\mathbb{N}}\phi(q^{+}{\text{inst}},n{i})}\right)roman_min ( - roman_log divide start_ARG italic_ϕ ( italic_q start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT inst end_POSTSUBSCRIPT , italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_ϕ ( italic_q start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT inst end_POSTSUBSCRIPT , italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_N end_POSTSUBSCRIPT italic_ϕ ( italic_q start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT inst end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG )

In the equation above, ℕ ℕ\mathbb{N}blackboard_N denotes the set of negative samples, and ϕ⁢(q,d)italic-ϕ 𝑞 𝑑\phi(q,d)italic_ϕ ( italic_q , italic_d ) is the similarity scoring function between a query q 𝑞 q italic_q and a document d 𝑑 d italic_d.

3.4 Inclusion of Hard-Negatives

To enhance the model’s performance, it is crucial to use negative documents closely aligned with the query’s context Karpukhin et al. (2020); Khondaker et al. (2022). This method allows us to observe the impact of introducing more challenging or "hard" negatives into the training process. We only generate hard negatives for the Arabic subset of our training data from Section3.1. We found that using 15 15 15 15 hard negatives for Swan-Small yields the best performance, whereas for our bigger model, Swan-Large, the model overfits a more significant number of hard negatives, and 7 7 7 7 gives us the best performance.

Impact of Hard Negatives. Hard negatives in contrastive learning are examples that closely resemble correct instances but are ultimately incorrect. Their inclusion encourages the model to learn finer-grained distinctions, improving its ability to differentiate between similar but distinct classes. The process involves converting all documents into a vector form within the embedding space. Subsequently, these document embeddings are compared using the cosine similarity score to establish their relevance to the query. Once all documents are scored, they are sorted by their similarity to the query. The top-ranked document is typically the positive example, while the rest are potential negatives. Our experiments assess the impact of varying the hard negatives used while training our models, Swan-Large and Swan-Small. We train each model with different quantities of hard negatives. Namely, we experiment with using hard negatives values from the set {1, 3, 7, 15, 31} per training instance. Swan-Small achieves its highest performance of 56.25 with 15 hard negatives. The model exhibits a general upward trend as the number of hard negatives increases, peaking at 15 before slightly declining at 31. This pattern suggests that while additional hard negatives initially enhance learning by introducing valuable challenges, excessive complexity may lead to diminishing returns, ultimately hindering further improvement. Swan-Large achieves its peak performance of 60.42 when trained with seven hard negatives, suggesting an optimal balance that enhances learning without overloading the model. Notably, increasing the number of hard negatives beyond this point does not lead to further gains, indicating a threshold where additional complexity ceases to improve learning outcomes.

Table 3: Impact of number of Hard Negatives (HN).

4 ArabicMTEB Benchmark

Task Datasets Langs Dialects Metric RTR 36 1 4 nDCG@10 CRTR 12 7 0 nDCG@10 CLF 18 1 6 AP BTM 11 5 8 F1 RRK 5 2 0 MAP STS 5 1 3 Spearman Corr CLR 4 1 0 v-measure PairCLF 3 1 0 AP Total 94 9 11

Table 4: Overview of our Tasks in ArabicMTEB. ∗Total represents the unique languages.

In this section, we present ArabicMTEB, a comprehensive Arabic-centric text embedding benchmark designed to evaluate text embeddings across a wide range of tasks and scenarios. ArabicMTEB addresses the limitations of existing benchmarks that either exclude Arabic or lack coverage of diverse Arabic language varieties, dialects, and cultural nuances. Our benchmark includes 94 94 94 94 datasets spanning 8 8 8 8 distinct tasks, as summarized in Table4. Further details about the datasets used in benchmark can be found in Appendix B. ArabicMTEB was developed to provide comprehensive coverage of Arabic text embedding capabilities, ensuring the inclusion of MSA and other varieties. It offers diverse task types, such as retrieval, classification, and semantic similarity, to evaluate embeddings holistically across different scenarios by incorporating novel domain-specific, dialectal, and country-level culturally aware datasets, ArabicMTEB represents a more applicable and realistic assessment of Arabic text embeddings.

4.1 Task Categories

ArabicMTEB categorizes evaluation datasets into the following key task categories, with each type providing a unique perspective on the capabilities of text embeddings. The corresponding metadata for each task, covering the considered number of datasets, number of languages, number of dialects, and evaluation metric, is presented in Table4.

Arabic Text Retrieval. This task uses Arabic queries to retrieve Top-k 𝑘 k italic_k relevant documents from a large Arabic corpus. ArabicMTEB includes 35 35 35 35 retrieval datasets such as XPDA Shen et al. (2023) and Dolphin’s long-form QA datasets Nagoudi et al. (2023b). Including these datasets helps evaluate complex information retrieval scenarios in Arabic.

Bitext Mining. This task identifies sentence-level translations between different languages and dialects. ArabicMTEB includes 12 12 12 12 datasets spanning various language pairs like Arabic to French and English. This task is crucial for understanding text embeddings’ cross-lingual and dialectal translation capabilities.

Cross-Lingual Retrieval. This task uses Arabic queries to retrieve documents in other languages, such as English, German, Spanish, and Chinese. ArabicMTEB employs the mMarco Dev set Bonifacio et al. (2021) and includes 11 11 11 11 language pairs.

Re-Ranking. This task reorders candidate documents for a query based on embedding similarity scores. ArabicMTEB features five re-ranking datasets such as MIRACL Zhang et al. (2023), enabling the evaluation of embeddings’ ability to refine search results.

Semantic Textual Similarity (STS). STS measures the correlation between the embeddings of two sentences, assessing their semantic similarity. ArabicMTEB includes five STS datasets like STS17 and STS22 Cer et al. (2017), along with two synthetic datasets generated using GPT-4 (Details of the creation of these datasets can be found in Appendix G).

Classification. Classification predicts labels for input texts using text embeddings. ArabicMTEB comprises 18 18 18 18 multi-domain datasets, from ORCA Elmadany et al. (2022). This task evaluates models’ ability to categorize Arabic text accurately, making it a valuable benchmark for downstream tasks such as sentiment analysis.

Pair Classification. This task predicts the relationship between two sentences based on their embeddings. ArabicMTEB includes three datasets, such as XNLI Conneau et al. (2018).

Clustering. Clustering groups sentences into clusters based on embedding similarity, evaluating unsupervised learning performance. ArabicMTEB includes four clustering datasets, such as Arabic News Articles and stance detection datasets from Baly et al. (2018).

4.2 Dialectal ArabicMTEB

Dialectal ArabicMTEB is a specialized fork of the original ArabicMTEB, focusing exclusively on Arabic dialectal datasets. This extension addresses the unique challenges posed by the significant variations in Arabic dialects across different regions, which have been underrepresented in NLP research. While research on dialectal Arabic text embedding has been limited, dialectal ArabicMTEB fills this gap by providing a comprehensive collection of 19 19 19 19 datasets specifically curated to evaluate embeddings’ performance on diverse Arabic dialects. These datasets span multiple tasks, offering a robust framework for assessing model performance across various dialectal contexts: (1) Bitext Mining. Eight datasets covering dialects such as Algerian, Egyptian, Jordanian, Lebanese, Moroccan, Saudi, and Yemeni Nagoudi et al. (2022); Bouamor et al. (2014). (2) Retrieval. Five datasets focusing on dialects from Algeria, Egypt, Morocco, and the Gulf regions Nagoudi et al. (2023b). (3) Classification: Five datasets for binary, regional, and country-level dialect identification Abdul-Mageed et al. (2021d, 2024b); Elmadany et al. (2022); Abdul-Mageed et al. (2021b); Ahmed et al. (2024). (4) STS. A novel synthetic dataset for Egyptian text similarity generated using Command-R+.

4.3 Domain-Specific ArabicMTEB

Arabic Text retrieval tasks are currently trending in real-world applications. They are utilized across multiple fields, including healthcare, finance, and legal sectors. Having specialized evaluation datasets is crucial for building text embeddings tailored to these domains. To meet this need, we introduce domain-specific ArabicMTEB, a specialized fork of the broader ArabicMTEB benchmark. Domain-specific ArabicMTEB focuses on the news, finance, legal, medical, and general knowledge domains, offering a closer approximation to real-world scenarios. The creation of this benchmark involves collecting Arabic documents from these specialized sources and from Arabic Wikipedia. We then segment and chunk the documents into texts of 1,024 1 024 1,024 1 , 024 tokens each. Subsequently, we randomly select chunks and employ GPT4-Turbo OpenAI et al. (2024) to generate five different styles of queries for each chunk. We filter out duplicate and repeated queries using GPT3.5 OpenAI et al. (2024) to ensure a high-quality evaluation dataset. Our evaluation data creation pipeline is visualized in Figure 2. The resulting benchmark, which we call ArabicMTEB-Lite, contains 10⁢k 10 𝑘 10k 10 italic_k queries and 100⁢k 100 𝑘 100k 100 italic_k documents spanning the domains described above.

Image 3: Refer to caption

Figure 3: Generation pipeline for our domain specific ArabicMTEB.

4.4 Cultural ArabicMTEB

To show that our models are culturally aware, we have introduced Cultural ArabicMTEB, a collection of datasets from 20 different Arab countries where we focus on specific cultural aspects like their geography, history, etc. To construct the Cultural ArabicMTEB, we use Arabic Wikipedia as our primary data source. For each included Arab country, we extract articles related to that country from its corresponding Wikipedia portal. The portal covers multiple categories (e.g., geography, economy, history) with subcategories (e.g., local movies, food items). This process resulted in 5 5 5 5 K to 55 55 55 55 K articles per country. Next, we generate retrieval questions and passages for each country. For this, we use GPT-4o-mini to develop, for each passage (from an article), a corresponding question whose specific answer is available within the passage itself. Following the same methodology, but applied to Egyptian and Moroccan dialectal versions of Wikipedia, we generate dialectal queries and their corresponding passages using Command-R+. Cultural ArabicMTEB contains 1⁢k 1 𝑘 1k 1 italic_k queries and an average of 15⁢k 15 𝑘 15k 15 italic_k documents from various countries as described above.

5 Evaluation

Table 5: Overall ArabicMTEB results.

Table 6: Dialectal ArabicMTEB results.

Table 7: Domain-specific ArabicMTEB results.

Table 8: Cultural ArabicMTEB results.

Table 9: The impact of Synthetic Data on Swan performance. ArRTR: Arabic retrieval, DOM-RTR: Domain-specific retrieval, and DIA-RTR: Dialectal Retrieval.

We evaluate the performance of our models, Swan-Small and Swan-Large, across the multiple proposed ArabicMTEB benchmarks and compare them with existing SoTA models, including MARBERT Abdul-Mageed et al. (2020a), ARBERTv2 Elmadany et al. (2022), CamelBERT Inoue et al. (2021), multilingual E5 models Wang et al. (2024b), and Arabic-triplet-Matryoshka-V2 (ATM-V2)Nacar and Koubaa (2024). Our evaluation results encompasses overall ArabicMTEB (Table 5), dialectal ArabicMTEB(Table 6), domain-specific ArabicMTEB tasks (Table 7), and cultural ArabicMTEB(Table 8). In these tables, the tasks will be referred to as RTR: Retrieval, STS: Semantic Textual Similarity, PairCLF: Pair Classification, CLF: Classification, CLR: Clustering, RRK: Reranking, and BTM: BiText Mining.

ArabicMTEB Results. Table 5 presents the overall results of our models on the ArabicMTEB benchmark. Our models demonstrate top-tier performance across a variety of NLP tasks. Swan-Small achieves an average score of 57.33, surpassing its main competitors, Me5-base (55.29) and Me5-small (55.06), by a significant margin. This model performs exceptionally well in retrieval (58.42), classification (57.34), and pair classification (74.93), outperforming ATM-V2, which only scores 45.24 on average. Similarly, Swan-Large sets a new state-of-the-art performance with an average score of 62.45, beating Me5-large (61.65) and even the massive e5-mistral-7b model (59.00). The model excels particularly in retrieval (65.63), classification (54.89), and bitext mining (71.24), indicating its robustness across both cross-lingual and Arabic-centric tasks. These results validate our training strategy of using diverse training data covering multiple languages, where Swan-Large outperforms its counterparts by more than five points in cross-lingual tasks such as bitext mining.

Dialectal ArabicMTEB Results. Table 6 shows the dialectal ArabicMTEB results. Swan-Small scores an average of 63.41, considerably higher than Me5-small (45.27) and AlcLaM (30.44), showing strong performance across retrieval (63.16) and classification (54.52). Swan-Large achieves an impressive average score of 70.45, leading all tasks and outperforming the e5-mistral-7b model, which scores 60.81. The standout result is in bitext mining, which achieves 72.10, showcasing a substantial 14-point improvement over AlcLaM (59.38). Our models’ significant advantage in dialectal retrieval and bitext mining is their unique training with a combination of synthetic and human-generated dialectal datasets, which is absent in many competitive models.

Domain-Specific ArabicMTEB Results. As seen from Table7, Swan-Small performs exceptionally well, with an average score of 70.86, surpassing OpenAI’s text-embedding-3-small model (68.65) and Cohere-light-v3.0 (67.57). Its best performance is in the legal domain, where it scores 78.86. Swan-Large sets a new standard in domain-specific tasks, scoring 82.49 on average, surpassing OpenAI’s text-embedding-3-large (82.20) and Cohere’s multilingual model (73.76). The model excels particularly in the news domain (90.42), medical (81.64), and Wikipedia (93.10), indicating its superior generalization across varied Arabic domains. Moreover, the cost-effectiveness of our models is evident: using Swan-Large costs only 0.75 0.75 0.75 0.75 for 10k documents compared to 9.88 9.88 9.88 9.88 for OpenAI’s model, making it a more efficient solution for large-scale deployments.

Cultural ArabicMTEB Results. Cultural ArabicMTEB is designed to capture culturally sensitive aspects of the Arabic language, such as regional dialects, local idiomatic expressions, and culturally specific knowledge. We generated queries from country-specific Wikipedia articles, including questions about local cuisine, traditional practices, and historical events, which challenge the models to capture more than just linguistic information. For example, Swan-Large achieved the highest performance on tasks related to Egyptian cultural queries, outperforming other models on retrieval tasks by 1.5%. However, we observed slightly lower performance on Moroccan dialect queries, where cultural nuances (such as regional vocabulary) presented a more significant challenge.

Synthetic Data Analysis. We systematically analyze the impact of synthetic data on the performance of Swan-Small and Swan-Large using different combinations of training datasets. Table 9 presents the results for the base models, models trained with additional human-generated Arabic data, and models enhanced using synthetic subsets such as MSA, domain-specific, and dialectal data. When comparing the initial Swan-Small(average score of 32.46) to its version trained with synthetic MSA data, we observe a significant increase in average performance to 48.42, representing an improvement of more than 16 points. Similarly, Swan-Large benefits from a 6.52-point boost in average performance (from 55.39 to 61.91) with the inclusion of synthetic MSA data.

6 Conclusion

In this paper, we introduced Swan-Small and Swan-Large, along with the comprehensive ArabicMTEB benchmark for evaluating Arabic text embeddings. Our models demonstrate outstanding performance, benefiting from the strategic use of hard negatives and synthetic data in training. The evaluation across multiple benchmarks demonstrates that both Swan-Small and Swan-Large set new standards in Arabic-centric NLP tasks. They outperform existing SoTA models in both cross-lingual and Arabic-specific tasks while being cost-effective and capable of understanding cultural context—making them ideal for real-world applications in diverse Arabic language settings.

7 Limitations

While the development of the Swan models and the introduction of ArabicMTEB mark significant advancements in Arabic text embeddings, there are a number of limitations to consider. For example, although synthetic data significantly enhances model performance, it can introduce biases due to the reliance on specific patterns in the generated content. We ensured our synthetic data generation diversity by varying the data sources and generating dialectal data for multiple regions, including Egypt, Morocco, and the Gulf states, to mitigate this. We also analyzed our models by examining whether MSA data received higher accuracies in retrieval tasks in Table 9. Further, our synthetic data generation pipeline was subjected to human verification for correctness and balance across cultural contexts.

8 Ethical Statement

The societal implications of deploying dialect-aware models, such as Swan, require careful consideration. While these models can bridge gaps in NLP for Arabic-speaking regions, there is a risk of inadvertently reinforcing biases or language hierarchies, particularly in areas where particular dialects are stigmatized or underrepresented. For instance, users in communities with dialects associated with lower socioeconomic status may feel marginalized if their dialect is not adequately supported. To mitigate these concerns, we have prioritized the inclusion of low-resource dialects and ensured that our synthetic data generation pipeline accounts for dialectal diversity. Additionally, future versions of models should include further dialectal balancing, specifically focusing on underrepresented communities.

Importantly, all research and development activities for the Swan models and ArabicMTEB benchmark were conducted with a commitment to ethical standards. Data collection and usage adhered to privacy and confidentiality norms, ensuring no sensitive information was utilized without proper anonymization and consent.

Acknowledgments

We acknowledge support from Canada Research Chairs (CRC), the Natural Sciences and Engineering Research Council of Canada (NSERC; RGPIN-2018-04267), the Social Sciences and Humanities Research Council of Canada (SSHRC; 895-2020-1004; 895-2021-1008), Canadian Foundation for Innovation (CFI; 37771), Digital Research Alliance of Canada,4 4 4https://alliancecan.ca and UBC Advanced Research Computing-Sockeye.5 5 5https://arc.ubc.ca/ubc-arc-sockeye

References

Appendix A ArMistral Training

ArMistral, is an autoregressive pretrained language model based on Mistral-7B.

Pretraining data We further pretrain it on a large and diverse Arabic dataset, including all categories of Arabic, namely Classical Arabic (CA), Dialectal Arabic (DA), and MSA. This data is aggregated from various sources: AraNews v2 Nagoudi et al. (2020), El-Khair El-Khair (2016), Gigaword,6 6 6LDC Catalog Link OSCAR Suárez et al. (2019), OSIAN Zeroual et al. (2019), 101 Billion arabic words Aloui et al. (2024a), Wikipedia Arabic, and Hindawi Books.7 7 7OpenITI corpus (v1.6). We also derived ArabicWeb22 (A) and (B) from the open source Arabic text 2022.8 8 8ArabicText-2022 data This pretraining dataset was cleaned, filtered and deduplicated using Bhatia (2023). We have also ensured that the model is pretrained in multiple domains, enhancing its results as seen in Table 10.

Instruction Finetuning. To enhance the capabilities of our ArMistral, we instruct-tuning it on three datasets: Alpaca-GPT4, Evol-instruct, and ShareGPT extracted from MultilingualSIFT datasets Chen et al. (2023).

Alignment Dataset. We collected an alignment dataset from Quora and Mawdoo websites and then we took the gold answers as the choosen and we generated the rejected using AceGPT-7B Huang et al. (2024).

Results. As seen from Table 10, Our ArMistral-Chat model outperforms all existing Arabic LLMs.

Table 10: Comparison of ArMistral with other Arabic LLMs.

Appendix B Datasets overview

The Table 11 provides a comprehensive summary of the various datasets utilized in the study. It categorizes datasets based on their type, such as Reranking, Bitext Mining, Retrieval, Crosslingual Retrieval, STS, Pair Classification, Clustering, and Classification. Each entry specifies the dataset name, language, citation, and category, reflecting the diversity and scope of data sources for evaluating the model’s performance across different tasks and linguistic contexts.

Appendix C Polydedupe: versatile cleaning Pipeline

PolyDeDupe is a Python package designed for efficient and effective data deduplication across over 100 languages. It supports syntactic and semantic deduplication, making it a versatile tool for high-quality data preprocessing in NLP tasks. Key features include customizable Jaccard similarity thresholds, a performance speed twice that of other tools like SlimPajama, and support for deduplicating instruction tuning data. It can be easily installed via pip to deduplicate datasets, display original and filtered dataset sizes, and identify duplicate clusters. Supported languages span Western, Central, and Eastern European languages, Slavic languages using Cyrillic script, Greek, various Arabic and Devanagari script languages, and more.

Task Dataset Type Language Citation Size BitextMining Darija S2S Moroccan Arabic Dialect to English Nagoudi et al. (2023b)2000 BitextMining Narabizi S2S Arabizi to French Nagoudi et al. (2023b)144 BitextMining Mt_en2ar S2S English to MSA Nagoudi et al. (2023b)4000 BitextMining Mt_fr2ar S2S French to MSA Nagoudi et al. (2023b)4000 BitextMining Mt_es2ar S2S Spanish to MSA Nagoudi et al. (2023b)4000 BitextMining Mt_ru2ar S2S Russian to MSA Nagoudi et al. (2023b)4000 BitextMining Cs_dz_fr S2S Algerian Arabic Dialect to French Nagoudi et al. (2023b)200 BitextMining Cs_eg_en S2S Egyptian Arabic Dialect to English Nagoudi et al. (2023b)200 BitextMining Cs_jo_en S2S Jordanian Arabic to English Nagoudi et al. (2023b)200 BitextMining Cs_ma_fr S2S Moroccan Arabic to French Nagoudi et al. (2023b)200 BitextMining Cs_ps_en S2S Palestinian Arabic to English Nagoudi et al. (2023b)200 BitextMining Cs_ye_en S2S Yemeni Arabic to English Nagoudi et al. (2023b)200 Classification MassiveIntent S2S Multilingual (Arabic subset)FitzGerald et al. (2022)100 Classification MassiveScenario S2S Multilingual (Arabic subset)FitzGerald et al. (2022)100 Classification OrcaSentiment S2S Arabic Elmadany et al. (2022)5000 Classification OrcaDialect_region S2S Arabic Elmadany et al. (2022)5000 Classification OrcaDialect_binary S2S Arabic Elmadany et al. (2022)5000 Classification OrcaDialect_country S2S Arabic Elmadany et al. (2022)5000 Classification OrcaAns_claim S2S Arabic Elmadany et al. (2022)5000 Classification OrcaMachine_generation S2S Arabic Elmadany et al. (2022)5000 Classification OrcaAge S2S Arabic Elmadany et al. (2022)5000 Classification OrcaGender S2S Arabic Elmadany et al. (2022)5000 Classification OrcaAdult S2S Arabic Elmadany et al. (2022)5000 Classification OrcaDangerous S2S Arabic Elmadany et al. (2022)5000 Classification OrcaEmotion S2S Arabic Elmadany et al. (2022)5000 Classification OrcaHate_speech S2S Arabic Elmadany et al. (2022)5000 Classification OrcaOffensive S2S Arabic Elmadany et al. (2022)5000 Classification OrcaIrony S2S Arabic Elmadany et al. (2022)5000 Classification OrcaSarcasm S2S Arabic Elmadany et al. (2022)5000 Classification OrcaAbusive S2S Arabic Elmadany et al. (2022)5000 Clustering Arabic_news P2P Arabic Our Paper 2500 Clustering Arabic_topic S2S Arabic Our Paper 30 Clustering Arabic_baly_stance P2P Arabic Elmadany et al. (2022)1000 Clustering Arabic_baly_stance S2S Arabic Elmadany et al. (2022)100 PairClassification Arabic_xnli S2S Arabic Our Paper 538 PairClassification Arabic_sts S2S Arabic Our Paper 1256 PairClassification Arabic_mq2q S2S Arabic Our Paper 244 Reranking Miracl_ar S2P Multilingual (Arabic subset)Zhang et al. (2023)750 Reranking Mmarco_arabic S2P Arabic Our Paper 3000 Reranking MedicalQA_arabic S2P Arabic Our Paper 4350 Reranking Mmarco_en2ar S2P English to MSA Our Paper 500 Reranking Mmarco_ar2en S2P MSA to English Our Paper 500 Retrieval MultiLongDoc S2P Multilingual (Arabic subset)MDQA Retrieval XPQA S2S Multilingual (Arabic subset)XPQA Retrieval Mintaka S2S Multilingual (Arabic subset)Mintaka Retrieval Lareqa S2P Arabic Nagoudi et al. (2023b)220 Retrieval Dawqs S2S Arabic Nagoudi et al. (2023b)318 Retrieval Exams S2S Arabic Nagoudi et al. (2023b)2600 Retrieval Mkqa S2S Arabic Nagoudi et al. (2023b)340 Retrieval Mlqa S2S Arabic Nagoudi et al. (2023b)517 Retrieval Arcd S2S Arabic Nagoudi et al. (2023b)693 Retrieval Tydiqa S2S Arabic Nagoudi et al. (2023b)5700 Retrieval Xsquad S2S Arabic Nagoudi et al. (2023b)5700 Retrieval Crosslingual_ar2de S2P MSA to German Our Paper 1831 Retrieval Crosslingual_ar2en S2P MSA to English Our Paper 1831 Retrieval Crosslingual_ar2es S2P MSA to Spanish Our Paper 1831 Retrieval Crosslingual_ar2hi S2P MSA to Hindi Our Paper 1831 Retrieval Crosslingual_ar2vi S2P MSA to Vietnamese Our Paper 1831 Retrieval Crosslingual_ar2zh S2P MSA to Chinese Our Paper 1831 Retrieval Crosslingual_de2ar S2P German to MSA Our Paper 1831 Retrieval Crosslingual_en2ar S2P English to MSA Our Paper 1831 Retrieval Crosslingual_es2ar S2P Spanish to MSA Our Paper 1831 Retrieval Crosslingual_hi2ar S2P Hindi to MSA Our Paper 1831 Retrieval Crosslingual_vi2ar S2P Vietnamese to MSA Our Paper 1831 Retrieval Crosslingual_zh2ar S2P Chinese to MSA Our Paper 1912 Retrieval MoroccoCultural S2P Arabic Our Paper 100 Retrieval SyriaCultural S2P Arabic Our Paper 100 Retrieval LibyaCultural S2P Arabic Our Paper 100 Retrieval LebanonCultural S2P Arabic Our Paper 100 Retrieval QatarCultural S2P Arabic Our Paper 100 Retrieval SudanCultural S2P Arabic Our Paper 100 Retrieval AlgeriaCultural S2P Arabic Our Paper 100 Retrieval MauritaniaCultural S2P Arabic Our Paper 100 Retrieval TunisiaCultural S2P Arabic Our Paper 100 Retrieval IraqCultural S2P Arabic Our Paper 100 Retrieval EgyptCultural S2P Arabic Our Paper 100 Retrieval SomaliaCultural S2P Arabic Our Paper 100 Retrieval UAE_Cultural S2P Arabic Our Paper 100 Retrieval OmanCultural S2P Arabic Our Paper 100 Retrieval KuwaitCultural S2P Arabic Our Paper 100 Retrieval BahrainCultural S2P Arabic Our Paper 100 Retrieval Saudi_ArabiaCultural S2P Arabic Our Paper 100 Retrieval JordanCultural S2P Arabic Our Paper 100 Retrieval PalestineCultural S2P Arabic Our Paper 100 Retrieval YemenCultural S2P Arabic Our Paper 100 Retrieval MoroccoDIA S2P Moroccan Arabic Dialect Our Paper 100 Retrieval EgyptDIA S2P Egyptian Arabic Dialect Our Paper 100 Retrieval NewsDomainSpecific S2P Arabic Our Paper 1000 Retrieval LegalDomainSpecific S2P Arabic Our Paper 1000 Retrieval MedicalDomainSpecific S2P Arabic Our Paper 1000 Retrieval FinanceDomainSpecific S2P Arabic Our Paper 1000 Retrieval WikipediaDomainSpecific S2P Arabic Our Paper 1000 STS STS17 S2S Arabic Cer et al. (2017)8060 STS STS22 P2P Arabic Semenov et al. (2023)500 STS Arabic_sts S2S Arabic Our Paper 750 STS Arabic_stsb_multi_dialect S2S Arabic Dialectal Our Paper 1500 STS Arabic_sts P2P Arabic Our Paper 500

Table 11: Overview of ArabicMTEB datasets. S2S: Sentence to Sentence. S2P: Sentence to Paragraph. P2P: Paragraph to Paragraph.

Appendix D Prompts for evaluation

Table12 provides an overview of the prompts used for evaluating various tasks. It includes instructions for Reranking, Bitext Mining, Retrieval, Crosslingual Retrieval, Semantic Textual Similarity (STS), Pair Classification, Clustering, and Classification. Each entry outlines the specific task and the corresponding instruction used to guide the model’s evaluation process.

Table 12: Prompts used for evaluation.

Appendix E Full Leaderboard

Table13 presents the performance comparison of various models on different tasks within the ArabicMTEB benchmark. It includes metrics for Retrieval, Semantic Textual Similarity (STS), Pair Classification (PairCLF), Classification (CLF), Re-ranking, Clustering, and Bitext Mining (BTM). The table lists each model, its dimensionality, and the scores for each task, along with an overall average score. The results highlight the strengths and weaknesses of each model across a range of tasks, providing a comprehensive overview of their performance.

Table 13: ArabicMTEB Results.

Appendix F Inference Latency.

Image 4: Refer to caption

Figure 4: Latency vs Performance.

Inference latency is very critical in deploying machine learning models, especially in real-time applications with crucial response time. It refers to the time taken by a model to predict received input. In the context of text embedding models such as Swan-Small and Swan-Large, lower latency is particularly valuable for user-facing services that rely on fast processing of natural language input, such as chatbots and search engines. From Figure 4, we find that Swan-Large, despite its larger size indicated by a larger bubble, has optimized inference times due to architectural efficiencies, and Swan-Small strikes the perfect balance between size, performance, and latency. We compare the performance of the models from Table 5.

Appendix G STS Dataset Creation:

The Arabic Semantic Textual Similarity (Arabic-STS) datasets was developed to facilitate research in semantic similarity for the Arabic language. The dataset is derived from the Arabic Billion Words Aloui et al. (2024b) corpus, which serves as a foundation for extracting a diverse collection of sentence pairs. Each pair is annotated with a similarity score that captures the degree of semantic equivalence between the sentences. The dataset generation process was guided by the capabilities of the GPT-4 model developed by OpenAI, ensuring that the resulting sentence pairs are of high quality and reflect nuanced linguistic characteristics. The creation involved several steps, including selecting representative sentences from the corpus, generating semantically varied sentence pairs, and annotating similarity scores using both automated methods and human reviewers to maintain consistency and reliability.

Appendix H Country level Cultural Evaluation

Table 14: Country-level cultural evaluation.

Xet Storage Details

Size:
87.4 kB
·
Xet hash:
fd3eef3abe3b833bbc4065ec3044b9f0bc6d1b8d55b2c910d70e1733c3ab1b3f

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.