Buckets:

|
download
raw
105 kB

Title: Enhancing Text Editing for Grammatical Error Correction: Arabic as a Case Study

URL Source: https://arxiv.org/html/2503.00985

Markdown Content: First Author

Affiliation / Address line 1

Affiliation / Address line 2

Affiliation / Address line 3

email@domain

&Second Author

Affiliation / Address line 1

Affiliation / Address line 2

Affiliation / Address line 3

email@domain

Nizar Habash

Computational Approaches to Modeling Language Lab

New York University Abu Dhabi

††\dagger†Mohamed bin Zayed University of Artificial Intelligence

{alhafni,nizar.habash}@nyu.edu

Abstract

\novocalize\setcode

utf8 \setarab

Enhancing Text Editing for Grammatical Error Correction:

Arabic as a Case Study

Bashar Alhafni††\dagger†and Nizar Habash Computational Approaches to Modeling Language Lab New York University Abu Dhabi††\dagger†Mohamed bin Zayed University of Artificial Intelligence{alhafni,nizar.habash}@nyu.edu

\novocalize\setcode

utf8 \setarab

Enhancing Text Editing for Grammatical Error Correction:

Arabic as a Case Study

Bashar Alhafni††\dagger†and Nizar Habash Computational Approaches to Modeling Language Lab New York University Abu Dhabi††\dagger†Mohamed bin Zayed University of Artificial Intelligence{alhafni,nizar.habash}@nyu.edu

1 Introduction

Grammatical Error Correction (GEC) is a well-studied problem, particularly in English, with numerous datasets and shared tasks Ng et al. (2013, 2014); Bryant et al. (2019). GEC has applications in both writing assistance for native speakers (L1) and language learning for second-language (L2) learners. While neural machine translation (NMT) approaches have long dominated GEC and continue to achieve strong results when trained on large amounts of data Stahlberg and Kumar (2024); Bryant et al. (2023), they are not inherently the most efficient. Unlike MT, where input and output sequences differ significantly, GEC typically involves minimal changes, with most input tokens copied to the output. Employing full-sequence autoregressive models in such cases can be computationally wasteful Stahlberg and Kumar (2020). A highly efficient and competitive alternative to sequence-to-sequence (Seq2Seq) models is text editing, which frames GEC as a sequence tagging problem. Instead of generating text autoregressively, text editing models assign edit labels to input tokens, leading to a more efficient and interpretable corrections. However, most popular text editing approaches require effort to design language-specific edit tag sets Awasthi et al. (2019); Omelianchuk et al. (2020); Mesham et al. (2023). This limits their adaptability for morphologically rich languages like Arabic Kwon et al. (2023), where the space of possible edits is large. Inspired by recent advancements in text editing Awasthi et al. (2019); Malmi et al. (2019); Omelianchuk et al. (2020); Straka et al. (2021); Mesham et al. (2023), we introduce a novel text editing approach that eliminates the need for language-specific edits. Instead, our method derives edit tags directly from data, making it more adaptable and scalable across different linguistic settings. We demonstrate the effectiveness of our approach on Arabic GEC. Our contributions are as follows:

    1. We introduce the first successful application of text editing to Arabic GEC and study the effect of edit representation on the task.
    1. We achieve SOTA results on two Arabic GEC benchmarks and perform on par with SOTA on two others.
    1. Our models are over six times faster than existing Arabic GEC systems, making them more practical for real-world applications.
    1. We show through ensembling experiments how different models complement each other, leading to significant performance gains.

2 Background and Related Work

2.1 Grammatical Error Correction

GEC has been approached using a variety of methods, with Transformer-based systems being the most popular Bryant et al. (2023). The use of Transformer-based architectures in GEC began by framing the task as a neural machine translation (NMT) problem Junczys-Dowmunt et al. (2018); Yuan et al. (2019); Zhao et al. (2019); Grundkiewicz et al. (2019); Katsumata and Komachi (2020); Kaneko et al. (2020); Wan et al. (2020); Yuan et al. (2021); Yuan and Bryant (2021); Stahlberg and Kumar (2021); Rothe et al. (2021); Zhou et al. (2023); Luhtaru et al. (2024). To improve efficiency and interpretability, text editing models have emerged as an alternative to Seq2Seq approaches Awasthi et al. (2019); Malmi et al. (2019); Stahlberg and Kumar (2020); Mallinson et al. (2020); Omelianchuk et al. (2020); Straka et al. (2021); Mallinson et al. (2022); Tarnavskyi et al. (2022); Mesham et al. (2023); Zhang et al. (2023). Unlike Seq2Seq models, which generate corrected text from scratch, text editing models treat GEC as a sequence tagging task, producing a set of edit operations that modify the erroneous input. Our work follows this text editing paradigm. LLMs have also been evaluated on GEC Fang et al. (2023); Coyne et al. (2023); Wu et al. (2023); Loem et al. (2023); Raheja et al. (2023); Kaneko and Okazaki (2023); Raheja et al. (2024); Davis et al. (2024); Katinskaia and Yangarber (2024); Omelianchuk et al. (2024); Mita et al. (2024); Kaneko and Okazaki (2024). However, despite their strong generalization capabilities, they remain less effective than Seq2Seq and text editing models.

2.2 Arabic Grammatical Error Correction

Arabic exhibits a diglossic Ferguson (1959) linguistic nature where a non-standard variety, Dialectal Arabic (DA), coexists with Modern Standard Arabic (MSA), the standard form of the language.

MSA GEC

The first major efforts on MSA GEC were initiated by the Qatar Arabic Language Bank (QALB) project Zaghouani et al. (2014, 2015), which organized the QALB-2014 Mohit et al. (2014) and QALB-2015 Rozovskaya et al. (2015) shared tasks. More recently, Habash and Palfreyman (2022) introduced the ZAEBUC corpus, a dataset of essays written by native Arabic-speaking university students. Approaches to MSA GEC have included feature-based classifiers Rozovskaya et al. (2014); Farra et al. (2014); Bougares and Bouamor (2015); Nawar (2015) and NMT-based systems Watson et al. (2018); Solyman et al. (2021, 2022, 2023). LLMs have also been evaluated for MSA GEC Kwon et al. (2023); Alhafni et al. (2023); Magdy et al. (2024), but attempts to adapt text editing models have been largely ineffective. The current SOTA was established by Alhafni et al. (2023), who incorporated contextualized morphological preprocessing and grammatical error detection (GED) features into Seq2Seq models, achieving SOTA results on the QALB-2014, QALB-2015, and ZAEBUC datasets.

DA GEC

Dialectal Arabic (DA) comprises multiple regional varieties that differ from MSA and each other in phonology, morphology, and lexicon. While primarily spoken, DA lacks standardized orthography, though its written use has grown on social media, where it appears in varied and noisy forms. To address this, Habash et al. (2012a, 2018) introduced the Conventional Orthography for Dialectal Arabic (CODA), a standardized spelling convention for DA. CODA has since been used to develop multiple DA datasets Habash et al. (2012b); Eskander et al. (2013); Maamouri et al. (2014); Diab et al. (2014); Pasha et al. (2014); Jarrar et al. (2016); Khalifa et al. (2018). Building on this work, Eryani et al. (2020) created the MADAR CODA Corpus, which consists of parallel sentences in CODA and their original raw form for five Arabic city dialects. CODAfication–the process of normalizing DA into CODA–has been addressed using feature-based methods Eskander et al. (2013) and morphological disambiguation models Pasha et al. (2014); Zalmout et al. (2018); Khalifa et al. (2020); Zalmout and Habash (2020); Obeid et al. (2022). More recently, Alhafni et al. (2024) framed CODAfication as a DA GEC problem, benchmarking pretrained Arabic Seq2Seq models on the MADAR CODA corpus and demonstrating that incorporating dialect identification improves performance. In this work, we propose a generalizable and efficient text editing approach and evaluate its effectiveness on both MSA and DA GEC. For MSA GEC, we benchmark our models against Alhafni et al. (2023) on QALB-2014, QALB-2015, and ZAEBUC. For DA GEC, we build on Alhafni et al. (2024) by framing CODAfication as a DA GEC problem, evaluating our approach on the MADAR CODA corpus and comparing it to their results.

Image 1: Refer to caption

Figure 1: An example showing the different edit representations: words, words (compressed), subwords, and subwords (compressed). The edit operations are keep (K/K*), delete (D/D*), merge before (M), replace (R_[c]), insert (I_[c]), and append (A_[c]). Solid lines indicate word alignments between the corrected and erroneous sentences, while dotted lines denote erroneous subword boundaries. The sentence in the figure can be translated as “Health, especially mental health, must be taken care of”.

3 Approach

We adopt a text editing approach to GEC and frame the task as a sequence tagging problem. Formally, given an input erroneous sequence x=x 1,x 2,…,x n 𝑥 subscript 𝑥 1 subscript 𝑥 2…subscript 𝑥 𝑛 x=x_{1},x_{2},...,x_{n}italic_x = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the goal is to assign a sequence of edit operations e=e 1,e 2,…,e n 𝑒 subscript 𝑒 1 subscript 𝑒 2…subscript 𝑒 𝑛 e=e_{1},e_{2},...,e_{n}italic_e = italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT; e i∈E subscript 𝑒 𝑖 𝐸 e_{i}\in E italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_E, where E 𝐸 E italic_E is the edit vocabulary, such that applying edit e i subscript 𝑒 𝑖 e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on the input token x i subscript 𝑥 𝑖 x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at each position i 𝑖 i italic_i would result in the corrected sequence y=y 1,y 2,…,y m 𝑦 subscript 𝑦 1 subscript 𝑦 2…subscript 𝑦 𝑚 y=y_{1},y_{2},...,y_{m}italic_y = italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. In the next two sections, we describe how we extract the edits and the edit representations we use to build our edit-based taggers.

3.1 Edit Extraction

We begin by aligning erroneous and corrected sentence pairs at the word level using a weighted Levenshtein edit distance Levenshtein (1966), which represents the minimum number of insertions, deletions, and replacements required to correct the erroneous sentence, with each edit affecting a single word. However, some errors span multiple words. To capture multi-word edits, we follow the approach of Alhafni et al. (2023) by extending the alignment process with an iterative algorithm that greedily merges or splits adjacent words, minimizing the overall cumulative edit distance. After obtaining the word-level alignment, we apply the algorithm again, this time to each aligned word pair rather than the entire sentence, to determine character-level alignments. This process identifies the minimal character edits in terms of keep (K), delete (D), merge before (M), insert (I_[c]), and replace (R_[c]) that are needed to transform each erroneous word into its correction, where the inserted or replaced character (c) is explicitly specified. Figure1 presents an example of an aligned erroneous-corrected sentence pair along with the corresponding edits. For instance, in row b, the erroneous word <الإهتمام>AlǍhtmAm 2 2 2\novocalize Arabic HSB transliteration Habash et al. (2007). (word 1) requires the edit KKR_[<ا>]KKKKK (row c) which consists of eight character edits–one replacement and seven keeps–to produce its corrected form <الاهتمام>AlAhtmAm. Similarly, <لصحه>lSHh (row b, word 3), must be merged with the word before it, in addition to one insertion and one replacement (MI_[<ا>]KKKR_[<ة>], row c). In some cases, corrections require the insertion of entirely new characters, forming additional words in the erroneous input. Since we frame the task as a sequence tagging problem, we represent these insertions as appends (A_[c]) to existing edits rather than introducing standalone edits. This ensures that all edits, including word insertions, remain within the tagging framework. For example, to insert a period at the end of the erroneous sentence in Figure1, we append the tag (A_[.]) to the edit of the final word (row c, word 7).

3.2 Edit Representation

The edit representation directly influences the size of the edit vocabulary (|E|𝐸|E|| italic_E |), creating an important trade-off: a larger vocabulary offers more precise corrections but increases model complexity, whereas a smaller vocabulary enhances learning efficiency at the cost of expressiveness. Controlling |E|𝐸|E|| italic_E | is crucial to avoid the explosion of possible edits, which is particularly important when working with morphologically rich languages like Arabic. We explore four methods for controlling |E|𝐸|E|| italic_E | while maintaining sufficient coverage.

Edit Compression

Once we obtain character-level edits for each word, we compress them into a more compact representation. The motivation behind this transformation is that while different words may undergo the same type of correction, their character-level edits can differ due to variations in word length. For example, in row b of Figure1, both words 0 and 2 share a keep edit, yet they receive different edit labels because of their length differences (row c). To address this, we introduce a generalized notation for common edit patterns. Consecutive keep (K) and delete (D) operations are represented as K* and D*, respectively. Similarly, consecutive insertions and appends are merged into a single operation, represented as I_[c*] for insertions and A_[c*] for appends, indicating the insertion or appending of multiple characters. Since there are multiple ways to compress an edit sequence, we select the optimal strategy based on the frequency distribution of edit patterns in the training data. This approach ensures that the most common transformations are encoded in a way that balances expressiveness with efficiency, resulting in a more structured and learnable edit representation.

Input Comp.Subset Prune Edits OOV%F 0.5 Word✗All-16,221 1.00%98.4 Subword✗All-9,060 0.36%98.7 Word✓All-10,410 1.00%98.4 Subword✓All-6,170 0.36%98.7 Subword✓NoPnx-4,799 0.27%98.8 Subword✓Pnx-160 0.01%99.4 Subword✓All 10 683 0.75%98.1 Subword✓All 20 442 1.02%97.7 Subword✓All 30 329 1.24%97.4 Subword✓NoPnx 10 520 0.56%98.2 Subword✓NoPnx 20 335 0.75%97.8 Subword✓NoPnx 30 250 0.92%97.5 Subword✓Pnx 10 48 0.02%99.4 Subword✓Pnx 20 35 0.05%99.4 Subword✓Pnx 30 29 0.05%99.3

Table 1: Edit statistics on QALB-2014. Input is the input unit (word or subword). Comp. indicates whether the edit is compressed. Subset specifies whether the edits capture all errors, punctuation-only errors (Pnx), or non-punctuation errors (NoPnx). Edits represents the total number of unique edits in the training set. OOV% is the percentage of out-of-vocabulary edits (non-unique) in the Dev set of QALB-2014.

Input Unit

Since Transformer-based models operate at the subword level, we project character-level edits onto subwords while maintaining their boundaries to ensure proper alignment. This not only ensures consistency with the model’s input representation but also helps reduce the edit vocabulary size. Our approach is inspired by the method of Straka et al. (2021), but it differs in several key aspects: (1) Straka et al. (2021) tokenize the erroneous and corrected sentence pairs before aligning them to extract the edits at the subword level. In contrast, our method extracts edits at the word level and then projects them onto subwords; (2) They limit the number of character-level edits per subword edit, while our approach imposes no such restrictions, allowing for broader coverage. Figure1 presents the subword-level edits in both their uncompressed (row f) and compressed (row g) forms. In the uncompressed subword-level edits, we observe that two subwords (3b and 6b in row e), which belong to different words, share the same edit (R_[<ة>]). In the compressed representation, we notice that several subwords–such as 0, 1b, 2, 6a, and 7a–end up sharing the same edit (K*).

Edit Segregation

Both the MSA GEC datasets we report on, QALB-2014 and ZAEBUC, exhibit high frequencies of punctuation errors, with punctuation accounting for 40% of the errors in QALB-2014 and 15% in ZAEBUC training sets Alhafni et al. (2023). To reduce the number of edits that the MSA GEC models must learn, we segregate punctuation edits from non-punctuation edits. This results in two versions of the data: one where only non-punctuation errors are tagged, and another where all non-punctuation errors are corrected, leaving only punctuation errors for the model to focus on. Note that this separation is applied only to the MSA GEC datasets we report on, and not to the DA GEC dataset. Additionally, this approach requires training two systems to be applied sequentially during inference: the first system fixes non-punctuation errors, while the second system addresses only punctuation errors.

Edit Pruning

Morphologically rich languages, in particular, tend to have many infrequent edits in GEC datasets. To improve the model’s learning ability, we analyze the distribution of edits in the training data and prune those that occur less frequently than a threshold T 𝑇 T italic_T, replacing them with the “keep” edit. This pruning is applied exclusively during training, enabling the model to focus on frequent and informative edits.

3.3 Edit Coverage

Table1 presents edit statistics for QALB-2014, illustrating the impact of our strategies to reduce the edit vocabulary size |E|𝐸|E|| italic_E | on edit coverage and upper-bound (oracle) performance on the development (Dev) set. Edit coverage measures the proportion of training edits found in the Dev set, while oracle performance is evaluated using the MaxMatch (M 2) scorer Dahlmeier and Ng (2012) F 0.5 (§4.2). We use AraBERTv02 Antoun et al. (2020) for subword tokenization, as it yielded the best results among our tested models (more details in §5). Switching from word-level to subword-level edits reduces unique training edits by 44% (16,221 to 9,060) and lowers the Dev set OOV rate from 1% to 0.4%, yielding a 0.3-point F 0.5 gain. Edit compression further reduces unique edits while preserving OOV% and oracle performance. Segregating punctuation (Pnx) from non-punctuation (NoPnx) edits reduces combined training edits (4,799+160 from 6,170). However, NoPnx results are not directly comparable, as punctuation is explicitly removed before the evaluation. Pnx F 0.5 scores are higher as they are evaluated on a Dev set with non-punctuation errors already corrected, making the test easier. To assess the impact of pruning, we apply frequency thresholds of 10, 20, and 30 to remove low-frequency edits. As expected, pruning reduces the number of unique training edits and increases the OOV% in the Dev set, yet F 0.5 remains largely unaffected. This suggests that the majority of the 6,170 compressed subword edits occur infrequently and contribute little to the model’s upper-bound performance. A similar trend is observed for both Pnx and NoPnx edits, reinforcing the idea that many low-frequency edits can be pruned without degrading oracle performance. We present the same analysis on all datasets in AppendixB Table9.

4 Experimental Setup

4.1 Data

MSA GEC

We report on three publicly available MSA GEC datasets. The first is the QALB-2014 shared task dataset Mohit et al. (2014), followed by the native (L1) test set from the QALB-2015 shared task Rozovskaya et al. (2015). The third dataset is ZAEBUC Habash and Palfreyman (2022). QALB-2014 and the L1 test set of QALB-2015 contain comments by native speakers from the Aljazeera news website, whereas ZAEBUC consists of essays written by native university students. We use the publicly available splits for QALB-2014 and QALB-2015, while for ZAEBUC, we use the splits created by Alhafni et al. (2023).

DA GEC

We use the MADAR CODA corpus Eryani et al. (2020), a set of 10,000 sentences from five Arabic city dialects (Beirut, Cairo, Doha, Rabat, and Tunis) written in the CODA standard in parallel with their original raw form. The sentences come from the Multi-Arabic Dialect Applications and Resources (MADAR) Project Bouamor et al. (2018) and are in parallel across the cities (2,000 sentences per city). We use the publicly available splits created by Alhafni et al. (2024). Table2 summarizes the dataset statistics.

4.2 Evaluation

We use the MaxMatch (M 2) scorer Dahlmeier and Ng (2012), which evaluates GEC systems by comparing hypothesis edits with reference edits, calculating precision (P), recall (R), F 1, and F 0.5 scores. F 0.5 weighs precision twice as much as recall, to prioritize the accuracy of edits relative to all edits made by the system.

Dataset Split Lines Words Err.%Domain QALB-2014 Train 19K 1M 30%Comments Dev 1K 54K 31%Comments Test 968 51K 32%Comments QALB-2015 Test 920 49K 27%Comments ZAEBUC Train 150 25K 24%Essays Dev 33 5K 25%Essays Test 31 5K 26%Essays MADAR CODA Train 7K 40K 22%Comments Dev 1.5K 9K 20%Comments Test 1.5K 9K 21%Comments

Table 2: Corpus statistics of MSA (QALB, ZAEBUC) and DA (MADAR CODA) GEC datasets.

QALB-2014 ZAEBUC P R F 1 F 0.5 P R F 1 F 0.5 A’2023 (Seq2Seq)83.2 64.9 72.9 78.7 87.3 70.6 78.1 83.4 A’2023 (Seq2Seq++)83.1 67.9 74.7 79.6 87.6 73.9 80.2 84.5 GPT-3.5-turbo 68.6 58.6 63.2 66.3 71.0 63.5 67.1 69.4 GPT-4o 80.7 65.7 72.4 77.2 86.5 76.8 81.3 84.3 Fanar 69.7 63.7 66.6 68.4 76.3 73.6 74.9 75.8 Jais-13B-Chat 49.1 36.9 42.1 46.0 50.2 19.7 28.3 38.4 sweet 81.8 68.8 74.7 78.8 85.8 72.3 78.4 82.7 sweet 2 superscript sweet 2\textsc{sweet}^{2}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 81.9 70.4 75.7 79.3 85.8 73.3 79.1 83.0 sweet NoPnx 2 subscript superscript sweet 2 NoPnx\textsc{sweet}^{2}{\text{NoPnx}}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT NoPnx end_POSTSUBSCRIPT + sweet Pnx subscript sweet Pnx\textsc{sweet}{\text{Pnx}}sweet start_POSTSUBSCRIPT Pnx end_POSTSUBSCRIPT 83.7 68.8 75.6 80.3††\dagger†86.7 73.9 79.8 83.8 3-Ensemble 84.9 68.8 76.0 81.1 89.6 72.8 80.3 85.6 4-Ensemble 89.1 61.6 72.8 81.8‡‡\ddagger‡93.3 68.3 78.9 86.9‡‡\ddagger‡

Table 3: MSA GEC results on the Dev sets of QALB-2014 and ZAEBUC. A’2023 is Alhafni et al. (2023). Best non-ensemble results are underlined; best overall results are in bold. ††\dagger† denotes statistical significance over the best baseline; ‡‡\ddagger‡ denotes statistical significance over both the best baseline and the best non-ensemble model.

P R F 1 F 0.5 A’2024 (Seq2Seq)86.8 77.4 81.8 84.7 A’2024 (Seq2Seq++)87.6 79.3 83.3 85.8 GPT-3.5-turbo 35.5 29.7 32.3 34.1 GPT-4o 53.7 54.4 54.1 53.8 Fanar 24.5 28.8 26.4 25.2 Jais-13B-Chat 14.1 15.0 14.5 14.3 sweet 89.1 75.5 81.7 86.0 sweet 2 superscript sweet 2\textsc{sweet}^{\text{2}}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 87.5 73.5 79.9 84.3 3-Ensemble 91.7 77.4 83.9 88.4 4-Ensemble 93.8 72.5 81.8 88.6‡‡\ddagger‡

Table 4: DA GEC results on the MADAR CODA Dev set. A’2024 is Alhafni et al. (2024). Best non-ensemble results are underlined; best overall results are in bold. ‡‡\ddagger‡ denotes statistical significance over both the best baseline and the best non-ensemble model.

4.3 Models

LLMs

We evaluate four LLMs: two commercial models and two open-source, Arabic-centric models. The commercial models include OpenAI’s GPT-3.5-turbo and GPT-4o OpenAI et al. (2024), while the Arabic-centric models are Jais-13B-Chat Sengupta et al. (2023) and the recently introduced Fanar LLM Team et al. (2025). We prompt GPT-3.5-turbo, GPT-4o, and Fanar through the OpenAI API, while Jais-13B-Chat is prompted using Hugging Face’s Transformers Wolf et al. (2020). Our experiments use both English and Arabic prompts, employing 0-shot and 5-shot prompting strategies. We design the prompts to elicit minimal edit-style corrections, ensuring that the LLMs’ outputs remain as close as possible to the original input in phrasing and lexical choices. We present our prompts in Figures2 and 3 in AppendixH.

Edit Taggers

To investigate the impact of edit representation design on performance (§3.2), we build edit taggers with different configurations. For word-level tagging, we use the representation of the first subword of each word and pass it through the subsequent layers. For subword-level tagging, we use the representation of each subword individually. Several Arabic pretrained transformer encoders based on BERT Devlin et al. (2019) have been developed Antoun et al. (2020); Abdul-Mageed et al. (2021); Inoue et al. (2021); Ghaddar et al. (2022). We select the three best-performing Arabic BERT models, as identified by Inoue et al. (2021) across various sentence and token classification tasks: AraBERTv02 Antoun et al. (2020), ARBERTv2 Abdul-Mageed et al. (2021), and CAMeLBERT-MSA Inoue et al. (2021). For QALB-2014, our edit taggers are trained exclusively on QALB-2014, following the shared task restrictions. For QALB-2015 (L1), we train only on QALB-2014 for consistency. For ZAEBUC, we train on both QALB-2014 and ZAEBUC, upsampling ZAEBUC tenfold to address its smaller size and domain shift. For DA GEC, we train on the MADAR CODA training split. The hyperparameters we used are detailed in AppendixA.

QALB-2014 QALB-2015 ZAEBUC P R F 1 F 0.5 P R F 1 F 0.5 P R F 1 F 0.5 A’2023 (Seq2Seq)84.0 64.7 73.1 79.3 82.0 71.7 76.5 79.7 86.0 71.6 78.2 82.7 A’2023 (Seq2Seq++)84.2 65.4 73.6 79.6 82.6 72.1 77.0 80.3 85.9 73.4 79.2 83.1 GPT-4o 81.5 65.5 72.6 77.7 81.1 74.3 77.5 79.6 84.4 75.9 79.9 82.5 sweet 2 superscript sweet 2\textsc{sweet}^{2}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 82.6 69.5 75.5 79.6 80.0 74.3 77.0 78.8 85.5 74.4 79.6 83.0 sweet NoPnx 2 subscript superscript sweet 2 NoPnx\textsc{sweet}^{2}{\text{NoPnx}}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT NoPnx end_POSTSUBSCRIPT + sweet Pnx subscript sweet Pnx\textsc{sweet}{\text{Pnx}}sweet start_POSTSUBSCRIPT Pnx end_POSTSUBSCRIPT 84.5 67.7 75.2 80.5††\dagger†82.2 73.6 77.7 80.3 85.7 74.1 79.5 83.1 3-Ensemble 85.7 67.4 75.4 81.3 83.7 73.3 78.1 81.3 89.7 73.7 80.9 85.9 4-Ensemble 89.7 60.2 72.0 81.7‡‡\ddagger‡88.3 66.7 76.0 82.9‡‡\ddagger‡93.4 68.9 79.3 87.2‡‡\ddagger‡

Table 5: MSA GEC results on the Test sets of QALB-2014, QALB-2015 (L1), and ZAEBUC. A’2023 is Alhafni et al. (2023). Best non-ensemble results are underlined; best overall results are in bold. ††\dagger† denotes statistical significance over the best baseline; ‡‡\ddagger‡ denotes statistical significance over both the best baseline and the best non-ensemble model.

P R F 1 F 0.5 A’2024 (Seq2Seq)87.3 78.0 82.4 85.2 A’2024 (Seq2Seq++)88.4 79.0 83.4 86.3 GPT-4o 56.1 54.8 55.5 55.9 sweet 89.4 76.6 82.5 86.5 3-Ensemble 92.2 77.7 84.3 88.9‡‡\ddagger‡ 4-Ensemble 94.0 72.9 82.1 88.8

Table 6: DA GEC results on the Test set of MADAR CODA. A’2024 is Alhafni et al. (2024). Best non-ensemble results are underlined, best overall results are in bold. ‡‡\ddagger‡ denotes statistical significance over both the best baseline and the best non-ensemble model.

4.4 Ensembling

We construct majority vote ensemble models by aggregating the outputs of multiple GEC systems. This is enabled by our edit extraction algorithm (§3.1), which allows us to align and extract edits from models with different architectures. Using this algorithm, we first align each model’s output with the input text, extract the proposed edits, and then determine the final edit sequence through majority voting. Following Tarnavskyi et al. (2022), we retain an edit only if at least k−1 𝑘 1 k-1 italic_k - 1 models out of k 𝑘 k italic_k models predict it; otherwise, we leave the input unchanged. This strategy prioritizes precision over recall, which is crucial for GEC systems, as precision is generally more important than correcting every possible error Bryant et al. (2023).

5 Results

Tables3 and4 show the Dev results for MSA and DA GEC, respectively. For each dataset, we compare our models with the best-performing Seq2Seq and Seq2Seq++ baselines reported by Alhafni et al. (2023) and Alhafni et al. (2024). The Seq2Seq++ setups incorporate additional signals, such as morphological preprocessing and GED information for MSA GEC, or dialect identification for DA GEC. Full results for all Seq2Seq-based baseline variants across datasets are provided in AppendicesE andF.

LLMs

We present LLMs results on MSA and DA GEC using their best setups, optimized for average F 0.5 across all datasets based on prompt language and strategy (0-shot vs. 5-shot). Full results are in Table11 (AppendixD). For QALB-2014, GPT-4o and Fanar outperform GPT-3.5 and Jais-13B-Chat, with GPT-4o achieving the best performance, though none surpass Alhafni et al. (2023). On ZAEBUC, GPT-4o leads, achieving the highest recall (76.8) and F 1 (81.3). For DA GEC, GPT-4o is the top LLM, but overall LLM performance is notably lower than for MSA.

QALB-2014 ZAEBUC MADAR CODA Baseline sweet Ensemble Baseline sweet Ensemble Baseline sweet Ensemble Delete 41.1 44.3 45.2 51.9 63.6 62.5 0.0 0.0 0.0 Merge-B 94.0 93.7 93.8 96.7 96.9 96.6 94.4 86.6 92.7 Merge-I 93.8 93.5 93.6 96.7 96.9 96.6 93.6 84.8 91.6 M 33.9 33.6 28.6 48.6 50.0 41.7 82.5 78.0 82.4 M+O 58.0 61.0 60.6 55.6 100.0 0.0 0.0 0.0 0.0 O 94.3 94.5 94.4 94.4 94.4 94.1 92.1 90.2 91.4 O+X 78.1 81.5 83.3 0.0 0.0 0.0 0.0 0.0 0.0 P 75.0 75.6 76.8 62.8 71.9 70.4 65.8 35.7 55.6 S 46.4 57.1 57.4 40.4 47.6 46.9 83.1 82.3 83.0 X 61.2 61.4 62.4 72.9 74.1 74.1 73.8 76.9 79.5 Split 87.1 83.8 87.6 88.2 90.0 95.2 85.9 83.3 86.8 UNK 59.2 55.0 56.0 63.1 44.6 47.6 93.0 94.6 94.1 C 97.1 96.4 94.7 96.1 96.0 93.9 97.0 96.0 94.7 Macro Avg.70.7 71.6 71.9 66.7 71.2 63.1 78.3 73.5 77.4

Table 7: Error type performance on the Dev sets of QALB-2014, ZAEBUC, and MADAR CODA for the best Seq2Seq++ baseline, the best sweet model (sweet NoPnx 2 subscript superscript sweet 2 NoPnx\textsc{sweet}^{2}{\text{NoPnx}}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT NoPnx end_POSTSUBSCRIPT + sweet Pnx subscript sweet Pnx\textsc{sweet}{\text{Pnx}}sweet start_POSTSUBSCRIPT Pnx end_POSTSUBSCRIPT for QALB-2014 and ZAEBUC; sweet for MADAR CODA), and the best ensemble (4-Ensemble). Results are reported in terms of F 0.5. Best non-ensemble results are underlined; best overall results are in bold. UNK refers to unknown error types; C refers to correct words.

Edit Taggers

Table10 (AppendixC) presents the full edit tagging results on the Dev sets, exploring edit design choices using CAMeLBERT-MSA, AraBERTv02, and ARBERTv2. AraBERTv02 consistently performs best. Subword-level edits, compression, and pruning improve performance, with optimal pruning thresholds of 10 for QALB-2014 and MADAR CODA, and 30 for ZAEBUC. The optimal setup for each dataset (subword, compression, pruning) is presented in Tables 3 and 4. We henceforth refer to this system as sweet (Subword Edit Error Tagger). sweet achieves an F 0.5 of 78.8 on QALB-2014 and 86.0 on MADAR CODA, outperforming the Seq2Seq baseline on QALB-2014 and setting a new SOTA on MADAR CODA (though the improvement is not statistically significant).3 3 3 Statistical significance was done using a two-sided approximate randomization test. On ZAEBUC, it scores 82.7 F 0.5, trailing behind the Seq2Seq baseline. Consistent with previous work on text editing Omelianchuk et al. (2020); Straka et al. (2021), we find that iterative correction improves MSA GEC up to two iterations (sweet 2 superscript sweet 2\textsc{sweet}^{2}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT), achieving 79.3 on QALB-2014 and 83.0 F 0.5 on ZAEBUC, with the highest recall on QALB-2014 (70.4). However, iterative correction degrades DA GEC performance. Separating non-punctuation edits (sweet NoPnx subscript sweet NoPnx\textsc{sweet}{\text{NoPnx}}sweet start_POSTSUBSCRIPT NoPnx end_POSTSUBSCRIPT) from punctuation edits (sweet Pnx subscript sweet Pnx\textsc{sweet}{\text{Pnx}}sweet start_POSTSUBSCRIPT Pnx end_POSTSUBSCRIPT) improves MSA GEC performance. The best setup applies these systems in sequence: two iterations of non-punctuation correction followed by one iteration of punctuation correction (sweet NoPnx 2 subscript superscript sweet 2 NoPnx\textsc{sweet}^{2}{\text{NoPnx}}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT NoPnx end_POSTSUBSCRIPT + sweet Pnx subscript sweet Pnx\textsc{sweet}{\text{Pnx}}sweet start_POSTSUBSCRIPT Pnx end_POSTSUBSCRIPT). This setup achieves the highest F 0.5 score among text editing models, setting a new SOTA on QALB-2014 with 80.3. This improvement is statistically significant compared to Seq2Seq++ (p<0.05 𝑝 0.05 p<0.05 italic_p < 0.05) and is driven by a precision of 83.7. Its performance on ZAEBUC leads other edit tagging techniques but trails behind GPT-4o. For our ensemble models (3-Ensemble), we combine the outputs of the top three non-LLM models per dataset. For QALB-2014 and ZAEBUC, this includes Seq2Seq++, sweet 2 superscript sweet 2\textsc{sweet}^{2}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and the cascaded setup sweet NoPnx 2 subscript superscript sweet 2 NoPnx\textsc{sweet}^{2}{\text{NoPnx}}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT NoPnx end_POSTSUBSCRIPT + sweet Pnx subscript sweet Pnx\textsc{sweet}{\text{Pnx}}sweet start_POSTSUBSCRIPT Pnx end_POSTSUBSCRIPT. For MADAR CODA, we ensemble Seq2Seq++, sweet, and the second-best sweet model using CAMeLBERT-MSA (see Table10 in AppendixC). The 3-Ensembles outperform single models, achieving SOTA results across all datasets, primarily through increased precision at the cost of recall. Adding GPT-4o’s output to the ensemble further boosts performance (4-Ensemble), reaching an F 0.5 of 81.8 on QALB-2014, 86.9 on ZAEBUC, and 88.6 on MADAR CODA. These gains are statistically significant (p<0.05 𝑝 0.05 p<0.05 italic_p < 0.05) compared to the best baseline and the best non-ensemble model for each dataset.

Test Results

Tables 5 and 6 present the Test results for MSA and DA GEC, using the best setups identified from the Dev sets. On QALB-2014, the cascaded setup sweet NoPnx 2 subscript superscript sweet 2 NoPnx\textsc{sweet}^{2}{\text{NoPnx}}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT NoPnx end_POSTSUBSCRIPT + sweet Pnx subscript sweet Pnx\textsc{sweet}{\text{Pnx}}sweet start_POSTSUBSCRIPT Pnx end_POSTSUBSCRIPT sets a new SOTA with 80.5 F 0.5, outperforming the Seq2Seq and Seq2Seq++ baselines (statistically significant at p<0.05 𝑝 0.05 p<0.05 italic_p < 0.05). On QALB-2015, this setup matches Seq2Seq++ with an F 0.5 of 80.3. Similarly, on ZAEBUC, it achieves 83.1, on par with Seq2Seq++. On MADAR CODA, sweet achieves 86.5, outperforming Seq2Seq++ (though not statistically significant). Our ensemble models further enhance performance across all datasets, reaching 81.7 on QALB-2014, 82.9 on QALB-2015, 87.2 on ZAEBUC, and 88.9 on MADAR CODA. Notably, adding GPT-4o’s output to the ensemble (i.e., the 4-Ensemble) yields statistically significant improvements for MSA GEC. On MADAR CODA, the 3-Ensemble already achieves statistically significant gains, while the addition of GPT-4o does not lead to further improvement.

5.1 Error Analysis

Table7 presents specific error type performance over the Dev sets of QALB-2014, ZAEBUC, and MADAR CODA. We conduct automatic error analysis using ARETA Belkebir and Habash (2021), an automatic MSA error type annotation tool. ARETA defines error types based on seven classes covering: orthography (O), morphology (M), syntax (X), semantics (S), punctuation (P), merges (Merge-Beginning/Merge-Inside), and splits (Split). On QALB-2014, the cascaded setup sweet NoPnx 2 subscript superscript sweet 2 NoPnx\textsc{sweet}^{2}{\text{NoPnx}}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT NoPnx end_POSTSUBSCRIPT + sweet Pnx subscript sweet Pnx\textsc{sweet}{\text{Pnx}}sweet start_POSTSUBSCRIPT Pnx end_POSTSUBSCRIPT outperforms the Seq2Seq++ baseline on most error types, achieving a macro F 0.5 of 71.6. The 4-Ensemble provides a modest improvement, reaching 71.9. On ZAEBUC, while the cascaded sweet setup does not surpass Seq2Seq++ in overall performance (Table3), it achieves higher scores on most individual error types, with a macro F 0.5 of 71.2. This stems from the skewed distribution of error types in the ZAEBUC Dev set, which is dominated by correct words (C) and frequent errors like O and Merge, categories where both models perform similarly. The error types where the cascaded sweet model excels are relatively infrequent (see Table16 in AppendixG). Notably, while the 4-Ensemble yields the best overall GEC results, it falls short of Seq2Seq++ and the cascaded sweet model in error-type performance, likely due to prioritizing precision over recall (Table3). On MADAR CODA, neither sweet nor the 4-Ensemble surpasses Seq2Seq++ in error-type performance. While sweet performs slightly better in terms of DA GEC, the improvement is not statistically significant and is driven primarily by precision; in contrast, Seq2Seq++ claims the highest recall (Table4). The 4-Ensemble further increases precision but at the cost of recall. The high proportion of the unknown (UNK) errors also highlights ARETA’s limitations in capturing dialect-specific errors, as it was primarily designed for MSA.

Params Time Init.Run A’2023 (Seq2Seq)139M 1.7 70.7 A’2023 (Seq2Seq++)502M 24.7 218.5 sweet 135M 1.3 11.6 sweet 2 superscript sweet 2\textsc{sweet}^{\text{2}}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 135M 1.3 23.2 sweet NoPnx 2 subscript superscript sweet 2 NoPnx\textsc{sweet}^{2}{\text{NoPnx}}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT NoPnx end_POSTSUBSCRIPT + sweet Pnx subscript sweet Pnx\textsc{sweet}{\text{Pnx}}sweet start_POSTSUBSCRIPT Pnx end_POSTSUBSCRIPT 270M 2.7 34.8 3-Ensemble 908M 28.7 276.4

Table 8: Number of parameters (Params.), initialization time (Init.), and runtime for different models on the Dev set of QALB-2014. Init. and runtime are in seconds and averaged over 10 runs on a single A100 GPU using a batch size of 32.

5.2 Runtime Performance

Table8 compares our text editing models to the Seq2Seq models from Alhafni et al. (2023) in terms of model size, initialization time, and inference runtime. Initialization and inference times were averaged over 10 runs on the QALB-2014 Dev set using a single A100 GPU with a batch size of 32. The reported values for Seq2Seq++ reflect the combined size, initialization, and inference times of all its components. Our sweet model is 4x smaller than Seq2Seq++, while the cascaded system sweet NoPnx 2 subscript superscript sweet 2 NoPnx\textsc{sweet}^{2}{\text{NoPnx}}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT NoPnx end_POSTSUBSCRIPT + sweet Pnx subscript sweet Pnx\textsc{sweet}{\text{Pnx}}sweet start_POSTSUBSCRIPT Pnx end_POSTSUBSCRIPT is about half the Seq2Seq++ model size. In terms of speed, sweet initializes 19x faster than Seq2Seq++, while the cascaded system achieves a 9x initialization speedup. For inference, sweet is also 19x faster, sweet 2 superscript sweet 2\textsc{sweet}^{2}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is 9x faster, and the cascaded setup is 6x faster. Compared to the vanilla Seq2Seq model, sweet runs 6x faster, sweet 2 superscript sweet 2\textsc{sweet}^{2}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is 3x faster, and the cascaded system is twice as fast. Although the 3-Ensemble setup achieves the best performance, it is the largest in model size and the slowest overall.

6 Conclusion and Future Work

We introduced a data-driven text editing approach that eliminates the need for predefined language-specific edits. By applying it to Arabic, a diglossic and morphologically rich language, we studied the impact of different edit representations on model performance. Our models set new SOTA results on two Arabic GEC benchmarks and matched top-performing systems on two others. Moreover, they offer a significant efficiency advantage, running over six times faster than existing Arabic GEC systems, making them more suitable for practical deployment. We also explored how ensemble models contribute to further performance improvements. In future work, we plan to extend this approach to other languages and dialectal varieties Jarrar et al. (2016); Khalifa et al. (2018) and investigate its potential for generating synthetic data for GEC Li et al. (2022); Zhang et al. (2022); Stahlberg and Kumar (2024). We also plan to explore other ensembling approaches Qorib and Ng (2023); Qorib et al. (2022).

Limitations

While our work demonstrates promising results, there are several considerations that could impact its broader applicability. One limitation is the use of closed-source commercial LLMs, which introduces a degree of uncertainty, as these models may undergo undisclosed updates over time. Such changes could affect the reproducibility of our results. Additionally, we did not report on L2 Arabic GEC, which could provide valuable insights into how our approach generalizes to second-language learners errors. We also did not explore multilingual transformer encoders, as we hypothesize that monolingual models would be more effective for Arabic GEC. However, future work is needed to verify this assumption. Finally, our analysis focused on Arabic, which may limit the generalizability of our findings to languages with different error correction challenges.

Ethical Considerations

GEC systems can aid in identifying and correcting errors, but they also raise ethical concerns. Misidentifications or miscorrections may frustrate learners, and GEC tools should complement, not replace, human judgment. There is also the risk of malicious use, such as profiling learners based on error patterns, which could lead to bias or privacy issues. It is important to use these systems responsibly to protect end users.

Acknowledgments

We thank Ted Briscoe for helpful discussions and constructive feedback. We also acknowledge the support of the High Performance Computing Center at New York University Abu Dhabi.

References

Appendix A Hyperparameters

We use Hugging Face’s Transformers to build our edit taggers. Models trained on QALB-2014 or MADAR CODA are fine-tuned for 10 epochs using a learning rate of 5e-5, a batch size of 32, a maximum sequence length of 512, and a seed of 42 on a single A100 GPU. For models trained on QALB-2014 with the tenfold upsampled ZAEBUC, we use the same hyperparameters but run training for 15 epochs. At the end of fine-tuning, we pick the best checkpoint based on the performance on the Dev sets by using the M 2 scorer.

Appendix B Edit Coverage

QALB-2014 ZAEBUC MADAR CODA Input Comp.Subset Prune Edits OOV%F 0.5 Edits OOV%F 0.5 Edits OOV%F 0.5 Word✗All-16,221 1.00%98.4 1,097 2.94%96.2 1,228 1.52%98.0 Subword✗All-9,060 0.36%98.7 905 1.85%96.5 677 0.55%98.1 Word✓All-10,410 1.00%98.4 687 2.94%96.2 741 1.52%98.0 Subword✓All-6,170 0.36%98.7 563 1.85%96.5 454 0.55%98.1 Subword✓NoPnx-4,799 0.27%98.8 498 1.74%96.2--- Subword✓Pnx-160 0.01%99.4 23 0.06%99.9--- Subword✓All 10 683 0.75%98.1 58 3.71%93.9 84 1.33%96.2 Subword✓All 20 442 1.02%97.7 35 4.67%92.6 52 2.02%94.1 Subword✓All 30 329 1.24%97.4 27 5.26%91.8 45 2.28%93.4 Subword✓NoPnx 10 520 0.56%98.2 52 3.39%93.7--- Subword✓NoPnx 20 335 0.75%97.8 30 4.31%92.3--- Subword✓NoPnx 30 250 0.92%97.5 22 4.90%91.4--- Subword✓Pnx 10 48 0.02%99.4 6 0.11%99.9--- Subword✓Pnx 20 35 0.05%99.4 6 0.11%99.9--- Subword✓Pnx 30 29 0.05%99.3 6 0.11%99.9---

Table 9: Edit statistics on QALB-2014, ZAEBUC and MADAR CODA. Input is the input unit (word or subword). Comp. indicates whether the edit is compressed. Subset specifies whether the edits capture all errors, punctuation-only errors (Pnx), or non-punctuation errors (NoPnx). Edits represents the total number of unique edits in the training set of each dataset. OOV% is the percentage of out-of-vocabulary edits (non-unique) in the Dev set of each dataset.

Appendix C Edit Tagging Results

QALB-2014 ZAEBUC MADAR CODA Model Input Comp.Subset Prune P R F 1 F 0.5 P R F 1 F 0.5 P R F 1 F 0.5 AraBERTv02 Word✗All-81.0 64.3 71.7 77.0 84.8 69.5 76.4 81.2 87.9 66.5 75.7 82.6 AraBERTv02 Subword✗All-81.0 67.8 73.8 77.9 84.4 71.3 77.3 81.4 87.6 76.8 81.9 85.2 AraBERTv02 Word✔All-80.8 66.6 73.0 77.5 83.8 71.4 77.1 81.0 85.6 76.9 81.0 83.7 AraBERTv02 Subword✔All-81.1 69.1 74.6 78.4 84.3 72.9 78.2 81.7 86.9 79.2 82.9 85.2 ARBERTv2 Word✗All-78.9 57.7 66.7 73.5 82.2 54.8 65.8 74.8 86.4 61.0 71.5 79.8 ARBERTv2 Subword✗All-78.7 60.8 68.6 74.3 79.7 58.1 67.2 74.2 84.5 69.0 76.0 80.8 ARBERTv2 Word✔All-77.8 61.4 68.6 73.8 80.7 62.8 70.6 76.3 81.8 68.4 74.5 78.7 ARBERTv2 Subword✔All-78.6 60.0 68.0 74.0 82.7 62.1 70.9 77.5 84.2 70.8 77.0 81.2 CAMeLBERT Word✗All-81.2 61.5 70.0 76.3 84.6 66.4 74.4 80.2 88.3 66.4 75.8 82.8 CAMeLBERT Subword✗All-80.4 65.2 72.0 76.9 83.5 69.3 75.8 80.2 87.1 76.8 81.6 84.8 CAMeLBERT Word✔All-79.9 65.4 71.9 76.5 84.2 69.3 76.0 80.7 85.6 76.0 80.6 83.5 CAMeLBERT Subword✔All-80.6 67.4 73.4 77.6 84.6 70.8 77.1 81.4 87.0 78.8 82.7 85.2 AraBERTv02 Subword✔All 10 81.8 68.8 74.7 78.8 84.5 71.9 77.7 81.6 89.1 75.5 81.7 86.0 AraBERTv02 Subword✔All 20 81.4 68.6 74.4 78.5 85.3 72.0 78.1 82.2 87.7 73.1 79.8 84.4 AraBERTv02 Subword✔All 30 81.6 68.1 74.3 78.5 85.8 72.3 78.4 82.7 88.3 72.1 79.4 84.5 CAMeLBERT Subword✔All 10 81.2 67.4 73.7 78.0 85.1 71.0 77.4 81.8 88.4 76.3 81.9 85.7 CAMeLBERT Subword✔All 20 81.3 66.7 73.3 77.9 84.4 70.1 76.6 81.1 88.2 72.6 79.6 84.6 CAMeLBERT Subword✔All 30 81.1 67.5 73.7 77.9 84.7 70.0 76.6 81.3 88.7 71.3 79.1 84.6 AraBERTv02 Subword✔NoPnx-88.3 77.7 82.6 85.9 87.2 77.0 81.8 85.0---- AraBERTv02 Subword✔NoPnx 10 88.8 78.1 83.1 86.4 87.6 76.1 81.4 85.0---- AraBERTv02 Subword✔NoPnx 20 89.0 77.8 83.0 86.5 87.9 75.8 81.4 85.1---- AraBERTv02 Subword✔NoPnx 30 89.4 77.5 83.0 86.7 88.1 76.8 82.1 85.6---- AraBERTv02 Subword✔Pnx-90.6 83.0 86.6 89.0 96.8 94.0 95.4 96.2---- AraBERTv02 Subword✔Pnx 10 89.5 83.6 86.5 88.3 96.9 93.8 95.3 96.3---- AraBERTv02 Subword✔Pnx 20 90.7 82.8 86.5 89.0 96.7 93.6 95.1 96.1---- AraBERTv02 Subword✔Pnx 30 90.1 83.6 86.7 88.7 96.5 94.0 95.2 96.0----

Table 10: MSA and DA GEC results on the Dev sets of QALB-2014, ZAEBUC, and MADAR CODA. Input is the input unit (word or subword). Comp. indicates whether the edit is compressed. Subset specifies whether the edits capture all errors, punctuation-only errors (Pnx), or non-punctuation errors (NoPnx). NoPnx models are evaluated after removing punctuation, while Pnx models are evaluated on a version of the Dev set where all non-punctuation errors are corrected. Pruning experiments were conducted using the top two models (AraBERTv02 and CAMeLBERT), while punctuation segregation experiments used the best model (AraBERTv02). Best All results are in bold; best NoPnx and Pnx results are underlined.

Appendix D LLMs Results

QALB-2014 ZAEBUC MADAR CODA Avg. Model P-Lang Shots P R F 1 F 0.5 P R F 1 F 0.5 P R F 1 F 0.5 F 0.5 GPT-3.5-turbo EN 0 70.6 54.8 61.7 66.7 70.8 70.3 70.5 70.7 22.8 17.7 19.9 21.5 53.0 GPT-3.5-turbo EN 5 68.6 58.6 63.2 66.3 71.0 63.5 67.1 69.4 35.5 29.7 32.3 34.1 56.6 GPT-3.5-turbo AR 0 70.0 58.5 63.7 67.3 68.3 71.3 69.8 68.9 24.2 22.7 23.4 23.9 53.3 GPT-3.5-turbo AR 5 68.1 58.0 62.6 65.8 71.4 63.7 67.3 69.7 27.0 26.5 26.7 26.9 54.1 GPT-4o EN 0 82.1 56.4 66.8 75.2 80.2 75.5 77.8 79.2 28.8 25.5 27.0 28.1 60.8 GPT-4o EN 5 80.7 65.7 72.4 77.2 86.5 76.8 81.3 84.3 53.7 54.4 54.1 53.8 71.8 GPT-4o AR 0 78.9 62.8 69.9 75.1 77.4 77.7 77.5 77.4 36.4 33.5 34.9 35.8 62.8 GPT-4o AR 5 79.5 66.8 72.6 76.6 82.6 75.7 79.0 81.1 50.1 48.6 49.4 49.8 69.2 Fanar EN 0 57.4 31.4 40.6 49.2 58.4 18.6 28.2 40.9 13.7 14.6 14.1 13.9 34.7 Fanar EN 5 63.3 58.8 61.0 62.4 69.2 63.5 66.2 68.0 22.4 26.8 24.4 23.1 51.2 Fanar AR 0 62.4 57.3 59.7 61.3 57.5 33.9 42.6 50.4 17.2 19.0 18.1 17.5 43.1 Fanar AR 5 69.7 63.7 66.6 68.4 76.3 73.6 74.9 75.8 24.5 28.8 26.4 25.2 56.5 Jais-13B-Chat EN 0 49.1 37.0 42.2 46.1 53.3 5.5 10.0 19.5 8.6 8.3 8.4 8.5 24.7 Jais-13B-Chat EN 5 48.9 36.0 41.5 45.7 46.6 4.7 8.6 16.9 14.9 16.3 15.6 15.2 25.9 Jais-13B-Chat AR 0 48.2 36.2 41.4 45.2 40.8 5.4 9.6 17.7 10.7 10.6 10.7 10.7 24.5 Jais-13B-Chat AR 5 49.1 36.9 42.1 46.0 50.2 19.7 28.3 38.4 14.1 15.0 14.5 14.3 32.9

Table 11: LLMs results on MSA and DA GEC on the Dev sets of QALB-2014, ZAEBUC, and MADAR CODA. P-Lang is the prompt language either in English (EN) or Arabic (AR). Best average F 0.5 results for each LLM are underlined; best overall results are in bold.

Appendix E MSA GEC Results

QALB-2014 ZAEBUC P R F 1 F 0.5 P R F 1 F 0.5 AraBART 83.2 64.9 72.9 78.7 87.3 70.6 78.1 83.4 AraT5+Morph+GED 43 83.1 67.9 74.7 79.6 85.2 71.2 77.6 82.0 AraBART+Morph+GED 13 83.9 65.7 73.7 79.5 87.6 73.9 80.2 84.5 GPT-3.5-turbo 68.6 58.6 63.2 66.3 71.0 63.5 67.1 69.4 GPT-4o 80.7 65.7 72.4 77.2 86.5 76.8 81.3 84.3 Fanar 69.7 63.7 66.6 68.4 76.3 73.6 74.9 75.8 Jais-13B-Chat 49.1 36.9 42.1 46.0 50.2 19.7 28.3 38.4 sweet 81.8 68.8 74.7 78.8 85.8 72.3 78.4 82.7 sweet 2 superscript sweet 2\textsc{sweet}^{2}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 81.9 70.4 75.7 79.3 85.8 73.3 79.1 83.0 sweet NoPnx 2 subscript superscript sweet 2 NoPnx\textsc{sweet}^{2}{\text{NoPnx}}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT NoPnx end_POSTSUBSCRIPT + sweet Pnx subscript sweet Pnx\textsc{sweet}{\text{Pnx}}sweet start_POSTSUBSCRIPT Pnx end_POSTSUBSCRIPT 83.7 68.8 75.6 80.3††\dagger†86.7 73.9 79.8 83.8 3-Ensemble 84.9 68.8 76.0 81.1 89.6 72.8 80.3 85.6 4-Ensemble 89.1 61.6 72.8 81.8‡‡\ddagger‡93.3 68.3 78.9 86.9††\dagger†

Table 12: MSA GEC results on the Dev sets of QALB-2014 and ZAEBUC. Best non-ensemble results are underlined; best overall results are in bold. ††\dagger† denotes statistical significance over the best baseline; ‡‡\ddagger‡ denotes statistical significance over both the best baseline and the best non-ensemble model.

QALB-2014 QALB-2015 ZAEBUC P R F 1 F 0.5 P R F 1 F 0.5 P R F 1 F 0.5 AraBART 84.0 64.7 73.1 79.3 82.0 71.7 76.5 79.7 86.0 71.6 78.2 82.7 AraBART+GED 43 84.2 65.4 73.6 79.6 81.2 72.4 76.5 79.3 85.4 72.6 78.5 82.5 AraBART+Morph+GED 43 83.9 65.7 73.7 79.5 82.6 72.1 77.0 80.3 85.4 73.7 79.1 82.7 AraBART+GED 13 84.1 65.0 73.3 79.4 81.5 72.7 76.8 79.5 85.9 73.4 79.2 83.1 GPT-4o 81.5 65.5 72.6 77.7 81.1 74.3 77.5 79.6 84.4 75.9 79.9 82.5 sweet 2 superscript sweet 2\textsc{sweet}^{2}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 82.6 69.5 75.5 79.6 80.0 74.3 77.0 78.8 85.5 74.4 79.6 83.0 sweet NoPnx 2 subscript superscript sweet 2 NoPnx\textsc{sweet}^{2}{\text{NoPnx}}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT NoPnx end_POSTSUBSCRIPT + sweet Pnx subscript sweet Pnx\textsc{sweet}{\text{Pnx}}sweet start_POSTSUBSCRIPT Pnx end_POSTSUBSCRIPT 84.5 67.7 75.2 80.5††\dagger†82.2 73.6 77.7 80.3 85.7 74.1 79.5 83.1 3-Ensemble 85.7 67.4 75.4 81.3 83.7 73.3 78.1 81.3 89.7 73.7 80.9 85.9 4-Ensemble 89.7 60.2 72.0 81.7‡‡\ddagger‡88.3 66.7 76.0 82.9‡‡\ddagger‡93.4 68.9 79.3 87.2‡‡\ddagger‡

Table 13: MSA GEC results on the Test sets of QALB-2014, QALB-2015 (L1), and ZAEBUC. Best non-ensemble results are underlined; best overall results are in bold. ††\dagger† denotes statistical significance over the best baseline; ‡‡\ddagger‡ denotes statistical significance over both the best baseline and the best non-ensemble model.

Appendix F DA GEC Results

P R F 1 F 0.5 AraT5 86.8 77.4 81.8 84.7 AraT5+City 87.6 79.3 83.3 85.8 GPT-3.5-turbo 35.5 29.7 32.3 34.1 GPT-4o 53.7 54.4 54.1 53.8 Fanar 24.5 28.8 26.4 25.2 Jais-13B-Chat 14.1 15.0 14.5 14.3 sweet 89.1 75.5 81.7 86.0 sweet 2 superscript sweet 2\textsc{sweet}^{\text{2}}sweet start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 87.5 73.5 79.9 84.3 3-Ensemble 91.7 77.4 83.9 88.4 4-Ensemble 93.8 72.5 81.8 88.6‡‡\ddagger‡

Table 14: DA GEC results on the MADAR CODA Dev set. Best non-ensemble results are underlined; best overall results are in bold. ‡‡\ddagger‡ denotes statistical significance over both the best baseline and the best non-ensemble model.

P R F 1 F 0.5 AraT5 87.3 78.0 82.4 85.2 AraT5+DA Phrase 88.4 79.0 83.4 86.3 GPT-4o 56.1 54.8 55.5 55.9 sweet 89.4 76.6 82.5 86.5 3-Ensemble 92.2 77.7 84.3 88.9 4-Ensemble 94.0 72.9 82.1 88.8‡‡\ddagger‡

Table 15: DA GEC results on the Test set of MADAR CODA. Best non-ensemble results are underlined; best overall results are in bold. ‡‡\ddagger‡ denotes statistical significance over both the best baseline and the best non-ensemble model.

Appendix G Error Type Statistics

QALB-2014 ZAEBUC MADAR CODA Train Dev Test Train Dev Test Train Dev Test Delete 6,442 346 540 305 64 66 35 0 1 Merge-B 15,063 797 795 849 180 133 404 102 95 Merge-I 15,296 812 807 851 180 133 429 109 103 M 1,466 69 63 137 32 28 159 42 56 M+O 243 17 15 7 1 8 0 0 0 O 141,752 7,380 6,976 3,203 695 792 5,604 1,179 1,193 O+X 323 24 18 20 0 4 0 0 0 P 11,379 598 687 237 51 36 18 10 9 S 5,436 247 252 169 36 51 408 77 64 X 13,592 809 743 528 110 113 911 166 151 Split 7,828 432 399 49 10 10 279 58 69 UNK 6,835 331 300 361 78 61 2,235 430 649 C 795,510 41,875 39,690 18,411 3,839 3,683 29,285 6,627 6,456 1,021,165 53,737 51,285 25,127 5,276 5,118 39,767 8,800 8,846

Table 16: Distribution of error types in QALB-2014, ZAEBUC, and MADAR CODA. UNK refers to unknown error types; C refers to correct words.

Appendix H Prompts

Image 2: Refer to caption

Figure 2: 0-shot prompts used to evaluate LLMs performance on MSA and DA GEC. P-Lang is the prompt language either in English (EN) or Arabic (AR).

Image 3: Refer to caption

Figure 3: 5-shot prompts used to evaluate LLMs performance on MSA and DA GEC. P-Lang is the prompt language either in English (EN) or Arabic (AR).

Xet Storage Details

Size:
105 kB
·
Xet hash:
0f7c96a46d41d2ec4d1ca02da83e1a9e5a92d20337c6ca1d19f3be9ebb00bb60

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.