WALAR

📃 Paper | ⚙️ Code | 🤗 Model | 📭 Contact

This repository contains Qwen3-8B-WALAR, a variant of Qwen3-8B trained using the WALAR framework as presented in the paper Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation.

Overview

We propose WALAR, a reinforcement training method using only monolingual text to elevate LLMs' translation capabilities on massive low-resource languages. Our key insight is based on mending the holes of current state-of-the-art neural machine translation metrics, as training directly on these metrics will amplify such holes in trained LLMs. Specifically, we integrate quality estimation score, word alignment score and language alignment into WALAR's reward to mitigate the reward hacking brought by the holes. Finally, we trained three LLMs using WALAR. Extensive experiments on over 1400 language directions demonstrate that our model outperforms the strongest prior multilingual model of the same size.

📐 Experimental Results

📊 FLORES-101

We conducted extensive experiments on FLORES-101 and reported xCOMET and MetricX scores for over 1400 language directions. Results demonstrate that WALAR improves LLM translation quality by a large margin. By comparing Qwen3-8B, Translategemma-4B-it and LLaMAX3-8B-Alpaca before and after training with WALAR, we observe significant average improvements across all metrics, demonstrating the generalizability of WALAR across different model families.

We also leveraged Gemini 3 Flash to perform LLM-as-a-Judge to provide a more comprehensive evaluation of the translations generated by LLaMAX and LLaMAX+WALAR. Results show that LLaMAX3-8B-Alpaca trained with WALAR outperforms the base model on all language directions and the average score achieved by WALAR-trained LLaMAX3-8B-Alpaca is higher than 66, corresponding to translations with only minor issues according to the judging rubric.

📄 Language Consistency

To systematically assess an LLM's ability to generate translations in the desired target language, we define the Language Consistency Rate (LCR) as the proportion of test instances whose outputs are identified as being in the correct target language. As shown in the figure below, WALAR also improves language consistency by a large margin, especially for low-resource target language, such as Swahili.

📈Generalization of WALAR

Our model trained with WALAR also demonstrated strong generalization ability on language directions that are unseen during training. These results indicate that the improvements induced by WALAR can transfer beyond the training language set, potentially reducing the amount of parallel data and the number of language directions required to train massive multilingual models.

Model Index

We trained three models using WALAR. In this repo, we present Qwen3-8B-WALAR, which is a variant of Qwen3-8B trained with WALAR. The model index is shown below:

Model	Link
LLaMAX3-8B-Alpaca-WALAR	https://huggingface.co/lyf07/LLaMAX3-8B-Alpaca-WALAR
Qwen3-8B-WALAR	https://huggingface.co/lyf07/Qwen3-8B-WALAR
Translategemma-4B-it-WALAR	https://huggingface.co/lyf07/Translategemma-4B-it-WALAR