Model Card for double7/Qwen2.5-7B-MT-GRRM-Optimized

Model Details

Model Description

double7/Qwen2.5-7B-MT-GRRM-Optimized is a multilingual Machine Translation (MT) model post-trained with Group Relative Policy Optimization (GRPO) using GRRM (Group Relative Reward Model) as the reward provider. The training goal is to improve translation quality—especially on challenging, reasoning-intensive translation cases—by leveraging groupwise, relative reward signals (GQM) that provide fine-grained intra-group ranking feedback.

The model is initialized from Qwen2.5-7B, then:

  1. Cold-started via SFT on Chinese–English data (with LLM-annotated comparative reasoning / CoT-style supervision for translation with reasoning).
  2. Optimized via GRPO on multilingual MT data (TowerBlocks, ~150k samples spanning 10 languages) using GRRM as the reference-free reward model, with Cross-Lingual Augmentation (CLA) enabled.
  • Model type: Causal Language Model (Instruction-tuned / MT-oriented post-training)
  • Primary use: Machine Translation (multilingual, En↔X / Zh↔En emphasized)
  • Language(s): English, Portuguese, Spanish, French, German, Dutch, Italian, Russian, Chinese (and potentially other languages, but not guaranteed)
  • License: Apache License 2.0
  • Finetuned from model: Qwen2.5-7B

Model Sources

Uses

Direct Use

This model is intended for translation-with-reasoning, including:

  • General-domain MT across multiple language pairs (e.g., En↔De/Fr/Es/Pt/It/Nl/Ru/Zh).
  • Challenging MT scenarios where reasoning about ambiguity, localization, idioms, discourse coherence, or subtle adequacy issues is required.

Input / Output Format

Input format

Format the input as an instruction-style MT prompt with explicit reasoning request (See the example below). Wrap the source text in a fenced code block to avoid formatting ambiguity and to support multi-line inputs.

Output format

The model typically outputs:

  1. A step-by-step analysis section in Markdown, explaining idioms, ambiguity resolution, and translation choices.
  2. The final translation wrapped in a fenced code block, as requested by the prompt.

Bias, Risks, and Limitations

  • Language and domain coverage: Although trained with multilingual GRPO, quality can degrade for low-resource languages, specialized terminology, or very creative text.
  • Prompt sensitivity: Output style may vary with prompting. For consistent outputs, standardize prompts.
  • Reward-model-induced bias: Optimization uses GRRM (a learned judge). While GRRM improves ranking sensitivity in groups, it may still encode preferences (e.g., fluency/style) and occasionally mis-rank edge cases, which can influence the final MT model.

How to Get Started with the Model

For high-throughput inference with vllm and official parsing script, please refer to our code repository.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_PATH = "double7/Qwen2.5-7B-MT-GRRM-Optimized"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH, device_map="auto", torch_dtype="auto"
)

prompt = """Translate the following text from English into Chinese. Perform a step by step analysis and output the final translation in a code block.

Source text:
```
The grass is always greener on the other side.
```
"""

messages = [{"role": "user", "content": prompt}]
input_text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        do_sample=False,
        max_new_tokens=8192,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Output Example
### Step-by-step Analysis

1.  **"The grass is always greener on the other side."**: This is a well-known English idiom. A literal, word-for-word translation would be unnatural in Chinese. The core meaning is that people often feel their current situation is not as good as they perceive another person's or another situation to be.

2.  **Identifying the Chinese equivalent**: The most common and idiomatic translation for this concept in Chinese is "这山望着那山高" (zhè shān wàngzhe nà shān gāo), which literally means "To look at the other mountain, thinking it's higher." This perfectly captures the essence of wanting what one doesn't have.

3.  **Exploring other options**:
    *   "彼岸的草总是更绿" (Bǐ'àn de cǎo zǒngshì gèng lǜ): A more literal, poetic translation, "The grass on the other shore is always greener," which is also understandable but not as common in everyday speech as the idiom.
    *   "隔岸观火" (gé àn guān huǒ): This idiom means "to watch a fire from the other bank," implying a passive, detached, or envious observation, but it doesn't fully capture the comparison between one's own situation and another's.

4.  **Final Selection**: "这山望着那山高" is the most idiomatic and widely used equivalent. A slightly more formal and written version is "人往高处走,水往低处流" (rén wǎng gāochù zǒu, shuǐ wǎng dīchù liú), meaning "People strive for higher ground, water flows to lower ground," which also conveys a similar message. However, "这山望着那山高" is the most direct and natural translation.

### Final Translation

```
这山望着那山高。
```

Training Details

Training Data

  • SFT cold-start (Zh–En): Chinese–English subset of TowerBlocks, supervised with LLM-annotated reasoning and translation signals.
  • GRPO stage (multilingual): TowerBlocks multilingual translation data covering 10 languages, about 150k training samples.

Training Procedure (High-level)

  1. SFT (cold start): Initialize translation and basic reasoning behaviors (Zh–En).
  2. GRPO w/ GRRM feedback: Sample groups of candidate translations per source, score/rank them with GRRM (groupwise), compute advantages within-group, and update the policy to prefer better candidates—targeting improved reasoning ability and robustness on challenging MT cases.

Training Hyperparameters

  • Hardware: 16 × NVIDIA A100 (80GB)

SFT (policy cold-start)

  • Epochs: 2
  • Global batch size: 64
  • LR scheduler: cosine
  • Peak learning rate: 1e-5
  • Warmup ratio: 0.1

Reinforcement Learning (policy optimization with GRRM)

  • RL algorithm: GSPO (Group Sequence Policy Optimization), with additional stabilization enhancements (see paper appendix)
  • Epochs: 1
  • Learning rate: 1e-5
  • LR scheduler: constant
  • Rollouts per prompt: 4
  • Length control: max 4096 tokens, soft length penalty with overlong buffer = 2048 tokens
  • Total batch size: 512
  • PPO mini-batch size: 128
  • KL penalty: disabled (no KL divergence penalty)
  • Advantage normalization: standard deviation normalization removed
  • Reward scaling: scaled to [0, 0.1] for stability

Evaluation

MT performance on WMT and Seed-X-Challenge benchmarks. We report BLEURT-20 and LLM-as-a-Judge scores (evaluated by DeepSeek-R1-0528). Optimizing with GRRM via GRPO significantly improves the translation quality and reasoning capabilities of the base model.

Model WMT Zh→En (BLEURT / R1) WMT En→Zh (BLEURT / R1) WMT En→X (BLEURT / R1) Seed-X Zh→En (BLEURT / R1) Seed-X En→Zh (BLEURT / R1)
General LLMs
Gemini-2.5-Pro 68.66 / 92.92 66.00 / 91.31 68.87 / 90.35 71.59 / 89.41 69.19 / 86.06
DeepSeek-R1-0528 67.78 / 92.34 64.87 / 89.24 67.72 / 88.48 70.92 / 87.95 68.23 / 84.40
Qwen2.5-7B-Instruct 67.31 / 88.49 59.92 / 80.51 58.72 / 72.51 66.59 / 79.23 62.75 / 72.37
Specialized Models
TowerInstruct-13B 67.56 / 84.83 62.92 / 77.63 66.61 / 82.68 63.32 / 69.54 63.46 / 71.17
SeedX-PPO 69.02 / 90.47 67.21 / 87.98 68.35 / 86.04 69.37 / 82.47 68.72 / 80.56
SSR-X-Zero-7B 68.30 / 88.67 66.12 / 83.78 - / - 68.84 / 81.15 67.08 / 77.56
Qwen2.5-7B-SFT 67.07 / 87.78 59.99 / 76.98 57.14 / 67.91 67.65 / 80.91 62.36 / 72.42
⭐+ GRPO 67.41 / 92.24 64.80 / 87.80 64.65 / 83.86 69.55 / 85.90 67.05 / 82.55
⭐+ GRPO w/ CLA 67.39 / 92.09 63.91 / 88.29 64.50 / 83.71 69.25 / 88.58 67.07 / 83.33

Citation

@article{yang2026grrmgrouprelativereward,
      title={GRRM: Group Relative Reward Modeling for Machine Translation}, 
      author={Sen Yang and Shanbo Cheng and Lu Xu and Jianbing Zhang and Shujian Huang},
      year={2026},
      eprint={2602.14028},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.14028},
}
Downloads last month
5
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for double7/Qwen2.5-7B-MT-GRRM-Optimized

Base model

Qwen/Qwen2.5-7B
Finetuned
(901)
this model

Collection including double7/Qwen2.5-7B-MT-GRRM-Optimized

Paper for double7/Qwen2.5-7B-MT-GRRM-Optimized