Instructions to use double7/Qwen2.5-7B-MT-GRRM-Optimized with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use double7/Qwen2.5-7B-MT-GRRM-Optimized with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="double7/Qwen2.5-7B-MT-GRRM-Optimized")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("double7/Qwen2.5-7B-MT-GRRM-Optimized")
model = AutoModelForCausalLM.from_pretrained("double7/Qwen2.5-7B-MT-GRRM-Optimized")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use double7/Qwen2.5-7B-MT-GRRM-Optimized with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "double7/Qwen2.5-7B-MT-GRRM-Optimized"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "double7/Qwen2.5-7B-MT-GRRM-Optimized",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/double7/Qwen2.5-7B-MT-GRRM-Optimized

SGLang

How to use double7/Qwen2.5-7B-MT-GRRM-Optimized with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "double7/Qwen2.5-7B-MT-GRRM-Optimized" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "double7/Qwen2.5-7B-MT-GRRM-Optimized",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "double7/Qwen2.5-7B-MT-GRRM-Optimized" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "double7/Qwen2.5-7B-MT-GRRM-Optimized",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use double7/Qwen2.5-7B-MT-GRRM-Optimized with Docker Model Runner:
```
docker model run hf.co/double7/Qwen2.5-7B-MT-GRRM-Optimized
```

Model Card for double7/Qwen2.5-7B-MT-GRRM-Optimized

Model Details

Model Description

double7/Qwen2.5-7B-MT-GRRM-Optimized is a multilingual Machine Translation (MT) model post-trained with Group Relative Policy Optimization (GRPO) using GRRM (Group Relative Reward Model) as the reward provider. The training goal is to improve translation quality—especially on challenging, reasoning-intensive translation cases—by leveraging groupwise, relative reward signals (GQM) that provide fine-grained intra-group ranking feedback.

The model is initialized from Qwen2.5-7B, then:

Cold-started via SFT on Chinese–English data (with LLM-annotated comparative reasoning / CoT-style supervision for translation with reasoning).
Optimized via GRPO on multilingual MT data (TowerBlocks, ~150k samples spanning 10 languages) using GRRM as the reference-free reward model, with Cross-Lingual Augmentation (CLA) enabled.

Model type: Causal Language Model (Instruction-tuned / MT-oriented post-training)
Primary use: Machine Translation (multilingual, En↔X / Zh↔En emphasized)
Language(s): English, Portuguese, Spanish, French, German, Dutch, Italian, Russian, Chinese (and potentially other languages, but not guaranteed)
License: Apache License 2.0
Finetuned from model: Qwen2.5-7B

Model Sources

Paper: GRRM: Group Relative Reward Modeling for Machine Translation
Repository: https://github.com/NJUNLP/GRRM
Reward model (used in training): Qwen2.5-7B-GRRM

Uses

Direct Use

This model is intended for translation-with-reasoning, including:

General-domain MT across multiple language pairs (e.g., En↔De/Fr/Es/Pt/It/Nl/Ru/Zh).
Challenging MT scenarios where reasoning about ambiguity, localization, idioms, discourse coherence, or subtle adequacy issues is required.

Input / Output Format

Input format

Format the input as an instruction-style MT prompt with explicit reasoning request (See the example below). Wrap the source text in a fenced code block to avoid formatting ambiguity and to support multi-line inputs.

Output format

The model typically outputs:

A step-by-step analysis section in Markdown, explaining idioms, ambiguity resolution, and translation choices.
The final translation wrapped in a fenced code block, as requested by the prompt.

Bias, Risks, and Limitations

Language and domain coverage: Although trained with multilingual GRPO, quality can degrade for low-resource languages, specialized terminology, or very creative text.
Prompt sensitivity: Output style may vary with prompting. For consistent outputs, standardize prompts.
Reward-model-induced bias: Optimization uses GRRM (a learned judge). While GRRM improves ranking sensitivity in groups, it may still encode preferences (e.g., fluency/style) and occasionally mis-rank edge cases, which can influence the final MT model.

How to Get Started with the Model

For high-throughput inference with vllm and official parsing script, please refer to our code repository.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_PATH = "double7/Qwen2.5-7B-MT-GRRM-Optimized"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH, device_map="auto", torch_dtype="auto"
)

prompt = """Translate the following text from English into Chinese. Perform a step by step analysis and output the final translation in a code block.

Source text:
```
The grass is always greener on the other side.
```
"""

messages = [{"role": "user", "content": prompt}]
input_text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        do_sample=False,
        max_new_tokens=8192,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output Example

### Step-by-step Analysis

1.  **"The grass is always greener on the other side."**: This is a well-known English idiom. A literal, word-for-word translation would be unnatural in Chinese. The core meaning is that people often feel their current situation is not as good as they perceive another person's or another situation to be.

2.  **Identifying the Chinese equivalent**: The most common and idiomatic translation for this concept in Chinese is "这山望着那山高" (zhè shān wàngzhe nà shān gāo), which literally means "To look at the other mountain, thinking it's higher." This perfectly captures the essence of wanting what one doesn't have.

3.  **Exploring other options**:
    *   "彼岸的草总是更绿" (Bǐ'àn de cǎo zǒngshì gèng lǜ): A more literal, poetic translation, "The grass on the other shore is always greener," which is also understandable but not as common in everyday speech as the idiom.
    *   "隔岸观火" (gé àn guān huǒ): This idiom means "to watch a fire from the other bank," implying a passive, detached, or envious observation, but it doesn't fully capture the comparison between one's own situation and another's.

4.  **Final Selection**: "这山望着那山高" is the most idiomatic and widely used equivalent. A slightly more formal and written version is "人往高处走，水往低处流" (rén wǎng gāochù zǒu, shuǐ wǎng dīchù liú), meaning "People strive for higher ground, water flows to lower ground," which also conveys a similar message. However, "这山望着那山高" is the most direct and natural translation.

### Final Translation

```
这山望着那山高。
```

Training Details

Training Data

SFT cold-start (Zh–En): Chinese–English subset of TowerBlocks, supervised with LLM-annotated reasoning and translation signals.
GRPO stage (multilingual): TowerBlocks multilingual translation data covering 10 languages, about 150k training samples.

Training Procedure (High-level)

SFT (cold start): Initialize translation and basic reasoning behaviors (Zh–En).
GRPO w/ GRRM feedback: Sample groups of candidate translations per source, score/rank them with GRRM (groupwise), compute advantages within-group, and update the policy to prefer better candidates—targeting improved reasoning ability and robustness on challenging MT cases.

Training Hyperparameters

Hardware: 16 × NVIDIA A100 (80GB)

SFT (policy cold-start)

Epochs: 2
Global batch size: 64
LR scheduler: cosine
Peak learning rate: 1e-5
Warmup ratio: 0.1

Reinforcement Learning (policy optimization with GRRM)

RL algorithm: GSPO (Group Sequence Policy Optimization), with additional stabilization enhancements (see paper appendix)
Epochs: 1
Learning rate: 1e-5
LR scheduler: constant
Rollouts per prompt: 4
Length control: max 4096 tokens, soft length penalty with overlong buffer = 2048 tokens
Total batch size: 512
PPO mini-batch size: 128
KL penalty: disabled (no KL divergence penalty)
Advantage normalization: standard deviation normalization removed
Reward scaling: scaled to [0, 0.1] for stability

Evaluation

MT performance on WMT and Seed-X-Challenge benchmarks. We report BLEURT-20 and LLM-as-a-Judge scores (evaluated by DeepSeek-R1-0528). Optimizing with GRRM via GRPO significantly improves the translation quality and reasoning capabilities of the base model.

Model	WMT Zh→En (BLEURT / R1)	WMT En→Zh (BLEURT / R1)	WMT En→X (BLEURT / R1)	Seed-X Zh→En (BLEURT / R1)	Seed-X En→Zh (BLEURT / R1)
General LLMs
Gemini-2.5-Pro	68.66 / 92.92	66.00 / 91.31	68.87 / 90.35	71.59 / 89.41	69.19 / 86.06
DeepSeek-R1-0528	67.78 / 92.34	64.87 / 89.24	67.72 / 88.48	70.92 / 87.95	68.23 / 84.40
Qwen2.5-7B-Instruct	67.31 / 88.49	59.92 / 80.51	58.72 / 72.51	66.59 / 79.23	62.75 / 72.37
Specialized Models
TowerInstruct-13B	67.56 / 84.83	62.92 / 77.63	66.61 / 82.68	63.32 / 69.54	63.46 / 71.17
SeedX-PPO	69.02 / 90.47	67.21 / 87.98	68.35 / 86.04	69.37 / 82.47	68.72 / 80.56
SSR-X-Zero-7B	68.30 / 88.67	66.12 / 83.78	- / -	68.84 / 81.15	67.08 / 77.56
Qwen2.5-7B-SFT	67.07 / 87.78	59.99 / 76.98	57.14 / 67.91	67.65 / 80.91	62.36 / 72.42
⭐+ GRPO	67.41 / 92.24	64.80 / 87.80	64.65 / 83.86	69.55 / 85.90	67.05 / 82.55
⭐+ GRPO w/ CLA	67.39 / 92.09	63.91 / 88.29	64.50 / 83.71	69.25 / 88.58	67.07 / 83.33

Citation

@article{yang2026grrmgrouprelativereward,
      title={GRRM: Group Relative Reward Modeling for Machine Translation}, 
      author={Sen Yang and Shanbo Cheng and Lu Xu and Jianbing Zhang and Shujian Huang},
      year={2026},
      eprint={2602.14028},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.14028},
}