Update README.md
Browse files
README.md
CHANGED
|
@@ -1,56 +1,30 @@
|
|
| 1 |
---
|
| 2 |
-
library_name: transformers
|
| 3 |
license: apache-2.0
|
| 4 |
-
license_link: https://huggingface.co/Qwen/Qwen3-30B-A3B/blob/main/LICENSE
|
| 5 |
pipeline_tag: text-generation
|
| 6 |
base_model:
|
| 7 |
-
- Qwen/Qwen3-30B-A3B
|
| 8 |
---
|
| 9 |
|
| 10 |
-
#
|
| 11 |
-
<a href="https://chat.qwen.ai/" target="_blank" style="margin: 2px;">
|
| 12 |
-
<img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
|
| 13 |
-
</a>
|
| 14 |
|
| 15 |
-
|
|
|
|
| 16 |
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
- **Uniquely support of seamless switching between thinking mode** (for complex logical reasoning, math, and coding) and **non-thinking mode** (for efficient, general-purpose dialogue) **within single model**, ensuring optimal performance across various scenarios.
|
| 20 |
-
- **Significantly enhancement in its reasoning capabilities**, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
|
| 21 |
-
- **Superior human preference alignment**, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.
|
| 22 |
-
- **Expertise in agent capabilities**, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks.
|
| 23 |
-
- **Support of 100+ languages and dialects** with strong capabilities for **multilingual instruction following** and **translation**.
|
| 24 |
-
|
| 25 |
-
## Model Overview
|
| 26 |
-
|
| 27 |
-
**Qwen3-30B-A3B** has the following features:
|
| 28 |
-
- Type: Causal Language Models
|
| 29 |
-
- Training Stage: Pretraining & Post-training
|
| 30 |
-
- Number of Parameters: 30.5B in total and 3.3B activated
|
| 31 |
-
- Number of Paramaters (Non-Embedding): 29.9B
|
| 32 |
-
- Number of Layers: 48
|
| 33 |
-
- Number of Attention Heads (GQA): 32 for Q and 4 for KV
|
| 34 |
-
- Number of Experts: 128
|
| 35 |
-
- Number of Activated Experts: 8
|
| 36 |
-
- Context Length: 32,768 natively and [131,072 tokens with YaRN](#processing-long-texts).
|
| 37 |
-
|
| 38 |
-
For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwenlm.github.io/blog/qwen3/), [GitHub](https://github.com/QwenLM/Qwen3), and [Documentation](https://qwen.readthedocs.io/en/latest/).
|
| 39 |
-
|
| 40 |
-
## Quickstart
|
| 41 |
-
|
| 42 |
-
The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`.
|
| 43 |
-
|
| 44 |
-
With `transformers<4.51.0`, you will encounter the following error:
|
| 45 |
-
```
|
| 46 |
-
KeyError: 'qwen3_moe'
|
| 47 |
-
```
|
| 48 |
|
| 49 |
The following contains a code snippet illustrating how to use the model generate content based on given inputs.
|
| 50 |
```python
|
| 51 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
-
model_name = "
|
| 54 |
|
| 55 |
# load the tokenizer and the model
|
| 56 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|
@@ -61,293 +35,227 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
| 61 |
)
|
| 62 |
|
| 63 |
# prepare the model input
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
{"role": "user", "content": prompt}
|
| 67 |
-
]
|
| 68 |
-
text = tokenizer.apply_chat_template(
|
| 69 |
-
messages,
|
| 70 |
-
tokenize=False,
|
| 71 |
-
add_generation_prompt=True,
|
| 72 |
-
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
|
| 73 |
-
)
|
| 74 |
-
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
| 75 |
|
| 76 |
-
|
| 77 |
-
generated_ids = model.generate(
|
| 78 |
-
**model_inputs,
|
| 79 |
-
max_new_tokens=32768
|
| 80 |
-
)
|
| 81 |
-
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
|
| 82 |
|
| 83 |
-
|
| 84 |
-
try:
|
| 85 |
-
# rindex finding 151668 (</think>)
|
| 86 |
-
index = len(output_ids) - output_ids[::-1].index(151668)
|
| 87 |
-
except ValueError:
|
| 88 |
-
index = 0
|
| 89 |
|
| 90 |
-
|
| 91 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
|
|
|
|
|
|
| 96 |
|
| 97 |
-
|
| 98 |
-
- SGLang:
|
| 99 |
-
```shell
|
| 100 |
-
python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B --reasoning-parser qwen3
|
| 101 |
-
```
|
| 102 |
-
- vLLM:
|
| 103 |
-
```shell
|
| 104 |
-
vllm serve Qwen/Qwen3-30B-A3B --enable-reasoning --reasoning-parser deepseek_r1
|
| 105 |
-
```
|
| 106 |
|
| 107 |
-
|
|
|
|
| 108 |
|
| 109 |
-
|
|
|
|
| 110 |
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
|
|
|
| 114 |
|
| 115 |
-
|
|
|
|
| 116 |
|
| 117 |
-
|
| 118 |
|
| 119 |
-
|
| 120 |
-
text = tokenizer.apply_chat_template(
|
| 121 |
-
messages,
|
| 122 |
-
tokenize=False,
|
| 123 |
-
add_generation_prompt=True,
|
| 124 |
-
enable_thinking=True # True is the default value for enable_thinking
|
| 125 |
-
)
|
| 126 |
-
```
|
| 127 |
|
| 128 |
-
|
|
|
|
|
|
|
|
|
|
| 129 |
|
| 130 |
-
|
| 131 |
-
> For thinking mode, use `Temperature=0.6`, `TopP=0.95`, `TopK=20`, and `MinP=0` (the default setting in `generation_config.json`). **DO NOT use greedy decoding**, as it can lead to performance degradation and endless repetitions. For more detailed guidance, please refer to the [Best Practices](#best-practices) section.
|
| 132 |
|
|
|
|
| 133 |
|
| 134 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
|
| 136 |
-
|
| 137 |
|
| 138 |
-
|
| 139 |
-
text = tokenizer.apply_chat_template(
|
| 140 |
-
messages,
|
| 141 |
-
tokenize=False,
|
| 142 |
-
add_generation_prompt=True,
|
| 143 |
-
enable_thinking=False # Setting enable_thinking=False disables thinking mode
|
| 144 |
-
)
|
| 145 |
-
```
|
| 146 |
|
| 147 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
|
| 149 |
-
|
| 150 |
-
> For non-thinking mode, we suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. For more detailed guidance, please refer to the [Best Practices](#best-practices) section.
|
| 151 |
|
| 152 |
-
|
|
|
|
| 153 |
|
| 154 |
-
|
| 155 |
|
| 156 |
-
|
|
|
|
| 157 |
|
| 158 |
-
|
| 159 |
-
|
| 160 |
|
| 161 |
-
|
| 162 |
-
def __init__(self, model_name="Qwen/Qwen3-30B-A3B"):
|
| 163 |
-
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 164 |
-
self.model = AutoModelForCausalLM.from_pretrained(model_name)
|
| 165 |
-
self.history = []
|
| 166 |
-
|
| 167 |
-
def generate_response(self, user_input):
|
| 168 |
-
messages = self.history + [{"role": "user", "content": user_input}]
|
| 169 |
-
|
| 170 |
-
text = self.tokenizer.apply_chat_template(
|
| 171 |
-
messages,
|
| 172 |
-
tokenize=False,
|
| 173 |
-
add_generation_prompt=True
|
| 174 |
-
)
|
| 175 |
-
|
| 176 |
-
inputs = self.tokenizer(text, return_tensors="pt")
|
| 177 |
-
response_ids = self.model.generate(**inputs, max_new_tokens=32768)[0][len(inputs.input_ids[0]):].tolist()
|
| 178 |
-
response = self.tokenizer.decode(response_ids, skip_special_tokens=True)
|
| 179 |
-
|
| 180 |
-
# Update history
|
| 181 |
-
self.history.append({"role": "user", "content": user_input})
|
| 182 |
-
self.history.append({"role": "assistant", "content": response})
|
| 183 |
-
|
| 184 |
-
return response
|
| 185 |
-
|
| 186 |
-
# Example Usage
|
| 187 |
-
if __name__ == "__main__":
|
| 188 |
-
chatbot = QwenChatbot()
|
| 189 |
-
|
| 190 |
-
# First input (without /think or /no_think tags, thinking mode is enabled by default)
|
| 191 |
-
user_input_1 = "How many r's in strawberries?"
|
| 192 |
-
print(f"User: {user_input_1}")
|
| 193 |
-
response_1 = chatbot.generate_response(user_input_1)
|
| 194 |
-
print(f"Bot: {response_1}")
|
| 195 |
-
print("----------------------")
|
| 196 |
-
|
| 197 |
-
# Second input with /no_think
|
| 198 |
-
user_input_2 = "Then, how many r's in blueberries? /no_think"
|
| 199 |
-
print(f"User: {user_input_2}")
|
| 200 |
-
response_2 = chatbot.generate_response(user_input_2)
|
| 201 |
-
print(f"Bot: {response_2}")
|
| 202 |
-
print("----------------------")
|
| 203 |
-
|
| 204 |
-
# Third input with /think
|
| 205 |
-
user_input_3 = "Really? /think"
|
| 206 |
-
print(f"User: {user_input_3}")
|
| 207 |
-
response_3 = chatbot.generate_response(user_input_3)
|
| 208 |
-
print(f"Bot: {response_3}")
|
| 209 |
-
```
|
| 210 |
|
| 211 |
-
|
| 212 |
-
> For API compatibility, when `enable_thinking=True`, regardless of whether the user uses `/think` or `/no_think`, the model will always output a block wrapped in `<think>...</think>`. However, the content inside this block may be empty if thinking is disabled.
|
| 213 |
-
> When `enable_thinking=False`, the soft switches are not valid. Regardless of any `/think` or `/no_think` tags input by the user, the model will not generate think content and will not include a `<think>...</think>` block.
|
| 214 |
|
| 215 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 216 |
|
| 217 |
-
|
|
|
|
| 218 |
|
| 219 |
-
|
| 220 |
-
``
|
| 221 |
-
|
| 222 |
-
|
| 223 |
-
# Define LLM
|
| 224 |
-
llm_cfg = {
|
| 225 |
-
'model': 'Qwen3-30B-A3B',
|
| 226 |
-
|
| 227 |
-
# Use the endpoint provided by Alibaba Model Studio:
|
| 228 |
-
# 'model_type': 'qwen_dashscope',
|
| 229 |
-
# 'api_key': os.getenv('DASHSCOPE_API_KEY'),
|
| 230 |
-
|
| 231 |
-
# Use a custom endpoint compatible with OpenAI API:
|
| 232 |
-
'model_server': 'http://localhost:8000/v1', # api_base
|
| 233 |
-
'api_key': 'EMPTY',
|
| 234 |
-
|
| 235 |
-
# Other parameters:
|
| 236 |
-
# 'generate_cfg': {
|
| 237 |
-
# # Add: When the response content is `<think>this is the thought</think>this is the answer;
|
| 238 |
-
# # Do not add: When the response has been separated by reasoning_content and content.
|
| 239 |
-
# 'thought_in_content': True,
|
| 240 |
-
# },
|
| 241 |
-
}
|
| 242 |
|
| 243 |
-
|
| 244 |
-
tools = [
|
| 245 |
-
{'mcpServers': { # You can specify the MCP configuration file
|
| 246 |
-
'time': {
|
| 247 |
-
'command': 'uvx',
|
| 248 |
-
'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']
|
| 249 |
-
},
|
| 250 |
-
"fetch": {
|
| 251 |
-
"command": "uvx",
|
| 252 |
-
"args": ["mcp-server-fetch"]
|
| 253 |
-
}
|
| 254 |
-
}
|
| 255 |
-
},
|
| 256 |
-
'code_interpreter', # Built-in tools
|
| 257 |
-
]
|
| 258 |
|
| 259 |
-
#
|
| 260 |
-
|
| 261 |
|
| 262 |
-
#
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
``
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 268 |
|
| 269 |
-
|
|
|
|
|
|
|
|
|
|
| 270 |
|
| 271 |
-
|
|
|
|
|
|
|
| 272 |
|
| 273 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 274 |
|
| 275 |
-
-
|
| 276 |
-
In the `config.json` file, add the `rope_scaling` fields:
|
| 277 |
-
```json
|
| 278 |
-
{
|
| 279 |
-
...,
|
| 280 |
-
"rope_scaling": {
|
| 281 |
-
"rope_type": "yarn",
|
| 282 |
-
"factor": 4.0,
|
| 283 |
-
"original_max_position_embeddings": 32768
|
| 284 |
-
}
|
| 285 |
-
}
|
| 286 |
-
```
|
| 287 |
-
For `llama.cpp`, you need to regenerate the GGUF file after the modification.
|
| 288 |
|
| 289 |
-
|
| 290 |
|
| 291 |
-
|
| 292 |
-
|
| 293 |
-
|
| 294 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 295 |
|
| 296 |
-
|
| 297 |
-
```shell
|
| 298 |
-
python -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}'
|
| 299 |
-
```
|
| 300 |
|
| 301 |
-
|
| 302 |
-
|
| 303 |
-
llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
|
| 304 |
-
```
|
| 305 |
|
| 306 |
-
|
| 307 |
-
> If you encounter the following warning
|
| 308 |
-
> ```
|
| 309 |
-
> Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
|
| 310 |
-
> ```
|
| 311 |
-
> please upgrade `transformers>=4.51.0`.
|
| 312 |
|
| 313 |
-
|
| 314 |
-
|
| 315 |
-
> We advise adding the `rope_scaling` configuration only when processing long contexts is required.
|
| 316 |
-
> It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0.
|
| 317 |
|
| 318 |
-
|
| 319 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 320 |
|
| 321 |
-
|
| 322 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 323 |
|
| 324 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 325 |
|
| 326 |
-
|
|
|
|
| 327 |
|
| 328 |
-
|
| 329 |
-
|
| 330 |
-
- For non-thinking mode (`enable_thinking=False`), we suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`.
|
| 331 |
-
- For supported frameworks, you can adjust the `presence_penalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
|
| 332 |
|
| 333 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 334 |
|
| 335 |
-
|
| 336 |
-
|
| 337 |
-
|
| 338 |
|
| 339 |
-
|
| 340 |
|
| 341 |
-
##
|
| 342 |
|
| 343 |
If you find our work helpful, feel free to give us a cite.
|
| 344 |
|
| 345 |
```
|
| 346 |
-
@
|
| 347 |
-
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
|
| 351 |
-
year = {2025}
|
| 352 |
}
|
| 353 |
-
```
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
license: apache-2.0
|
|
|
|
| 3 |
pipeline_tag: text-generation
|
| 4 |
base_model:
|
| 5 |
+
- Qwen/Qwen3-30B-A3B
|
| 6 |
---
|
| 7 |
|
| 8 |
+
# Rubric Generator
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
+
This is the official checkpoint of the rubric generator trained using the method proposed in [**Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation**](https://arxiv.org/pdf/2602.03619).
|
| 11 |
+
Our rubric generator was trained based on [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B).
|
| 12 |
|
| 13 |
+
## Quick Start
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
The following contains a code snippet illustrating how to use the model generate content based on given inputs.
|
| 16 |
```python
|
| 17 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 18 |
+
import re
|
| 19 |
+
import json
|
| 20 |
+
import argparse
|
| 21 |
+
|
| 22 |
+
parser = argparse.ArgumentParser()
|
| 23 |
+
parser.add_argument("--query", type=str, required=True, help="The report generation query") # string, research query
|
| 24 |
+
args = parser.parse_args()
|
| 25 |
+
query = args.query
|
| 26 |
|
| 27 |
+
model_name = "fdu-lcz/rubric_generator"
|
| 28 |
|
| 29 |
# load the tokenizer and the model
|
| 30 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|
|
|
| 35 |
)
|
| 36 |
|
| 37 |
# prepare the model input
|
| 38 |
+
zh_system_prompt = """
|
| 39 |
+
你是一位专业的评分标准(rubric)撰写专家。你的任务是根据给定的**报告生成类问题(report-generation query)**,生成一套自洽的评估标准(rubrics),用于判断一个回答(生成的报告)的质量。
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
+
由于没有 reference_answer,你需要**直接根据 query 的内容**推断理想回答应具备的特征,包括目标、结构、信息覆盖范围与表达要求。
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
+
评分标准包含但不限于以下方面:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
+
* 内容的事实相关性与准确性
|
| 46 |
+
* 报告的结构与逻辑组织
|
| 47 |
+
* 信息的完整性与深度
|
| 48 |
+
* 推理过程与论证合理性
|
| 49 |
+
* 表达的清晰性与连贯性
|
| 50 |
+
* 语气、风格与报告意图的匹配度(如总结、分析、建议等)
|
| 51 |
|
| 52 |
+
每个评分项必须是**自包含的**,让非专业读者也能独立理解,无需额外查阅资料。每条描述必须以以下前缀之一开头:
|
| 53 |
+
“关键标准: …”
|
| 54 |
+
“重要标准: …”
|
| 55 |
+
“可选标准: …”
|
| 56 |
+
“错误标准: …”
|
| 57 |
|
| 58 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
+
**输入:**
|
| 61 |
+
● query:完整的报告生成请求文本
|
| 62 |
|
| 63 |
+
**评分项总数:**
|
| 64 |
+
● 根据 query 的复杂度,选择 7 至 20 个 rubric 项。
|
| 65 |
|
| 66 |
+
**每个 rubric 项包含:**
|
| 67 |
+
● title(标题,中文,2–6 个词)
|
| 68 |
+
● description(描述):一句话,中文,以类别前缀开头,明确说明应在生成报告中观察到的具体要素
|
| 69 |
+
● weight(权重):权重,数字
|
| 70 |
|
| 71 |
+
* 关键 / 重要 / 可选 分别取 1–5(5 表示最重要)
|
| 72 |
+
* 错误 取 –1 或 –2(表示负面扣分项)
|
| 73 |
|
| 74 |
+
---
|
| 75 |
|
| 76 |
+
**类别说明:**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
+
* **关键标准**:报告必须包含的核心事实、结构或目标要素;缺失则回答无效(权重 5)
|
| 79 |
+
* **重要标准**:关键推理、完整性或清晰度;对质量影响较大(权重 3–4)
|
| 80 |
+
* **可选标准**:表达风格或深度上的加分项(权重 1–2)
|
| 81 |
+
* **错误标准**:常见错误或遗漏项,明确指出“未提及”或“错误推荐”(权重 –1 或 –2)
|
| 82 |
|
| 83 |
+
---
|
|
|
|
| 84 |
|
| 85 |
+
**其他指导:**
|
| 86 |
|
| 87 |
+
* 如果报告应包含结论或建议,加入:
|
| 88 |
+
`关键标准: 包含有证据支持的清晰结论。`(必须包含有证据支持的清晰结论)
|
| 89 |
+
* 如果报告需要解释或论证,加入:
|
| 90 |
+
`重要标准: 解释关键点背后的推理,并提供支持性论据。`
|
| 91 |
+
* 如果报告需有清晰结构,加入:
|
| 92 |
+
`关键标准: 以清晰的章节和逻辑流程组织内容。`
|
| 93 |
+
* 如果报告有特定语体要求(如学术、政策、商业等),加入:
|
| 94 |
+
`重要标准: 保持与报告上下文一致的专业和客观语气。`
|
| 95 |
+
* 如果需要简洁表达,加入:
|
| 96 |
+
`可选标准: 保持简洁,避免冗余。`
|
| 97 |
|
| 98 |
+
---
|
| 99 |
|
| 100 |
+
**输出要求:**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
+
* 输出一个 JSON 数组,格式为[{…}, {…}, …],每个 JSON 对象对应一个 rubric 项
|
| 103 |
+
* 每个 JSON 对象必须只包含三个键:`title`、`description`、`weight`
|
| 104 |
+
* 不得包含多余键或复制大段 query 内容
|
| 105 |
+
* 每个 description 必须以类别前缀开头
|
| 106 |
+
* **重要格式说明:** 在 description 或 title 的文本中,如果需要引用内容或使用引号,**请务必使用单引号(')**,严禁使用双引号("),以免破坏 JSON 格式。例如:使用 '米其林星级' 而不是 "米其林星级"。
|
| 107 |
|
| 108 |
+
---
|
|
|
|
| 109 |
|
| 110 |
+
**总结:**
|
| 111 |
+
你的任务是——**仅根据 query 内容推断出理想报告应具备的关键特征**,并据此构建一套结构化、有权重的 rubric JSON,用于系统评估报告生成结果的质量。
|
| 112 |
|
| 113 |
+
请仅返回所请求的 JSON 数组,不要返回任何额外文本或说明。
|
| 114 |
|
| 115 |
+
query:
|
| 116 |
+
"""
|
| 117 |
|
| 118 |
+
en_system_prompt = """
|
| 119 |
+
You are a professional rubric-writing expert. Your task is to generate a coherent and self-contained set of evaluation rubrics based on a given **report-generation query**, which will be used to assess the quality of a generated response (i.e., a report).
|
| 120 |
|
| 121 |
+
Since no reference answer is provided, you must **infer the characteristics of an ideal answer directly from the query**, including its objectives, structure, information coverage, and expression requirements.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
+
The evaluation rubrics should include, but are not limited to, the following aspects:
|
|
|
|
|
|
|
| 124 |
|
| 125 |
+
* Factual relevance and accuracy of the content
|
| 126 |
+
* Structure and logical organization of the report
|
| 127 |
+
* Completeness and depth of information
|
| 128 |
+
* Soundness of reasoning and argumentation
|
| 129 |
+
* Clarity and coherence of expression
|
| 130 |
+
* Appropriateness of tone and style with respect to the report's intent (e.g., summary, analysis, recommendation)
|
| 131 |
|
| 132 |
+
Each rubric item must be **self-contained**, so that a non-expert reader can understand it independently without additional context.
|
| 133 |
+
Each description must begin with one of the following prefixes:
|
| 134 |
|
| 135 |
+
- ``Key Criterion: ...''
|
| 136 |
+
- ``Important Criterion: ...''
|
| 137 |
+
- ``Optional Criterion: ...''
|
| 138 |
+
- ``Error Criterion: ...''
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
|
| 140 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 141 |
|
| 142 |
+
### **Input:**
|
| 143 |
+
* query: the full text of the report-generation request
|
| 144 |
|
| 145 |
+
### **Number of Rubric Items:**
|
| 146 |
+
* Select between 7 and 20 rubric items depending on the complexity of the query.
|
| 147 |
+
|
| 148 |
+
### **Each rubric item must include:**
|
| 149 |
+
* `title` (2-6 words)
|
| 150 |
+
* `description`: one sentence, starting with a category prefix and clearly stating what should be observed in the generated report
|
| 151 |
+
* `weight`: a numeric value
|
| 152 |
+
|
| 153 |
+
* Key / Important / Optional criteria take values from 1-5 (5 = most important)
|
| 154 |
+
* Error criteria take values of -1 or -2 (indicating penalties)
|
| 155 |
+
|
| 156 |
+
---
|
| 157 |
+
|
| 158 |
+
### **Category Definitions:**
|
| 159 |
|
| 160 |
+
* **Key Criterion**: Core facts, structure, or objectives that must be present; missing them makes the answer invalid (weight = 5)
|
| 161 |
+
* **Important Criterion**: Critical reasoning, completeness, or clarity that significantly affects quality (weight = 3-4)
|
| 162 |
+
* **Optional Criterion**: Stylistic or depth-related enhancements (weight = 1-2)
|
| 163 |
+
* **Error Criterion**: Common mistakes or omissions, explicitly indicating ``missing'' or ``incorrect'' elements (weight = -1 or -2)
|
| 164 |
|
| 165 |
+
---
|
| 166 |
+
|
| 167 |
+
### **Additional Guidelines:**
|
| 168 |
|
| 169 |
+
* If the report should include conclusions or recommendations, include:
|
| 170 |
+
`Key Criterion: Includes a clear conclusion supported by evidence.`
|
| 171 |
+
* If the report requires explanation or reasoning, include:
|
| 172 |
+
`Important Criterion: Explains the reasoning behind key points and provides supporting arguments.`
|
| 173 |
+
* If the report requires a clear structure, include:
|
| 174 |
+
`Key Criterion: Organizes content with clear sections and logical flow.`
|
| 175 |
+
* If the report has a specific tone (e.g., academic, policy-oriented, business), include:
|
| 176 |
+
`Important Criterion: Maintains a professional and objective tone consistent with the report context.`
|
| 177 |
+
* If conciseness is required, include:
|
| 178 |
+
`Optional Criterion: Maintains conciseness and avoids redundancy.`
|
| 179 |
|
| 180 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
|
| 182 |
+
### **Output Requirements:**
|
| 183 |
|
| 184 |
+
* Output a JSON array in the format: [{...}, {...}, ...], where each object corresponds to one rubric item
|
| 185 |
+
* Each JSON object must contain **only** three keys: `title`, `description`, and `weight`
|
| 186 |
+
* Do not include any extra keys or copy large portions of the query
|
| 187 |
+
* Each `description` must begin with one of the required category prefixes
|
| 188 |
+
* **Important formatting rule:**
|
| 189 |
+
If quotation marks are needed inside `title` or `description`, **use single quotes (' ') only**.
|
| 190 |
+
Do NOT use double quotes (" "), as they will break the JSON format.
|
| 191 |
+
Example: use 'Michelin star' instead of "Michelin star".
|
| 192 |
|
| 193 |
+
---
|
|
|
|
|
|
|
|
|
|
| 194 |
|
| 195 |
+
### **Summary:**
|
| 196 |
+
Your task is to **infer the essential qualities of an ideal report solely from the given query**, and construct a structured, weighted rubric in JSON format to evaluate report-generation quality.
|
|
|
|
|
|
|
| 197 |
|
| 198 |
+
Return **only** the requested JSON array. Do not include any additional explanations or text.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
|
| 200 |
+
query:
|
| 201 |
+
"""
|
|
|
|
|
|
|
| 202 |
|
| 203 |
+
messages = [
|
| 204 |
+
{"role": "system", "content": zh_system_prompt}, # Choose the system prompt based on the language of the query
|
| 205 |
+
# {"role": "system", "content": en_system_prompt}, # Choose the system prompt based on the language of the query
|
| 206 |
+
{"role": "user", "content": query}
|
| 207 |
+
]
|
| 208 |
+
text = tokenizer.apply_chat_template(
|
| 209 |
+
messages,
|
| 210 |
+
tokenize=False,
|
| 211 |
+
add_generation_prompt=True,
|
| 212 |
+
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
|
| 213 |
+
)
|
| 214 |
+
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
| 215 |
|
| 216 |
+
# conduct text completion
|
| 217 |
+
generated_ids = model.generate(
|
| 218 |
+
**model_inputs,
|
| 219 |
+
max_new_tokens=32768,
|
| 220 |
+
temperature=0.3,
|
| 221 |
+
top_p=0.95
|
| 222 |
+
)
|
| 223 |
+
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
|
| 224 |
|
| 225 |
+
# parsing thinking content
|
| 226 |
+
try:
|
| 227 |
+
# rindex finding 151668 (</think>)
|
| 228 |
+
index = len(output_ids) - output_ids[::-1].index(151668)
|
| 229 |
+
except ValueError:
|
| 230 |
+
index = 0
|
| 231 |
|
| 232 |
+
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
|
| 233 |
+
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
|
| 234 |
|
| 235 |
+
print("thinking content:", thinking_content)
|
| 236 |
+
print("content:", content)
|
|
|
|
|
|
|
| 237 |
|
| 238 |
+
if content.startswith('```json'):
|
| 239 |
+
json_str = re.search(r'```json(.*?)```', content, re.DOTALL).group(1).strip()
|
| 240 |
+
rubric_list = json.loads(json_str)
|
| 241 |
+
else:
|
| 242 |
+
rubric_list = json.loads(content)
|
| 243 |
|
| 244 |
+
print(rubric_list)
|
| 245 |
+
print("rubric_count: ", len(rubric_list))
|
| 246 |
+
```
|
| 247 |
|
| 248 |
+
If you want to employ the model with vLLM or SGlang, please refer to the official Qwen3 guidelines.
|
| 249 |
|
| 250 |
+
## Citation
|
| 251 |
|
| 252 |
If you find our work helpful, feel free to give us a cite.
|
| 253 |
|
| 254 |
```
|
| 255 |
+
@article{lv2026learning,
|
| 256 |
+
title={Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation},
|
| 257 |
+
author={Lv, Changze and Zhou, Jie and Zhao, Wentao and Xu, Jingwen and Huang, Zisu and Tian, Muzhao and Dou, Shihan and Gui, Tao and Tian, Le and Zhou, Xiao and others},
|
| 258 |
+
journal={arXiv preprint arXiv:2602.03619},
|
| 259 |
+
year={2026}
|
|
|
|
| 260 |
}
|
| 261 |
+
```
|