Add metadata and link to paper

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +108 -45
README.md CHANGED
@@ -1,8 +1,17 @@
 
 
 
 
 
 
 
 
 
 
1
  # OpenRubrics/RubricARM-8B-Judge
2
 
3
  This is a 8B RubricARM-Judge model, finetuned from [Qwen3/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B).
4
- See our [paper](https://arxiv.org/abs/2602.01511) for more details.
5
-
6
 
7
  ## Usage
8
  ```python
@@ -20,53 +29,108 @@ Here `rubric` should be generated with a `RubricARM-Rubric`
20
  JUDGE_PROMPT_TEMPLATE = (
21
  "You are a fair and impartial judge. Your task is to evaluate 'Response A' and 'Response B' "
22
  "based on a given instruction and a rubric. You will conduct this evaluation in distinct "
23
- "phases as outlined below.\n\n"
24
- "### Phase 1: Compliance Check Instructions\n"
25
- "First, identify the single most important, objective 'Gatekeeper Criterion' from the rubric.\n"
 
 
 
 
26
  "- **A rule is objective (and likely a Gatekeeper) if it can be verified without opinion. "
27
  "Key examples are: word/paragraph limits, required output format (e.g., JSON validity), "
28
- "required/forbidden sections, or forbidden content.**\n"
 
29
  "- **Conversely, a rule is subjective if it requires interpretation or qualitative judgment. "
30
  "Subjective rules about quality are NOT Gatekeepers. Examples include criteria like \"be creative,\" "
31
- "\"write clearly,\" \"be engaging,\" or \"use a professional tone.\"**\n"
32
- f"Think step-by-step to determine this single most important Gatekeeper, then write a 1–2 sentence explanation of your decision.\n\n"
 
33
 
34
- "### Phase 2: Analyze Each Response\n"
 
 
 
35
  "Next, for each Gatekeeper Criterion and all other criteria in the rubric, evaluate each "
36
- "response item by item.\n"
37
- "For each item, think step-by-step and cite concrete evidence from the response before assigning your judgment.\n\n"
 
 
 
38
 
39
- "### Phase 3: Final Judgment Instructions\n"
 
40
  "Based on the results from the previous phases, determine the winner using these simple rules. "
41
- "Provide a final justification explaining your decision first and then give your decision.\n"
42
- "Think step-by-step to aggregate the findings and make the decision; keep the reasoning explicit and concise.\n\n"
43
- "---\n"
44
- "### REQUIRED OUTPUT FORMAT\n"
45
- "You must follow this exact output format below.\n\n"
46
- "--- Compliance Check ---\n"
47
- "Gatekeeper Reasoning: <1–2 sentences citing the relevant rubric text>\n"
48
- "Identified Gatekeeper Criterion: <e.g., Criterion 1: Must be under 50 words.>\n\n"
49
- "--- Analysis ---\n"
50
- "**Response A:**\n"
51
- "- Criterion 1 [Hard Rule]: Justification: <...>\n"
52
- "- Criterion 2 [Hard Rule]: Justification: <...>\n"
53
- "- Criterion 3 [Principle]: Justification: <...>\n"
54
- "- ... (and so on for all other criteria)\n\n"
55
- "**Response B:**\n"
56
- "- Criterion 1 [Hard Rule]: Justification: <...>\n"
57
- "- Criterion 2 [Hard Rule]: Justification: <...>\n"
58
- "- Criterion 3 [Principle]: Justification: <...>\n"
59
- "- ... (and so on for all other criteria)\n\n"
60
- "--- Final Judgment ---\n"
61
- # "Aggregation Summary: <Provide a detailed, step-by-step explanation (3–6 sentences) of how the Gatekeeper and other criteria led to the decision>\n"
62
- "Aggregation Summary: <1–3 sentences explaining how Gatekeeper and other criteria led to the decision>\n"
63
- "Justification: <...>\n"
64
- "Winner: <Response A / Response B>\n\n\n"
65
- "Task to Evaluate:\n"
66
- "Instruction:\n{instruction}\n\n"
67
- "Rubric:\n{rubric}\n\n"
68
- "Response A:\n{response_a}\n\n"
69
- "Response B:\n{response_b}"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  )
71
 
72
  user_text = JUDGE_PROMPT_TEMPLATE.format(
@@ -90,13 +154,12 @@ message = tok.apply_chat_template(
90
  # ...
91
  # ...
92
  ```
93
-
94
-
95
 
 
96
 
97
  If you find our work helpful, please consider citing our paper:
98
 
99
- ```
100
  @misc{xu2026alternating,
101
  title={Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training},
102
  author={Ran Xu and Tianci Liu and Zihan Dong and Tony You and Ilgee Hong and Carl Yang and Linjun Zhang and Tao Zhao and Haoyu Wang},
@@ -106,4 +169,4 @@ If you find our work helpful, please consider citing our paper:
106
  primaryClass={cs.CL},
107
  url={https://arxiv.org/abs/2602.01511},
108
  }
109
- ```
 
1
+ ---
2
+ pipeline_tag: text-generation
3
+ library_name: transformers
4
+ base_model: Qwen/Qwen3-8B
5
+ tags:
6
+ - reward-modeling
7
+ - alignment
8
+ - rubric-based-evaluation
9
+ ---
10
+
11
  # OpenRubrics/RubricARM-8B-Judge
12
 
13
  This is a 8B RubricARM-Judge model, finetuned from [Qwen3/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B).
14
+ It was introduced in the paper [Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training](https://huggingface.co/papers/2602.01511).
 
15
 
16
  ## Usage
17
  ```python
 
29
  JUDGE_PROMPT_TEMPLATE = (
30
  "You are a fair and impartial judge. Your task is to evaluate 'Response A' and 'Response B' "
31
  "based on a given instruction and a rubric. You will conduct this evaluation in distinct "
32
+ "phases as outlined below.
33
+
34
+ "
35
+ "### Phase 1: Compliance Check Instructions
36
+ "
37
+ "First, identify the single most important, objective 'Gatekeeper Criterion' from the rubric.
38
+ "
39
  "- **A rule is objective (and likely a Gatekeeper) if it can be verified without opinion. "
40
  "Key examples are: word/paragraph limits, required output format (e.g., JSON validity), "
41
+ "required/forbidden sections, or forbidden content.**
42
+ "
43
  "- **Conversely, a rule is subjective if it requires interpretation or qualitative judgment. "
44
  "Subjective rules about quality are NOT Gatekeepers. Examples include criteria like \"be creative,\" "
45
+ "\"write clearly,\" \"be engaging,\" or \"use a professional tone.\"**
46
+ "
47
+ f"Think step-by-step to determine this single most important Gatekeeper, then write a 1–2 sentence explanation of your decision.
48
 
49
+ "
50
+
51
+ "### Phase 2: Analyze Each Response
52
+ "
53
  "Next, for each Gatekeeper Criterion and all other criteria in the rubric, evaluate each "
54
+ "response item by item.
55
+ "
56
+ "For each item, think step-by-step and cite concrete evidence from the response before assigning your judgment.
57
+
58
+ "
59
 
60
+ "### Phase 3: Final Judgment Instructions
61
+ "
62
  "Based on the results from the previous phases, determine the winner using these simple rules. "
63
+ "Provide a final justification explaining your decision first and then give your decision.
64
+ "
65
+ "Think step-by-step to aggregate the findings and make the decision; keep the reasoning explicit and concise.
66
+
67
+ "
68
+ "---
69
+ "
70
+ "### REQUIRED OUTPUT FORMAT
71
+ "
72
+ "You must follow this exact output format below.
73
+
74
+ "
75
+ "--- Compliance Check ---
76
+ "
77
+ "Gatekeeper Reasoning: <1–2 sentences citing the relevant rubric text>
78
+ "
79
+ "Identified Gatekeeper Criterion: <e.g., Criterion 1: Must be under 50 words.>
80
+
81
+ "
82
+ "--- Analysis ---
83
+ "
84
+ "**Response A:**
85
+ "
86
+ "- Criterion 1 [Hard Rule]: Justification: <...>
87
+ "
88
+ "- Criterion 2 [Hard Rule]: Justification: <...>
89
+ "
90
+ "- Criterion 3 [Principle]: Justification: <...>
91
+ "
92
+ "- ... (and so on for all other criteria)
93
+
94
+ "
95
+ "**Response B:**
96
+ "
97
+ "- Criterion 1 [Hard Rule]: Justification: <...>
98
+ "
99
+ "- Criterion 2 [Hard Rule]: Justification: <...>
100
+ "
101
+ "- Criterion 3 [Principle]: Justification: <...>
102
+ "
103
+ "- ... (and so on for all other criteria)
104
+
105
+ "
106
+ "--- Final Judgment ---
107
+ "
108
+ # "Aggregation Summary: <Provide a detailed, step-by-step explanation (3–6 sentences) of how the Gatekeeper and other criteria led to the decision>
109
+ "
110
+ "Aggregation Summary: <1–3 sentences explaining how Gatekeeper and other criteria led to the decision>
111
+ "
112
+ "Justification: <...>
113
+ "
114
+ "Winner: <Response A / Response B>
115
+
116
+
117
+ "
118
+ "Task to Evaluate:
119
+ "
120
+ "Instruction:
121
+ {instruction}
122
+
123
+ "
124
+ "Rubric:
125
+ {rubric}
126
+
127
+ "
128
+ "Response A:
129
+ {response_a}
130
+
131
+ "
132
+ "Response B:
133
+ {response_b}"
134
  )
135
 
136
  user_text = JUDGE_PROMPT_TEMPLATE.format(
 
154
  # ...
155
  # ...
156
  ```
 
 
157
 
158
+ ## Citation
159
 
160
  If you find our work helpful, please consider citing our paper:
161
 
162
+ ```bibtex
163
  @misc{xu2026alternating,
164
  title={Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training},
165
  author={Ran Xu and Tianci Liu and Zihan Dong and Tony You and Ilgee Hong and Carl Yang and Linjun Zhang and Tao Zhao and Haoyu Wang},
 
169
  primaryClass={cs.CL},
170
  url={https://arxiv.org/abs/2602.01511},
171
  }
172
+ ```