RubricARM-8B-Judge / README.md

Update README.md

1cafd4c verified 5 days ago

4.67 kB

	# OpenRubrics/RubricARM-8B-Judge

	This is a 8B RubricARM-Judge model, finetuned from [Qwen3/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B).
	See our [paper](https://arxiv.org/abs/2602.01511) for more details.


	## Usage
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	model_id = "OpenRubrics/RubricARM-8B-Judge"
	tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
	model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
	```

	To evaluate the model, please use the following format to build up message.

	Here `rubric` should be generated with a `RubricARM-Rubric`

	```python
	JUDGE_PROMPT_TEMPLATE = (
	"You are a fair and impartial judge. Your task is to evaluate 'Response A' and 'Response B' "
	"based on a given instruction and a rubric. You will conduct this evaluation in distinct "
	"phases as outlined below.\n\n"
	"### Phase 1: Compliance Check Instructions\n"
	"First, identify the single most important, objective 'Gatekeeper Criterion' from the rubric.\n"
	"- **A rule is objective (and likely a Gatekeeper) if it can be verified without opinion. "
	"Key examples are: word/paragraph limits, required output format (e.g., JSON validity), "
	"required/forbidden sections, or forbidden content.**\n"
	"- **Conversely, a rule is subjective if it requires interpretation or qualitative judgment. "
	"Subjective rules about quality are NOT Gatekeepers. Examples include criteria like \"be creative,\" "
	"\"write clearly,\" \"be engaging,\" or \"use a professional tone.\"**\n"
	f"Think step-by-step to determine this single most important Gatekeeper, then write a 1–2 sentence explanation of your decision.\n\n"

	"### Phase 2: Analyze Each Response\n"
	"Next, for each Gatekeeper Criterion and all other criteria in the rubric, evaluate each "
	"response item by item.\n"
	"For each item, think step-by-step and cite concrete evidence from the response before assigning your judgment.\n\n"

	"### Phase 3: Final Judgment Instructions\n"
	"Based on the results from the previous phases, determine the winner using these simple rules. "
	"Provide a final justification explaining your decision first and then give your decision.\n"
	"Think step-by-step to aggregate the findings and make the decision; keep the reasoning explicit and concise.\n\n"
	"---\n"
	"### REQUIRED OUTPUT FORMAT\n"
	"You must follow this exact output format below.\n\n"
	"--- Compliance Check ---\n"
	"Gatekeeper Reasoning: <1–2 sentences citing the relevant rubric text>\n"
	"Identified Gatekeeper Criterion: <e.g., Criterion 1: Must be under 50 words.>\n\n"
	"--- Analysis ---\n"
	"Response A:\n"
	"- Criterion 1 [Hard Rule]: Justification: <...>\n"
	"- Criterion 2 [Hard Rule]: Justification: <...>\n"
	"- Criterion 3 [Principle]: Justification: <...>\n"
	"- ... (and so on for all other criteria)\n\n"
	"Response B:\n"
	"- Criterion 1 [Hard Rule]: Justification: <...>\n"
	"- Criterion 2 [Hard Rule]: Justification: <...>\n"
	"- Criterion 3 [Principle]: Justification: <...>\n"
	"- ... (and so on for all other criteria)\n\n"
	"--- Final Judgment ---\n"
	# "Aggregation Summary: <Provide a detailed, step-by-step explanation (3–6 sentences) of how the Gatekeeper and other criteria led to the decision>\n"
	"Aggregation Summary: <1–3 sentences explaining how Gatekeeper and other criteria led to the decision>\n"
	"Justification: <...>\n"
	"Winner: <Response A / Response B>\n\n\n"
	"Task to Evaluate:\n"
	"Instruction:\n{instruction}\n\n"
	"Rubric:\n{rubric}\n\n"
	"Response A:\n{response_a}\n\n"
	"Response B:\n{response_b}"
	)

	user_text = JUDGE_PROMPT_TEMPLATE.format(
	instruction=instruction,
	rubric=rubric,
	response_a=response_a,
	response_b=response_b
	)

	messages_list = [
	{"role": "user", "content": user_text},
	]
	message = tok.apply_chat_template(
	messages_list,
	tokenize=False,
	add_generation_prompt=True,
	enable_thinking=False
	)

	# Remaining step: Use either HF or vLLM for evaluation.
	# ...
	# ...
	```




	If you find our work helpful, please consider citing our paper:

	```
	@misc{xu2026alternating,
	title={Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training},
	author={Ran Xu and Tianci Liu and Zihan Dong and Tony You and Ilgee Hong and Carl Yang and Linjun Zhang and Tao Zhao and Haoyu Wang},
	year={2026},
	eprint={2602.01511},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2602.01511},
	}
	```