Update README.md
Browse files
README.md
CHANGED
|
@@ -11,6 +11,80 @@ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
|
|
| 11 |
```
|
| 12 |
|
| 13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
If you find our work helpful, please consider citing our paper:
|
| 15 |
|
| 16 |
```
|
|
|
|
| 11 |
```
|
| 12 |
|
| 13 |
|
| 14 |
+
To evaluate the model, please use the following format to build up message.
|
| 15 |
+
|
| 16 |
+
```python
|
| 17 |
+
|
| 18 |
+
JUDGE_PROMPT_TEMPLATE = (
|
| 19 |
+
f"You are a fair and impartial judge. Your task is to evaluate 'Response A' and 'Response B' "
|
| 20 |
+
f"based on a given instruction and a rubric. You will conduct this evaluation in distinct "
|
| 21 |
+
f"phases as outlined below.\n\n"
|
| 22 |
+
f"### Phase 1: Compliance Check Instructions\n"
|
| 23 |
+
f"First, identify the single most important, objective 'Gatekeeper Criterion' from the rubric.\n"
|
| 24 |
+
f"- **A rule is objective (and likely a Gatekeeper) if it can be verified without opinion. "
|
| 25 |
+
f"Key examples are: word/paragraph limits, required output format (e.g., JSON validity), "
|
| 26 |
+
f"required/forbidden sections, or forbidden content.**\n"
|
| 27 |
+
f"- **Conversely, a rule is subjective if it requires interpretation or qualitative judgment. "
|
| 28 |
+
f"Subjective rules about quality are NOT Gatekeepers. Examples include criteria like \"be creative,\" "
|
| 29 |
+
f"\"write clearly,\" \"be engaging,\" or \"use a professional tone.\"**\n\n"
|
| 30 |
+
f"### Phase 2: Analyze Each Response\n"
|
| 31 |
+
f"Next, for each Gatekeeper Criterion and all other criteria in the rubric, evaluate each "
|
| 32 |
+
f"response item by item.\n\n"
|
| 33 |
+
f"### Phase 3: Final Judgment Instructions\n"
|
| 34 |
+
f"Based on the results from the previous phases, determine the winner using these simple rules. "
|
| 35 |
+
f"Provide a final justification explaining your decision first and then give your decision.\n\n"
|
| 36 |
+
f"---\n"
|
| 37 |
+
f"### REQUIRED OUTPUT FORMAT\n"
|
| 38 |
+
f"You must follow this exact output format below.\n\n"
|
| 39 |
+
f"--- Compliance Check ---\n"
|
| 40 |
+
f"Identified Gatekeeper Criterion: <e.g., Criterion 1: Must be under 50 words.>\n\n"
|
| 41 |
+
f"--- Analysis ---\n"
|
| 42 |
+
f"**Response A:**\n"
|
| 43 |
+
f"- Criterion 1 [Hard Rule]: Justification: <...>\n"
|
| 44 |
+
f"- Criterion 2 [Hard Rule]: Justification: <...>\n"
|
| 45 |
+
f"- Criterion 3 [Principle]: Justification: <...>\n"
|
| 46 |
+
f"- ... (and so on for all other criteria)\n\n"
|
| 47 |
+
f"**Response B:**\n"
|
| 48 |
+
f"- Criterion 1 [Hard Rule]: Justification: <...>\n"
|
| 49 |
+
f"- Criterion 2 [Hard Rule]: Justification: <...>\n"
|
| 50 |
+
f"- Criterion 3 [Principle]: Justification: <...>\n"
|
| 51 |
+
f"- ... (and so on for all other criteria)\n\n"
|
| 52 |
+
f"--- Final Judgment ---\n"
|
| 53 |
+
f"Justification: <...>\n"
|
| 54 |
+
f"Winner: <Response A / Response B>\n\n\n"
|
| 55 |
+
f"Task to Evaluate:\n"
|
| 56 |
+
"Instruction:\n{instruction}\n\n"
|
| 57 |
+
"Rubric:\n{rubric}\n\n"
|
| 58 |
+
"Response A:\n{response_a}\n\n"
|
| 59 |
+
"Response B:\n{response_b}"
|
| 60 |
+
)
|
| 61 |
+
|
| 62 |
+
user_text = JUDGE_PROMPT_TEMPLATE.format(
|
| 63 |
+
instruction=instruction,
|
| 64 |
+
rubric=rubric,
|
| 65 |
+
response_a=response_a,
|
| 66 |
+
response_b=response_b
|
| 67 |
+
)
|
| 68 |
+
|
| 69 |
+
messages_list = [
|
| 70 |
+
{"role": "user", "content": user_text},
|
| 71 |
+
]
|
| 72 |
+
message = tok.apply_chat_template(
|
| 73 |
+
messages_list,
|
| 74 |
+
tokenize=False,
|
| 75 |
+
add_generation_prompt=True,
|
| 76 |
+
enable_thinking=False
|
| 77 |
+
)
|
| 78 |
+
|
| 79 |
+
# Remaining step: Use either HF or vLLM for evaluation.
|
| 80 |
+
# ...
|
| 81 |
+
# ...
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
|
| 88 |
If you find our work helpful, please consider citing our paper:
|
| 89 |
|
| 90 |
```
|