π§ Gemma-3 Fine-tuned on RACE (GRPO via Unsloth)
This model is a fine-tuned version of Gemma-3 trained on the RACE (ReAding Comprehension from Examinations) dataset using Unsloth.
It specializes in multiple-choice reading comprehension tasks that require reasoning and explanation.
The model was optimized using Generalized Reinforcement Policy Optimization (GRPO) with a custom reward function encouraging correct, confident, and concise answers in the form:
π§© Model Details
| Field | Description |
|---|---|
| Base Model | Gemma-3 |
| Fine-tuning Framework | Unsloth |
| Task Type | Multiple-Choice Reading Comprehension |
| Dataset | RACE |
| Language | English |
| Reward Function | Custom GRPO (mc_reward_grpo) |
| License | Gemma License |
| Training Objective | Reinforcement Fine-tuning for reasoning clarity and correctness |
βοΈ Reward Function
import re, math
LETTER_RE = re.compile(r'final(?:\s*answer)?\s*:\s*([A-D])', re.IGNORECASE)
CONF_RE = re.compile(r'(\d{1,3})\s*%')
CORRECT = 1.5
INCORRECT = -1.0
MALFORMED = -1.5
def mc_reward_grpo(completions, answer=None, **kwargs):
penalty = float(kwargs.get("malformed_penalty", MALFORMED))
texts = [str(c.get("content", c)) if isinstance(c, dict) else str(c) for c in completions or []]
golds = [answer] if isinstance(answer, str) else list(answer or [None]*len(texts))
golds = (golds * math.ceil(len(texts) / len(golds)))[:len(texts)]
rewards = []
for txt, g in zip(texts, golds):
m = LETTER_RE.findall(txt or "")
conf_match = CONF_RE.search(txt)
confidence = float(conf_match.group(1)) / 100 if conf_match else 1.0
length = len(txt.split())
if not m:
reward = penalty * (0.5 if "final" in txt.lower() else 1)
else:
pred = m[-1].upper()
reward = CORRECT if g is None or pred == str(g).upper() else INCORRECT
clarity_bonus = max(0, 1 - length / 200)
reward = reward * confidence + 0.3 * clarity_bonus
rewards.append(round(reward, 3))
return rewards
π Reward Components
- β Correctness: +1.5 for correct answers, β1.0 for incorrect
- β οΈ Malformed penalty: β1.5 if the answer lacks Final answer:
- π¬ Confidence scaling: adjusts reward based on confidence percentage (e.g. β80%β)
- π§Ύ Clarity bonus: rewards concise and focused completions
π§ͺ Training Configuration
| Setting | Value |
|---|---|
| Optimizer | AdamW |
| Learning Rate | 5e-6 |
| Epochs | 1 |
| Reward Type | Scalar |
| Engine | Unsloth Reinforcement Fine-tuning |
| Hardware | 1xT4 16GB |
π Training History
Reward Function Training History
π¬ Example Usage
prompt = """
Passage: The fox saw the grapes hanging high and decided they were sour.
Question: What does the phrase "sour grapes" mean?
Choices:
A. The fox liked the grapes.
B. The fox couldnβt reach the grapes, so he pretended not to care.
C. The grapes were actually sour.
D. The fox wanted to share the grapes.
Final answer:
"""
response = model.generate(prompt)
print(response)
Example Output:
The phrase "sour grapes" refers to pretending not to care about something you cannot have.
Final answer: B
β Intended Use
- Educational QA and reading comprehension
- Reasoning-based multiple-choice benchmarks
- Reinforcement learning and model evaluation research
β οΈ Limitations
- May produce verbose or hallucinated explanations.
- Limited generalization beyond English comprehension tasks.
- Confidence percentages are heuristic, not calibrated.
- Not suitable for high-stakes automated testing.
π Citation
If you use this model or the reward function, please cite:
@misc{gemma3_race_unsloth,
title = {Gemma-3 Fine-tuned on RACE with Unsloth and GRPO Reward},
author = {Samuel Lopera Torres},
year = {2025},
howpublished = {\url{https://huggingface.co/Senshi5620/gemma-3-finetune}},
}
π§Ύ Acknowledgements
- Downloads last month
- 9
