🧠 Gemma-3 Fine-tuned on RACE (GRPO via Unsloth)

This model is a fine-tuned version of Gemma-3 trained on the RACE (ReAding Comprehension from Examinations) dataset using Unsloth.
It specializes in multiple-choice reading comprehension tasks that require reasoning and explanation.

The model was optimized using Generalized Reinforcement Policy Optimization (GRPO) with a custom reward function encouraging correct, confident, and concise answers in the form:


🧩 Model Details

Field Description
Base Model Gemma-3
Fine-tuning Framework Unsloth
Task Type Multiple-Choice Reading Comprehension
Dataset RACE
Language English
Reward Function Custom GRPO (mc_reward_grpo)
License Gemma License
Training Objective Reinforcement Fine-tuning for reasoning clarity and correctness

βš™οΈ Reward Function

import re, math
LETTER_RE = re.compile(r'final(?:\s*answer)?\s*:\s*([A-D])', re.IGNORECASE)
CONF_RE = re.compile(r'(\d{1,3})\s*%')

CORRECT = 1.5
INCORRECT = -1.0
MALFORMED = -1.5

def mc_reward_grpo(completions, answer=None, **kwargs):
    penalty = float(kwargs.get("malformed_penalty", MALFORMED))
    texts = [str(c.get("content", c)) if isinstance(c, dict) else str(c) for c in completions or []]
    golds = [answer] if isinstance(answer, str) else list(answer or [None]*len(texts))
    golds = (golds * math.ceil(len(texts) / len(golds)))[:len(texts)]

    rewards = []
    for txt, g in zip(texts, golds):
        m = LETTER_RE.findall(txt or "")
        conf_match = CONF_RE.search(txt)
        confidence = float(conf_match.group(1)) / 100 if conf_match else 1.0
        length = len(txt.split())

        if not m:
            reward = penalty * (0.5 if "final" in txt.lower() else 1)
        else:
            pred = m[-1].upper()
            reward = CORRECT if g is None or pred == str(g).upper() else INCORRECT

        clarity_bonus = max(0, 1 - length / 200)
        reward = reward * confidence + 0.3 * clarity_bonus
        rewards.append(round(reward, 3))

    return rewards

πŸ† Reward Components

  • βœ… Correctness: +1.5 for correct answers, βˆ’1.0 for incorrect
  • ⚠️ Malformed penalty: βˆ’1.5 if the answer lacks Final answer:
  • πŸ’¬ Confidence scaling: adjusts reward based on confidence percentage (e.g. β€œ80%”)
  • 🧾 Clarity bonus: rewards concise and focused completions

πŸ§ͺ Training Configuration

Setting Value
Optimizer AdamW
Learning Rate 5e-6
Epochs 1
Reward Type Scalar
Engine Unsloth Reinforcement Fine-tuning
Hardware 1xT4 16GB

πŸ“ˆ Training History

Reward Function Training History

The San Juan Mountains are beautiful


πŸ’¬ Example Usage

prompt = """
Passage: The fox saw the grapes hanging high and decided they were sour.
Question: What does the phrase "sour grapes" mean?
Choices:
A. The fox liked the grapes.
B. The fox couldn’t reach the grapes, so he pretended not to care.
C. The grapes were actually sour.
D. The fox wanted to share the grapes.

Final answer:
"""

response = model.generate(prompt)
print(response)

Example Output:

The phrase "sour grapes" refers to pretending not to care about something you cannot have.
Final answer: B

βœ… Intended Use

  • Educational QA and reading comprehension
  • Reasoning-based multiple-choice benchmarks
  • Reinforcement learning and model evaluation research

⚠️ Limitations

  • May produce verbose or hallucinated explanations.
  • Limited generalization beyond English comprehension tasks.
  • Confidence percentages are heuristic, not calibrated.
  • Not suitable for high-stakes automated testing.

πŸ“š Citation

If you use this model or the reward function, please cite:

@misc{gemma3_race_unsloth,
  title = {Gemma-3 Fine-tuned on RACE with Unsloth and GRPO Reward},
  author = {Samuel Lopera Torres},
  year = {2025},
  howpublished = {\url{https://huggingface.co/Senshi5620/gemma-3-finetune}},
}

🧾 Acknowledgements

Downloads last month
9
Safetensors
Model size
1.0B params
Tensor type
BF16
Β·
Video Preview
loading

Dataset used to train Senshi5620/gemma-3-finetune

Evaluation results