🧠 Gemma-3 Fine-tuned on RACE (GRPO via Unsloth)

This model is a fine-tuned version of Gemma-3 trained on the RACE (ReAding Comprehension from Examinations) dataset using Unsloth.
It specializes in multiple-choice reading comprehension tasks that require reasoning and explanation.

The model was optimized using Generalized Reinforcement Policy Optimization (GRPO) with a custom reward function encouraging correct, confident, and concise answers in the form:

🧩 Model Details

Field	Description
Base Model	`Gemma-3`
Fine-tuning Framework	Unsloth
Task Type	Multiple-Choice Reading Comprehension
Dataset	RACE
Language	English
Reward Function	Custom GRPO (`mc_reward_grpo`)
License	Gemma License
Training Objective	Reinforcement Fine-tuning for reasoning clarity and correctness

⚙️ Reward Function

import re, math
LETTER_RE = re.compile(r'final(?:\s*answer)?\s*:\s*([A-D])', re.IGNORECASE)
CONF_RE = re.compile(r'(\d{1,3})\s*%')

CORRECT = 1.5
INCORRECT = -1.0
MALFORMED = -1.5

def mc_reward_grpo(completions, answer=None, **kwargs):
    penalty = float(kwargs.get("malformed_penalty", MALFORMED))
    texts = [str(c.get("content", c)) if isinstance(c, dict) else str(c) for c in completions or []]
    golds = [answer] if isinstance(answer, str) else list(answer or [None]*len(texts))
    golds = (golds * math.ceil(len(texts) / len(golds)))[:len(texts)]

    rewards = []
    for txt, g in zip(texts, golds):
        m = LETTER_RE.findall(txt or "")
        conf_match = CONF_RE.search(txt)
        confidence = float(conf_match.group(1)) / 100 if conf_match else 1.0
        length = len(txt.split())

        if not m:
            reward = penalty * (0.5 if "final" in txt.lower() else 1)
        else:
            pred = m[-1].upper()
            reward = CORRECT if g is None or pred == str(g).upper() else INCORRECT

        clarity_bonus = max(0, 1 - length / 200)
        reward = reward * confidence + 0.3 * clarity_bonus
        rewards.append(round(reward, 3))

    return rewards

🏆 Reward Components

✅ Correctness: +1.5 for correct answers, −1.0 for incorrect
⚠️ Malformed penalty: −1.5 if the answer lacks Final answer:
💬 Confidence scaling: adjusts reward based on confidence percentage (e.g. “80%”)
🧾 Clarity bonus: rewards concise and focused completions

🧪 Training Configuration

Setting	Value
Optimizer	AdamW
Learning Rate	5e-6
Epochs	1
Reward Type	Scalar
Engine	Unsloth Reinforcement Fine-tuning
Hardware	1xT4 16GB

📈 Training History

Reward Function Training History

💬 Example Usage

prompt = """
Passage: The fox saw the grapes hanging high and decided they were sour.
Question: What does the phrase "sour grapes" mean?
Choices:
A. The fox liked the grapes.
B. The fox couldn’t reach the grapes, so he pretended not to care.
C. The grapes were actually sour.
D. The fox wanted to share the grapes.

Final answer:
"""

response = model.generate(prompt)
print(response)

Example Output:

The phrase "sour grapes" refers to pretending not to care about something you cannot have.
Final answer: B

✅ Intended Use

Educational QA and reading comprehension
Reasoning-based multiple-choice benchmarks
Reinforcement learning and model evaluation research

⚠️ Limitations

May produce verbose or hallucinated explanations.
Limited generalization beyond English comprehension tasks.
Confidence percentages are heuristic, not calibrated.
Not suitable for high-stakes automated testing.

📚 Citation

If you use this model or the reward function, please cite:

@misc{gemma3_race_unsloth,
  title = {Gemma-3 Fine-tuned on RACE with Unsloth and GRPO Reward},
  author = {Samuel Lopera Torres},
  year = {2025},
  howpublished = {\url{https://huggingface.co/Senshi5620/gemma-3-finetune}},
}

🧾 Acknowledgements

Downloads last month: 12

Safetensors

Model size

1.0B params

Tensor type

BF16

Video Preview

Reinforcement Learning

Senshi5620
/

gemma-3-finetune