Transformers
Safetensors
trl
grpo
arabic-poetry
classical-arabic
lora

Shaer-adapters-grpo-vnext

This repo is the first patched rerun after Shaer-AI/Shaer-adapters-grpo was reclassified as reward hacked.

Place in the story

Project sequence:

  1. Shaer-AI/Shaer-adapters clean SFT baseline
  2. Shaer-AI/Shaer-adapters-grpo historically strong-looking but reward-hacked GRPO run
  3. Shaer-AI/Shaer-adapters-grpo-vnext stricter anti-template and artifact-filtering GRPO rerun
  4. Shaer-AI/Shaer-adapters-grpo-friend-v1 first judge-centered rerun
  5. Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst easier judge-centered rerun

What changed here

This run introduced the stricter reward patch that was designed to kill the old hacked behavior:

reward_total = meter * count_adherence * arabic_clean * repeat_penalty

with much stronger internals for:

  • artifact-free Arabic filtering
  • lexical plausibility
  • near-duplicate detection
  • opening diversity
  • distinct-2 phrase diversity

Run snapshot

  • starting adapter: Shaer-AI/Shaer-adapters
  • train subset: dropped-trio curated subset, cap 3000 per surviving meter
  • eval bank: full 13-meter eval, 104 rows

Best tracked checkpoint:

  • step: 500
  • eval total: 0.1937
  • eval meter: 0.5652
  • eval count adherence: 0.9099
  • eval judge diagnostic: 0.3774
  • eval repeat penalty: 0.5577
  • eval arabic clean: 0.8750

What this run proved

This stage was important because it showed the patched anti-template reward was much better at rejecting the old hacked outputs.

But it still was not the final answer:

  • tracked reward was much lower than the old hacked run
  • generation quality was still not strong enough
  • semantic quality still needed to be modeled more directly

Why we moved on

This repo motivated the next shift: bring in a focused Arabic semantic judge that scores whether the poem:

  • has meaning
  • is not garbage
  • is relevant to the description

That next stage was published as Shaer-AI/Shaer-adapters-grpo-friend-v1.

Recommended use

Use this repo as the first serious post-hack reward patch, not as the final recommended GRPO model.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shaer-AI/Shaer-adapters-grpo-vnext

Adapter
(10)
this model

Datasets used to train Shaer-AI/Shaer-adapters-grpo-vnext