Tenacious Judge — SimPO LoRA Adapter

A LoRA adapter trained with SimPO (Simple Preference Optimization) on 228 dimension-reasoning preference pairs derived from Tenacious-Bench v0.1.

The adapter teaches Qwen2.5-3B-Instruct to judge B2B outbound sales emails against five machine-verifiable rubric dimensions (D1–D5) and return a structured verdict.


Backbone

unsloth/Qwen2.5-3B-Instruct


Training

Parameter Value
Method SimPO via TRL CPOConfig(loss_type="simpo", cpo_alpha=0.0)
Preference pairs 228 (114 tasks x 2 pairs — aggressive wrong + subtle wrong)
LoRA rank / alpha 16 / 16, dropout=0.0
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable params ~8.7 M (LoRA only)
Epochs 3
Effective batch size 16 (batch=4 x grad_accum=4)
Learning rate 2e-4
beta / gamma 2.0 / 0.5
Final train loss 0.0745
Hardware Google Colab T4 (16 GB VRAM, fp16)
Wall time 14 min
Seed 42

Preference Pair Design

Each task produces two pairs sharing the same chosen (correct D1–D5 reasoning):

  • Pair A — aggressive wrong: all five dimensions marked PASS → wrong PASS verdict
  • Pair B — subtle wrong: correct REJECT verdict but wrong primary failure dimension

This forces the model to learn which dimension fails, not just whether to REJECT, preventing preference leakage from short verdict-sentence style.

Chosen format:

D1: PASS — grounding facts verified; $14M matches brief
D2: FAIL — Ambiguous segment with product claim
D3: PASS — ok
D4: PASS — ok
D5: PASS — ok
VERDICT: REJECT
Primary failure: D2

Evaluation (Act IV Ablation)

Evaluated on Tenacious-Bench v0.1 using scripts/analysis/run_ablation.py.

Partition n Base Qwen2.5-3B Trained Judge Delta B
dev 78 60.3% (47/78) 65.5% (36/55*) +5.2%
held_out 65 50.8% (33/65) 59.4% (19/32*) +8.6%

* Trained model produced UNKNOWN on 23 dev / 33 held-out tasks (see Limitation).

Dimension-Level Errors (held-out, 65 tasks)

Dimension Base errors Trained errors Improvement
D1 grounding_fidelity 13 8 −5
D2 icp_pitch_alignment 14 0 −14
D3 signal_directionality 3 3 0
D4 tone_compliance 2 2 0

The trained judge eliminated all D2 (ICP alignment) errors and reduced D1 errors by 38%.


Limitation

The evaluation harness expects a strict VERDICT: PASS or VERDICT: REJECT on the first output line. The trained model sometimes produces richer structured D1–D5 reasoning that cannot be parsed, returning UNKNOWN. Accuracy is computed only over scored (non-UNKNOWN) responses, so the +8.6% Delta B figure underestimates true performance — the model is partially abstaining rather than answering wrongly.


How to Use

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

backbone = "unsloth/Qwen2.5-3B-Instruct"
adapter  = "Birkity/tenacious-judge"

tokenizer = AutoTokenizer.from_pretrained(backbone)
model     = AutoModelForCausalLM.from_pretrained(backbone, torch_dtype="auto")
model     = PeftModel.from_pretrained(model, adapter)
model.eval()

prompt = (
    "You are a sales quality judge for Tenacious Intelligence Corporation.\n"
    "Score D1–D5 and return VERDICT: PASS or VERDICT: REJECT.\n\n"
    "Hiring Brief: ...\nGenerated Email: ...\n\n"
    "Start with D1:"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out    = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Dataset

Birkity/tenacious_bench_v0.1 — 257 tasks, four authoring modes, CC-BY-4.0.

Community Post

Reddit — Built a sales judgment benchmark (Tenacious-Bench)


Citation

@misc{tenacious-judge-2026,
  author    = {Birkity Mekasha},
  title     = {Tenacious Judge: SimPO LoRA Adapter for B2B Sales Email Evaluation},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/Birkity/tenacious-judge}
}

Author

Birkity Mekasha — Birkity@10academy.org
Research Partner, Tenacious Intelligence Corporation
10 Academy TRP1 Week 11

Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Birkity/tenacious-judge

Base model

Qwen/Qwen2.5-3B
Finetuned
(528)
this model

Dataset used to train Birkity/tenacious-judge