Tenacious Judge — SimPO LoRA Adapter

A LoRA adapter trained with SimPO (Simple Preference Optimization) on 228 dimension-reasoning preference pairs derived from Tenacious-Bench v0.1.

The adapter teaches Qwen2.5-3B-Instruct to judge B2B outbound sales emails against five machine-verifiable rubric dimensions (D1–D5) and return a structured verdict.

Backbone

unsloth/Qwen2.5-3B-Instruct

Training

Parameter	Value
Method	SimPO via TRL `CPOConfig(loss_type="simpo", cpo_alpha=0.0)`
Preference pairs	228 (114 tasks x 2 pairs — aggressive wrong + subtle wrong)
LoRA rank / alpha	16 / 16, dropout=0.0
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable params	~8.7 M (LoRA only)
Epochs	3
Effective batch size	16 (batch=4 x grad_accum=4)
Learning rate	2e-4
beta / gamma	2.0 / 0.5
Final train loss	0.0745
Hardware	Google Colab T4 (16 GB VRAM, fp16)
Wall time	14 min
Seed	42

Preference Pair Design

Each task produces two pairs sharing the same chosen (correct D1–D5 reasoning):

Pair A — aggressive wrong: all five dimensions marked PASS → wrong PASS verdict
Pair B — subtle wrong: correct REJECT verdict but wrong primary failure dimension

This forces the model to learn which dimension fails, not just whether to REJECT, preventing preference leakage from short verdict-sentence style.

Chosen format:

D1: PASS — grounding facts verified; $14M matches brief
D2: FAIL — Ambiguous segment with product claim
D3: PASS — ok
D4: PASS — ok
D5: PASS — ok
VERDICT: REJECT
Primary failure: D2

Evaluation (Act IV Ablation)

Evaluated on Tenacious-Bench v0.1 using scripts/analysis/run_ablation.py.

Partition	n	Base Qwen2.5-3B	Trained Judge	Delta B
dev	78	60.3% (47/78)	65.5% (36/55*)	+5.2%
held_out	65	50.8% (33/65)	59.4% (19/32*)	+8.6%

* Trained model produced UNKNOWN on 23 dev / 33 held-out tasks (see Limitation).

Dimension-Level Errors (held-out, 65 tasks)

Dimension	Base errors	Trained errors	Improvement
D1 grounding_fidelity	13	8	−5
D2 icp_pitch_alignment	14	0	−14
D3 signal_directionality	3	3	0
D4 tone_compliance	2	2	0

The trained judge eliminated all D2 (ICP alignment) errors and reduced D1 errors by 38%.

Limitation

The evaluation harness expects a strict VERDICT: PASS or VERDICT: REJECT on the first output line. The trained model sometimes produces richer structured D1–D5 reasoning that cannot be parsed, returning UNKNOWN. Accuracy is computed only over scored (non-UNKNOWN) responses, so the +8.6% Delta B figure underestimates true performance — the model is partially abstaining rather than answering wrongly.

How to Use

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

backbone = "unsloth/Qwen2.5-3B-Instruct"
adapter  = "Birkity/tenacious-judge"

tokenizer = AutoTokenizer.from_pretrained(backbone)
model     = AutoModelForCausalLM.from_pretrained(backbone, torch_dtype="auto")
model     = PeftModel.from_pretrained(model, adapter)
model.eval()

prompt = (
    "You are a sales quality judge for Tenacious Intelligence Corporation.\n"
    "Score D1–D5 and return VERDICT: PASS or VERDICT: REJECT.\n\n"
    "Hiring Brief: ...\nGenerated Email: ...\n\n"
    "Start with D1:"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out    = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Dataset

Birkity/tenacious_bench_v0.1 — 257 tasks, four authoring modes, CC-BY-4.0.

Community Post

Reddit — Built a sales judgment benchmark (Tenacious-Bench)

Citation

@misc{tenacious-judge-2026,
  author    = {Birkity Mekasha},
  title     = {Tenacious Judge: SimPO LoRA Adapter for B2B Sales Email Evaluation},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/Birkity/tenacious-judge}
}

Author

Birkity Mekasha — Birkity@10academy.org
Research Partner, Tenacious Intelligence Corporation
10 Academy TRP1 Week 11

Downloads last month: 12

Model tree for Birkity/tenacious-judge

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Finetuned

unsloth/Qwen2.5-3B-Instruct

Finetuned

(528)

this model

Birkity
/

tenacious-judge

Tenacious Judge — SimPO LoRA Adapter

Backbone

Training

Preference Pair Design

Evaluation (Act IV Ablation)

Dimension-Level Errors (held-out, 65 tasks)

Limitation

How to Use

Dataset

Community Post

Citation

Author

Model tree for Birkity/tenacious-judge

Dataset used to train Birkity/tenacious-judge