Tenacious Judge — SimPO LoRA Adapter
A LoRA adapter trained with SimPO (Simple Preference Optimization) on 228 dimension-reasoning preference pairs derived from Tenacious-Bench v0.1.
The adapter teaches Qwen2.5-3B-Instruct to judge B2B outbound sales emails against
five machine-verifiable rubric dimensions (D1–D5) and return a structured verdict.
Backbone
Training
| Parameter | Value |
|---|---|
| Method | SimPO via TRL CPOConfig(loss_type="simpo", cpo_alpha=0.0) |
| Preference pairs | 228 (114 tasks x 2 pairs — aggressive wrong + subtle wrong) |
| LoRA rank / alpha | 16 / 16, dropout=0.0 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable params | ~8.7 M (LoRA only) |
| Epochs | 3 |
| Effective batch size | 16 (batch=4 x grad_accum=4) |
| Learning rate | 2e-4 |
| beta / gamma | 2.0 / 0.5 |
| Final train loss | 0.0745 |
| Hardware | Google Colab T4 (16 GB VRAM, fp16) |
| Wall time | 14 min |
| Seed | 42 |
Preference Pair Design
Each task produces two pairs sharing the same chosen (correct D1–D5 reasoning):
- Pair A — aggressive wrong: all five dimensions marked PASS → wrong PASS verdict
- Pair B — subtle wrong: correct REJECT verdict but wrong primary failure dimension
This forces the model to learn which dimension fails, not just whether to REJECT, preventing preference leakage from short verdict-sentence style.
Chosen format:
D1: PASS — grounding facts verified; $14M matches brief
D2: FAIL — Ambiguous segment with product claim
D3: PASS — ok
D4: PASS — ok
D5: PASS — ok
VERDICT: REJECT
Primary failure: D2
Evaluation (Act IV Ablation)
Evaluated on Tenacious-Bench v0.1
using scripts/analysis/run_ablation.py.
| Partition | n | Base Qwen2.5-3B | Trained Judge | Delta B |
|---|---|---|---|---|
| dev | 78 | 60.3% (47/78) | 65.5% (36/55*) | +5.2% |
| held_out | 65 | 50.8% (33/65) | 59.4% (19/32*) | +8.6% |
* Trained model produced UNKNOWN on 23 dev / 33 held-out tasks (see Limitation).
Dimension-Level Errors (held-out, 65 tasks)
| Dimension | Base errors | Trained errors | Improvement |
|---|---|---|---|
| D1 grounding_fidelity | 13 | 8 | −5 |
| D2 icp_pitch_alignment | 14 | 0 | −14 |
| D3 signal_directionality | 3 | 3 | 0 |
| D4 tone_compliance | 2 | 2 | 0 |
The trained judge eliminated all D2 (ICP alignment) errors and reduced D1 errors by 38%.
Limitation
The evaluation harness expects a strict VERDICT: PASS or VERDICT: REJECT on the
first output line. The trained model sometimes produces richer structured D1–D5 reasoning
that cannot be parsed, returning UNKNOWN. Accuracy is computed only over scored
(non-UNKNOWN) responses, so the +8.6% Delta B figure underestimates true performance —
the model is partially abstaining rather than answering wrongly.
How to Use
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
backbone = "unsloth/Qwen2.5-3B-Instruct"
adapter = "Birkity/tenacious-judge"
tokenizer = AutoTokenizer.from_pretrained(backbone)
model = AutoModelForCausalLM.from_pretrained(backbone, torch_dtype="auto")
model = PeftModel.from_pretrained(model, adapter)
model.eval()
prompt = (
"You are a sales quality judge for Tenacious Intelligence Corporation.\n"
"Score D1–D5 and return VERDICT: PASS or VERDICT: REJECT.\n\n"
"Hiring Brief: ...\nGenerated Email: ...\n\n"
"Start with D1:"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Dataset
Birkity/tenacious_bench_v0.1 — 257 tasks, four authoring modes, CC-BY-4.0.
Community Post
Reddit — Built a sales judgment benchmark (Tenacious-Bench)
Citation
@misc{tenacious-judge-2026,
author = {Birkity Mekasha},
title = {Tenacious Judge: SimPO LoRA Adapter for B2B Sales Email Evaluation},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/Birkity/tenacious-judge}
}
Author
Birkity Mekasha — Birkity@10academy.org
Research Partner, Tenacious Intelligence Corporation
10 Academy TRP1 Week 11
- Downloads last month
- 12