Tenacious-Bench Judge โ ORPO LoRA Adapter (Qwen2.5-1.5B)
A rubric-aware scoring judge for B2B outbound sales emails, trained with ORPO on Tenacious-Bench v0.1 preference pairs. Deployed as a rejection-sampling gate in the Tenacious Conversion Engine.
Base model: unsloth/Qwen2.5-1.5B-Instruct
Adapter type: LoRA (PEFT) โ load with base model + PeftModel.from_pretrained
Training algorithm: ORPO (no reference model required)
Precision: 4-bit quantized during training (Unsloth), fp16 for inference
Training data: 94 ORPO preference pairs from rafiakedir/tenacious-bench-v0.1 (train split)
Training: 3 epochs ยท 36 steps ยท lr=8e-6 ยท beta=0.1 ยท LoRA r=16 alpha=32
What It Scores
| Dimension | Trigger Rate (Week 10 probes) | Risk if Missed |
|---|---|---|
signal_grounding_fidelity |
35% | CTO credibility loss |
competitor_gap_honesty |
45% | Irreversible brand damage |
icp_segment_appropriateness |
20% | ~$480K ACV per error |
tone_preservation |
15% | Brand voice violation |
bench_commitment_honesty |
5% | SOW-breach / delivery failure |
Quick Start โ Inference
import json, torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
BASE_MODEL = "unsloth/Qwen2.5-1.5B-Instruct"
ADAPTER_ID = "rafiakedir/tenacious-bench-adapter"
tokenizer = AutoTokenizer.from_pretrained(ADAPTER_ID)
base = AutoModelForCausalLM.from_pretrained(
BASE_MODEL, torch_dtype=torch.float16, device_map="auto"
)
model = PeftModel.from_pretrained(base, ADAPTER_ID)
model.eval()
JUDGE_SYSTEM = (
"You are a rubric-aware judge for Tenacious Consulting B2B outbound sales emails. "
"Given a task context and a candidate email, score it on the specified rubric dimension. "
"Respond with a JSON object only:\n"
'{"dimension": "<dim>", "score": <0.0-1.0>, "pass": <true|false>, "reasoning": "<one sentence>"}'
)
def judge(email, context, dimension):
user = (
f"EVALUATION DIMENSION: {dimension}\n\n"
f"TASK CONTEXT:\n{context}\n\n"
f"CANDIDATE EMAIL:\n{email}\n\n"
f"Score this email on the {dimension} dimension."
)
msgs = [{"role": "system", "content": JUDGE_SYSTEM},
{"role": "user", "content": user}]
text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True,
pad_token_id=tokenizer.eos_token_id)
resp = tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()
s, e = resp.find("{"), resp.rfind("}") + 1
return json.loads(resp[s:e]) if s >= 0 else {"score": 0.5, "raw": resp[:200]}
result = judge(
email="Casey โ TalentBridge has 8 open AI/ML roles this quarter. 30-min scoping call: calendly.com/tenacious",
context="company: TalentBridge, stage: Series A, open_roles: 8, confidence: high",
dimension="signal_grounding_fidelity"
)
print(result)
Training Details
| Parameter | Value |
|---|---|
| Base model | unsloth/Qwen2.5-1.5B-Instruct (4-bit during training) |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| Target modules | q_proj, v_proj |
| LoRA dropout | 0.05 |
| Learning rate | 8e-6 |
| Effective batch size | 8 (batch=2, grad_accum=4) |
| Epochs | 3 |
| Total steps | 36 |
| ORPO beta | 0.1 |
| Max sequence length | 1024 |
| Seed | 42 |
Training loss: 2.8676 โ 2.9646 โ 2.9386 (3 checkpoints) Reward accuracy: 0.5375 โ 0.6026 โ 0.5128
Training data: 94 preference pairs from the train partition. Preference leakage prevention:
generator (DeepSeek V3.2) โ judge family (Claude Sonnet 4.6 / scoring_evaluator.py).
All generation decisions logged in the dataset repo at training_data/generation_log.jsonl.
Evaluation Results
Evaluated on 59 held-out tasks from rafiakedir/tenacious-bench-v0.1.
Full results in ablation_results.json in the dataset repo.
Deployment recommendation: Run ablations/run_ablations.py with this adapter to get Delta A.
The ablation script loads this adapter via HuggingFace โ requires GPU + transformers + peft.
Known Limitations
Dimension coverage gap. 0 training pairs for
bench_commitment_honesty, 4 foricp_segment_appropriatenessdue to scoring key mismatch during pair construction. The model received zero gradient signal on bench commitment honesty.Backbone below Prometheus-2 threshold. Prometheus-2 demonstrated rubric-matching at 7B+ parameters. At 1.5B the model may underfit multi-dimension generalization.
Synthetic training distribution. All pairs derive from synthetic prospect briefs and LLM-generated emails.
Static bench_summary. Judge calibration drifts as real bench composition changes weekly.
Files
| File | Description |
|---|---|
adapter_config.json |
LoRA configuration (r=16, alpha=32, q_proj+v_proj) |
adapter_model.safetensors |
Trained LoRA weights (8.4 MB) |
tokenizer.json, tokenizer_config.json |
Tokenizer (ChatML format) |
run_on_colab.ipynb |
End-to-end training + push notebook |
train_judge.py |
Training script |
inference_example.py |
Per-dimension and all-dimension scoring helper |
Training data: rafiakedir/tenacious-bench-v0.1
Citation
@misc{tenacious-bench-adapter-2026,
title = {Tenacious-Bench Judge: ORPO LoRA Adapter for B2B Sales Evaluation},
author = {Kedir, Rafia},
year = {2026},
url = {https://huggingface.co/rafiakedir/tenacious-bench-adapter}
}
- Downloads last month
- 79
Model tree for rafiakedir/tenacious-bench-adapter
Base model
Qwen/Qwen2.5-1.5B