Tenacious-Bench Judge โ€” ORPO LoRA Adapter (Qwen2.5-1.5B)

A rubric-aware scoring judge for B2B outbound sales emails, trained with ORPO on Tenacious-Bench v0.1 preference pairs. Deployed as a rejection-sampling gate in the Tenacious Conversion Engine.

Base model: unsloth/Qwen2.5-1.5B-Instruct Adapter type: LoRA (PEFT) โ€” load with base model + PeftModel.from_pretrained Training algorithm: ORPO (no reference model required) Precision: 4-bit quantized during training (Unsloth), fp16 for inference Training data: 94 ORPO preference pairs from rafiakedir/tenacious-bench-v0.1 (train split) Training: 3 epochs ยท 36 steps ยท lr=8e-6 ยท beta=0.1 ยท LoRA r=16 alpha=32


What It Scores

Dimension Trigger Rate (Week 10 probes) Risk if Missed
signal_grounding_fidelity 35% CTO credibility loss
competitor_gap_honesty 45% Irreversible brand damage
icp_segment_appropriateness 20% ~$480K ACV per error
tone_preservation 15% Brand voice violation
bench_commitment_honesty 5% SOW-breach / delivery failure

Quick Start โ€” Inference

import json, torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

BASE_MODEL = "unsloth/Qwen2.5-1.5B-Instruct"
ADAPTER_ID  = "rafiakedir/tenacious-bench-adapter"

tokenizer = AutoTokenizer.from_pretrained(ADAPTER_ID)
base = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL, torch_dtype=torch.float16, device_map="auto"
)
model = PeftModel.from_pretrained(base, ADAPTER_ID)
model.eval()

JUDGE_SYSTEM = (
    "You are a rubric-aware judge for Tenacious Consulting B2B outbound sales emails. "
    "Given a task context and a candidate email, score it on the specified rubric dimension. "
    "Respond with a JSON object only:\n"
    '{"dimension": "<dim>", "score": <0.0-1.0>, "pass": <true|false>, "reasoning": "<one sentence>"}'
)

def judge(email, context, dimension):
    user = (
        f"EVALUATION DIMENSION: {dimension}\n\n"
        f"TASK CONTEXT:\n{context}\n\n"
        f"CANDIDATE EMAIL:\n{email}\n\n"
        f"Score this email on the {dimension} dimension."
    )
    msgs = [{"role": "system", "content": JUDGE_SYSTEM},
            {"role": "user",   "content": user}]
    text   = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True,
                             pad_token_id=tokenizer.eos_token_id)
    resp = tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()
    s, e = resp.find("{"), resp.rfind("}") + 1
    return json.loads(resp[s:e]) if s >= 0 else {"score": 0.5, "raw": resp[:200]}

result = judge(
    email="Casey โ€” TalentBridge has 8 open AI/ML roles this quarter. 30-min scoping call: calendly.com/tenacious",
    context="company: TalentBridge, stage: Series A, open_roles: 8, confidence: high",
    dimension="signal_grounding_fidelity"
)
print(result)

Training Details

Parameter Value
Base model unsloth/Qwen2.5-1.5B-Instruct (4-bit during training)
LoRA rank 16
LoRA alpha 32
Target modules q_proj, v_proj
LoRA dropout 0.05
Learning rate 8e-6
Effective batch size 8 (batch=2, grad_accum=4)
Epochs 3
Total steps 36
ORPO beta 0.1
Max sequence length 1024
Seed 42

Training loss: 2.8676 โ†’ 2.9646 โ†’ 2.9386 (3 checkpoints) Reward accuracy: 0.5375 โ†’ 0.6026 โ†’ 0.5128

Training data: 94 preference pairs from the train partition. Preference leakage prevention: generator (DeepSeek V3.2) โ‰  judge family (Claude Sonnet 4.6 / scoring_evaluator.py). All generation decisions logged in the dataset repo at training_data/generation_log.jsonl.


Evaluation Results

Evaluated on 59 held-out tasks from rafiakedir/tenacious-bench-v0.1. Full results in ablation_results.json in the dataset repo.

Deployment recommendation: Run ablations/run_ablations.py with this adapter to get Delta A. The ablation script loads this adapter via HuggingFace โ€” requires GPU + transformers + peft.


Known Limitations

  1. Dimension coverage gap. 0 training pairs for bench_commitment_honesty, 4 for icp_segment_appropriateness due to scoring key mismatch during pair construction. The model received zero gradient signal on bench commitment honesty.

  2. Backbone below Prometheus-2 threshold. Prometheus-2 demonstrated rubric-matching at 7B+ parameters. At 1.5B the model may underfit multi-dimension generalization.

  3. Synthetic training distribution. All pairs derive from synthetic prospect briefs and LLM-generated emails.

  4. Static bench_summary. Judge calibration drifts as real bench composition changes weekly.


Files

File Description
adapter_config.json LoRA configuration (r=16, alpha=32, q_proj+v_proj)
adapter_model.safetensors Trained LoRA weights (8.4 MB)
tokenizer.json, tokenizer_config.json Tokenizer (ChatML format)
run_on_colab.ipynb End-to-end training + push notebook
train_judge.py Training script
inference_example.py Per-dimension and all-dimension scoring helper

Training data: rafiakedir/tenacious-bench-v0.1


Citation

@misc{tenacious-bench-adapter-2026,
  title  = {Tenacious-Bench Judge: ORPO LoRA Adapter for B2B Sales Evaluation},
  author = {Kedir, Rafia},
  year   = {2026},
  url    = {https://huggingface.co/rafiakedir/tenacious-bench-adapter}
}
Downloads last month
79
Safetensors
Model size
0.9B params
Tensor type
F32
ยท
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for rafiakedir/tenacious-bench-adapter

Adapter
(474)
this model

Dataset used to train rafiakedir/tenacious-bench-adapter