Instructions to use bethelhem21/tenacious-judge-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use bethelhem21/tenacious-judge-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit") model = PeftModel.from_pretrained(base_model, "bethelhem21/tenacious-judge-lora") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Unsloth Studio new
How to use bethelhem21/tenacious-judge-lora with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for bethelhem21/tenacious-judge-lora to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for bethelhem21/tenacious-judge-lora to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for bethelhem21/tenacious-judge-lora to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="bethelhem21/tenacious-judge-lora", max_seq_length=2048, )
🤖 Tenacious Judge LoRA — B2B Sales Outreach Pre-Send Judge
Base model: unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit
Adapter type: LoRA (ORPO)
Author: Bethelhem Abay · 10 Academy TRP1
Date: 2026-05-02
License: MIT
A LoRA adapter that teaches a small LLM when NOT to send a B2B sales email — 85.2% accuracy on sealed held-out data, trained in 20 minutes on a Colab T4 free tier.
🔗 Quick Links
| Resource | Link |
|---|---|
| 🤖 Model (this page) | bethelhem21/tenacious-judge-lora |
| 📦 Training Dataset | bethelhem21/tenacious-bench |
| 💻 GitHub Repository | bettyabay/tenacious-bench |
| 📝 Blog Post | Teaching a Sales Agent When NOT to Act |
Overview
What is this model?
Tenacious Judge is a LoRA adapter on top of Qwen2.5-1.5B-Instruct (4-bit quantised via Unsloth), fine-tuned with ORPO on the Tenacious-Bench preference dataset.
It sits between the Tenacious Conversion Engine and the email send queue. Before any B2B outreach is dispatched, the judge evaluates the proposed action against a 7-rule rubric and returns a decision:
| Decision | Meaning |
|---|---|
SUPPRESS |
Disqualifier or opt-out signal present — block all outreach |
ESCALATE |
C-level recipient at >2,000-headcount company — route to human account executive |
BLOCK |
Cross-thread context leak or low-confidence funding cited as fact |
PENALISE |
Generic peer company names — flag for human review |
PASS |
No rule violations — approve for dispatch |
Key Results
| Metric | Value |
|---|---|
| Held-out accuracy | 85.2% (52 / 61 pairs) |
| 95% confidence interval | [0.77, 0.93] |
| Training time | ~20 minutes on Colab T4 (free tier) |
| Compute | Colab T4 free tier (~$1.50 total incl. data synthesis) |
| Adapter size | ~74 MB |
| Baseline (no judge) | 0.0% (0 / 61) |
How to Use
Requirements
pip install unsloth transformers torch
Load and Run Inference
from unsloth import FastLanguageModel
import torch
import json
# Load the adapter
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="bethelhem21/tenacious-judge-lora",
max_seq_length=2048,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
SYSTEM_PROMPT = """You are a sales-outreach judge for Tenacious, a B2B sales agent.
Apply the 7-rule rubric in strict priority order:
1. SUPPRESS if any disqualifier is present (anti_offshore, local_only, do_not_contact, competitor)
2. SUPPRESS if the prospect has opted out of the outreach channel
3. ESCALATE if the recipient is C-level at a company with >2000 employees
4. BLOCK if the output references content from a different prospect's thread
5. BLOCK if a funding amount is cited but funding_confidence is low or insufficient_signal
6. PENALISE if peer company names are generic or reused across prospects
7. PASS if none of the above conditions are triggered
Respond with one of: SUPPRESS | ESCALATE | BLOCK | PENALISE | PASS
Then give a one-sentence rationale."""
# Example: anti-offshore disqualifier case
context = {
"company": "NearshoreStack Ltd",
"headcount": 120,
"funding_stage": "series_a",
"funding_confidence": "high",
"disqualifiers": ["anti_offshore"],
"opt_out_channels": [],
"recipient_role": "vp_eng",
"available_signals": {"tech_stack": ["aws", "react"]}
}
agent_output = (
"Hi Amir, I wanted to introduce Tenacious's offshore engineering placement "
"service. We've helped similar Series A companies scale their backend teams..."
)
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Context:\n{json.dumps(context, indent=2)}\n\n"
f"Agent output:\n{agent_output}"},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to("cuda")
with torch.no_grad():
outputs = model.generate(inputs, max_new_tokens=64, temperature=0.0)
decision = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(decision)
# Expected: SUPPRESS
# Rationale: NearshoreStack has an anti_offshore disqualifier — Rule 1 fires.
All Five Decision Paths
# SUPPRESS — disqualifier present
context_suppress = {
"disqualifiers": ["anti_offshore"], "opt_out_channels": [],
"headcount": 80, "recipient_role": "cto", "funding_confidence": "high"
}
# ESCALATE — C-level at large company
context_escalate = {
"disqualifiers": [], "opt_out_channels": [],
"headcount": 5000, "recipient_role": "c_level", "funding_confidence": "high"
}
# BLOCK — cross-thread leak
context_block = {
"disqualifiers": [], "opt_out_channels": [],
"headcount": 200, "recipient_role": "vp_eng", "funding_confidence": "high",
"thread_id": "thread-042",
"available_signals": {"leaked_thread": "thread-039"}
}
# PENALISE — generic peer names
agent_output_penalise = (
"We've helped companies like TechCorp and StartupCo scale their teams..."
)
# PASS — clean outreach
context_pass = {
"disqualifiers": [], "opt_out_channels": [],
"headcount": 300, "recipient_role": "vp_eng", "funding_confidence": "high"
}
Training Details
Method: Why ORPO over DPO?
ORPO (Odds-Ratio Preference Optimisation, Hong et al., 2024) was chosen for three concrete reasons:
No reference model. DPO requires keeping the original model in memory alongside the trained model, consuming 3–4 GB additional VRAM. On a Colab T4 (15 GB total), this ruled out 7B+ base models. ORPO eliminates the reference model entirely.
Combined SFT + preference in one pass. DPO requires a supervised fine-tuning step before preference training. ORPO combines both objectives into a single training loop, halving the compute requirement.
Better performance on small datasets. Published benchmarks show ORPO outperforms DPO when training data is < 500 pairs. The Tenacious-Bench training set has 169 pairs — well within this regime.
Training Configuration
| Parameter | Value |
|---|---|
| Base model | unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit |
| LoRA rank (r) | 16 |
| LoRA alpha | 16 |
| LoRA dropout | 0 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable parameters | 18,087,936 (1.18% of total) |
| Training pairs | 169 (train split) |
| Eval pairs | 93 (dev split) |
| Epochs | 10 |
| Total steps | 200 |
| Effective batch size | 8 |
| Learning rate | 8e-5 |
| Optimizer | AdamW (8-bit via Unsloth) |
| Max sequence length | 2,048 tokens |
| Random seed | 3407 |
| Hardware | Google Colab T4 (free tier, 15 GB VRAM) |
| Training time | ~17 minutes (1,009 seconds) |
| Adapter size on disk | ~74 MB |
Training Progression
| Step | Train Loss | Eval Loss | Rewards Accuracy | Rewards Margin |
|---|---|---|---|---|
| 0 | 3.25 | — | — | — |
| 40 | ~0.85 | ~0.90 | 100% | — |
| 100 | ~0.35 | ~0.50 | 100% | ~0.55 |
| 200 | 0.1411 | 0.3851 | 100% | 0.6934 |
- Loss reduced by 95.7% over 200 steps
- 100% preference accuracy (rewards) reached at step 40 and maintained through step 200
- No overfitting detected (eval loss converges and plateaus; does not diverge)
Note: An initial 60-step run achieved only 8.2% held-out accuracy — the model had not converged. The final 200-step run was required to reach 85.2%.
Evaluation Results
Evaluated on 61 sealed held-out pairs — all representing cases where the Conversion Engine made the wrong decision. The judge must correctly identify the failure.
Summary
| Variant | Correct | Accuracy | 95% CI |
|---|---|---|---|
| No judge (baseline) | 0 / 61 | 0.0% | [0.00, 0.00] |
| ORPO judge (this model) | 52 / 61 | 85.2% | [0.77, 0.93] |
Per-Probe Breakdown
| Probe | Failure Description | Held-out | Correct | Accuracy | Status |
|---|---|---|---|---|---|
| A07 | Anti-offshore disqualifier | 6 | 6 | 100% | ✅ |
| B03 | Funding-tier mismatch | 6 | 5 | 83% | ✅ |
| B04 | Low-confidence funding | 5 | 5 | 100% | ✅ |
| C02 | Bench commitment ignored | 6 | 4 | 67% | ⚠️ |
| C04 | Regulatory caveat omitted | 6 | 3 | 50% | ⚠️ |
| D05 | Soft rejection doubled down | 6 | 6 | 100% | ✅ |
| E01 | Cross-thread context leak | 6 | 6 | 100% | ✅ |
| E02 | Generic peer company names | 6 | 4 | 67% | ⚠️ |
| E03 | Opt-out channel ignored | 6 | 5 | 83% | ✅ |
| G03 | C-level escalation missed | 8 | 8 | 100% | ✅ |
| Total | 61 | 52 | 85.2% |
Confusion Matrix Summary
The 9 misclassified held-out pairs break down as:
| Probe | Misses | Root cause |
|---|---|---|
| C02 | 2 | No structured prior_commitments field; judge infers from prose |
| C04 | 3 | Regulated-industry examples underrepresented in training |
| E02 | 2 | Peer-name specificity threshold is subjective without a reference list |
| B03 | 1 | Borderline funding-tier case with ambiguous signal |
| E03 | 1 | Partial opt-out (sms only) with email channel active |
All 52 correct decisions were true positives — the judge correctly identified the failure class and recommended the right action.
Inference Latency
Based on prefill vs. decode phase analysis (see latency breakdown blog post):
| Phase | Bottleneck | Scales with | Cost |
|---|---|---|---|
| Prefill | GPU compute (FLOP/s) | Prompt token count | ~0.2 ms/token |
| Decode | Memory bandwidth (GB/s) | Output token count | ~2 ms/token |
- Observed end-to-end latency: ~200ms on T4 (batch size 1)
- Typical token ratio:
750 prompt tokens / ~60 output tokens (12:1) - Dominant phase: Prefill (compute-bound; dominates at high prompt-to-output ratio)
- Optimization priority: Reduce prompt length, not output length
Reducing average prompt length from ~750 to ~400 tokens is projected to reduce total latency by 30–35%.
This latency profile is designed for async pre-send queues, not real-time filtering.
Limitations
Active Limitations (v0.1)
1. C02 — Bench commitment probe: 67% accuracy
The context schema has no structured prior_commitments field. The judge must infer commitment windows from free-text rationale strings, which is unreliable for edge cases (e.g., commitment window ending today vs. ending tomorrow). The v0.2 schema adds an explicit prior_commitments: [{starts, ends, type}] array.
2. C04 — Regulated-industry caveat probe: 50% accuracy
Training data included very few regulated-industry examples (SOX post-IPO, GDPR erasure requests, HIPAA-adjacent contexts). The judge cannot reliably detect when a caveat is required for fintech, healthcare, or government prospects. Do not deploy against regulated-industry verticals without retraining on a regulated-industry probe set.
3. Single-turn only
The judge evaluates one context object + one agent output. It cannot detect failures that span a multi-turn conversation thread (e.g., commitment made in message 3 violated in message 7).
4. Signal lag
Ground truth depends on CRM signals recorded at annotation time. A prospect who revoked an opt-out or changed their role after the signal was recorded will be incorrectly classified.
5. English only
All training data is in English. Not validated for non-English outreach.
Kill-Switch Conditions
Remove this adapter and revert to the deterministic rule layer if:
- False-positive rate exceeds 15% over a rolling 500-prospect window
- Any probe drops below 50% accuracy on a weekly 20-decision spot-check
- Adapter fails to load or produces malformed output on >1% of calls
- A Tier-1 brand-damage event occurs that was not blocked, traced to a known probe
Environmental Cost
| Phase | Cost |
|---|---|
| Dataset generation (trace-derived + programmatic) | $0.00 |
| Multi-LLM synthesis (OpenRouter) | ~$1.50 |
| ORPO training (Colab T4 free tier, 17 min) | $0.00 |
| Inference per prospect (local adapter) | $0.00 |
| Total | < $1.50 |
Training Data
Trained on bethelhem21/tenacious-bench:
- 323 preference pairs across 10 failure probes
- 169 training pairs used for this adapter
- Inter-rater agreement: Cohen's κ = 1.000
- Contamination check: PASS (0 n-gram violations)
- Authoring modes: trace_derived (90), programmatic (73), multi_llm (120), hand_authored (40)
Citation
@misc{tenacious-judge-lora-2026,
author = {Bethelhem Abay},
title = {Tenacious Judge LoRA: Preference-Tuned B2B Sales Outreach Judge},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/bethelhem21/tenacious-judge-lora}
}
References
@article{hong2024orpo,
title = {ORPO: Monolithic Preference Optimization without Reference Model},
author = {Hong, Jiwoo and Lee, Noah and Thorne, James},
year = {2024}
}
@article{rafailov2023dpo,
title = {Direct Preference Optimization: Your Language Model is Secretly a Reward Model},
author = {Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D. and Ermon, Stefano and Finn, Chelsea},
year = {2023}
}
Acknowledgments
This work would not have been possible without:
Mentors: My mentor Abdulhamid and Temesgen, who guided me through choosing ORPO over DPO and pushed me to run IRA before training. That conversation — about why label reliability must come before model training — is the reason this project has κ = 1.000 in the header and not a footnote about noisy labels.
Yonatan Wondimu (Community Manager) — for hands-on guidance with HuggingFace dataset and model publishing, and for the daily theory and reflective questions that pushed me to articulate my reasoning instead of just shipping code.
10 Academy: The TRP1 tutors for daily standups, debugging support, and technical tutorials throughout Week 11. The kind of infrastructure that makes a Colab-T4 experiment feel like real research.
Cohort: My TRP1 cohort for the daily accountability. You all made the impossible feel possible.
Adapter built as part of the 10 Academy TRP1 Sales Agent Evaluation Bench challenge (Week 11, 2026). All training data is fully synthetic — no real companies, individuals, or emails.
- Downloads last month
- 115
Model tree for bethelhem21/tenacious-judge-lora
Base model
Qwen/Qwen2.5-1.5BDataset used to train bethelhem21/tenacious-judge-lora
Space using bethelhem21/tenacious-judge-lora 1
Evaluation results
- Accuracy (held-out, n=61) on Tenacious-Benchself-reported0.852
- Accuracy — PROBE-A07 (anti_offshore disqualifier) on Tenacious-Benchself-reported1.000
- Accuracy — PROBE-B03 (funding-tier mismatch) on Tenacious-Benchself-reported0.833
- Accuracy — PROBE-B04 (low-confidence funding) on Tenacious-Benchself-reported1.000
- Accuracy — PROBE-C02 (bench commitment) on Tenacious-Benchself-reported0.667
- Accuracy — PROBE-C04 (regulatory caveat) on Tenacious-Benchself-reported0.500
- Accuracy — PROBE-D05 (soft rejection) on Tenacious-Benchself-reported1.000
- Accuracy — PROBE-E01 (thread leakage) on Tenacious-Benchself-reported1.000
