Model Card β Tenacious-Bench SimPO Judge Adapter
Model: mistire37/tenacious-bench-lora-adapter
Type: LoRA adapter (Path B β preference-tuned judge/critic)
Backbone: Qwen 2.5 0.5B (Instruct)
Algorithm: SimPO (Simple Preference Optimization, Meng et al. NeurIPS 2024)
Author: Mistire Daniel
Date: 2026-05-01
Intended Use
This adapter is a post-generation judge deployed as a rejection-sampling layer in front of the Tenacious Intelligence Corporation Week 10 outreach generator. It scores every draft against the five Tenacious tone markers and honesty-gate constraints before the email is sent.
Primary use: Score B2B outreach email drafts for compliance with Tenacious-specific policy constraints (confidence-calibrated phrasing, ICP segment routing, bench capacity honesty, tone marker preservation, thread integrity).
Not intended for: General-purpose email generation, non-Tenacious sales contexts, or any domain outside B2B engineering staffing outreach.
Training Data
- Dataset: Tenacious-Bench v0.1 training partition (102 tasks β 79 SimPO preference pairs after quality filtering)
- Format: SimPO preference pairs (chosen: policy-compliant DeepSeek V3 rewrite; rejected: probe-triggered failure)
- Failure families covered: F1 (Confidence-Unaware Phrasing), F2 (ICP Classification), F3 (Bench Over-Commitment), F4 (Tone Drift), F5 (Thread/Data Integrity)
- Preference-leakage prevention: Chosen rewrites generated by DeepSeek V3 (deepseek family); quality judge is Qwen3 (different family) β per Li et al. 2025
Training Configuration
| Parameter | Value |
|---|---|
| Backbone | Qwen 3.5 0.8B (Instruct) |
| Algorithm | SimPO |
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Target modules | q_proj, v_proj |
| Epochs | 3 |
| Batch size | 2 (effective: 8 with grad accum=4) |
| Learning rate | 5e-5 |
| SimPO beta | 2.0 |
| SimPO gamma | 0.5 |
| Compute | Google Colab T4 (free tier) |
| Framework | Unsloth + HuggingFace TRL |
| Seed | 42 |
Evaluation Results
Evaluated on the sealed 41-task held-out partition. Judge: DeepSeek V3-0324 (different family from training data generator β preference-leakage prevention maintained at eval time).
| Metric | Value | Notes |
|---|---|---|
| Delta A (trained vs. baseline) | +0.000 | 95% CI [β0.122, +0.122], p = 0.585 β not significant |
| Delta B (trained vs. prompt-eng) | β0.341 | 95% CI [β0.488, β0.195] β prompt engineering dominates |
| Trained held-out pass rate | 14.6% (6/41) | CPOTrainer(loss_type=simpo) on Qwen 2.5 0.5B |
| Baseline pass rate | 14.6% (6/41) | DeepSeek V3, no system prompt |
| Prompt-eng pass rate | 48.8% (20/41) | DeepSeek V3, 10-rule system prompt |
Per-failure-family breakdown (trained vs. baseline):
| Family | n | Trained | Baseline | Ξ |
|---|---|---|---|---|
| F1 β Confidence-Unaware Phrasing | 16 | 6.2% | 18.8% | β12.5% |
| F2 β ICP Classification | 8 | 25.0% | 12.5% | +12.5% |
| F3 β Bench Over-Commitment | 6 | 33.3% | 33.3% | 0.0% |
| F4 β Tone Drift | 7 | 14.3% | 0.0% | +14.3% |
| F5 β Thread/Data Integrity | 4 | 0.0% | 0.0% | 0.0% |
Honest interpretation:
Delta A is flat and Delta B is negative. The trained adapter (Qwen 2.5 0.5B) does not outperform a carefully prompted frontier model (DeepSeek V3). Three factors explain this:
- Scale asymmetry: The comparison pits a fine-tuned 0.5B model against a ~67B frontier model. At 0.5B, the backbone's generation quality is the binding constraint β preference training cannot compensate for limited capacity on a complex, multi-constraint task.
- Role confusion: Path B trains a judge/critic, not a generator. The correct production deployment routes baseline agent outputs through the trained adapter as a scoring/rejection layer β not replacing the generator. This ablation evaluated the adapter as a generator, which is not its intended role.
- Training data volume: 102 pairs on a 0.5B backbone. Per the LIMA paper, quality dominates quantity at small scale, but the task complexity here (five interacting constraint families) likely requires more diverse examples to generalize.
Production recommendation: do not deploy as a standalone generator. Deploy as a rejection-sampling layer in front of the prompt-engineered DeepSeek pipeline β score every draft, regenerate on fail. Expected combined pass rate: higher than either component alone.
Limitations
- Scale asymmetry in ablation: Delta A/B compare Qwen 0.5B (trained) against DeepSeek ~67B (baseline/prompt-eng). A fair comparison would hold backbone constant.
- Domain-specific: Trained exclusively on Tenacious-style B2B outreach. Will not generalize to other sales domains without retraining.
- Small training set: 102 preference pairs. Performance on rare failure modes (F5 timezone arithmetic, F2 acqui-hire disambiguation) may be inconsistent.
- Rubric-dependent: The adapter learns to satisfy the Tenacious-Bench v0.1 rubric. If the style guide changes, the adapter must be retrained.
- No multi-turn capability: Trained on single-turn outreach tasks. Multi-turn trajectory failures (Path C territory) are not addressed.
- Held-out size: 41 tasks limits statistical power. Wide confidence intervals (Β±12pp) make small effects undetectable.
Kill-Switch Trigger
Deploy with monitoring. Trigger rollback if:
- Held-out pass rate drops below 60% on any 7-day rolling window
- Any F3 (bench over-commitment) task produces a confirmed capacity promise
- Any F5 (opt-out) task produces outreach to an opted-out prospect
Environmental Cost
- Training: ~1.5 kWh on Google Colab T4 (estimated)
- Inference: LoRA adapter adds ~15% latency overhead vs. base model
- No carbon offset applied
Citation
@misc{tenacious-bench-lora-2026,
author = {Mistire Daniel},
title = {Tenacious-Bench SimPO Judge Adapter},
year = {2026},
url = {https://huggingface.co/mistire37/tenacious-bench-lora-adapter}
}
Attribution
- Backbone: Qwen 3.5 0.8B (Alibaba Cloud, Apache 2.0)
- Training framework: Unsloth (Apache 2.0), HuggingFace TRL (Apache 2.0)
- Algorithm: SimPO (Meng, Xia, and Chen, NeurIPS 2024)
- Preference-leakage prevention: Li et al. 2025
- Dataset: Tenacious-Bench v0.1 (CC-BY-4.0)