Model Card β€” Tenacious-Bench SimPO Judge Adapter

Model: mistire37/tenacious-bench-lora-adapter Type: LoRA adapter (Path B β€” preference-tuned judge/critic) Backbone: Qwen 2.5 0.5B (Instruct) Algorithm: SimPO (Simple Preference Optimization, Meng et al. NeurIPS 2024) Author: Mistire Daniel Date: 2026-05-01


Intended Use

This adapter is a post-generation judge deployed as a rejection-sampling layer in front of the Tenacious Intelligence Corporation Week 10 outreach generator. It scores every draft against the five Tenacious tone markers and honesty-gate constraints before the email is sent.

Primary use: Score B2B outreach email drafts for compliance with Tenacious-specific policy constraints (confidence-calibrated phrasing, ICP segment routing, bench capacity honesty, tone marker preservation, thread integrity).

Not intended for: General-purpose email generation, non-Tenacious sales contexts, or any domain outside B2B engineering staffing outreach.


Training Data

  • Dataset: Tenacious-Bench v0.1 training partition (102 tasks β†’ 79 SimPO preference pairs after quality filtering)
  • Format: SimPO preference pairs (chosen: policy-compliant DeepSeek V3 rewrite; rejected: probe-triggered failure)
  • Failure families covered: F1 (Confidence-Unaware Phrasing), F2 (ICP Classification), F3 (Bench Over-Commitment), F4 (Tone Drift), F5 (Thread/Data Integrity)
  • Preference-leakage prevention: Chosen rewrites generated by DeepSeek V3 (deepseek family); quality judge is Qwen3 (different family) β€” per Li et al. 2025

Training Configuration

Parameter Value
Backbone Qwen 3.5 0.8B (Instruct)
Algorithm SimPO
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.05
Target modules q_proj, v_proj
Epochs 3
Batch size 2 (effective: 8 with grad accum=4)
Learning rate 5e-5
SimPO beta 2.0
SimPO gamma 0.5
Compute Google Colab T4 (free tier)
Framework Unsloth + HuggingFace TRL
Seed 42

Evaluation Results

Evaluated on the sealed 41-task held-out partition. Judge: DeepSeek V3-0324 (different family from training data generator β€” preference-leakage prevention maintained at eval time).

Metric Value Notes
Delta A (trained vs. baseline) +0.000 95% CI [βˆ’0.122, +0.122], p = 0.585 β€” not significant
Delta B (trained vs. prompt-eng) βˆ’0.341 95% CI [βˆ’0.488, βˆ’0.195] β€” prompt engineering dominates
Trained held-out pass rate 14.6% (6/41) CPOTrainer(loss_type=simpo) on Qwen 2.5 0.5B
Baseline pass rate 14.6% (6/41) DeepSeek V3, no system prompt
Prompt-eng pass rate 48.8% (20/41) DeepSeek V3, 10-rule system prompt

Per-failure-family breakdown (trained vs. baseline):

Family n Trained Baseline Ξ”
F1 β€” Confidence-Unaware Phrasing 16 6.2% 18.8% βˆ’12.5%
F2 β€” ICP Classification 8 25.0% 12.5% +12.5%
F3 β€” Bench Over-Commitment 6 33.3% 33.3% 0.0%
F4 β€” Tone Drift 7 14.3% 0.0% +14.3%
F5 β€” Thread/Data Integrity 4 0.0% 0.0% 0.0%

Honest interpretation:

Delta A is flat and Delta B is negative. The trained adapter (Qwen 2.5 0.5B) does not outperform a carefully prompted frontier model (DeepSeek V3). Three factors explain this:

  1. Scale asymmetry: The comparison pits a fine-tuned 0.5B model against a ~67B frontier model. At 0.5B, the backbone's generation quality is the binding constraint β€” preference training cannot compensate for limited capacity on a complex, multi-constraint task.
  2. Role confusion: Path B trains a judge/critic, not a generator. The correct production deployment routes baseline agent outputs through the trained adapter as a scoring/rejection layer β€” not replacing the generator. This ablation evaluated the adapter as a generator, which is not its intended role.
  3. Training data volume: 102 pairs on a 0.5B backbone. Per the LIMA paper, quality dominates quantity at small scale, but the task complexity here (five interacting constraint families) likely requires more diverse examples to generalize.

Production recommendation: do not deploy as a standalone generator. Deploy as a rejection-sampling layer in front of the prompt-engineered DeepSeek pipeline β€” score every draft, regenerate on fail. Expected combined pass rate: higher than either component alone.


Limitations

  1. Scale asymmetry in ablation: Delta A/B compare Qwen 0.5B (trained) against DeepSeek ~67B (baseline/prompt-eng). A fair comparison would hold backbone constant.
  2. Domain-specific: Trained exclusively on Tenacious-style B2B outreach. Will not generalize to other sales domains without retraining.
  3. Small training set: 102 preference pairs. Performance on rare failure modes (F5 timezone arithmetic, F2 acqui-hire disambiguation) may be inconsistent.
  4. Rubric-dependent: The adapter learns to satisfy the Tenacious-Bench v0.1 rubric. If the style guide changes, the adapter must be retrained.
  5. No multi-turn capability: Trained on single-turn outreach tasks. Multi-turn trajectory failures (Path C territory) are not addressed.
  6. Held-out size: 41 tasks limits statistical power. Wide confidence intervals (Β±12pp) make small effects undetectable.

Kill-Switch Trigger

Deploy with monitoring. Trigger rollback if:

  • Held-out pass rate drops below 60% on any 7-day rolling window
  • Any F3 (bench over-commitment) task produces a confirmed capacity promise
  • Any F5 (opt-out) task produces outreach to an opted-out prospect

Environmental Cost

  • Training: ~1.5 kWh on Google Colab T4 (estimated)
  • Inference: LoRA adapter adds ~15% latency overhead vs. base model
  • No carbon offset applied

Citation

@misc{tenacious-bench-lora-2026,
  author = {Mistire Daniel},
  title  = {Tenacious-Bench SimPO Judge Adapter},
  year   = {2026},
  url    = {https://huggingface.co/mistire37/tenacious-bench-lora-adapter}
}

Attribution

  • Backbone: Qwen 3.5 0.8B (Alibaba Cloud, Apache 2.0)
  • Training framework: Unsloth (Apache 2.0), HuggingFace TRL (Apache 2.0)
  • Algorithm: SimPO (Meng, Xia, and Chen, NeurIPS 2024)
  • Preference-leakage prevention: Li et al. 2025
  • Dataset: Tenacious-Bench v0.1 (CC-BY-4.0)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mistire37/tenacious-bench-lora-adapter

Adapter
(574)
this model

Dataset used to train mistire37/tenacious-bench-lora-adapter