Model Card — Tenacious-Bench SimPO Judge Adapter

Model: mistire37/tenacious-bench-lora-adapter Type: LoRA adapter (Path B — preference-tuned judge/critic) Backbone: Qwen 2.5 0.5B (Instruct) Algorithm: SimPO (Simple Preference Optimization, Meng et al. NeurIPS 2024) Author: Mistire Daniel Date: 2026-05-01

Intended Use

This adapter is a post-generation judge deployed as a rejection-sampling layer in front of the Tenacious Intelligence Corporation Week 10 outreach generator. It scores every draft against the five Tenacious tone markers and honesty-gate constraints before the email is sent.

Primary use: Score B2B outreach email drafts for compliance with Tenacious-specific policy constraints (confidence-calibrated phrasing, ICP segment routing, bench capacity honesty, tone marker preservation, thread integrity).

Not intended for: General-purpose email generation, non-Tenacious sales contexts, or any domain outside B2B engineering staffing outreach.

Training Data

Dataset: Tenacious-Bench v0.1 training partition (102 tasks → 79 SimPO preference pairs after quality filtering)
Format: SimPO preference pairs (chosen: policy-compliant DeepSeek V3 rewrite; rejected: probe-triggered failure)
Failure families covered: F1 (Confidence-Unaware Phrasing), F2 (ICP Classification), F3 (Bench Over-Commitment), F4 (Tone Drift), F5 (Thread/Data Integrity)
Preference-leakage prevention: Chosen rewrites generated by DeepSeek V3 (deepseek family); quality judge is Qwen3 (different family) — per Li et al. 2025

Training Configuration

Parameter	Value
Backbone	Qwen 3.5 0.8B (Instruct)
Algorithm	SimPO
LoRA rank (r)	16
LoRA alpha	32
LoRA dropout	0.05
Target modules	q_proj, v_proj
Epochs	3
Batch size	2 (effective: 8 with grad accum=4)
Learning rate	5e-5
SimPO beta	2.0
SimPO gamma	0.5
Compute	Google Colab T4 (free tier)
Framework	Unsloth + HuggingFace TRL
Seed	42

Evaluation Results

Evaluated on the sealed 41-task held-out partition. Judge: DeepSeek V3-0324 (different family from training data generator — preference-leakage prevention maintained at eval time).

Metric	Value	Notes
Delta A (trained vs. baseline)	+0.000	95% CI [−0.122, +0.122], p = 0.585 — not significant
Delta B (trained vs. prompt-eng)	−0.341	95% CI [−0.488, −0.195] — prompt engineering dominates
Trained held-out pass rate	14.6% (6/41)	CPOTrainer(loss_type=simpo) on Qwen 2.5 0.5B
Baseline pass rate	14.6% (6/41)	DeepSeek V3, no system prompt
Prompt-eng pass rate	48.8% (20/41)	DeepSeek V3, 10-rule system prompt

Per-failure-family breakdown (trained vs. baseline):

Family	n	Trained	Baseline	Δ
F1 — Confidence-Unaware Phrasing	16	6.2%	18.8%	−12.5%
F2 — ICP Classification	8	25.0%	12.5%	+12.5%
F3 — Bench Over-Commitment	6	33.3%	33.3%	0.0%
F4 — Tone Drift	7	14.3%	0.0%	+14.3%
F5 — Thread/Data Integrity	4	0.0%	0.0%	0.0%

Honest interpretation:

Delta A is flat and Delta B is negative. The trained adapter (Qwen 2.5 0.5B) does not outperform a carefully prompted frontier model (DeepSeek V3). Three factors explain this:

Scale asymmetry: The comparison pits a fine-tuned 0.5B model against a ~67B frontier model. At 0.5B, the backbone's generation quality is the binding constraint — preference training cannot compensate for limited capacity on a complex, multi-constraint task.
Role confusion: Path B trains a judge/critic, not a generator. The correct production deployment routes baseline agent outputs through the trained adapter as a scoring/rejection layer — not replacing the generator. This ablation evaluated the adapter as a generator, which is not its intended role.
Training data volume: 102 pairs on a 0.5B backbone. Per the LIMA paper, quality dominates quantity at small scale, but the task complexity here (five interacting constraint families) likely requires more diverse examples to generalize.

Production recommendation: do not deploy as a standalone generator. Deploy as a rejection-sampling layer in front of the prompt-engineered DeepSeek pipeline — score every draft, regenerate on fail. Expected combined pass rate: higher than either component alone.

Limitations

Scale asymmetry in ablation: Delta A/B compare Qwen 0.5B (trained) against DeepSeek ~67B (baseline/prompt-eng). A fair comparison would hold backbone constant.
Domain-specific: Trained exclusively on Tenacious-style B2B outreach. Will not generalize to other sales domains without retraining.
Small training set: 102 preference pairs. Performance on rare failure modes (F5 timezone arithmetic, F2 acqui-hire disambiguation) may be inconsistent.
Rubric-dependent: The adapter learns to satisfy the Tenacious-Bench v0.1 rubric. If the style guide changes, the adapter must be retrained.
No multi-turn capability: Trained on single-turn outreach tasks. Multi-turn trajectory failures (Path C territory) are not addressed.
Held-out size: 41 tasks limits statistical power. Wide confidence intervals (±12pp) make small effects undetectable.

Kill-Switch Trigger

Deploy with monitoring. Trigger rollback if:

Held-out pass rate drops below 60% on any 7-day rolling window
Any F3 (bench over-commitment) task produces a confirmed capacity promise
Any F5 (opt-out) task produces outreach to an opted-out prospect

Environmental Cost

Training: ~1.5 kWh on Google Colab T4 (estimated)
Inference: LoRA adapter adds ~15% latency overhead vs. base model
No carbon offset applied

Citation

@misc{tenacious-bench-lora-2026,
  author = {Mistire Daniel},
  title  = {Tenacious-Bench SimPO Judge Adapter},
  year   = {2026},
  url    = {https://huggingface.co/mistire37/tenacious-bench-lora-adapter}
}

Attribution

Backbone: Qwen 3.5 0.8B (Alibaba Cloud, Apache 2.0)
Training framework: Unsloth (Apache 2.0), HuggingFace TRL (Apache 2.0)
Algorithm: SimPO (Meng, Xia, and Chen, NeurIPS 2024)
Preference-leakage prevention: Li et al. 2025
Dataset: Tenacious-Bench v0.1 (CC-BY-4.0)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mistire37/tenacious-bench-lora-adapter

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Adapter

(574)

this model

mistire37
/

tenacious-bench-lora-adapter