Model Card — Tenacious-Bench LoRA Adapter (Path A)

Author: Gashaw Bekele | gashaw@10academy.org
Date: 2026-04-29
Adapter ID: gashawbekele/tenacious-bench-lora-path-a
Base Model: unsloth/Qwen2.5-0.5B-Instruct

Model Description

This is a LoRA adapter fine-tuned on Tenacious-Bench SFT pairs for the TRP1 Week 11 Path A experiment. The adapter targets surface-level generation quality in B2B outreach email tasks: enforcing a five-marker tone profile (Direct, Grounded, Honest, Professional, Non-condescending) and suppressing 28 banned vendor-speak phrases identified in the Tenacious Style Guide v2.

The adapter was trained on 221 curated instruction-response pairs drawn from the Tenacious-Bench v0.1 training partition (233 tasks total; 12 discarded after failing 3-retry programmatic filter). Training used Unsloth's efficient LoRA implementation on a free-tier Colab T4 GPU.

Intended Use

Use	Supported
Generating prospect-facing B2B outreach emails in Tenacious style	Yes
Evaluating generation quality on Tenacious-Bench v0.1 tasks	Yes
General-purpose instruction following	No — not optimised beyond this domain
Production deployment	No — published as a reproducible research baseline

Training Data

Field	Value
Dataset	Tenacious-Bench v0.1 training partition
Tasks	221 SFT pairs (out of 233 processed; 94.8% pass rate)
Format	Qwen2.5 chat-template: `<\|im_start\|>system/user/assistant<\|im_end\|>`
Authoring modes	Trace-derived (28.4%), Programmatic (28.0%), Multi-LLM (39.6%), Hand-authored (4.0%)
Judge rotation	Claude-authored tasks judged by Qwen; Qwen-authored judged by Claude
Quality filter	All pairs pass full programmatic rubric before inclusion

Hyperparameters

Parameter	Value	Rationale
Base model	`unsloth/Qwen2.5-0.5B-Instruct`	Free T4 budget; Unsloth-optimised
LoRA r / alpha	16 / 16	Unsloth Qwen guide default; sufficient for style adaptation
Max seq length	2048	Covers longest SFT pair (~1400 tokens)
Epochs	3	LIMA standard for small SFT datasets; convergence confirmed at epoch 3
Learning rate	2e-4	Tülu 3 LoRA SFT default for instruction-following tasks
Batch size	2 per device × 4 grad accum = 8 effective	Maximum fitting T4 16GB
Precision	fp16	T4 does not support bf16
Seed	42	Full reproducibility
Warmup steps	5	6.25% of 80 total steps

Training Run Summary

Metric	Value
Total steps	80
Loss at step 10	3.08
Loss at step 80	0.42
Final epoch loss	0.99
Wall-clock time	~2 minutes (T4 16GB, free Colab)
GPU cost	$0.00 (free tier)

The loss curve confirms convergence: rapid early drop (step 10–40), plateau from step 60 onward. No sign of overfitting on the 221-pair training set.

Evaluation Results

Evaluated on Tenacious-Bench v0.1 held-out partition (n=3 tasks).

Comparison	Condition A Mean	Condition B Mean	Delta	p-value	Significant
Delta A: Trained vs Baseline	0.5315	0.5315	0.0	1.0	No
Delta B: Trained vs Prompted	0.5315	0.5315	0.0	1.0	No
Delta C: Trained vs τ²-Bench	0.5315	0.61	-0.0785	—	Informational

Output length: Trained outputs averaged 210 words vs baseline 256 words (−18% reduction), confirming the adapter learned concision. This is consistent with the LIMA finding that style dimensions (length, register) are learnable from small SFT sets.

Why Delta A = 0.0: Three confirmed root causes (see methodology_rationale.md §Honest Limitation):

0.5B backbone cannot enforce negative lexical constraints — "bench" appears in input context and is reproduced via attention copying across all conditions.
Two of three held-out tasks use metadata check values that auto-pass regardless of output content (design artefact from the metadata _is_metadata_phrase() guard).
120-word count threshold: trained outputs at 210 words clear the threshold but so does the baseline (256 words), making the programmatic checks tied.

Limitations

Negative lexical constraint failure: The 0.5B backbone cannot reliably suppress words that appear in the input context. All three conditions (baseline, prompted, trained) produce the word "bench" despite it being in the banned-phrase list. This is a backbone capacity limitation. Expected to clear on Qwen2.5-1.5B with the same adapter.
Held-out set size (n=3): Statistical power is insufficient to detect small effects. Delta A = 0.0 at n=3 is consistent with both "no effect" and "effect too small to detect." A 20-task held-out set (v0.2 target) would provide 80% power at δ=0.05.
Single domain: The adapter is trained exclusively on B2B outreach scenarios from the Tenacious hiring signal domain. No generalisation claim is made.
LLM judge variance: The tone_judge dimension has κ=0.66 (substantial but not perfect agreement). Judge stochasticity contributes non-zero variance to all LLM-scored dimensions.
Not production-ready: This adapter is a reproducible research baseline. It should not be deployed in customer-facing systems without independent evaluation on a larger held-out set and human review of outputs.

Environmental Impact

Resource	Value
Training hardware	Google Colab T4 16GB (free tier)
Training time	~2 minutes
CO₂ equivalent	Negligible (free-tier shared infrastructure)
Inference hardware	CPU or T4 (Colab)
Total project API cost	~$2.12 (see `cost_log.csv`)

Citation

@misc{bekele2026tenaciousbench,
  title={Tenacious-Bench v0.1: A Style-Compliance Evaluation Benchmark for B2B Sales Agents},
  author={Bekele, Gashaw},
  year={2026},
  note={TRP1 Week 11 Project, 10 Academy}
}

License

Dataset: CC-BY-4.0. Adapter weights: Apache-2.0. Base model license: Apache-2.0 (Qwen2.5).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support