qwen14b-code-trainer-v6-aggressive

LoRA adapter for Qwen/Qwen2.5-Coder-14B-Instruct, fine-tuned to generate source code from chat-formatted instructions in the code-trainer-offsec-dataset. This is the winner of the Phase 4A validation sweep across three LoRA configurations (eval_loss = 0.4724 on the full 3,265-row validation split — see eval results).

It is the canonical Phase 5 conversion target for the Code-Trainer V6 / RTPI project (GitHub), chosen over the 3-epoch Phase 4B variant after a head-to-head full-validation comparison.

Intended use

  • Direct use: load the adapter on top of Qwen/Qwen2.5-Coder-14B-Instruct and use it for instruction-following code generation in the eight languages of the dataset (Python, JavaScript, TypeScript, Java, Go, Rust, C++, C#).
  • Downstream: merge into the base model and quantize to GGUF for local serving — see qwen14b-code-trainer-v6-gguf.
  • Out of scope: this adapter was not trained for safety alignment, RLHF, or any non-code task; treat any non-code response as undefined behaviour.

Training data

  • Dataset: cmndcntrlcyber/code-trainer-offsec-dataset (revision main, text-only chat format).
  • Splits seen: 26,126 train rows, 3,265 validation rows.
  • Source: scraped GitHub repositories across 8 languages, filtered by a composite quality score (stars, activity, docs, code quality, community) and code-shape heuristics (20–500 lines, 200 B – 50 KB per file).
  • Format: OpenAI-style messages (system / user / assistant) where the user turn carries an instruction and the assistant turn is the source code to produce.

Training procedure

Knob Value
Base model Qwen/Qwen2.5-Coder-14B-Instruct
Adapter LoRA (PEFT), r = 64, alpha = 128, dropout = 0.05
Optimizer fused AdamW (BF16 master weights)
Learning rate 3e-4 (cosine decay, warmup ratio 0.03)
Batch size × grad accum 4 × 4 (effective batch = 16)
Epochs 1 (full data — see Phase 4A vs 4B note below)
Sequence length 2,048
Precision bfloat16 + gradient checkpointing
Hardware HF Skills a100-large (1× A100 80 GB)
Frameworks transformers, peft, trl (SFTTrainer)

The Phase 4A sweep also covered a conservative (r=16, lr=1.5e-4) and standard (r=32, lr=2e-4) configuration; aggressive won by ≥ 0.007 eval_loss on identical data. See docs/sweep/phase4a-summary.md for the full sweep table.

A second-pass Phase 4B experiment ran the same aggressive config for 3 epochs over an 8 K slice of the data (24 K total samples seen). On the full validation split it scored 0.5126 — measurably worse than this adapter's 0.4724 across 1 epoch on the full 26 K. Conclusion: more unique examples > more passes for this dataset. See docs/sweep/phase4b-summary.md.

Evaluation

Task-specific (full validation split, 3,265 rows)

Metric Value
eval_loss 0.4724
eval_runtime 677.88 s

Source: HF Job 69f7659298a8d679adfb8b8e, re-evaluated post-hoc against the full val split via job 69f89e5b9d85bec4d76f217e (also phase4-eval-full.json in this repo).

Sweep comparison (Phase 4A — full val)

Config r / α LR eval_loss
conservative 16 / 32 1.5e-4 0.4819
standard 32 / 64 2e-4 0.4798
aggressive (this) 64 / 128 3e-4 0.4724

General benchmark — catastrophic-forgetting check (GSM8K, 0-shot)

GSM8K (grade-school math word problems) is orthogonal to the screenshot-to-code training task, so any large drop here would indicate catastrophic forgetting on general reasoning. Both rows use the same lm-evaluation-harness (lm-eval==0.4.4) pipeline.

Run exact_match (flexible-extract) exact_match (strict-match)
Base Qwen/Qwen2.5-Coder-14B-Instruct 0.6050 ± 0.013 0.0000
+ this adapter 0.6778 ± 0.013 0.0000
Δ +0.0728 (+12.0 % relative) —

No catastrophic forgetting — the adapter actually improves GSM8K performance by 12 % relative. This is consistent with the chat-format SFT having taught the model cleaner answer formatting on free-form prompts; the LoRA changes did not erase math-reasoning capability.

Notes:

  • flexible-extract is the meaningful column — it regex-extracts the numeric answer from chat-formatted output. strict-match is 0 for both rows because the chat-trained model emits prose like "The answer is 42" rather than GSM8K's raw #### 42 format; this is a formatting artifact, not a reasoning failure.
  • See phase4-benchmark-gsm8k.json and phase4-benchmark-gsm8k-base.json in this repo for the raw lm-evaluation-harness output.

Limitations

  • Multilingual eval is approximate. The Phase 3 syntax_valid_rate metric uses a Python parser, so the absolute numbers across the dataset's 8 languages are not directly comparable; the delta against the base model is the meaningful signal.
  • No safety tuning. This adapter is for code synthesis; it inherits the base model's safety properties and adds no extra alignment.
  • One epoch. A single pass through the data leaves further headroom; gains are likely possible with longer training or larger sequence length, at the cost of compute.
  • Adapter, not full weights. You need the base model to use this — total download is ~28 GB (base) + ~0.7 GB (this adapter).

How to use

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_id = "Qwen/Qwen2.5-Coder-14B-Instruct"
adapter_id = "cmndcntrlcyber/qwen14b-code-trainer-v6-aggressive"

tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(
    base_id, dtype=torch.bfloat16, device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()

messages = [
    {"role": "system", "content": "You are a senior software engineer."},
    {"role": "user", "content": "Write a Rust function that parses an ISO-8601 timestamp."},
]
inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True,
).to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

Reproducibility

  • Code: github.com/cmndcntrlcyber/code-trainer-offsec-pipeline
  • Launch command (HF Jobs):
    python -m src.phase4_qwen_finetuning.scripts.launch_full_training \
        --config src/config/v6_config.yaml --best-config aggressive --wait
    
  • W&B project: rtpi-phase43-qwen14b.
  • Cost: $22 on a100-large (7 h, eventually marked ERROR by HF Skills' lazy timeout enforcement, but the adapter was already pushed to Hub before the kill — see Phase 4A summary).
Downloads last month
64
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cmndcntrlcyber/qwen14b-code-trainer-v6-aggressive

Base model

Qwen/Qwen2.5-14B
Adapter
(63)
this model

Dataset used to train cmndcntrlcyber/qwen14b-code-trainer-v6-aggressive