Wish Engine Tool-Calling FT Runbook

Last updated: 2026-02-22 Owner: sahilmob

This runbook centralizes the fine-tuning status for the Wish Engine implementor tool-calling optimization project.

Scope

Base model: openai/gpt-oss-20b
Training method: 2-phase LoRA SFT (strict tool-call focus, then broader curriculum)
Orchestration: Hugging Face Jobs (hf jobs uv run)
Goal: improve tool-calling reliability and reduce trivial implementor mistakes

Hub Assets

Datasets

Active phase-2 curriculum dataset: sahilmob/wish-engine-toolcall-next-v3-general
- splits: train 1770, validation 245, test 222
Active phase-1 strict dataset: sahilmob/wish-engine-toolcall-next-v3-strict-general
- splits: train 732, validation 91, test 90
Legacy v2 curriculum dataset: sahilmob/wish-engine-toolcall-next-v2
Legacy v2 strict dataset: sahilmob/wish-engine-toolcall-next-v2-strict

Models

Active phase-1 adapter (strict): sahilmob/gpt-oss-20b-toolcall-phase1-v3-strict-general-lora
Active final target model (phase 2): sahilmob/gpt-oss-20b-toolcall-two-phase-v3-general-lora
Legacy phase-1 adapter: sahilmob/gpt-oss-20b-toolcall-phase1-v2-strict-lora
Legacy final model: sahilmob/gpt-oss-20b-toolcall-two-phase-v2-lora

Tracking

Trackio Space: sahilmob/trackio

Current Job History

Job ID	Hardware	Status	Notes
`699b2e2952d1c53b7df7d2d4`	`a10g-large`	`ERROR`	CUDA OOM during model placement (`torch.OutOfMemoryError`)
`699b2eda52d1c53b7df7d2d6`	`a10g-large`	`ERROR`	`invalid gradient ... expected device meta but got cuda:0`
`699b30551aad19adb8aacbdb`	`a10g-large`	`ERROR`	CUDA OOM in `accelerate.prepare`
`699b310652d1c53b7df7d2d8`	`a100-large`	`COMPLETED`	successful v2 two-phase training run
`699b380b1aad19adb8aacc34`	`a100-large`	`ERROR`	early v2 eval run failed
`699b4a1b1aad19adb8aacd14`	`a100-large`	`COMPLETED`	v3 datasets baseline eval (base model)
`699b4a1a1aad19adb8aacd13`	`a100-large`	`COMPLETED`	v3 datasets baseline eval (v2 adapter)
`699b6ba31aad19adb8aacee3`	`a100-large`	`COMPLETED`	successful v3 two-phase training run
`699b735552d1c53b7df7d347`	`a100-large`	`COMPLETED`	post-train base eval on v3 datasets
`699b73601aad19adb8aacf4d`	`a100-large`	`COMPLETED`	post-train tuned eval on v3 datasets

Snapshot JSON for the table above: artifacts/job-history.snapshot.json

Training Outcome (v3 Completed)

From job 699b6ba31aad19adb8aacee3:

Phase 1 (sahilmob/wish-engine-toolcall-next-v3-strict-general, 2 epochs)
- train_loss: 2.904
- mean_token_accuracy: 0.6004
- last eval checkpoint:
  - eval_loss: 2.486
  - eval_mean_token_accuracy: 0.5683
Phase 2 (sahilmob/wish-engine-toolcall-next-v3-general, 1 epoch)
- train_loss: 1.924
- mean_token_accuracy: 0.6225
- last eval checkpoint:
  - eval_loss: 1.801
  - eval_mean_token_accuracy: 0.6285

Evaluation Outcome (v3 Datasets)

Datasets used for all metrics in this section:

Post-train eval (base vs v3 tuned)

Base eval job:

699b735552d1c53b7df7d347

Tuned eval job:

699b73601aad19adb8aacf4d

Summary (tuned - base):

mean_masked_nll: 4.3395 -> 0.5443 (-3.7952, lower is better)
masked_perplexity: 76.6717 -> 1.7235 (-74.9483, lower is better)
tool_name_exact_accuracy: 0.0619 -> 0.0000 (-0.0619)
tool_name_canonical_exact_accuracy: 0.0667 -> 0.3762 (+0.3095)
tool_name_contains_expected_accuracy: 0.0810 -> 0.3762 (+0.2952)
tool_name_canonical_contains_expected_accuracy: 0.0810 -> 0.3762 (+0.2952)
tool_name_parsed_ratio: 0.2619 -> 0.9810 (+0.7190)

Comparison vs v2 adapter on same v3 eval set

Baseline v2 adapter eval job:

699b4a1a1aad19adb8aacd13

Key change (v3 tuned - v2 tuned):

mean_masked_nll: +0.0031 (slightly worse)
masked_perplexity: +0.0053 (slightly worse)
tool_name_canonical_exact_accuracy: +0.0048 (slightly better)
tool_name_canonical_contains_expected_accuracy: +0.0048 (slightly better)
tool_name_parsed_ratio: -0.0190 (slightly worse)

Runtime Safeguards Added

These safeguards were added to reduce earlier failures:

runtime.strict_chat_template=true
- prevents silent fallback formatting of tool-call chat samples
runtime.skip_trainer_model_move=true
- avoids duplicate model move from Trainer when model is already placed
runtime.force_single_device_model=true
- forces consistent single-device placement to avoid meta/cuda mismatch

Current active config snapshot: artifacts/two-phase-toolcall.hf.config.json

Repro Commands

Build payload with configurable budget

HF_JOB_FLAVOR=a100-large HF_JOB_TIMEOUT=5h npm run ft:train:toolcalls:payload

Submit job

hf jobs uv run training/hf-jobs/generated_two_phase_toolcall_train.py \
  --flavor a100-large \
  --secrets HF_TOKEN \
  --timeout 5h \
  -d

Check status

hf jobs inspect 699b6ba31aad19adb8aacee3

Run post-train evals

hf jobs uv run training/hf-jobs/eval_toolcall_models.py \
  --flavor a100-large \
  --timeout 3h \
  --secrets HF_TOKEN \
  -d \
  --base-model openai/gpt-oss-20b \
  --adapter-model sahilmob/gpt-oss-20b-toolcall-two-phase-v3-general-lora \
  --datasets sahilmob/wish-engine-toolcall-next-v3-strict-general,sahilmob/wish-engine-toolcall-next-v3-general \
  --max-samples-per-dataset 120 \
  --generation-samples-per-dataset 120 \
  --max-new-tokens 96 \
  --max-context-tokens 2048 \
  --eval-target tuned \
  --seed 42

Artifacts

Job history snapshot: artifacts/job-history.snapshot.json
Baseline eval summary: ../eval-results/v3-general-eval-2026-02-22.json
Post-train eval summary: ../eval-results/v3-general-eval-post-train-2026-02-22.json

Pending

Run Wish Engine implementor benchmark against gpt-oss-20b-toolcall-two-phase-v3-general-lora.
Track tool-name canonical exact accuracy and parsed ratio in real pipeline runs.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support