Wish Engine Tool-Calling FT Runbook

Last updated: 2026-02-22 Owner: sahilmob

This runbook centralizes the fine-tuning status for the Wish Engine implementor tool-calling optimization project.

Scope

  • Base model: openai/gpt-oss-20b
  • Training method: 2-phase LoRA SFT (strict tool-call focus, then broader curriculum)
  • Orchestration: Hugging Face Jobs (hf jobs uv run)
  • Goal: improve tool-calling reliability and reduce trivial implementor mistakes

Hub Assets

Datasets

Models

Tracking

Current Job History

Job ID Hardware Status Notes
699b2e2952d1c53b7df7d2d4 a10g-large ERROR CUDA OOM during model placement (torch.OutOfMemoryError)
699b2eda52d1c53b7df7d2d6 a10g-large ERROR invalid gradient ... expected device meta but got cuda:0
699b30551aad19adb8aacbdb a10g-large ERROR CUDA OOM in accelerate.prepare
699b310652d1c53b7df7d2d8 a100-large COMPLETED successful v2 two-phase training run
699b380b1aad19adb8aacc34 a100-large ERROR early v2 eval run failed
699b4a1b1aad19adb8aacd14 a100-large COMPLETED v3 datasets baseline eval (base model)
699b4a1a1aad19adb8aacd13 a100-large COMPLETED v3 datasets baseline eval (v2 adapter)
699b6ba31aad19adb8aacee3 a100-large COMPLETED successful v3 two-phase training run
699b735552d1c53b7df7d347 a100-large COMPLETED post-train base eval on v3 datasets
699b73601aad19adb8aacf4d a100-large COMPLETED post-train tuned eval on v3 datasets

Snapshot JSON for the table above: artifacts/job-history.snapshot.json

Training Outcome (v3 Completed)

From job 699b6ba31aad19adb8aacee3:

  • Phase 1 (sahilmob/wish-engine-toolcall-next-v3-strict-general, 2 epochs)
    • train_loss: 2.904
    • mean_token_accuracy: 0.6004
    • last eval checkpoint:
      • eval_loss: 2.486
      • eval_mean_token_accuracy: 0.5683
  • Phase 2 (sahilmob/wish-engine-toolcall-next-v3-general, 1 epoch)
    • train_loss: 1.924
    • mean_token_accuracy: 0.6225
    • last eval checkpoint:
      • eval_loss: 1.801
      • eval_mean_token_accuracy: 0.6285

Evaluation Outcome (v3 Datasets)

Datasets used for all metrics in this section:

Post-train eval (base vs v3 tuned)

Base eval job:

Tuned eval job:

Summary (tuned - base):

  • mean_masked_nll: 4.3395 -> 0.5443 (-3.7952, lower is better)
  • masked_perplexity: 76.6717 -> 1.7235 (-74.9483, lower is better)
  • tool_name_exact_accuracy: 0.0619 -> 0.0000 (-0.0619)
  • tool_name_canonical_exact_accuracy: 0.0667 -> 0.3762 (+0.3095)
  • tool_name_contains_expected_accuracy: 0.0810 -> 0.3762 (+0.2952)
  • tool_name_canonical_contains_expected_accuracy: 0.0810 -> 0.3762 (+0.2952)
  • tool_name_parsed_ratio: 0.2619 -> 0.9810 (+0.7190)

Comparison vs v2 adapter on same v3 eval set

Baseline v2 adapter eval job:

Key change (v3 tuned - v2 tuned):

  • mean_masked_nll: +0.0031 (slightly worse)
  • masked_perplexity: +0.0053 (slightly worse)
  • tool_name_canonical_exact_accuracy: +0.0048 (slightly better)
  • tool_name_canonical_contains_expected_accuracy: +0.0048 (slightly better)
  • tool_name_parsed_ratio: -0.0190 (slightly worse)

Runtime Safeguards Added

These safeguards were added to reduce earlier failures:

  • runtime.strict_chat_template=true
    • prevents silent fallback formatting of tool-call chat samples
  • runtime.skip_trainer_model_move=true
    • avoids duplicate model move from Trainer when model is already placed
  • runtime.force_single_device_model=true
    • forces consistent single-device placement to avoid meta/cuda mismatch

Current active config snapshot: artifacts/two-phase-toolcall.hf.config.json

Repro Commands

Build payload with configurable budget

HF_JOB_FLAVOR=a100-large HF_JOB_TIMEOUT=5h npm run ft:train:toolcalls:payload

Submit job

hf jobs uv run training/hf-jobs/generated_two_phase_toolcall_train.py \
  --flavor a100-large \
  --secrets HF_TOKEN \
  --timeout 5h \
  -d

Check status

hf jobs inspect 699b6ba31aad19adb8aacee3

Run post-train evals

hf jobs uv run training/hf-jobs/eval_toolcall_models.py \
  --flavor a100-large \
  --timeout 3h \
  --secrets HF_TOKEN \
  -d \
  --base-model openai/gpt-oss-20b \
  --adapter-model sahilmob/gpt-oss-20b-toolcall-two-phase-v3-general-lora \
  --datasets sahilmob/wish-engine-toolcall-next-v3-strict-general,sahilmob/wish-engine-toolcall-next-v3-general \
  --max-samples-per-dataset 120 \
  --generation-samples-per-dataset 120 \
  --max-new-tokens 96 \
  --max-context-tokens 2048 \
  --eval-target tuned \
  --seed 42

Artifacts

  • Job history snapshot: artifacts/job-history.snapshot.json
  • Baseline eval summary: ../eval-results/v3-general-eval-2026-02-22.json
  • Post-train eval summary: ../eval-results/v3-general-eval-post-train-2026-02-22.json

Pending

  • Run Wish Engine implementor benchmark against gpt-oss-20b-toolcall-two-phase-v3-general-lora.
  • Track tool-name canonical exact accuracy and parsed ratio in real pipeline runs.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support