Wish Engine Tool-Calling FT Runbook
Last updated: 2026-02-22
Owner: sahilmob
This runbook centralizes the fine-tuning status for the Wish Engine implementor tool-calling optimization project.
Scope
- Base model:
openai/gpt-oss-20b - Training method: 2-phase LoRA SFT (strict tool-call focus, then broader curriculum)
- Orchestration: Hugging Face Jobs (
hf jobs uv run) - Goal: improve tool-calling reliability and reduce trivial implementor mistakes
Hub Assets
Datasets
- Active phase-2 curriculum dataset:
sahilmob/wish-engine-toolcall-next-v3-general- splits: train
1770, validation245, test222
- splits: train
- Active phase-1 strict dataset:
sahilmob/wish-engine-toolcall-next-v3-strict-general- splits: train
732, validation91, test90
- splits: train
- Legacy v2 curriculum dataset:
sahilmob/wish-engine-toolcall-next-v2 - Legacy v2 strict dataset:
sahilmob/wish-engine-toolcall-next-v2-strict
Models
- Active phase-1 adapter (strict):
sahilmob/gpt-oss-20b-toolcall-phase1-v3-strict-general-lora - Active final target model (phase 2):
sahilmob/gpt-oss-20b-toolcall-two-phase-v3-general-lora - Legacy phase-1 adapter:
sahilmob/gpt-oss-20b-toolcall-phase1-v2-strict-lora - Legacy final model:
sahilmob/gpt-oss-20b-toolcall-two-phase-v2-lora
Tracking
- Trackio Space:
sahilmob/trackio
Current Job History
| Job ID | Hardware | Status | Notes |
|---|---|---|---|
699b2e2952d1c53b7df7d2d4 |
a10g-large |
ERROR |
CUDA OOM during model placement (torch.OutOfMemoryError) |
699b2eda52d1c53b7df7d2d6 |
a10g-large |
ERROR |
invalid gradient ... expected device meta but got cuda:0 |
699b30551aad19adb8aacbdb |
a10g-large |
ERROR |
CUDA OOM in accelerate.prepare |
699b310652d1c53b7df7d2d8 |
a100-large |
COMPLETED |
successful v2 two-phase training run |
699b380b1aad19adb8aacc34 |
a100-large |
ERROR |
early v2 eval run failed |
699b4a1b1aad19adb8aacd14 |
a100-large |
COMPLETED |
v3 datasets baseline eval (base model) |
699b4a1a1aad19adb8aacd13 |
a100-large |
COMPLETED |
v3 datasets baseline eval (v2 adapter) |
699b6ba31aad19adb8aacee3 |
a100-large |
COMPLETED |
successful v3 two-phase training run |
699b735552d1c53b7df7d347 |
a100-large |
COMPLETED |
post-train base eval on v3 datasets |
699b73601aad19adb8aacf4d |
a100-large |
COMPLETED |
post-train tuned eval on v3 datasets |
Snapshot JSON for the table above: artifacts/job-history.snapshot.json
Training Outcome (v3 Completed)
From job 699b6ba31aad19adb8aacee3:
- Phase 1 (
sahilmob/wish-engine-toolcall-next-v3-strict-general, 2 epochs)train_loss:2.904mean_token_accuracy:0.6004- last eval checkpoint:
eval_loss:2.486eval_mean_token_accuracy:0.5683
- Phase 2 (
sahilmob/wish-engine-toolcall-next-v3-general, 1 epoch)train_loss:1.924mean_token_accuracy:0.6225- last eval checkpoint:
eval_loss:1.801eval_mean_token_accuracy:0.6285
Evaluation Outcome (v3 Datasets)
Datasets used for all metrics in this section:
Post-train eval (base vs v3 tuned)
Base eval job:
Tuned eval job:
Summary (tuned - base):
mean_masked_nll:4.3395 -> 0.5443(-3.7952, lower is better)masked_perplexity:76.6717 -> 1.7235(-74.9483, lower is better)tool_name_exact_accuracy:0.0619 -> 0.0000(-0.0619)tool_name_canonical_exact_accuracy:0.0667 -> 0.3762(+0.3095)tool_name_contains_expected_accuracy:0.0810 -> 0.3762(+0.2952)tool_name_canonical_contains_expected_accuracy:0.0810 -> 0.3762(+0.2952)tool_name_parsed_ratio:0.2619 -> 0.9810(+0.7190)
Comparison vs v2 adapter on same v3 eval set
Baseline v2 adapter eval job:
Key change (v3 tuned - v2 tuned):
mean_masked_nll:+0.0031(slightly worse)masked_perplexity:+0.0053(slightly worse)tool_name_canonical_exact_accuracy:+0.0048(slightly better)tool_name_canonical_contains_expected_accuracy:+0.0048(slightly better)tool_name_parsed_ratio:-0.0190(slightly worse)
Runtime Safeguards Added
These safeguards were added to reduce earlier failures:
runtime.strict_chat_template=true- prevents silent fallback formatting of tool-call chat samples
runtime.skip_trainer_model_move=true- avoids duplicate model move from Trainer when model is already placed
runtime.force_single_device_model=true- forces consistent single-device placement to avoid meta/cuda mismatch
Current active config snapshot: artifacts/two-phase-toolcall.hf.config.json
Repro Commands
Build payload with configurable budget
HF_JOB_FLAVOR=a100-large HF_JOB_TIMEOUT=5h npm run ft:train:toolcalls:payload
Submit job
hf jobs uv run training/hf-jobs/generated_two_phase_toolcall_train.py \
--flavor a100-large \
--secrets HF_TOKEN \
--timeout 5h \
-d
Check status
hf jobs inspect 699b6ba31aad19adb8aacee3
Run post-train evals
hf jobs uv run training/hf-jobs/eval_toolcall_models.py \
--flavor a100-large \
--timeout 3h \
--secrets HF_TOKEN \
-d \
--base-model openai/gpt-oss-20b \
--adapter-model sahilmob/gpt-oss-20b-toolcall-two-phase-v3-general-lora \
--datasets sahilmob/wish-engine-toolcall-next-v3-strict-general,sahilmob/wish-engine-toolcall-next-v3-general \
--max-samples-per-dataset 120 \
--generation-samples-per-dataset 120 \
--max-new-tokens 96 \
--max-context-tokens 2048 \
--eval-target tuned \
--seed 42
Artifacts
- Job history snapshot:
artifacts/job-history.snapshot.json - Baseline eval summary:
../eval-results/v3-general-eval-2026-02-22.json - Post-train eval summary:
../eval-results/v3-general-eval-post-train-2026-02-22.json
Pending
- Run Wish Engine implementor benchmark against
gpt-oss-20b-toolcall-two-phase-v3-general-lora. - Track tool-name canonical exact accuracy and parsed ratio in real pipeline runs.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support