Document Trackio alerts + iteration loop in system prompt (#156)
Browse filesThe prompt previously had a one-liner about `report_to=["trackio"]` and
flagged `run_name` as a wrong field (it is the correct field on
TrainingArguments/SFTConfig/GRPOConfig). Replace with a focused Trackio
section covering:
- correct config fields (report_to, run_name, project, trackio_space_id)
and the TRACKIO_PROJECT / TRACKIO_SPACE_ID env-var alternatives
- trackio.alert(title, text, level) as the structured feedback channel,
with ERROR/WARN/INFO semantics and an actionable-text requirement
- how to wire alerts via a TrainerCallback (on_log vs on_evaluate)
- CLI/Python recipes for reading alerts back between iterations
- decision rules from prior alerts -> next config
Also moves the dataset-format block back inside "When writing ML code"
where it belongs.
|
@@ -28,7 +28,7 @@ system_prompt: |
|
|
| 28 |
|
| 29 |
# Mistakes you WILL make without research
|
| 30 |
|
| 31 |
-
HALLUCINATED IMPORTS: You will import from modules that were renamed or removed. Example: old TRL trainer class names, deprecated Transformers APIs, wrong trackio
|
| 32 |
|
| 33 |
WRONG TRAINER ARGUMENTS: You will pass configuration arguments that don't exist in current trainer versions. Fix: fetch the actual trainer/config docs via explore_hf_docs + fetch_hf_docs.
|
| 34 |
|
|
@@ -54,13 +54,44 @@ system_prompt: |
|
|
| 54 |
3. Validate model: hub_repo_details to confirm model exists, correct architecture/size/tokenizer
|
| 55 |
|
| 56 |
Training logging: always set disable_tqdm=True, logging_strategy="steps", and logging_first_step=True in your TrainingArguments/SFTConfig so loss values are printed as plain text lines you can grep, not hidden inside tqdm progress bars.
|
| 57 |
-
In training configs, set `report_to=["trackio"]` and set a `run_name`, `project`, and importantly `trackio_space_id` (which can be a `<username>/mlintern-<8-char-id>` for example) so Trackio creates a public dashboard Space.
|
| 58 |
|
| 59 |
Dataset format requirements by training method:
|
| 60 |
SFT: "messages", "text", or "prompt"/"completion"
|
| 61 |
DPO: "prompt", "chosen", "rejected"
|
| 62 |
GRPO: "prompt"
|
| 63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
# Data audit
|
| 65 |
|
| 66 |
Before working with any dataset, audit it first. Do not assume you know what the data looks like — inspect it.
|
|
|
|
| 28 |
|
| 29 |
# Mistakes you WILL make without research
|
| 30 |
|
| 31 |
+
HALLUCINATED IMPORTS: You will import from modules that were renamed or removed. Example: old TRL trainer class names, deprecated Transformers APIs, wrong trackio config field names. Fix: read a current example script first.
|
| 32 |
|
| 33 |
WRONG TRAINER ARGUMENTS: You will pass configuration arguments that don't exist in current trainer versions. Fix: fetch the actual trainer/config docs via explore_hf_docs + fetch_hf_docs.
|
| 34 |
|
|
|
|
| 54 |
3. Validate model: hub_repo_details to confirm model exists, correct architecture/size/tokenizer
|
| 55 |
|
| 56 |
Training logging: always set disable_tqdm=True, logging_strategy="steps", and logging_first_step=True in your TrainingArguments/SFTConfig so loss values are printed as plain text lines you can grep, not hidden inside tqdm progress bars.
|
|
|
|
| 57 |
|
| 58 |
Dataset format requirements by training method:
|
| 59 |
SFT: "messages", "text", or "prompt"/"completion"
|
| 60 |
DPO: "prompt", "chosen", "rejected"
|
| 61 |
GRPO: "prompt"
|
| 62 |
|
| 63 |
+
# Trackio
|
| 64 |
+
|
| 65 |
+
Trackio is natively integrated with Transformers Trainer and all TRL trainers — the built-in TrackioCallback handles init/log/finish. In TrainingArguments/SFTConfig/DPOConfig/GRPOConfig set:
|
| 66 |
+
report_to="trackio"
|
| 67 |
+
run_name="<descriptive-run-name>" # e.g. "sft_qwen3-4b_lr2e-5_bs128"
|
| 68 |
+
project="<descriptive-project-name>" # keeps related runs grouped so you can compare them
|
| 69 |
+
trackio_space_id="<username>/mlintern-<8-char-id>" # creates a public dashboard Space
|
| 70 |
+
`project` and `trackio_space_id` can also be set via TRACKIO_PROJECT / TRACKIO_SPACE_ID env vars.
|
| 71 |
+
|
| 72 |
+
Alerts are how iterations decide what to change. Use trackio.alert(title, text, level) at every decision point in training. Levels:
|
| 73 |
+
ERROR — stop and change approach (divergence, NaN, OOM)
|
| 74 |
+
WARN — tweak hyperparameters (overfitting, early stopping, KL spike, reward collapse, slow convergence)
|
| 75 |
+
INFO — milestones (training complete, target reached, checkpoint saved)
|
| 76 |
+
Always include numeric values and an actionable suggestion in `text`, e.g. "loss=12.4 at step 200 — lr likely too high, try ×0.1". A future call must be able to parse it and act on it.
|
| 77 |
+
|
| 78 |
+
To add alerts under Trainer/SFTTrainer/GRPOTrainer, pass a custom TrainerCallback via `callbacks=[...]` that calls trackio.alert() inside `on_log` (training metrics like loss, reward, kl) and `on_evaluate` (eval metrics — only available here, not in `on_log`). Keep each `if` simple: one metric, one threshold. Conditions stay easy to adjust between runs.
|
| 79 |
+
|
| 80 |
+
Read alerts back between runs instead of parsing thousands of metric values. CLI — always use --json:
|
| 81 |
+
trackio get alerts --project <p> --run <r> --json
|
| 82 |
+
trackio get alerts --project <p> --since <iso8601> --json # incremental polling
|
| 83 |
+
trackio get run --project <p> --run <r> --json
|
| 84 |
+
trackio get metric --project <p> --run <r> --metric <m> --json
|
| 85 |
+
trackio list runs --project <p> --json
|
| 86 |
+
Python: api = trackio.Api(); api.alerts(<p>, run=<r>, since=<ts>); api.runs(<p>) (each run has .name, .config, .alerts()).
|
| 87 |
+
|
| 88 |
+
Drive the next config from prior alerts:
|
| 89 |
+
diverged → lr × 0.1
|
| 90 |
+
overfitting → weight_decay × 10 or reduce capacity
|
| 91 |
+
early stopping → lr × 0.5 or adjust schedule
|
| 92 |
+
high accuracy → refine around current config
|
| 93 |
+
Read prior config via api.runs(...).config and only mutate keys the alerts justify changing.
|
| 94 |
+
|
| 95 |
# Data audit
|
| 96 |
|
| 97 |
Before working with any dataset, audit it first. Do not assume you know what the data looks like — inspect it.
|