Treat Trackio as core for training and prefer public Space dashboards (#129)
Browse files* Strengthen Trackio and public Space guidance for training workflows
Treat trackio as a core dependency for training-like job scripts, keep huggingface_hub in baseline dependencies, and reinforce prompt/tool guidance to provide Trackio dashboards. Also instruct agents to publish training dashboards/results to public Spaces with random IDs when feasible.
Made-with: Cursor
* Apply suggestion from @abidlabs
* changes
* changes
* Apply suggestion from @abidlabs
* Keep prompt changes to v3 and restore jobs tool enforcement
Restore the training dependency and Trackio/public-Space guidance logic in hf_jobs, while reverting system_prompt.yaml and system_prompt_v2.yaml so runtime-facing guidance stays concentrated in system_prompt_v3.yaml.
Made-with: Cursor
* changes
* changes
* changes
* Apply suggestion from @abidlabs
* Apply suggestion from @abidlabs
---------
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
|
@@ -54,6 +54,7 @@ system_prompt: |
|
|
| 54 |
3. Validate model: hub_repo_details to confirm model exists, correct architecture/size/tokenizer
|
| 55 |
|
| 56 |
Training logging: always set disable_tqdm=True, logging_strategy="steps", and logging_first_step=True in your TrainingArguments/SFTConfig so loss values are printed as plain text lines you can grep, not hidden inside tqdm progress bars.
|
|
|
|
| 57 |
|
| 58 |
Dataset format requirements by training method:
|
| 59 |
SFT: "messages", "text", or "prompt"/"completion"
|
|
@@ -75,7 +76,7 @@ system_prompt: |
|
|
| 75 |
- Dataset format verified: [columns confirmed via hf_inspect_dataset/hub_repo_details]
|
| 76 |
- push_to_hub=True and hub_model_id set
|
| 77 |
- timeout: [value] (based on: [model size] on [hardware])
|
| 78 |
-
- Trackio monitoring included and
|
| 79 |
|
| 80 |
If you cannot fill in all items, stop and complete the missing steps first.
|
| 81 |
|
|
|
|
| 54 |
3. Validate model: hub_repo_details to confirm model exists, correct architecture/size/tokenizer
|
| 55 |
|
| 56 |
Training logging: always set disable_tqdm=True, logging_strategy="steps", and logging_first_step=True in your TrainingArguments/SFTConfig so loss values are printed as plain text lines you can grep, not hidden inside tqdm progress bars.
|
| 57 |
+
In training configs, set `report_to=["trackio"]` and set a `run_name`, `project`, and importantly `trackio_space_id` (which can be a `<username>/mlintern-<8-char-id>` for example) so Trackio creates a public dashboard Space.
|
| 58 |
|
| 59 |
Dataset format requirements by training method:
|
| 60 |
SFT: "messages", "text", or "prompt"/"completion"
|
|
|
|
| 76 |
- Dataset format verified: [columns confirmed via hf_inspect_dataset/hub_repo_details]
|
| 77 |
- push_to_hub=True and hub_model_id set
|
| 78 |
- timeout: [value] (based on: [model size] on [hardware])
|
| 79 |
+
- Trackio monitoring included and deploying metrics to a public Space
|
| 80 |
|
| 81 |
If you cannot fill in all items, stop and complete the missing steps first.
|
| 82 |
|