abidlabs HF Staff lewtun HF Staff commited on
Commit
8e93e94
·
unverified ·
1 Parent(s): 2083476

Treat Trackio as core for training and prefer public Space dashboards (#129)

Browse files

* Strengthen Trackio and public Space guidance for training workflows

Treat trackio as a core dependency for training-like job scripts, keep huggingface_hub in baseline dependencies, and reinforce prompt/tool guidance to provide Trackio dashboards. Also instruct agents to publish training dashboards/results to public Spaces with random IDs when feasible.

Made-with: Cursor

* Apply suggestion from @abidlabs

* changes

* changes

* Apply suggestion from @abidlabs

* Keep prompt changes to v3 and restore jobs tool enforcement

Restore the training dependency and Trackio/public-Space guidance logic in hf_jobs, while reverting system_prompt.yaml and system_prompt_v2.yaml so runtime-facing guidance stays concentrated in system_prompt_v3.yaml.

Made-with: Cursor

* changes

* changes

* changes

* Apply suggestion from @abidlabs

* Apply suggestion from @abidlabs

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

agent/prompts/system_prompt_v3.yaml CHANGED
@@ -54,6 +54,7 @@ system_prompt: |
54
  3. Validate model: hub_repo_details to confirm model exists, correct architecture/size/tokenizer
55
 
56
  Training logging: always set disable_tqdm=True, logging_strategy="steps", and logging_first_step=True in your TrainingArguments/SFTConfig so loss values are printed as plain text lines you can grep, not hidden inside tqdm progress bars.
 
57
 
58
  Dataset format requirements by training method:
59
  SFT: "messages", "text", or "prompt"/"completion"
@@ -75,7 +76,7 @@ system_prompt: |
75
  - Dataset format verified: [columns confirmed via hf_inspect_dataset/hub_repo_details]
76
  - push_to_hub=True and hub_model_id set
77
  - timeout: [value] (based on: [model size] on [hardware])
78
- - Trackio monitoring included and working
79
 
80
  If you cannot fill in all items, stop and complete the missing steps first.
81
 
 
54
  3. Validate model: hub_repo_details to confirm model exists, correct architecture/size/tokenizer
55
 
56
  Training logging: always set disable_tqdm=True, logging_strategy="steps", and logging_first_step=True in your TrainingArguments/SFTConfig so loss values are printed as plain text lines you can grep, not hidden inside tqdm progress bars.
57
+ In training configs, set `report_to=["trackio"]` and set a `run_name`, `project`, and importantly `trackio_space_id` (which can be a `<username>/mlintern-<8-char-id>` for example) so Trackio creates a public dashboard Space.
58
 
59
  Dataset format requirements by training method:
60
  SFT: "messages", "text", or "prompt"/"completion"
 
76
  - Dataset format verified: [columns confirmed via hf_inspect_dataset/hub_repo_details]
77
  - push_to_hub=True and hub_model_id set
78
  - timeout: [value] (based on: [model size] on [hardware])
79
+ - Trackio monitoring included and deploying metrics to a public Space
80
 
81
  If you cannot fill in all items, stop and complete the missing steps first.
82