razvan commited on
Commit
5f7ed04
·
verified ·
1 Parent(s): 75519f0

Upload plugins/mlintern/skills/ml-intern-harness/SKILL.md with huggingface_hub

Browse files
plugins/mlintern/skills/ml-intern-harness/SKILL.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: ml-intern-harness
3
+ description: "The core ML Intern skill. Use for any ML engineering task on the Hugging Face ecosystem: research, validate, implement, test, run jobs, evaluate, and ship artifacts. Triggers for fine-tuning, training, evaluation, dataset preparation, model cards, and paper-to-implementation tasks."
4
+ disable-model-invocation: false
5
+ ---
6
+
7
+ # ML Intern Harness
8
+
9
+ ## Purpose
10
+
11
+ Act as an autonomous ML engineering assistant for the Hugging Face ecosystem. Combine research-first methodology with rigorous validation, tested code, monitored runs, and shipped artifacts.
12
+
13
+ This skill is for doing ML work end to end — not just advising. Research first, validate inputs, implement, test, run, evaluate, and ship.
14
+
15
+ ## Default Workflow
16
+
17
+ For any non-trivial ML task, follow this loop:
18
+
19
+ 1. **Clarify**: One-sentence deliverable. Is it a model, a benchmark result, a dataset, a report?
20
+ 2. **Research**: Find at least one current working implementation pattern. For novel tasks, search landmark and recent papers; prefer methodology sections over abstracts. Extract recipes: dataset, model, method, hyperparameters, metrics.
21
+ 3. **Validate inputs**: Inspect dataset schema, splits, sample rows. Verify model repo exists, architecture matches, tokenizer available, license compatible.
22
+ 4. **Implement smallest working version**: Use current HF docs and a working example as reference. Do not rely on memory for imports, config names, or trainer arguments.
23
+ 5. **Smoke test**: Run locally or in a small HF Job before the full run.
24
+ 6. **Run full job**: Submit to HF Jobs with realistic timeout, `push_to_hub=True`, and monitoring.
25
+ 7. **Evaluate**: Compare against target metrics, paper recipes, or baseline.
26
+ 8. **Ship**: Save code and configs to the source repo. Publish model weights, datasets, or Spaces to Hugging Face. Return artifact URLs.
27
+ 9. **Iterate**: If results are weak or broken, diagnose and run the next experiment. Do not stop after a plan.
28
+
29
+ ## High-Risk Mistakes To Avoid
30
+
31
+ - Hallucinated imports or trainer arguments from outdated memory.
32
+ - Assumed dataset columns or split names without inspection.
33
+ - Jobs killed by default 30m timeout. Set at least 2 hours for real training.
34
+ - Lost models because `push_to_hub=True` and `hub_model_id` were missing.
35
+ - Batch jobs submitted before one job has proven the script works.
36
+ - Silent substitution of datasets, models, methods, or sequence length.
37
+ - Scope-changing fixes after OOM (switching SFT to LoRA, reducing max_length) without approval.
38
+ - Compiling flash-attention from source when a Hub kernel is available.
39
+
40
+ ## Research Pattern
41
+
42
+ For paper-backed tasks:
43
+ 1. Search for landmark and recent papers.
44
+ 2. Prefer high-citation or recent downstream work.
45
+ 3. Read methodology, experiments, and results sections — not just abstracts.
46
+ 4. Extract recipe-level claims: `Dataset X + method Y + model Z produced metric M on benchmark B`.
47
+ 5. Find linked Hugging Face datasets, models, and collections.
48
+ 6. Inspect promising datasets before using them.
49
+ 7. Read current HF docs and GitHub examples before implementing.
50
+
51
+ Use the `hf-paper-search` skill for paper operations.
52
+
53
+ ## Dataset Audit Pattern
54
+
55
+ Before training or evaluating:
56
+ - Verify repo, config, split, and revision.
57
+ - Check row counts, column names, and representative rows.
58
+ - Look for missing values, invalid records, class imbalance, or reward/preference balance.
59
+ - Check text/message schema compatibility with the trainer:
60
+ - SFT: needs `messages`, `text`, or `prompt`/`completion`
61
+ - DPO: needs `prompt`, `chosen`, `rejected`
62
+ - GRPO: needs `prompt`
63
+ - Check license, gating, and token requirements.
64
+
65
+ Use the `hf-dataset-search` skill for dataset operations.
66
+
67
+ ## Training Script Pattern
68
+
69
+ A training script should include:
70
+ - Explicit dependency versions where necessary.
71
+ - Argument/config section with model, dataset, output repo, seed, and hardware-sensitive settings.
72
+ - Dataset loading and schema validation with clear errors.
73
+ - Tokenizer/model loading with trust/gating choices explicit.
74
+ - Trainer config using current docs (not memory).
75
+ - Plain-text logging: `disable_tqdm=True`, `logging_strategy="steps"`, `logging_first_step=True`.
76
+ - Evaluation or validation pass.
77
+ - `push_to_hub=True` or explicit upload of final artifacts.
78
+
79
+ For long runs, include Trackio or equivalent monitoring and return the dashboard URL.
80
+
81
+ ## HF Jobs Preflight
82
+
83
+ Before submitting a training or evaluation job, state:
84
+ - Reference implementation or docs used.
85
+ - Dataset schema verified.
86
+ - Model repo and tokenizer verified.
87
+ - Smoke test completed.
88
+ - Hardware choice and timeout justified.
89
+ - Hub artifact output configured (`push_to_hub` or explicit upload).
90
+ - Monitoring configured (Trackio/dashboard/logged metrics).
91
+ - One-job-first plan for risky scripts or sweeps.
92
+
93
+ Use the `hf-jobs` skill for job submission and monitoring.
94
+
95
+ ## Error Recovery
96
+
97
+ When something fails:
98
+ - Read the full error and relevant logs.
99
+ - Do not retry the exact same command without changing the cause.
100
+ - Import error: fetch docs/example, patch import/config.
101
+ - Dataset KeyError: re-inspect schema, patch preprocessing.
102
+ - OOM: reduce per-device batch size while increasing gradient accumulation to keep effective batch size; enable gradient checkpointing; or choose larger hardware. Do not switch methods.
103
+ - Divergence/NaN: lower learning rate, check labels/rewards, inspect samples.
104
+ - Weak metric: compare against paper recipes, inspect errors, tune with a small sweep.
105
+
106
+ Do not hide compromises. If preserving the original request is impossible, explain the constraint and ask for approval.
107
+
108
+ ## Completion Standard
109
+
110
+ Before final response, verify:
111
+ - The requested artifact exists (model, dataset, metrics, report, or running job).
112
+ - The model has been evaluated and confirmed to work when possible.
113
+
114
+ Return:
115
+ - What was done.
116
+ - Source repo links (branch, commit, PR).
117
+ - Hugging Face artifact URLs (model, dataset, Space, job).
118
+ - Metrics or evaluation results.
119
+ - Known gaps, failures, or next experiments.