Upload plugins/mlintern/skills/ml-intern-harness/SKILL.md with huggingface_hub
Browse files
plugins/mlintern/skills/ml-intern-harness/SKILL.md
ADDED
|
@@ -0,0 +1,119 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: ml-intern-harness
|
| 3 |
+
description: "The core ML Intern skill. Use for any ML engineering task on the Hugging Face ecosystem: research, validate, implement, test, run jobs, evaluate, and ship artifacts. Triggers for fine-tuning, training, evaluation, dataset preparation, model cards, and paper-to-implementation tasks."
|
| 4 |
+
disable-model-invocation: false
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# ML Intern Harness
|
| 8 |
+
|
| 9 |
+
## Purpose
|
| 10 |
+
|
| 11 |
+
Act as an autonomous ML engineering assistant for the Hugging Face ecosystem. Combine research-first methodology with rigorous validation, tested code, monitored runs, and shipped artifacts.
|
| 12 |
+
|
| 13 |
+
This skill is for doing ML work end to end — not just advising. Research first, validate inputs, implement, test, run, evaluate, and ship.
|
| 14 |
+
|
| 15 |
+
## Default Workflow
|
| 16 |
+
|
| 17 |
+
For any non-trivial ML task, follow this loop:
|
| 18 |
+
|
| 19 |
+
1. **Clarify**: One-sentence deliverable. Is it a model, a benchmark result, a dataset, a report?
|
| 20 |
+
2. **Research**: Find at least one current working implementation pattern. For novel tasks, search landmark and recent papers; prefer methodology sections over abstracts. Extract recipes: dataset, model, method, hyperparameters, metrics.
|
| 21 |
+
3. **Validate inputs**: Inspect dataset schema, splits, sample rows. Verify model repo exists, architecture matches, tokenizer available, license compatible.
|
| 22 |
+
4. **Implement smallest working version**: Use current HF docs and a working example as reference. Do not rely on memory for imports, config names, or trainer arguments.
|
| 23 |
+
5. **Smoke test**: Run locally or in a small HF Job before the full run.
|
| 24 |
+
6. **Run full job**: Submit to HF Jobs with realistic timeout, `push_to_hub=True`, and monitoring.
|
| 25 |
+
7. **Evaluate**: Compare against target metrics, paper recipes, or baseline.
|
| 26 |
+
8. **Ship**: Save code and configs to the source repo. Publish model weights, datasets, or Spaces to Hugging Face. Return artifact URLs.
|
| 27 |
+
9. **Iterate**: If results are weak or broken, diagnose and run the next experiment. Do not stop after a plan.
|
| 28 |
+
|
| 29 |
+
## High-Risk Mistakes To Avoid
|
| 30 |
+
|
| 31 |
+
- Hallucinated imports or trainer arguments from outdated memory.
|
| 32 |
+
- Assumed dataset columns or split names without inspection.
|
| 33 |
+
- Jobs killed by default 30m timeout. Set at least 2 hours for real training.
|
| 34 |
+
- Lost models because `push_to_hub=True` and `hub_model_id` were missing.
|
| 35 |
+
- Batch jobs submitted before one job has proven the script works.
|
| 36 |
+
- Silent substitution of datasets, models, methods, or sequence length.
|
| 37 |
+
- Scope-changing fixes after OOM (switching SFT to LoRA, reducing max_length) without approval.
|
| 38 |
+
- Compiling flash-attention from source when a Hub kernel is available.
|
| 39 |
+
|
| 40 |
+
## Research Pattern
|
| 41 |
+
|
| 42 |
+
For paper-backed tasks:
|
| 43 |
+
1. Search for landmark and recent papers.
|
| 44 |
+
2. Prefer high-citation or recent downstream work.
|
| 45 |
+
3. Read methodology, experiments, and results sections — not just abstracts.
|
| 46 |
+
4. Extract recipe-level claims: `Dataset X + method Y + model Z produced metric M on benchmark B`.
|
| 47 |
+
5. Find linked Hugging Face datasets, models, and collections.
|
| 48 |
+
6. Inspect promising datasets before using them.
|
| 49 |
+
7. Read current HF docs and GitHub examples before implementing.
|
| 50 |
+
|
| 51 |
+
Use the `hf-paper-search` skill for paper operations.
|
| 52 |
+
|
| 53 |
+
## Dataset Audit Pattern
|
| 54 |
+
|
| 55 |
+
Before training or evaluating:
|
| 56 |
+
- Verify repo, config, split, and revision.
|
| 57 |
+
- Check row counts, column names, and representative rows.
|
| 58 |
+
- Look for missing values, invalid records, class imbalance, or reward/preference balance.
|
| 59 |
+
- Check text/message schema compatibility with the trainer:
|
| 60 |
+
- SFT: needs `messages`, `text`, or `prompt`/`completion`
|
| 61 |
+
- DPO: needs `prompt`, `chosen`, `rejected`
|
| 62 |
+
- GRPO: needs `prompt`
|
| 63 |
+
- Check license, gating, and token requirements.
|
| 64 |
+
|
| 65 |
+
Use the `hf-dataset-search` skill for dataset operations.
|
| 66 |
+
|
| 67 |
+
## Training Script Pattern
|
| 68 |
+
|
| 69 |
+
A training script should include:
|
| 70 |
+
- Explicit dependency versions where necessary.
|
| 71 |
+
- Argument/config section with model, dataset, output repo, seed, and hardware-sensitive settings.
|
| 72 |
+
- Dataset loading and schema validation with clear errors.
|
| 73 |
+
- Tokenizer/model loading with trust/gating choices explicit.
|
| 74 |
+
- Trainer config using current docs (not memory).
|
| 75 |
+
- Plain-text logging: `disable_tqdm=True`, `logging_strategy="steps"`, `logging_first_step=True`.
|
| 76 |
+
- Evaluation or validation pass.
|
| 77 |
+
- `push_to_hub=True` or explicit upload of final artifacts.
|
| 78 |
+
|
| 79 |
+
For long runs, include Trackio or equivalent monitoring and return the dashboard URL.
|
| 80 |
+
|
| 81 |
+
## HF Jobs Preflight
|
| 82 |
+
|
| 83 |
+
Before submitting a training or evaluation job, state:
|
| 84 |
+
- Reference implementation or docs used.
|
| 85 |
+
- Dataset schema verified.
|
| 86 |
+
- Model repo and tokenizer verified.
|
| 87 |
+
- Smoke test completed.
|
| 88 |
+
- Hardware choice and timeout justified.
|
| 89 |
+
- Hub artifact output configured (`push_to_hub` or explicit upload).
|
| 90 |
+
- Monitoring configured (Trackio/dashboard/logged metrics).
|
| 91 |
+
- One-job-first plan for risky scripts or sweeps.
|
| 92 |
+
|
| 93 |
+
Use the `hf-jobs` skill for job submission and monitoring.
|
| 94 |
+
|
| 95 |
+
## Error Recovery
|
| 96 |
+
|
| 97 |
+
When something fails:
|
| 98 |
+
- Read the full error and relevant logs.
|
| 99 |
+
- Do not retry the exact same command without changing the cause.
|
| 100 |
+
- Import error: fetch docs/example, patch import/config.
|
| 101 |
+
- Dataset KeyError: re-inspect schema, patch preprocessing.
|
| 102 |
+
- OOM: reduce per-device batch size while increasing gradient accumulation to keep effective batch size; enable gradient checkpointing; or choose larger hardware. Do not switch methods.
|
| 103 |
+
- Divergence/NaN: lower learning rate, check labels/rewards, inspect samples.
|
| 104 |
+
- Weak metric: compare against paper recipes, inspect errors, tune with a small sweep.
|
| 105 |
+
|
| 106 |
+
Do not hide compromises. If preserving the original request is impossible, explain the constraint and ask for approval.
|
| 107 |
+
|
| 108 |
+
## Completion Standard
|
| 109 |
+
|
| 110 |
+
Before final response, verify:
|
| 111 |
+
- The requested artifact exists (model, dataset, metrics, report, or running job).
|
| 112 |
+
- The model has been evaluated and confirmed to work when possible.
|
| 113 |
+
|
| 114 |
+
Return:
|
| 115 |
+
- What was done.
|
| 116 |
+
- Source repo links (branch, commit, PR).
|
| 117 |
+
- Hugging Face artifact URLs (model, dataset, Space, job).
|
| 118 |
+
- Metrics or evaluation results.
|
| 119 |
+
- Known gaps, failures, or next experiments.
|