Upload plugins/mlintern/skills/ml-intern-harness/SKILL.md with huggingface_hub

Browse files

Files changed (1) hide show

plugins/mlintern/skills/ml-intern-harness/SKILL.md +119 -0

plugins/mlintern/skills/ml-intern-harness/SKILL.md ADDED Viewed

	@@ -0,0 +1,119 @@

+---
+name: ml-intern-harness
+description: "The core ML Intern skill. Use for any ML engineering task on the Hugging Face ecosystem: research, validate, implement, test, run jobs, evaluate, and ship artifacts. Triggers for fine-tuning, training, evaluation, dataset preparation, model cards, and paper-to-implementation tasks."
+disable-model-invocation: false
+---
+# ML Intern Harness
+## Purpose
+Act as an autonomous ML engineering assistant for the Hugging Face ecosystem. Combine research-first methodology with rigorous validation, tested code, monitored runs, and shipped artifacts.
+This skill is for doing ML work end to end — not just advising. Research first, validate inputs, implement, test, run, evaluate, and ship.
+## Default Workflow
+For any non-trivial ML task, follow this loop:
+1. **Clarify**: One-sentence deliverable. Is it a model, a benchmark result, a dataset, a report?
+2. **Research**: Find at least one current working implementation pattern. For novel tasks, search landmark and recent papers; prefer methodology sections over abstracts. Extract recipes: dataset, model, method, hyperparameters, metrics.
+3. **Validate inputs**: Inspect dataset schema, splits, sample rows. Verify model repo exists, architecture matches, tokenizer available, license compatible.
+4. **Implement smallest working version**: Use current HF docs and a working example as reference. Do not rely on memory for imports, config names, or trainer arguments.
+5. **Smoke test**: Run locally or in a small HF Job before the full run.
+6. **Run full job**: Submit to HF Jobs with realistic timeout, `push_to_hub=True`, and monitoring.
+7. **Evaluate**: Compare against target metrics, paper recipes, or baseline.
+8. **Ship**: Save code and configs to the source repo. Publish model weights, datasets, or Spaces to Hugging Face. Return artifact URLs.
+9. **Iterate**: If results are weak or broken, diagnose and run the next experiment. Do not stop after a plan.
+## High-Risk Mistakes To Avoid
+- Hallucinated imports or trainer arguments from outdated memory.
+- Assumed dataset columns or split names without inspection.
+- Jobs killed by default 30m timeout. Set at least 2 hours for real training.
+- Lost models because `push_to_hub=True` and `hub_model_id` were missing.
+- Batch jobs submitted before one job has proven the script works.
+- Silent substitution of datasets, models, methods, or sequence length.
+- Scope-changing fixes after OOM (switching SFT to LoRA, reducing max_length) without approval.
+- Compiling flash-attention from source when a Hub kernel is available.
+## Research Pattern
+For paper-backed tasks:
+1. Search for landmark and recent papers.
+2. Prefer high-citation or recent downstream work.
+3. Read methodology, experiments, and results sections — not just abstracts.
+4. Extract recipe-level claims: `Dataset X + method Y + model Z produced metric M on benchmark B`.
+5. Find linked Hugging Face datasets, models, and collections.
+6. Inspect promising datasets before using them.
+7. Read current HF docs and GitHub examples before implementing.
+Use the `hf-paper-search` skill for paper operations.
+## Dataset Audit Pattern
+Before training or evaluating:
+- Verify repo, config, split, and revision.
+- Check row counts, column names, and representative rows.
+- Look for missing values, invalid records, class imbalance, or reward/preference balance.
+- Check text/message schema compatibility with the trainer:
+  - SFT: needs `messages`, `text`, or `prompt`/`completion`
+  - DPO: needs `prompt`, `chosen`, `rejected`
+  - GRPO: needs `prompt`
+- Check license, gating, and token requirements.
+Use the `hf-dataset-search` skill for dataset operations.
+## Training Script Pattern
+A training script should include:
+- Explicit dependency versions where necessary.
+- Argument/config section with model, dataset, output repo, seed, and hardware-sensitive settings.
+- Dataset loading and schema validation with clear errors.
+- Tokenizer/model loading with trust/gating choices explicit.
+- Trainer config using current docs (not memory).
+- Plain-text logging: `disable_tqdm=True`, `logging_strategy="steps"`, `logging_first_step=True`.
+- Evaluation or validation pass.
+- `push_to_hub=True` or explicit upload of final artifacts.
+For long runs, include Trackio or equivalent monitoring and return the dashboard URL.
+## HF Jobs Preflight
+Before submitting a training or evaluation job, state:
+- Reference implementation or docs used.
+- Dataset schema verified.
+- Model repo and tokenizer verified.
+- Smoke test completed.
+- Hardware choice and timeout justified.
+- Hub artifact output configured (`push_to_hub` or explicit upload).
+- Monitoring configured (Trackio/dashboard/logged metrics).
+- One-job-first plan for risky scripts or sweeps.
+Use the `hf-jobs` skill for job submission and monitoring.
+## Error Recovery
+When something fails:
+- Read the full error and relevant logs.
+- Do not retry the exact same command without changing the cause.
+- Import error: fetch docs/example, patch import/config.
+- Dataset KeyError: re-inspect schema, patch preprocessing.
+- OOM: reduce per-device batch size while increasing gradient accumulation to keep effective batch size; enable gradient checkpointing; or choose larger hardware. Do not switch methods.
+- Divergence/NaN: lower learning rate, check labels/rewards, inspect samples.
+- Weak metric: compare against paper recipes, inspect errors, tune with a small sweep.
+Do not hide compromises. If preserving the original request is impossible, explain the constraint and ask for approval.
+## Completion Standard
+Before final response, verify:
+- The requested artifact exists (model, dataset, metrics, report, or running job).
+- The model has been evaluated and confirmed to work when possible.
+Return:
+- What was done.
+- Source repo links (branch, commit, PR).
+- Hugging Face artifact URLs (model, dataset, Space, job).
+- Metrics or evaluation results.
+- Known gaps, failures, or next experiments.