| # /mlintern:run |
|
|
| Run an ML Intern task end-to-end. |
|
|
| ## Arguments |
|
|
| - `prompt` (required): A one-sentence description of the ML deliverable. Examples: "fine-tune Qwen3-4B for code completion on python-code-dataset", "benchmark sentence-transformers/all-MiniLM-L6-v2 on STS-B", "train a diffusion LoRA on my art dataset". |
| - `--model` (optional): LiteLLM model ID to use (e.g., `huggingface/openai/gpt-oss-120b`). Defaults to the environment default. |
| - `--background` (optional): Queue the task and return immediately. Check status later. |
| - `--status <job-id>` (optional): Check status of a background job. |
| - `--result <job-id>` (optional): Fetch the final report of a completed background job. |
| - `--cancel <job-id>` (optional): Cancel a running background job. |
|
|
| ## Workflow |
|
|
| 1. Clarify the deliverable from the prompt. |
| 2. If the task has 3 or more meaningful steps, create a full `update_plan` plan before deep work begins. Keep exactly one step in progress at a time and update it at phase transitions. |
| 3. Research the task before writing code: |
| - Use a research sub-agent for broad or novel research when the active Codex runtime explicitly allows delegation; otherwise run the same focused probes directly. |
| - Mirror upstream `research` behavior: keep research read-only, papers-first, and isolated from implementation as much as Codex allows. |
| - If the task is paper-backed or novel, search for landmark and recent papers first. |
| - Trace citation graphs or related-paper recommendations for old anchors and fast-moving methods. |
| - Search and inspect likely HF datasets, even for plan-only tasks. |
| - Read HF docs for current API patterns. |
| - Find and read a working GitHub implementation example. |
| - Use web sources only when the answer depends on current information outside HF and GitHub. |
| - If the user only wants a plan, stop after the full research floor and return the plan. Do not implement. |
| 4. Validate inputs: |
| - Inspect dataset schema, splits, sample rows. |
| - Verify model repo exists, architecture matches, tokenizer available. |
| 5. Implement the smallest working version. |
| 6. Smoke test locally or in a small HF Job. |
| 7. Run the full training/evaluation job with HF Jobs. |
| 8. Evaluate results against the target. |
| 9. Save code, configs, and reports; publish ML artifacts to Hugging Face. |
|
|
| ## Output |
|
|
| Return: |
| - Deliverable status (complete / partial / failed). |
| - Evidence checked: papers, datasets, docs, GitHub examples, and external sources. |
| - GitHub branch, commit, PR, or report path for code. |
| - Hugging Face model/dataset/Space URLs for published artifacts. |
| - Job ID and log URL for HF Jobs runs. |
| - Metrics and evaluation results when available. |
| - Known failures, compromises, and next recommended steps. |
|
|
| ## Guardrails |
|
|
| - Never silently substitute a dataset, model, or training method. Ask for approval if the original request is incompatible. |
| - For multi-step tasks, do not skip plan updates at start, phase change, or completion. |
| - Always set realistic timeouts for HF Jobs (at least 2 hours for real training). |
| - Always include `push_to_hub=True` and `hub_model_id` in training configs. |
| - Run one job first before launching sweeps or ablations. |
| - For OOM errors: reduce batch size and increase gradient accumulation, enable gradient checkpointing, or upgrade hardware. Do not change the requested method. |
|
|