razvan's picture
update
b6b1825

/mlintern:run

Run an ML Intern task end-to-end.

Arguments

  • prompt (required): A one-sentence description of the ML deliverable. Examples: "fine-tune Qwen3-4B for code completion on python-code-dataset", "benchmark sentence-transformers/all-MiniLM-L6-v2 on STS-B", "train a diffusion LoRA on my art dataset".
  • --model (optional): LiteLLM model ID to use (e.g., huggingface/openai/gpt-oss-120b). Defaults to the environment default.
  • --background (optional): Queue the task and return immediately. Check status later.
  • --status <job-id> (optional): Check status of a background job.
  • --result <job-id> (optional): Fetch the final report of a completed background job.
  • --cancel <job-id> (optional): Cancel a running background job.

Workflow

  1. Clarify the deliverable from the prompt.
  2. If the task has 3 or more meaningful steps, create a full update_plan plan before deep work begins. Keep exactly one step in progress at a time and update it at phase transitions.
  3. Research the task before writing code:
    • Use a research sub-agent for broad or novel research when the active Codex runtime explicitly allows delegation; otherwise run the same focused probes directly.
    • Mirror upstream research behavior: keep research read-only, papers-first, and isolated from implementation as much as Codex allows.
    • If the task is paper-backed or novel, search for landmark and recent papers first.
    • Trace citation graphs or related-paper recommendations for old anchors and fast-moving methods.
    • Search and inspect likely HF datasets, even for plan-only tasks.
    • Read HF docs for current API patterns.
    • Find and read a working GitHub implementation example.
    • Use web sources only when the answer depends on current information outside HF and GitHub.
    • If the user only wants a plan, stop after the full research floor and return the plan. Do not implement.
  4. Validate inputs:
    • Inspect dataset schema, splits, sample rows.
    • Verify model repo exists, architecture matches, tokenizer available.
  5. Implement the smallest working version.
  6. Smoke test locally or in a small HF Job.
  7. Run the full training/evaluation job with HF Jobs.
  8. Evaluate results against the target.
  9. Save code, configs, and reports; publish ML artifacts to Hugging Face.

Output

Return:

  • Deliverable status (complete / partial / failed).
  • Evidence checked: papers, datasets, docs, GitHub examples, and external sources.
  • GitHub branch, commit, PR, or report path for code.
  • Hugging Face model/dataset/Space URLs for published artifacts.
  • Job ID and log URL for HF Jobs runs.
  • Metrics and evaluation results when available.
  • Known failures, compromises, and next recommended steps.

Guardrails

  • Never silently substitute a dataset, model, or training method. Ask for approval if the original request is incompatible.
  • For multi-step tasks, do not skip plan updates at start, phase change, or completion.
  • Always set realistic timeouts for HF Jobs (at least 2 hours for real training).
  • Always include push_to_hub=True and hub_model_id in training configs.
  • Run one job first before launching sweeps or ablations.
  • For OOM errors: reduce batch size and increase gradient accumulation, enable gradient checkpointing, or upgrade hardware. Do not change the requested method.