update

Browse files

Files changed (18) hide show

marketplace.json → .agents/plugins/marketplace.json +2 -2
README.md +126 -25
plugins/{mlintern → ml-intern}/.codex-plugin/plugin.json +5 -5
plugins/ml-intern/agents/openai.yaml +4 -0
plugins/{mlintern → ml-intern}/commands/run.md +18 -9
plugins/ml-intern/skills/github-example-search/SKILL.md +99 -0
plugins/{mlintern → ml-intern}/skills/hf-dataset-search/SKILL.md +5 -2
plugins/{mlintern → ml-intern}/skills/hf-docs/SKILL.md +5 -3
plugins/{mlintern → ml-intern}/skills/hf-jobs/SKILL.md +0 -0
plugins/{mlintern → ml-intern}/skills/hf-model-search/SKILL.md +0 -0
plugins/{mlintern → ml-intern}/skills/hf-paper-search/SKILL.md +15 -7
plugins/{mlintern → ml-intern}/skills/ml-intern-harness/SKILL.md +119 -2
plugins/{mlintern → ml-intern}/skills/ml-intern-harness/scripts/inspect_dataset.py +0 -0
plugins/{mlintern → ml-intern}/skills/ml-intern-harness/scripts/papers.py +0 -0
plugins/{mlintern → ml-intern}/skills/ml-intern-harness/scripts/preflight_check.py +0 -0
plugins/ml-intern/skills/ml-intern/SKILL.md +41 -0
plugins/ml-intern/skills/web-search/SKILL.md +73 -0
plugins/mlintern/agents/openai.yaml +0 -4

marketplace.json → .agents/plugins/marketplace.json RENAMED Viewed

@@ -5,10 +5,10 @@
   },
   "plugins": [
     {
-      "name": "mlintern",
       "source": {
         "source": "local",
-        "path": "./plugins/mlintern"
       },
       "policy": {
         "installation": "AVAILABLE",

   },
   "plugins": [
     {
+      "name": "ml-intern",
       "source": {
         "source": "local",
+        "path": "./plugins/ml-intern"
       },
       "policy": {
         "installation": "AVAILABLE",

README.md CHANGED Viewed

@@ -4,7 +4,7 @@ tags:
 ---
 # ML Intern Plugin for OpenAI Codex
-Hugging Face ML Intern reimagined as an OpenAI Codex plugin. Research papers, inspect datasets and models, run training and evaluation on Hugging Face Jobs, and ship ML artifacts — all inside Codex.
 ## What This Is
@@ -18,41 +18,68 @@ The original `mlintern-plugin` wraps the `ml-intern` CLI binary inside Claude Co
 ## Plugin Structure
-```
-plugins/mlintern/
-├── .codex-plugin/
-│   └── plugin.json          ← Plugin manifest
-├── agents/
-│   └── openai.yaml          ← UI metadata for Codex
-├── commands/
-│   └── run.md               ← /mlintern:run command definition
-├── skills/
-│   ├── ml-intern-harness/   ← Core ML Intern behavior (research, validate, implement, ship)
-│   ├── hf-model-search/     ← Model discovery and validation
-│   ├── hf-dataset-search/   ← Dataset discovery and schema inspection
-│   ├── hf-paper-search/     ← Paper research (search, read, citations)
-│   ├── hf-docs/             ← HF library documentation lookup
-│   └── hf-jobs/             ← Hugging Face cloud job submission and monitoring
-└── assets/                    ← Icon, logo, screenshots
-```
 ## Installation
-### Method 1: Local Install (Recommended for Development)
-Clone this repo and link it into your Codex plugins directory:
 ```bash
 git clone https://github.com/razvan/ml-intern-codex-plugin.git
 cd ml-intern-codex-plugin
-# Link or copy to Codex plugins directory
-mkdir -p ~/.codex/plugins/
-ln -s $(pwd)/plugins/mlintern ~/.codex/plugins/mlintern
 ```
-### Method 2: Marketplace (Future)
-Once OpenAI launches a public plugin marketplace, this plugin can be registered via `marketplace.json`.
 ## Dependencies
@@ -74,6 +101,18 @@ Or use the skill name in your prompt:
 Use ml-intern-harness to research DPO training recipes, find a suitable dataset, and implement a training script.
 ```
 ## Skills Reference
 | Skill | Purpose | Key Tools |
@@ -83,6 +122,8 @@ Use ml-intern-harness to research DPO training recipes, find a suitable dataset,
 | `hf-dataset-search` | Find and validate datasets | `_dataset_search`, `_hub_repo_details`, `inspect_dataset.py` |
 | `hf-paper-search` | Research papers and extract recipes | `_paper_search`, `papers.py` (details, citations, resources) |
 | `hf-docs` | Look up current HF library APIs | `_hf_doc_search`, `_hf_doc_fetch` |
 | `hf-jobs` | Submit and monitor cloud jobs | `_hf_jobs` (run, uv, ps, logs, inspect, cancel) |
 ## Comparison to Original
@@ -98,6 +139,66 @@ Use ml-intern-harness to research DPO training recipes, find a suitable dataset,
 | Job Submission | Built into `ml-intern` binary | `_hf_jobs` via Hugging Face Codex plugin |
 | Sandbox | HF Space sandboxes | Codex local shell + `_hf_jobs` |
 ## Why This Exists
 The original `huggingface/mlintern-plugin` is **Claude Code only** — it's a companion script that spawns the `ml-intern` CLI inside Claude Code sessions. There is no equivalent for Codex. The `huggingface/skills` repo provides general HF skills but not the full ML Intern harness. This plugin bridges the gap.

 ---
 # ML Intern Plugin for OpenAI Codex
+Hugging Face ML Intern reimagined as an OpenAI Codex plugin. Research papers, inspect datasets and models, plan and evaluate AI/RAG systems, run training and evaluation on Hugging Face Jobs, and ship ML artifacts — all inside Codex.
 ## What This Is
 ## Plugin Structure
+- `./.agents/plugins/marketplace.json` - Repo marketplace for Codex
+- `./plugins/ml-intern/.codex-plugin/plugin.json` - Plugin manifest
+- `./plugins/ml-intern/agents/openai.yaml` - UI metadata for Codex
+- `./plugins/ml-intern/commands/run.md` - `/mlintern:run` command definition
+- `./plugins/ml-intern/skills/ml-intern-harness/` - Core ML Intern behavior
+- `./plugins/ml-intern/skills/hf-model-search/` - Model discovery and validation
+- `./plugins/ml-intern/skills/hf-dataset-search/` - Dataset discovery and schema inspection
+- `./plugins/ml-intern/skills/hf-paper-search/` - Paper research, reading, citations
+- `./plugins/ml-intern/skills/hf-docs/` - Hugging Face library documentation lookup
+- `./plugins/ml-intern/skills/github-example-search/` - GitHub example-file discovery
+- `./plugins/ml-intern/skills/web-search/` - Current web search with source filtering
+- `./plugins/ml-intern/skills/hf-jobs/` - Hugging Face cloud job submission and monitoring
 ## Installation
+This plugin is hosted on GitHub, and the easiest local install is to clone the repo and let Codex load the plugin from the local checkout.
+### Method 1: Clean Codex UI Install
+This is the recommended path if you want a clean install inside the Codex UI without copying files into a global plugin directory.
+1. Clone this repo somewhere local that Codex can read:
+```bash
+git clone https://github.com/razvan/ml-intern-codex-plugin.git
+cd ml-intern-codex-plugin
+```
+2. Restart Codex so it reloads the repo marketplace.
+3. Use the repository marketplace entry in this repo. The marketplace file is:
+`./.agents/plugins/marketplace.json`
+4. The marketplace points Codex at the plugin bundle here:
+`./plugins/ml-intern`
+After Codex reloads, the plugin should appear in the Codex UI as **ML Intern for Codex**.
+### Method 2: Manual Local Install
+If you want to install it into your local Codex plugin directory manually:
+1. Clone the repository locally:
 ```bash
 git clone https://github.com/razvan/ml-intern-codex-plugin.git
 cd ml-intern-codex-plugin
 ```
+2. Copy the plugin bundle into your Codex plugins directory:
+```bash
+cp -R plugins/ml-intern ~/.codex/plugins/ml-intern
+```
+3. Make sure Codex can see the plugin via a local marketplace entry or by reloading the Codex UI, depending on how your Codex setup is configured.
+4. Restart Codex.
+5. Look for **ML Intern for Codex** in the plugin list.
 ## Dependencies
 Use ml-intern-harness to research DPO training recipes, find a suitable dataset, and implement a training script.
 ```
+The bundled skills are the main entry points:
+- `ml-intern-harness` for the end-to-end research, validation, implementation, and job loop, including plan-only AI/RAG/search/QA system design
+- `ml-intern` as the short alias for the main plugin workflow
+- `hf-model-search` for model discovery and validation
+- `hf-dataset-search` for dataset discovery and schema inspection
+- `hf-paper-search` for paper research and recipe extraction
+- `hf-docs` for current Hugging Face library docs
+- `github-example-search` for finding working example files in GitHub repos
+- `web-search` for current web sources and general research outside the Hugging Face Hub
+- `hf-jobs` for Hugging Face job submission and monitoring
 ## Skills Reference
 | Skill | Purpose | Key Tools |
 | `hf-dataset-search` | Find and validate datasets | `_dataset_search`, `_hub_repo_details`, `inspect_dataset.py` |
 | `hf-paper-search` | Research papers and extract recipes | `_paper_search`, `papers.py` (details, citations, resources) |
 | `hf-docs` | Look up current HF library APIs | `_hf_doc_search`, `_hf_doc_fetch` |
+| `github-example-search` | Find working example files in GitHub repos | GitHub repo/file search plus `fetch_file` |
+| `web-search` | Find current web sources with filters | Codex web browsing/search tools and citation links |
 | `hf-jobs` | Submit and monitor cloud jobs | `_hf_jobs` (run, uv, ps, logs, inspect, cancel) |
 ## Comparison to Original
 | Job Submission | Built into `ml-intern` binary | `_hf_jobs` via Hugging Face Codex plugin |
 | Sandbox | HF Space sandboxes | Codex local shell + `_hf_jobs` |
+## Parity Status
+This plugin intentionally mirrors the parts of ML Intern that matter most for HF ML work:
+| Area | Status | Notes |
+|---|---|---|
+| Paper discovery | Done | HF paper search plus the deeper `papers.py` research flow is available. |
+| Paper reading | Done | Section reading, citations, recommendations, and linked resources are implemented. |
+| Dataset validation | Done | Schema, splits, sample rows, parquet availability, and compatibility notes are covered. |
+| HF docs lookup | Done | Search plus fetch are available for current HF library guidance. |
+| End-to-end ML workflow | Done | The harness pushes research-first, validate-first behavior. |
+| Generic web search | Partial | Best-effort Codex `web-search` guidance exists, but it is not the exact ML Intern DuckDuckGo wrapper. |
+For the closest ML Intern feel, use the paper and dataset skills first, then docs, then the harness workflow.
+## Behavioral Contract
+When ML Intern is invoked directly, the plugin should behave like a research harness rather than a generic assistant:
+- Route through `ml-intern` -> `ml-intern-harness` for non-trivial AI/ML/RAG/search/evaluation tasks, even when they are not purely Hugging Face tasks.
+- Use plan tracking at the beginning, each phase transition, and completion when Codex exposes a planning tool.
+- Split research into explicit tracks before synthesis, such as platform constraints, technical approaches, and evaluation methodology.
+- Use `hf-paper-search` for literature, benchmarks, and evaluation methods.
+- Use `web-search` for current platform/API constraints, official docs, policies, pricing, rate limits, SDK behavior, and other non-HF facts.
+- Cite important architecture and evaluation decisions with papers, official docs, or primary sources.
+- If the user asks for "plan only", stop after research and do not write implementation code or scaffold files.
+- If a write, shell, network, or sandbox step fails, fail forward with an inline deliverable when possible.
+### Codex Compatibility Layer
+This plugin cannot inject upstream Python tools into Codex directly, so it should reproduce the behavior of the upstream tools using Codex-native primitives:
+| Upstream tool | Codex compatibility layer | Required behavior to preserve |
+|---|---|---|
+| `plan_tool` | `update_plan` | Full-plan replacement, one `in_progress` item, updates at start/phase transition/completion |
+| `research` | delegated sub-agent research when explicitly allowed, otherwise focused sequential research | Separate research context when possible, read-only scope, papers-first workflow, compact evidence-backed summary |
+Faithful `plan_tool` semantics to preserve:
+- Use it for tasks with 3 or more meaningful steps.
+- Replace the whole visible plan each update.
+- Keep exactly one item in progress.
+- Mark completed only after full success.
+Faithful `research` semantics to preserve:
+- Main context stays focused on synthesis and decisions.
+- Research uses papers, citation graphs, dataset inspection, docs, GitHub examples, and web search.
+- Research returns compact findings with concrete references and recipe-level claims.
+- If separate delegation is unavailable, preserve the same research floor directly rather than skipping it.
+Example plan-only trigger:
+```text
+[@ml-intern](plugin://ml-intern@ml-intern-codex)
+i want to query generic Discord servers in natural language.
+first figure out constraints and challenges, then research how to build and test quality.
+i'm only interested in the plan for now.
+```
+Expected behavior: track a plan, use `web-search` for Discord API constraints, use `hf-paper-search` for RAG/forum/social QA and evaluation research, synthesize a cited build-and-test plan, and avoid implementation.
 ## Why This Exists
 The original `huggingface/mlintern-plugin` is **Claude Code only** — it's a companion script that spawns the `ml-intern` CLI inside Claude Code sessions. There is no equivalent for Codex. The `huggingface/skills` repo provides general HF skills but not the full ML Intern harness. This plugin bridges the gap.

plugins/{mlintern → ml-intern}/.codex-plugin/plugin.json RENAMED Viewed

@@ -1,7 +1,7 @@
 {
   "name": "ml-intern",
-  "version": "0.1.0",
-  "description": "Hugging Face ML Intern for Codex — research ML papers, inspect models and datasets, run training and evaluation jobs, and ship ML artifacts.",
   "author": {
     "name": "Hugging Face",
     "email": "agents@huggingface.co",
@@ -15,15 +15,15 @@
   "interface": {
     "displayName": "ML Intern",
     "shortDescription": "Hugging Face ML engineering agent for Codex",
-    "longDescription": "ML Intern is an autonomous ML engineering agent for the Hugging Face ecosystem. It researches papers, validates datasets and models, writes training/evaluation code, runs HF Jobs, and ships ML artifacts with zero avoidable errors.",
     "developerName": "Hugging Face",
     "category": "Coding",
     "capabilities": ["Interactive", "Read", "Write"],
     "websiteURL": "https://huggingface.co",
     "defaultPrompt": [
       "Fine-tune a language model on a custom dataset using Hugging Face Jobs",
-      "Find the best open-source embedding model and benchmark it",
-      "Research papers on preference optimization and implement DPO training"
     ],
     "brandColor": "#FF6B00"
   }

 {
   "name": "ml-intern",
+  "version": "0.1.3",
+  "description": "Hugging Face ML Intern for Codex — research ML papers first, inspect models and datasets, run training and evaluation jobs, and ship ML artifacts.",
   "author": {
     "name": "Hugging Face",
     "email": "agents@huggingface.co",
   "interface": {
     "displayName": "ML Intern",
     "shortDescription": "Hugging Face ML engineering agent for Codex",
+    "longDescription": "ML Intern is an autonomous ML engineering agent for the Hugging Face ecosystem. It follows a strict research-first workflow: clarify the deliverable, search papers first for novel or paper-backed tasks, trace citations, inspect likely datasets, read current HF docs and GitHub examples, use web sources only when necessary, validate datasets and models, and only then write code, run HF Jobs, and ship ML artifacts.",
     "developerName": "Hugging Face",
     "category": "Coding",
     "capabilities": ["Interactive", "Read", "Write"],
     "websiteURL": "https://huggingface.co",
     "defaultPrompt": [
+      "Research a paper-backed ML task and return a plan only",
       "Fine-tune a language model on a custom dataset using Hugging Face Jobs",
+      "Find the best open-source embedding model and benchmark it"
     ],
     "brandColor": "#FF6B00"
   }

plugins/ml-intern/agents/openai.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+interface:
+  display_name: "ML Intern"
+  short_description: "Hugging Face ML engineering agent"
+  default_prompt: "Act as an ML engineering intern with a strict research-first workflow. Clarify the deliverable, search papers first for paper-backed or novel tasks, trace citations when useful, validate datasets and models, read current HF docs and GitHub examples, use web sources only when current external facts are needed, and if the user only wants a plan, stop after the full research floor and return the plan with evidence checked."

plugins/{mlintern → ml-intern}/commands/run.md RENAMED Viewed

@@ -14,23 +14,31 @@ Run an ML Intern task end-to-end.
 ## Workflow
 1. Clarify the deliverable from the prompt.
-2. Research the task before writing code:
-   - Search for landmark and recent papers if the task is novel.
    - Read HF docs for current API patterns.
-   - Find a working implementation example.
-3. Validate inputs:
    - Inspect dataset schema, splits, sample rows.
    - Verify model repo exists, architecture matches, tokenizer available.
-4. Implement the smallest working version.
-5. Smoke test locally or in a small HF Job.
-6. Run the full training/evaluation job with HF Jobs.
-7. Evaluate results against the target.
-8. Save code, configs, and reports; publish ML artifacts to Hugging Face.
 ## Output
 Return:
 - Deliverable status (complete / partial / failed).
 - GitHub branch, commit, PR, or report path for code.
 - Hugging Face model/dataset/Space URLs for published artifacts.
 - Job ID and log URL for HF Jobs runs.
@@ -40,6 +48,7 @@ Return:
 ## Guardrails
 - Never silently substitute a dataset, model, or training method. Ask for approval if the original request is incompatible.
 - Always set realistic timeouts for HF Jobs (at least 2 hours for real training).
 - Always include `push_to_hub=True` and `hub_model_id` in training configs.
 - Run one job first before launching sweeps or ablations.

 ## Workflow
 1. Clarify the deliverable from the prompt.
+2. If the task has 3 or more meaningful steps, create a full `update_plan` plan before deep work begins. Keep exactly one step in progress at a time and update it at phase transitions.
+3. Research the task before writing code:
+   - Use a research sub-agent for broad or novel research when the active Codex runtime explicitly allows delegation; otherwise run the same focused probes directly.
+   - Mirror upstream `research` behavior: keep research read-only, papers-first, and isolated from implementation as much as Codex allows.
+   - If the task is paper-backed or novel, search for landmark and recent papers first.
+   - Trace citation graphs or related-paper recommendations for old anchors and fast-moving methods.
+   - Search and inspect likely HF datasets, even for plan-only tasks.
    - Read HF docs for current API patterns.
+   - Find and read a working GitHub implementation example.
+   - Use web sources only when the answer depends on current information outside HF and GitHub.
+   - If the user only wants a plan, stop after the full research floor and return the plan. Do not implement.
+4. Validate inputs:
    - Inspect dataset schema, splits, sample rows.
    - Verify model repo exists, architecture matches, tokenizer available.
+5. Implement the smallest working version.
+6. Smoke test locally or in a small HF Job.
+7. Run the full training/evaluation job with HF Jobs.
+8. Evaluate results against the target.
+9. Save code, configs, and reports; publish ML artifacts to Hugging Face.
 ## Output
 Return:
 - Deliverable status (complete / partial / failed).
+- Evidence checked: papers, datasets, docs, GitHub examples, and external sources.
 - GitHub branch, commit, PR, or report path for code.
 - Hugging Face model/dataset/Space URLs for published artifacts.
 - Job ID and log URL for HF Jobs runs.
 ## Guardrails
 - Never silently substitute a dataset, model, or training method. Ask for approval if the original request is incompatible.
+- For multi-step tasks, do not skip plan updates at start, phase change, or completion.
 - Always set realistic timeouts for HF Jobs (at least 2 hours for real training).
 - Always include `push_to_hub=True` and `hub_model_id` in training configs.
 - Run one job first before launching sweeps or ablations.

plugins/ml-intern/skills/github-example-search/SKILL.md ADDED Viewed

	@@ -0,0 +1,99 @@

+---
+name: github-example-search
+description: "Find working example files in GitHub repositories using path heuristics and GitHub file search, then read the best candidates."
+disable-model-invocation: false
+---
+# GitHub Example Search
+## Purpose
+Replicate the useful part of ML Intern's GitHub example discovery:
+find example scripts, tutorials, notebooks, and guides in a target repo, then read the best matches before implementing code.
+This skill is intentionally path-first. It is not a semantic code search engine.
+## Tools
+Use the GitHub plugin tools:
+- `search_repositories` or `search_installed_repositories_v2` to identify the repo if the user did not name it.
+- `search` to search files within the target repository.
+- `fetch_file` to read the candidate file contents.
+- `search_branches` only if you need to confirm branch names.
+## Search Strategy
+Use prioritized file-path patterns, roughly in this order:
+1. `scripts`
+2. `examples`, `example`
+3. `notebooks`, `notebook`
+4. `tutorials`, `tutorial`, `quickstart`, `walkthrough`, `walkthroughs`
+5. `cookbook`, `recipe`, `recipes`
+6. `demos`, `demo`, `samples`, `sample`
+7. `guides`, `guide`, `getting-started`, `getting_started`
+8. `playground`, `howto`, `how-to`
+9. `use-cases`, `usecases`, `use_cases`
+10. `sandbox`, `showcase`
+## Workflow
+1. Resolve the repository first.
+2. If the repository is ambiguous, search repositories by name or organization until you have a strong candidate.
+3. Search the repo with the highest-priority path patterns first.
+4. If the user gave a keyword, combine it with the pattern queries.
+5. Prefer files under `examples/` or `example/` when there is a tie.
+6. Prefer `scripts/` over other example-like directories.
+7. Prefer shallower paths when multiple matches are similar.
+8. Read the top candidate files with `fetch_file`, using line ranges for large files.
+9. Use the exact file path from the search result to continue the investigation.
+## Practical Query Plan
+When looking for a specific method or trainer, search in this order:
+1. `<keyword> scripts`
+2. `<keyword> examples`
+3. `<keyword> tutorial`
+4. `<keyword> notebook`
+5. `<keyword> guide`
+When no keyword is given, search for the directories themselves:
+1. `scripts`
+2. `examples`
+3. `example`
+4. `notebooks`
+5. `tutorials`
+## Reading Pattern
+After finding a candidate file:
+1. Read the file header and argument parsing.
+2. Read the model, dataset, and trainer setup.
+3. Read the training loop or main execution path.
+4. If the file is long, fetch only the relevant line range.
+## Output Expectations
+Return:
+- the best candidate file path
+- why it was selected
+- the repository it came from
+- the exact line range to read next if the file is large
+## Example
+```
+_search(query="grpo scripts", repository_name="huggingface/trl", topn=10)
+_fetch_file(repository_full_name="huggingface/trl", ref="main", path="examples/scripts/grpo.py", encoding="utf-8")
+```
+## Notes
+- This is the closest Codex-native replacement for ML Intern's `github_find_examples`.
+- It relies on GitHub repo/file search plus the same path-priority intuition as the upstream tool.
+- Once you have a candidate, always read the actual file before implementing.

plugins/{mlintern → ml-intern}/skills/hf-dataset-search/SKILL.md RENAMED Viewed

@@ -35,12 +35,14 @@ This queries the Hugging Face Dataset Viewer API for:
 1. Search for candidate datasets with `dataset_search`.
 2. Inspect metadata with `hub_repo_details` (set `repo_type="dataset"`).
-3. Run `inspect_dataset.py` for schema and sample row details.
 4. Verify training-method compatibility:
    - SFT: needs `messages`, `text`, or `prompt`/`completion`
    - DPO: needs `prompt`, `chosen`, `rejected`
    - GRPO: needs `prompt`
-5. Surface class imbalance, missing values, unexpected formats, or unsafe substitutions.
 ## Example
@@ -56,6 +58,7 @@ Before using a dataset:
 - [ ] Dataset is valid and has Dataset Viewer coverage.
 - [ ] Configs/splits match expectations.
 - [ ] Column names and types are compatible with the trainer.
 - [ ] Sample rows look reasonable.
 - [ ] Row count is sufficient for the task.
 - [ ] License and gating are acceptable.

 1. Search for candidate datasets with `dataset_search`.
 2. Inspect metadata with `hub_repo_details` (set `repo_type="dataset"`).
+3. Run `inspect_dataset.py` for schema, splits, parquet availability, and sample row details.
 4. Verify training-method compatibility:
    - SFT: needs `messages`, `text`, or `prompt`/`completion`
    - DPO: needs `prompt`, `chosen`, `rejected`
    - GRPO: needs `prompt`
+5. If the dataset has a `messages` column, inspect the sample rows for role order and tool-call structure.
+6. Surface class imbalance, missing values, unexpected formats, or unsafe substitutions.
+7. Do not start training until the dataset shape matches the intended trainer.
 ## Example
 - [ ] Dataset is valid and has Dataset Viewer coverage.
 - [ ] Configs/splits match expectations.
 - [ ] Column names and types are compatible with the trainer.
+- [ ] If `messages` exists, the chat format looks correct in sample rows.
 - [ ] Sample rows look reasonable.
 - [ ] Row count is sufficient for the task.
 - [ ] License and gating are acceptable.

plugins/{mlintern → ml-intern}/skills/hf-docs/SKILL.md RENAMED Viewed

@@ -32,10 +32,11 @@ Look up current API usage patterns, trainer configs, and library documentation b
 ## Workflow
-1. Use `hf_doc_search` to find relevant pages.
-2. Use `hf_doc_fetch` to read the full content of the most relevant page.
 3. Extract the exact imports, class names, config parameters, and argument names.
-4. Use the current API in your implementation.
 ## Example
@@ -50,3 +51,4 @@ Before writing any training, fine-tuning, inference, or evaluation code:
 - Find at least one current working implementation pattern from HF docs or a relevant repo.
 - Verify import paths, trainer class names, and config field names.
 - Check that the example matches your library version constraints.

 ## Workflow
+1. Use `hf_doc_search` to find the most relevant current page for the exact library or trainer.
+2. Use `hf_doc_fetch` to read the full content of the page before coding.
 3. Extract the exact imports, class names, config parameters, and argument names.
+4. Cross-check any example against the current library version you are targeting.
+5. Use the current API in your implementation, not memory or old snippets.
 ## Example
 - Find at least one current working implementation pattern from HF docs or a relevant repo.
 - Verify import paths, trainer class names, and config field names.
 - Check that the example matches your library version constraints.
+- If the API changed recently, prefer the docs page over older code examples.

plugins/{mlintern → ml-intern}/skills/hf-jobs/SKILL.md RENAMED Viewed

File without changes

plugins/{mlintern → ml-intern}/skills/hf-model-search/SKILL.md RENAMED Viewed

File without changes

plugins/{mlintern → ml-intern}/skills/hf-paper-search/SKILL.md RENAMED Viewed

@@ -37,13 +37,21 @@ python skills/ml-intern-harness/scripts/papers.py <operation> [args]
 ## Workflow
-1. Use `paper_search` for quick discovery or `papers.py search` for filtered Semantic Scholar search.
-2. Read methodology and results sections with `papers.py read_paper --arxiv-id <id> --section 3`.
-3. Trace citation graphs with `papers.py citation_graph --arxiv-id <id> --direction citations`.
-4. Extract concrete recipes: dataset, preprocessing, method, hyperparameters, model, metric, result.
-5. Find linked HF datasets/models/collections with `papers.py find_all_resources`.
-6. Inspect promising datasets with `inspect_dataset.py`.
-7. Read current HF docs and GitHub examples before implementing.
 ## Example

 ## Workflow
+1. Use `paper_search` for quick discovery or `papers.py search` when you need filters.
+2. Prefer landmark and recent papers together, not just the most popular result.
+3. Read the paper TOC first, then the methodology and results sections with `papers.py read_paper`.
+4. Trace downstream citations with `papers.py citation_graph --direction citations` before settling on a recipe.
+5. Extract concrete recipes: dataset, preprocessing, method, hyperparameters, model, metric, result.
+6. Use `papers.py find_all_resources` to discover linked HF datasets, models, and collections.
+7. Inspect promising datasets with `inspect_dataset.py` before implementing anything.
+8. Read current HF docs and GitHub examples before implementing the training or evaluation code.
+## What To Prioritize
+- Prefer sections that explain how the result was achieved, not just the abstract.
+- Prefer recent downstream work when the anchor paper is old.
+- Prefer recipe-level claims such as "dataset X + method Y + model Z produced metric M".
+- Treat linked datasets as candidates, not assumptions. Validate schema and sample rows first.
 ## Example

plugins/{mlintern → ml-intern}/skills/ml-intern-harness/SKILL.md RENAMED Viewed

@@ -17,8 +17,17 @@ This skill is for doing ML work end to end — not just advising. Research first
 For any non-trivial ML task, follow this loop:
 1. **Clarify**: One-sentence deliverable. Is it a model, a benchmark result, a dataset, a report?
-2. **Research**: Find at least one current working implementation pattern. For novel tasks, search landmark and recent papers; prefer methodology sections over abstracts. Extract recipes: dataset, model, method, hyperparameters, metrics.
-3. **Validate inputs**: Inspect dataset schema, splits, sample rows. Verify model repo exists, architecture matches, tokenizer available, license compatible.
 4. **Implement smallest working version**: Use current HF docs and a working example as reference. Do not rely on memory for imports, config names, or trainer arguments.
 5. **Smoke test**: Run locally or in a small HF Job before the full run.
 6. **Run full job**: Submit to HF Jobs with realistic timeout, `push_to_hub=True`, and monitoring.
@@ -26,6 +35,33 @@ For any non-trivial ML task, follow this loop:
 8. **Ship**: Save code and configs to the source repo. Publish model weights, datasets, or Spaces to Hugging Face. Return artifact URLs.
 9. **Iterate**: If results are weak or broken, diagnose and run the next experiment. Do not stop after a plan.
 ## High-Risk Mistakes To Avoid
 - Hallucinated imports or trainer arguments from outdated memory.
@@ -47,8 +83,85 @@ For paper-backed tasks:
 5. Find linked Hugging Face datasets, models, and collections.
 6. Inspect promising datasets before using them.
 7. Read current HF docs and GitHub examples before implementing.
 Use the `hf-paper-search` skill for paper operations.
 ## Dataset Audit Pattern
@@ -56,6 +169,7 @@ Before training or evaluating:
 - Verify repo, config, split, and revision.
 - Check row counts, column names, and representative rows.
 - Look for missing values, invalid records, class imbalance, or reward/preference balance.
 - Check text/message schema compatibility with the trainer:
   - SFT: needs `messages`, `text`, or `prompt`/`completion`
   - DPO: needs `prompt`, `chosen`, `rejected`
@@ -75,6 +189,8 @@ A training script should include:
 - Plain-text logging: `disable_tqdm=True`, `logging_strategy="steps"`, `logging_first_step=True`.
 - Evaluation or validation pass.
 - `push_to_hub=True` or explicit upload of final artifacts.
 For long runs, include Trackio or equivalent monitoring and return the dashboard URL.
@@ -102,6 +218,7 @@ When something fails:
 - OOM: reduce per-device batch size while increasing gradient accumulation to keep effective batch size; enable gradient checkpointing; or choose larger hardware. Do not switch methods.
 - Divergence/NaN: lower learning rate, check labels/rewards, inspect samples.
 - Weak metric: compare against paper recipes, inspect errors, tune with a small sweep.
 Do not hide compromises. If preserving the original request is impossible, explain the constraint and ask for approval.

 For any non-trivial ML task, follow this loop:
 1. **Clarify**: One-sentence deliverable. Is it a model, a benchmark result, a dataset, a report?
+2. **Research**: Use the strongest evidence first.
+   - For broad or novel research, delegate focused literature/code/dataset probes to a research sub-agent when the active Codex runtime explicitly allows delegation. If delegation is unavailable, do the same probes directly and say that no separate sub-agent was allowed.
+   - Start with landmark and recent papers for novel tasks.
+   - Read methodology, experiments, and results sections before relying on abstracts.
+   - Trace citations to find recent downstream improvements.
+   - Locate linked datasets, models, and collections.
+   - Read current HF docs and GitHub examples before implementing.
+   - Use current web sources only when the answer depends on information outside HF and GitHub.
+   - Extract recipes: dataset, model, method, hyperparameters, metrics.
+   - If the user only wants a plan, stop after the research floor below and synthesize the plan. Do not implement.
+3. **Validate inputs**: Inspect dataset schema, splits, sample rows, and parquet availability. Verify model repo exists, architecture matches, tokenizer available, license compatible.
 4. **Implement smallest working version**: Use current HF docs and a working example as reference. Do not rely on memory for imports, config names, or trainer arguments.
 5. **Smoke test**: Run locally or in a small HF Job before the full run.
 6. **Run full job**: Submit to HF Jobs with realistic timeout, `push_to_hub=True`, and monitoring.
 8. **Ship**: Save code and configs to the source repo. Publish model weights, datasets, or Spaces to Hugging Face. Return artifact URLs.
 9. **Iterate**: If results are weak or broken, diagnose and run the next experiment. Do not stop after a plan.
+## Upstream plan_tool Compatibility
+Upstream `huggingface/ml-intern` has a real `plan_tool` with strict semantics. In Codex, emulate it as closely as possible with `update_plan`.
+Rules to preserve:
+- Use plan tracking for tasks with 3 or more meaningful steps.
+- Start with a full plan before deep work begins.
+- Each update replaces the whole visible plan, not just one item.
+- Keep exactly one item `in_progress` at a time.
+- Mark items `completed` immediately after they fully succeed.
+- Do not mark an item completed if it failed, is partial, or is blocked.
+- Update at phase transitions, after major research tracks complete, and at final completion.
+Compatibility mapping:
+- Upstream `todos[].id` becomes a stable numeric prefix in the step text, for example `1. Research papers`.
+- Upstream `todos[].content` becomes the human-readable step text.
+- Upstream `todos[].status` maps directly to Codex `pending`, `in_progress`, `completed`.
+Preferred shape:
+1. Research constraints / external platform facts
+2. Research papers, datasets, and benchmarks
+3. Research current docs and working code examples
+4. Synthesize plan or implement
+5. Verify and report
+When the user only wants a plan, the final `update_plan` call should still mark the synthesis step completed before returning.
 ## High-Risk Mistakes To Avoid
 - Hallucinated imports or trainer arguments from outdated memory.
 5. Find linked Hugging Face datasets, models, and collections.
 6. Inspect promising datasets before using them.
 7. Read current HF docs and GitHub examples before implementing.
+8. Use `github-example-search` when you need a working file path quickly, then read the file with GitHub.
+9. Use `web-search` when you need current information outside the Hugging Face Hub or GitHub.
+When choosing sources, prefer them in this order:
+1. Published papers and citation graphs.
+2. Linked datasets/models/collections from the paper.
+3. Current HF docs.
+4. Working GitHub examples.
+5. Current web sources for non-HF facts or updates.
 Use the `hf-paper-search` skill for paper operations.
+Use the `github-example-search` skill for example-file discovery in GitHub repos.
+Use the `web-search` skill for general current web research.
+## Plan-Only Research Floor
+When the user asks for research, recommendations, architecture, or a plan only, still behave like ML Intern's research pass. Do not stop at papers alone.
+Minimum research floor:
+- **Literature**: Search anchor papers and recent downstream papers. Read method/experiment/result sections where available. Include citation-graph or related-paper exploration for old anchors or fast-moving areas.
+- **Datasets**: Search HF Hub for task-adjacent datasets and inspect at least the most plausible candidates. If no suitable public dataset exists, state that as a verified gap and propose custom collection/eval.
+- **Code precedent**: Search GitHub for working implementations, especially in `transformers`, `trl`, `datasets`, `peft`, `accelerate`, `sentence-transformers`, or task-specific repos. Read the best candidate files before making API or architecture claims.
+- **Docs**: Read current HF docs for any library/API that the plan depends on.
+- **External constraints**: Use current web/official docs for non-HF platform constraints, policies, rate limits, pricing, or APIs.
+For plan-only outputs, return a compact evidence table before the plan when useful:
+- Source or artifact.
+- What was verified.
+- Design implication.
+- Confidence or gap.
+If runtime policy prevents spawning a research sub-agent, note that only as a process limitation; do not use it as a reason to skip dataset, code, docs, or citation-graph research.
+## Upstream research Compatibility
+Upstream `huggingface/ml-intern` has a built-in `research` tool that launches an isolated research sub-agent with its own context and a read-only tool subset. In Codex, emulate that behavior as closely as the runtime allows.
+Primary intent:
+- Keep the main context focused on decisions and synthesis.
+- Offload heavy literature/doc/code crawling into a separate agent context when delegation is explicitly allowed in the active runtime and by the user request.
+- Use a read-only research scope: papers, docs, GitHub/file reads, dataset inspection, and web search. Avoid writes or implementation inside research.
+When delegation is allowed:
+- Spawn one focused sub-agent for broad research, or multiple focused sub-agents for parallel tracks.
+- Give each sub-agent a narrow task with explicit scope, such as papers, datasets, docs, or code precedents.
+- Ask for compact findings, not raw notes.
+- Require recipe-level outputs: `dataset + method + hyperparameters -> result`.
+- Require concrete references: paper ids, dataset ids, repo paths, doc pages.
+When delegation is not allowed:
+- Perform the same probes directly in the main context.
+- State the limitation briefly as a process note only.
+- Still preserve the upstream research order: papers first, then datasets, then docs/examples, then current external constraints.
+Research prompt pattern to emulate:
+- Start from anchor papers or landmark work.
+- Crawl citation graphs or related recent work.
+- Read methodology/experiment/result sections, not just abstracts.
+- Validate linked datasets and code instead of trusting claims.
+- End with a compact evidence-backed summary for the main agent.
+Codex-specific wrap-up behavior:
+- If research gets too broad, stop expanding scope and summarize verified findings.
+- Return an evidence table before the final plan when useful.
+- Distinguish `verified`, `inferred`, and `not checked`.
+## Parity With Upstream ML Intern
+Upstream `huggingface/ml-intern` includes a built-in `research` tool that launches an isolated read-only research agent with papers, citation graph, dataset inspection, HF docs, GitHub example search, repository file reads, and web search. This Codex plugin approximates that behavior through skills and available tools.
+To stay close to upstream behavior:
+- Keep the main context focused on synthesis and decisions.
+- Use parallel research where the runtime explicitly permits it; otherwise run focused sequential probes.
+- Treat `update_plan` as the compatibility layer for upstream `plan_tool`.
+- Prefer recipe-level findings: `dataset + method + hyperparameters -> metric/result`.
+- Validate linked resources instead of trusting paper or README claims.
+- Include concrete file paths, API names, dataset IDs, model IDs, and citation links.
+- Distinguish "verified", "inferred", and "not checked".
+- Preserve a trace of the research in the final response when no artifact is written: sources consulted, datasets/code/docs checked, and remaining gaps.
 ## Dataset Audit Pattern
 - Verify repo, config, split, and revision.
 - Check row counts, column names, and representative rows.
 - Look for missing values, invalid records, class imbalance, or reward/preference balance.
+- If a paper suggested the dataset, verify the paper claim against the actual HF dataset contents instead of assuming they match.
 - Check text/message schema compatibility with the trainer:
   - SFT: needs `messages`, `text`, or `prompt`/`completion`
   - DPO: needs `prompt`, `chosen`, `rejected`
 - Plain-text logging: `disable_tqdm=True`, `logging_strategy="steps"`, `logging_first_step=True`.
 - Evaluation or validation pass.
 - `push_to_hub=True` or explicit upload of final artifacts.
+- The smallest working path first, then optional refinements after the baseline is proven.
+- If a step depends on a repository example, read the example before changing the implementation.
 For long runs, include Trackio or equivalent monitoring and return the dashboard URL.
 - OOM: reduce per-device batch size while increasing gradient accumulation to keep effective batch size; enable gradient checkpointing; or choose larger hardware. Do not switch methods.
 - Divergence/NaN: lower learning rate, check labels/rewards, inspect samples.
 - Weak metric: compare against paper recipes, inspect errors, tune with a small sweep.
+- If the issue is ambiguous, return to the most authoritative source available before making a speculative change.
 Do not hide compromises. If preserving the original request is impossible, explain the constraint and ask for approval.

plugins/{mlintern → ml-intern}/skills/ml-intern-harness/scripts/inspect_dataset.py RENAMED Viewed

File without changes

plugins/{mlintern → ml-intern}/skills/ml-intern-harness/scripts/papers.py RENAMED Viewed

File without changes

plugins/{mlintern → ml-intern}/skills/ml-intern-harness/scripts/preflight_check.py RENAMED Viewed

File without changes

plugins/ml-intern/skills/ml-intern/SKILL.md ADDED Viewed

	@@ -0,0 +1,41 @@

+---
+name: ml-intern
+description: "Shortcut entry point for the ML Intern plugin. Use this when you want the main Hugging Face ML workflow without remembering the longer harness name."
+disable-model-invocation: false
+---
+# ml-intern
+Use this as the short alias for the main plugin workflow.
+This skill is only a router into `ml-intern-harness`. Do not improvise a different workflow.
+If the task is non-trivial, delegate to `ml-intern-harness` and follow its exact research-first ML workflow:
+- clarify the deliverable in one sentence
+- if the task has 3 or more real steps, start with a full `update_plan` call that mirrors upstream `plan_tool` semantics
+- for broad or novel research, use a research sub-agent when the active runtime explicitly allows delegation; otherwise perform the same focused research probes directly and disclose the limitation
+- if the task is paper-backed or novel, use `hf-paper-search` first
+- for plan-only research, still check likely HF datasets, current docs, and GitHub implementation precedents before synthesizing
+- use `web-search` only when the answer depends on current information outside HF and GitHub
+- validate datasets and models
+- implement the smallest working version only after research
+- smoke test before full runs
+- run Hugging Face Jobs when needed
+- evaluate and ship artifacts
+For parity with upstream ML Intern:
+- treat Codex `update_plan` as the compatibility layer for upstream `plan_tool`
+- treat delegated focused research as the compatibility layer for upstream `research`
+- update the plan at start, major phase changes, and completion
+- keep research outputs compact, citation-backed, and recipe-level
+If the user says they only want a plan, stop after the full research floor and return the plan. Do not start implementation.
+For focused tasks, use the specialized skills directly:
+- `hf-model-search`
+- `hf-dataset-search`
+- `hf-paper-search`
+- `hf-docs`
+- `hf-jobs`

plugins/ml-intern/skills/web-search/SKILL.md ADDED Viewed

	@@ -0,0 +1,73 @@

+---
+name: web-search
+description: "Search the current web for up-to-date information, prefer authoritative sources, and return concise cited results with optional domain filtering."
+disable-model-invocation: false
+---
+# Web Search
+## Purpose
+Replicate the useful part of ML Intern's generic web search behavior:
+find current information outside the Hugging Face Hub and GitHub, prefer high-quality sources, and return a compact result list with links.
+This skill is best used for:
+- current product or API updates
+- release notes and changelogs
+- standards, policies, pricing, schedules, and announcements
+- official docs and primary sources that are not on the Hub
+## Search Behavior
+When searching the web:
+1. Start with the narrowest query that can reasonably work.
+2. Prefer authoritative sources first:
+   - official docs
+   - vendor docs
+   - standards bodies
+   - primary sources
+   - project repositories
+3. If a search is broad, use multiple focused queries instead of one giant query.
+4. If the result set looks noisy, restrict to known-good domains.
+5. If the source is time-sensitive, verify the publication or update date before using it.
+## Domain Filtering
+Use domain filtering when possible:
+- `allowed_domains` for official or trusted sources only
+- `blocked_domains` to avoid low-quality aggregators, mirrors, or SEO spam
+If both are available, prefer allowlists.
+## Workflow
+1. Define the question in one sentence.
+2. Search the web with 1-3 targeted queries.
+3. Narrow to authoritative sources or request domain restrictions if needed.
+4. Open the best sources and extract the relevant facts.
+5. Cross-check conflicting claims with at least one other source when the answer matters.
+6. Return links, dates, and a short explanation of why the source is trustworthy.
+## Output Expectations
+Return:
+- a short answer to the question
+- 3-8 source links, ordered by usefulness
+- publication or update dates when available
+- a note if the answer is based on inference rather than direct source text
+## Style
+- Keep results compact and citation-friendly.
+- Prefer source titles and URLs over long quotes.
+- For current or changing facts, always mention the date context.
+## Notes
+- This is a best-effort Codex analogue to ML Intern's web search behavior.
+- It does not replace paper search, docs search, or GitHub example search.
+- Use it when the answer depends on current web sources rather than HF-native resources.

plugins/mlintern/agents/openai.yaml DELETED Viewed

@@ -1,4 +0,0 @@
-interface:
-  display_name: "ML Intern"
-  short_description: "Hugging Face ML engineering agent"
-  default_prompt: "Act as an ML engineering intern. Research the task, validate datasets and models, write and test code, run training or evaluation on Hugging Face Jobs, and ship artifacts."