razvan commited on
Commit
b6b1825
·
1 Parent(s): 250b7b4
marketplace.json → .agents/plugins/marketplace.json RENAMED
@@ -5,10 +5,10 @@
5
  },
6
  "plugins": [
7
  {
8
- "name": "mlintern",
9
  "source": {
10
  "source": "local",
11
- "path": "./plugins/mlintern"
12
  },
13
  "policy": {
14
  "installation": "AVAILABLE",
 
5
  },
6
  "plugins": [
7
  {
8
+ "name": "ml-intern",
9
  "source": {
10
  "source": "local",
11
+ "path": "./plugins/ml-intern"
12
  },
13
  "policy": {
14
  "installation": "AVAILABLE",
README.md CHANGED
@@ -4,7 +4,7 @@ tags:
4
  ---
5
  # ML Intern Plugin for OpenAI Codex
6
 
7
- Hugging Face ML Intern reimagined as an OpenAI Codex plugin. Research papers, inspect datasets and models, run training and evaluation on Hugging Face Jobs, and ship ML artifacts — all inside Codex.
8
 
9
  ## What This Is
10
 
@@ -18,41 +18,68 @@ The original `mlintern-plugin` wraps the `ml-intern` CLI binary inside Claude Co
18
 
19
  ## Plugin Structure
20
 
21
- ```
22
- plugins/mlintern/
23
- ├── .codex-plugin/
24
- │ └── plugin.json ← Plugin manifest
25
- ├── agents/
26
- │ └── openai.yaml ← UI metadata for Codex
27
- ├── commands/
28
- │ └── run.md ← /mlintern:run command definition
29
- ├── skills/
30
- │ ├── ml-intern-harness/ Core ML Intern behavior (research, validate, implement, ship)
31
- │ ├── hf-model-search/ Model discovery and validation
32
- │ ├── hf-dataset-search/ Dataset discovery and schema inspection
33
- │ ├── hf-paper-search/ ← Paper research (search, read, citations)
34
- │ ├── hf-docs/ ← HF library documentation lookup
35
- │ └── hf-jobs/ ← Hugging Face cloud job submission and monitoring
36
- └── assets/ ← Icon, logo, screenshots
37
- ```
38
 
39
  ## Installation
40
 
41
- ### Method 1: Local Install (Recommended for Development)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
- Clone this repo and link it into your Codex plugins directory:
 
 
 
 
 
 
44
 
45
  ```bash
46
  git clone https://github.com/razvan/ml-intern-codex-plugin.git
47
  cd ml-intern-codex-plugin
48
- # Link or copy to Codex plugins directory
49
- mkdir -p ~/.codex/plugins/
50
- ln -s $(pwd)/plugins/mlintern ~/.codex/plugins/mlintern
51
  ```
52
 
53
- ### Method 2: Marketplace (Future)
54
 
55
- Once OpenAI launches a public plugin marketplace, this plugin can be registered via `marketplace.json`.
 
 
 
 
 
 
 
 
56
 
57
  ## Dependencies
58
 
@@ -74,6 +101,18 @@ Or use the skill name in your prompt:
74
  Use ml-intern-harness to research DPO training recipes, find a suitable dataset, and implement a training script.
75
  ```
76
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  ## Skills Reference
78
 
79
  | Skill | Purpose | Key Tools |
@@ -83,6 +122,8 @@ Use ml-intern-harness to research DPO training recipes, find a suitable dataset,
83
  | `hf-dataset-search` | Find and validate datasets | `_dataset_search`, `_hub_repo_details`, `inspect_dataset.py` |
84
  | `hf-paper-search` | Research papers and extract recipes | `_paper_search`, `papers.py` (details, citations, resources) |
85
  | `hf-docs` | Look up current HF library APIs | `_hf_doc_search`, `_hf_doc_fetch` |
 
 
86
  | `hf-jobs` | Submit and monitor cloud jobs | `_hf_jobs` (run, uv, ps, logs, inspect, cancel) |
87
 
88
  ## Comparison to Original
@@ -98,6 +139,66 @@ Use ml-intern-harness to research DPO training recipes, find a suitable dataset,
98
  | Job Submission | Built into `ml-intern` binary | `_hf_jobs` via Hugging Face Codex plugin |
99
  | Sandbox | HF Space sandboxes | Codex local shell + `_hf_jobs` |
100
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
  ## Why This Exists
102
 
103
  The original `huggingface/mlintern-plugin` is **Claude Code only** — it's a companion script that spawns the `ml-intern` CLI inside Claude Code sessions. There is no equivalent for Codex. The `huggingface/skills` repo provides general HF skills but not the full ML Intern harness. This plugin bridges the gap.
 
4
  ---
5
  # ML Intern Plugin for OpenAI Codex
6
 
7
+ Hugging Face ML Intern reimagined as an OpenAI Codex plugin. Research papers, inspect datasets and models, plan and evaluate AI/RAG systems, run training and evaluation on Hugging Face Jobs, and ship ML artifacts — all inside Codex.
8
 
9
  ## What This Is
10
 
 
18
 
19
  ## Plugin Structure
20
 
21
+ - `./.agents/plugins/marketplace.json` - Repo marketplace for Codex
22
+ - `./plugins/ml-intern/.codex-plugin/plugin.json` - Plugin manifest
23
+ - `./plugins/ml-intern/agents/openai.yaml` - UI metadata for Codex
24
+ - `./plugins/ml-intern/commands/run.md` - `/mlintern:run` command definition
25
+ - `./plugins/ml-intern/skills/ml-intern-harness/` - Core ML Intern behavior
26
+ - `./plugins/ml-intern/skills/hf-model-search/` - Model discovery and validation
27
+ - `./plugins/ml-intern/skills/hf-dataset-search/` - Dataset discovery and schema inspection
28
+ - `./plugins/ml-intern/skills/hf-paper-search/` - Paper research, reading, citations
29
+ - `./plugins/ml-intern/skills/hf-docs/` - Hugging Face library documentation lookup
30
+ - `./plugins/ml-intern/skills/github-example-search/` - GitHub example-file discovery
31
+ - `./plugins/ml-intern/skills/web-search/` - Current web search with source filtering
32
+ - `./plugins/ml-intern/skills/hf-jobs/` - Hugging Face cloud job submission and monitoring
 
 
 
 
 
33
 
34
  ## Installation
35
 
36
+ This plugin is hosted on GitHub, and the easiest local install is to clone the repo and let Codex load the plugin from the local checkout.
37
+
38
+ ### Method 1: Clean Codex UI Install
39
+
40
+ This is the recommended path if you want a clean install inside the Codex UI without copying files into a global plugin directory.
41
+
42
+ 1. Clone this repo somewhere local that Codex can read:
43
+
44
+ ```bash
45
+ git clone https://github.com/razvan/ml-intern-codex-plugin.git
46
+ cd ml-intern-codex-plugin
47
+ ```
48
+
49
+ 2. Restart Codex so it reloads the repo marketplace.
50
+
51
+ 3. Use the repository marketplace entry in this repo. The marketplace file is:
52
+
53
+ `./.agents/plugins/marketplace.json`
54
+
55
+ 4. The marketplace points Codex at the plugin bundle here:
56
+
57
+ `./plugins/ml-intern`
58
 
59
+ After Codex reloads, the plugin should appear in the Codex UI as **ML Intern for Codex**.
60
+
61
+ ### Method 2: Manual Local Install
62
+
63
+ If you want to install it into your local Codex plugin directory manually:
64
+
65
+ 1. Clone the repository locally:
66
 
67
  ```bash
68
  git clone https://github.com/razvan/ml-intern-codex-plugin.git
69
  cd ml-intern-codex-plugin
 
 
 
70
  ```
71
 
72
+ 2. Copy the plugin bundle into your Codex plugins directory:
73
 
74
+ ```bash
75
+ cp -R plugins/ml-intern ~/.codex/plugins/ml-intern
76
+ ```
77
+
78
+ 3. Make sure Codex can see the plugin via a local marketplace entry or by reloading the Codex UI, depending on how your Codex setup is configured.
79
+
80
+ 4. Restart Codex.
81
+
82
+ 5. Look for **ML Intern for Codex** in the plugin list.
83
 
84
  ## Dependencies
85
 
 
101
  Use ml-intern-harness to research DPO training recipes, find a suitable dataset, and implement a training script.
102
  ```
103
 
104
+ The bundled skills are the main entry points:
105
+
106
+ - `ml-intern-harness` for the end-to-end research, validation, implementation, and job loop, including plan-only AI/RAG/search/QA system design
107
+ - `ml-intern` as the short alias for the main plugin workflow
108
+ - `hf-model-search` for model discovery and validation
109
+ - `hf-dataset-search` for dataset discovery and schema inspection
110
+ - `hf-paper-search` for paper research and recipe extraction
111
+ - `hf-docs` for current Hugging Face library docs
112
+ - `github-example-search` for finding working example files in GitHub repos
113
+ - `web-search` for current web sources and general research outside the Hugging Face Hub
114
+ - `hf-jobs` for Hugging Face job submission and monitoring
115
+
116
  ## Skills Reference
117
 
118
  | Skill | Purpose | Key Tools |
 
122
  | `hf-dataset-search` | Find and validate datasets | `_dataset_search`, `_hub_repo_details`, `inspect_dataset.py` |
123
  | `hf-paper-search` | Research papers and extract recipes | `_paper_search`, `papers.py` (details, citations, resources) |
124
  | `hf-docs` | Look up current HF library APIs | `_hf_doc_search`, `_hf_doc_fetch` |
125
+ | `github-example-search` | Find working example files in GitHub repos | GitHub repo/file search plus `fetch_file` |
126
+ | `web-search` | Find current web sources with filters | Codex web browsing/search tools and citation links |
127
  | `hf-jobs` | Submit and monitor cloud jobs | `_hf_jobs` (run, uv, ps, logs, inspect, cancel) |
128
 
129
  ## Comparison to Original
 
139
  | Job Submission | Built into `ml-intern` binary | `_hf_jobs` via Hugging Face Codex plugin |
140
  | Sandbox | HF Space sandboxes | Codex local shell + `_hf_jobs` |
141
 
142
+ ## Parity Status
143
+
144
+ This plugin intentionally mirrors the parts of ML Intern that matter most for HF ML work:
145
+
146
+ | Area | Status | Notes |
147
+ |---|---|---|
148
+ | Paper discovery | Done | HF paper search plus the deeper `papers.py` research flow is available. |
149
+ | Paper reading | Done | Section reading, citations, recommendations, and linked resources are implemented. |
150
+ | Dataset validation | Done | Schema, splits, sample rows, parquet availability, and compatibility notes are covered. |
151
+ | HF docs lookup | Done | Search plus fetch are available for current HF library guidance. |
152
+ | End-to-end ML workflow | Done | The harness pushes research-first, validate-first behavior. |
153
+ | Generic web search | Partial | Best-effort Codex `web-search` guidance exists, but it is not the exact ML Intern DuckDuckGo wrapper. |
154
+
155
+ For the closest ML Intern feel, use the paper and dataset skills first, then docs, then the harness workflow.
156
+
157
+ ## Behavioral Contract
158
+
159
+ When ML Intern is invoked directly, the plugin should behave like a research harness rather than a generic assistant:
160
+
161
+ - Route through `ml-intern` -> `ml-intern-harness` for non-trivial AI/ML/RAG/search/evaluation tasks, even when they are not purely Hugging Face tasks.
162
+ - Use plan tracking at the beginning, each phase transition, and completion when Codex exposes a planning tool.
163
+ - Split research into explicit tracks before synthesis, such as platform constraints, technical approaches, and evaluation methodology.
164
+ - Use `hf-paper-search` for literature, benchmarks, and evaluation methods.
165
+ - Use `web-search` for current platform/API constraints, official docs, policies, pricing, rate limits, SDK behavior, and other non-HF facts.
166
+ - Cite important architecture and evaluation decisions with papers, official docs, or primary sources.
167
+ - If the user asks for "plan only", stop after research and do not write implementation code or scaffold files.
168
+ - If a write, shell, network, or sandbox step fails, fail forward with an inline deliverable when possible.
169
+
170
+ ### Codex Compatibility Layer
171
+
172
+ This plugin cannot inject upstream Python tools into Codex directly, so it should reproduce the behavior of the upstream tools using Codex-native primitives:
173
+
174
+ | Upstream tool | Codex compatibility layer | Required behavior to preserve |
175
+ |---|---|---|
176
+ | `plan_tool` | `update_plan` | Full-plan replacement, one `in_progress` item, updates at start/phase transition/completion |
177
+ | `research` | delegated sub-agent research when explicitly allowed, otherwise focused sequential research | Separate research context when possible, read-only scope, papers-first workflow, compact evidence-backed summary |
178
+
179
+ Faithful `plan_tool` semantics to preserve:
180
+ - Use it for tasks with 3 or more meaningful steps.
181
+ - Replace the whole visible plan each update.
182
+ - Keep exactly one item in progress.
183
+ - Mark completed only after full success.
184
+
185
+ Faithful `research` semantics to preserve:
186
+ - Main context stays focused on synthesis and decisions.
187
+ - Research uses papers, citation graphs, dataset inspection, docs, GitHub examples, and web search.
188
+ - Research returns compact findings with concrete references and recipe-level claims.
189
+ - If separate delegation is unavailable, preserve the same research floor directly rather than skipping it.
190
+
191
+ Example plan-only trigger:
192
+
193
+ ```text
194
+ [@ml-intern](plugin://ml-intern@ml-intern-codex)
195
+ i want to query generic Discord servers in natural language.
196
+ first figure out constraints and challenges, then research how to build and test quality.
197
+ i'm only interested in the plan for now.
198
+ ```
199
+
200
+ Expected behavior: track a plan, use `web-search` for Discord API constraints, use `hf-paper-search` for RAG/forum/social QA and evaluation research, synthesize a cited build-and-test plan, and avoid implementation.
201
+
202
  ## Why This Exists
203
 
204
  The original `huggingface/mlintern-plugin` is **Claude Code only** — it's a companion script that spawns the `ml-intern` CLI inside Claude Code sessions. There is no equivalent for Codex. The `huggingface/skills` repo provides general HF skills but not the full ML Intern harness. This plugin bridges the gap.
plugins/{mlintern → ml-intern}/.codex-plugin/plugin.json RENAMED
@@ -1,7 +1,7 @@
1
  {
2
  "name": "ml-intern",
3
- "version": "0.1.0",
4
- "description": "Hugging Face ML Intern for Codex — research ML papers, inspect models and datasets, run training and evaluation jobs, and ship ML artifacts.",
5
  "author": {
6
  "name": "Hugging Face",
7
  "email": "agents@huggingface.co",
@@ -15,15 +15,15 @@
15
  "interface": {
16
  "displayName": "ML Intern",
17
  "shortDescription": "Hugging Face ML engineering agent for Codex",
18
- "longDescription": "ML Intern is an autonomous ML engineering agent for the Hugging Face ecosystem. It researches papers, validates datasets and models, writes training/evaluation code, runs HF Jobs, and ships ML artifacts with zero avoidable errors.",
19
  "developerName": "Hugging Face",
20
  "category": "Coding",
21
  "capabilities": ["Interactive", "Read", "Write"],
22
  "websiteURL": "https://huggingface.co",
23
  "defaultPrompt": [
 
24
  "Fine-tune a language model on a custom dataset using Hugging Face Jobs",
25
- "Find the best open-source embedding model and benchmark it",
26
- "Research papers on preference optimization and implement DPO training"
27
  ],
28
  "brandColor": "#FF6B00"
29
  }
 
1
  {
2
  "name": "ml-intern",
3
+ "version": "0.1.3",
4
+ "description": "Hugging Face ML Intern for Codex — research ML papers first, inspect models and datasets, run training and evaluation jobs, and ship ML artifacts.",
5
  "author": {
6
  "name": "Hugging Face",
7
  "email": "agents@huggingface.co",
 
15
  "interface": {
16
  "displayName": "ML Intern",
17
  "shortDescription": "Hugging Face ML engineering agent for Codex",
18
+ "longDescription": "ML Intern is an autonomous ML engineering agent for the Hugging Face ecosystem. It follows a strict research-first workflow: clarify the deliverable, search papers first for novel or paper-backed tasks, trace citations, inspect likely datasets, read current HF docs and GitHub examples, use web sources only when necessary, validate datasets and models, and only then write code, run HF Jobs, and ship ML artifacts.",
19
  "developerName": "Hugging Face",
20
  "category": "Coding",
21
  "capabilities": ["Interactive", "Read", "Write"],
22
  "websiteURL": "https://huggingface.co",
23
  "defaultPrompt": [
24
+ "Research a paper-backed ML task and return a plan only",
25
  "Fine-tune a language model on a custom dataset using Hugging Face Jobs",
26
+ "Find the best open-source embedding model and benchmark it"
 
27
  ],
28
  "brandColor": "#FF6B00"
29
  }
plugins/ml-intern/agents/openai.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ interface:
2
+ display_name: "ML Intern"
3
+ short_description: "Hugging Face ML engineering agent"
4
+ default_prompt: "Act as an ML engineering intern with a strict research-first workflow. Clarify the deliverable, search papers first for paper-backed or novel tasks, trace citations when useful, validate datasets and models, read current HF docs and GitHub examples, use web sources only when current external facts are needed, and if the user only wants a plan, stop after the full research floor and return the plan with evidence checked."
plugins/{mlintern → ml-intern}/commands/run.md RENAMED
@@ -14,23 +14,31 @@ Run an ML Intern task end-to-end.
14
  ## Workflow
15
 
16
  1. Clarify the deliverable from the prompt.
17
- 2. Research the task before writing code:
18
- - Search for landmark and recent papers if the task is novel.
 
 
 
 
 
19
  - Read HF docs for current API patterns.
20
- - Find a working implementation example.
21
- 3. Validate inputs:
 
 
22
  - Inspect dataset schema, splits, sample rows.
23
  - Verify model repo exists, architecture matches, tokenizer available.
24
- 4. Implement the smallest working version.
25
- 5. Smoke test locally or in a small HF Job.
26
- 6. Run the full training/evaluation job with HF Jobs.
27
- 7. Evaluate results against the target.
28
- 8. Save code, configs, and reports; publish ML artifacts to Hugging Face.
29
 
30
  ## Output
31
 
32
  Return:
33
  - Deliverable status (complete / partial / failed).
 
34
  - GitHub branch, commit, PR, or report path for code.
35
  - Hugging Face model/dataset/Space URLs for published artifacts.
36
  - Job ID and log URL for HF Jobs runs.
@@ -40,6 +48,7 @@ Return:
40
  ## Guardrails
41
 
42
  - Never silently substitute a dataset, model, or training method. Ask for approval if the original request is incompatible.
 
43
  - Always set realistic timeouts for HF Jobs (at least 2 hours for real training).
44
  - Always include `push_to_hub=True` and `hub_model_id` in training configs.
45
  - Run one job first before launching sweeps or ablations.
 
14
  ## Workflow
15
 
16
  1. Clarify the deliverable from the prompt.
17
+ 2. If the task has 3 or more meaningful steps, create a full `update_plan` plan before deep work begins. Keep exactly one step in progress at a time and update it at phase transitions.
18
+ 3. Research the task before writing code:
19
+ - Use a research sub-agent for broad or novel research when the active Codex runtime explicitly allows delegation; otherwise run the same focused probes directly.
20
+ - Mirror upstream `research` behavior: keep research read-only, papers-first, and isolated from implementation as much as Codex allows.
21
+ - If the task is paper-backed or novel, search for landmark and recent papers first.
22
+ - Trace citation graphs or related-paper recommendations for old anchors and fast-moving methods.
23
+ - Search and inspect likely HF datasets, even for plan-only tasks.
24
  - Read HF docs for current API patterns.
25
+ - Find and read a working GitHub implementation example.
26
+ - Use web sources only when the answer depends on current information outside HF and GitHub.
27
+ - If the user only wants a plan, stop after the full research floor and return the plan. Do not implement.
28
+ 4. Validate inputs:
29
  - Inspect dataset schema, splits, sample rows.
30
  - Verify model repo exists, architecture matches, tokenizer available.
31
+ 5. Implement the smallest working version.
32
+ 6. Smoke test locally or in a small HF Job.
33
+ 7. Run the full training/evaluation job with HF Jobs.
34
+ 8. Evaluate results against the target.
35
+ 9. Save code, configs, and reports; publish ML artifacts to Hugging Face.
36
 
37
  ## Output
38
 
39
  Return:
40
  - Deliverable status (complete / partial / failed).
41
+ - Evidence checked: papers, datasets, docs, GitHub examples, and external sources.
42
  - GitHub branch, commit, PR, or report path for code.
43
  - Hugging Face model/dataset/Space URLs for published artifacts.
44
  - Job ID and log URL for HF Jobs runs.
 
48
  ## Guardrails
49
 
50
  - Never silently substitute a dataset, model, or training method. Ask for approval if the original request is incompatible.
51
+ - For multi-step tasks, do not skip plan updates at start, phase change, or completion.
52
  - Always set realistic timeouts for HF Jobs (at least 2 hours for real training).
53
  - Always include `push_to_hub=True` and `hub_model_id` in training configs.
54
  - Run one job first before launching sweeps or ablations.
plugins/ml-intern/skills/github-example-search/SKILL.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: github-example-search
3
+ description: "Find working example files in GitHub repositories using path heuristics and GitHub file search, then read the best candidates."
4
+ disable-model-invocation: false
5
+ ---
6
+
7
+ # GitHub Example Search
8
+
9
+ ## Purpose
10
+
11
+ Replicate the useful part of ML Intern's GitHub example discovery:
12
+ find example scripts, tutorials, notebooks, and guides in a target repo, then read the best matches before implementing code.
13
+
14
+ This skill is intentionally path-first. It is not a semantic code search engine.
15
+
16
+ ## Tools
17
+
18
+ Use the GitHub plugin tools:
19
+
20
+ - `search_repositories` or `search_installed_repositories_v2` to identify the repo if the user did not name it.
21
+ - `search` to search files within the target repository.
22
+ - `fetch_file` to read the candidate file contents.
23
+ - `search_branches` only if you need to confirm branch names.
24
+
25
+ ## Search Strategy
26
+
27
+ Use prioritized file-path patterns, roughly in this order:
28
+
29
+ 1. `scripts`
30
+ 2. `examples`, `example`
31
+ 3. `notebooks`, `notebook`
32
+ 4. `tutorials`, `tutorial`, `quickstart`, `walkthrough`, `walkthroughs`
33
+ 5. `cookbook`, `recipe`, `recipes`
34
+ 6. `demos`, `demo`, `samples`, `sample`
35
+ 7. `guides`, `guide`, `getting-started`, `getting_started`
36
+ 8. `playground`, `howto`, `how-to`
37
+ 9. `use-cases`, `usecases`, `use_cases`
38
+ 10. `sandbox`, `showcase`
39
+
40
+ ## Workflow
41
+
42
+ 1. Resolve the repository first.
43
+ 2. If the repository is ambiguous, search repositories by name or organization until you have a strong candidate.
44
+ 3. Search the repo with the highest-priority path patterns first.
45
+ 4. If the user gave a keyword, combine it with the pattern queries.
46
+ 5. Prefer files under `examples/` or `example/` when there is a tie.
47
+ 6. Prefer `scripts/` over other example-like directories.
48
+ 7. Prefer shallower paths when multiple matches are similar.
49
+ 8. Read the top candidate files with `fetch_file`, using line ranges for large files.
50
+ 9. Use the exact file path from the search result to continue the investigation.
51
+
52
+ ## Practical Query Plan
53
+
54
+ When looking for a specific method or trainer, search in this order:
55
+
56
+ 1. `<keyword> scripts`
57
+ 2. `<keyword> examples`
58
+ 3. `<keyword> tutorial`
59
+ 4. `<keyword> notebook`
60
+ 5. `<keyword> guide`
61
+
62
+ When no keyword is given, search for the directories themselves:
63
+
64
+ 1. `scripts`
65
+ 2. `examples`
66
+ 3. `example`
67
+ 4. `notebooks`
68
+ 5. `tutorials`
69
+
70
+ ## Reading Pattern
71
+
72
+ After finding a candidate file:
73
+
74
+ 1. Read the file header and argument parsing.
75
+ 2. Read the model, dataset, and trainer setup.
76
+ 3. Read the training loop or main execution path.
77
+ 4. If the file is long, fetch only the relevant line range.
78
+
79
+ ## Output Expectations
80
+
81
+ Return:
82
+
83
+ - the best candidate file path
84
+ - why it was selected
85
+ - the repository it came from
86
+ - the exact line range to read next if the file is large
87
+
88
+ ## Example
89
+
90
+ ```
91
+ _search(query="grpo scripts", repository_name="huggingface/trl", topn=10)
92
+ _fetch_file(repository_full_name="huggingface/trl", ref="main", path="examples/scripts/grpo.py", encoding="utf-8")
93
+ ```
94
+
95
+ ## Notes
96
+
97
+ - This is the closest Codex-native replacement for ML Intern's `github_find_examples`.
98
+ - It relies on GitHub repo/file search plus the same path-priority intuition as the upstream tool.
99
+ - Once you have a candidate, always read the actual file before implementing.
plugins/{mlintern → ml-intern}/skills/hf-dataset-search/SKILL.md RENAMED
@@ -35,12 +35,14 @@ This queries the Hugging Face Dataset Viewer API for:
35
 
36
  1. Search for candidate datasets with `dataset_search`.
37
  2. Inspect metadata with `hub_repo_details` (set `repo_type="dataset"`).
38
- 3. Run `inspect_dataset.py` for schema and sample row details.
39
  4. Verify training-method compatibility:
40
  - SFT: needs `messages`, `text`, or `prompt`/`completion`
41
  - DPO: needs `prompt`, `chosen`, `rejected`
42
  - GRPO: needs `prompt`
43
- 5. Surface class imbalance, missing values, unexpected formats, or unsafe substitutions.
 
 
44
 
45
  ## Example
46
 
@@ -56,6 +58,7 @@ Before using a dataset:
56
  - [ ] Dataset is valid and has Dataset Viewer coverage.
57
  - [ ] Configs/splits match expectations.
58
  - [ ] Column names and types are compatible with the trainer.
 
59
  - [ ] Sample rows look reasonable.
60
  - [ ] Row count is sufficient for the task.
61
  - [ ] License and gating are acceptable.
 
35
 
36
  1. Search for candidate datasets with `dataset_search`.
37
  2. Inspect metadata with `hub_repo_details` (set `repo_type="dataset"`).
38
+ 3. Run `inspect_dataset.py` for schema, splits, parquet availability, and sample row details.
39
  4. Verify training-method compatibility:
40
  - SFT: needs `messages`, `text`, or `prompt`/`completion`
41
  - DPO: needs `prompt`, `chosen`, `rejected`
42
  - GRPO: needs `prompt`
43
+ 5. If the dataset has a `messages` column, inspect the sample rows for role order and tool-call structure.
44
+ 6. Surface class imbalance, missing values, unexpected formats, or unsafe substitutions.
45
+ 7. Do not start training until the dataset shape matches the intended trainer.
46
 
47
  ## Example
48
 
 
58
  - [ ] Dataset is valid and has Dataset Viewer coverage.
59
  - [ ] Configs/splits match expectations.
60
  - [ ] Column names and types are compatible with the trainer.
61
+ - [ ] If `messages` exists, the chat format looks correct in sample rows.
62
  - [ ] Sample rows look reasonable.
63
  - [ ] Row count is sufficient for the task.
64
  - [ ] License and gating are acceptable.
plugins/{mlintern → ml-intern}/skills/hf-docs/SKILL.md RENAMED
@@ -32,10 +32,11 @@ Look up current API usage patterns, trainer configs, and library documentation b
32
 
33
  ## Workflow
34
 
35
- 1. Use `hf_doc_search` to find relevant pages.
36
- 2. Use `hf_doc_fetch` to read the full content of the most relevant page.
37
  3. Extract the exact imports, class names, config parameters, and argument names.
38
- 4. Use the current API in your implementation.
 
39
 
40
  ## Example
41
 
@@ -50,3 +51,4 @@ Before writing any training, fine-tuning, inference, or evaluation code:
50
  - Find at least one current working implementation pattern from HF docs or a relevant repo.
51
  - Verify import paths, trainer class names, and config field names.
52
  - Check that the example matches your library version constraints.
 
 
32
 
33
  ## Workflow
34
 
35
+ 1. Use `hf_doc_search` to find the most relevant current page for the exact library or trainer.
36
+ 2. Use `hf_doc_fetch` to read the full content of the page before coding.
37
  3. Extract the exact imports, class names, config parameters, and argument names.
38
+ 4. Cross-check any example against the current library version you are targeting.
39
+ 5. Use the current API in your implementation, not memory or old snippets.
40
 
41
  ## Example
42
 
 
51
  - Find at least one current working implementation pattern from HF docs or a relevant repo.
52
  - Verify import paths, trainer class names, and config field names.
53
  - Check that the example matches your library version constraints.
54
+ - If the API changed recently, prefer the docs page over older code examples.
plugins/{mlintern → ml-intern}/skills/hf-jobs/SKILL.md RENAMED
File without changes
plugins/{mlintern → ml-intern}/skills/hf-model-search/SKILL.md RENAMED
File without changes
plugins/{mlintern → ml-intern}/skills/hf-paper-search/SKILL.md RENAMED
@@ -37,13 +37,21 @@ python skills/ml-intern-harness/scripts/papers.py <operation> [args]
37
 
38
  ## Workflow
39
 
40
- 1. Use `paper_search` for quick discovery or `papers.py search` for filtered Semantic Scholar search.
41
- 2. Read methodology and results sections with `papers.py read_paper --arxiv-id <id> --section 3`.
42
- 3. Trace citation graphs with `papers.py citation_graph --arxiv-id <id> --direction citations`.
43
- 4. Extract concrete recipes: dataset, preprocessing, method, hyperparameters, model, metric, result.
44
- 5. Find linked HF datasets/models/collections with `papers.py find_all_resources`.
45
- 6. Inspect promising datasets with `inspect_dataset.py`.
46
- 7. Read current HF docs and GitHub examples before implementing.
 
 
 
 
 
 
 
 
47
 
48
  ## Example
49
 
 
37
 
38
  ## Workflow
39
 
40
+ 1. Use `paper_search` for quick discovery or `papers.py search` when you need filters.
41
+ 2. Prefer landmark and recent papers together, not just the most popular result.
42
+ 3. Read the paper TOC first, then the methodology and results sections with `papers.py read_paper`.
43
+ 4. Trace downstream citations with `papers.py citation_graph --direction citations` before settling on a recipe.
44
+ 5. Extract concrete recipes: dataset, preprocessing, method, hyperparameters, model, metric, result.
45
+ 6. Use `papers.py find_all_resources` to discover linked HF datasets, models, and collections.
46
+ 7. Inspect promising datasets with `inspect_dataset.py` before implementing anything.
47
+ 8. Read current HF docs and GitHub examples before implementing the training or evaluation code.
48
+
49
+ ## What To Prioritize
50
+
51
+ - Prefer sections that explain how the result was achieved, not just the abstract.
52
+ - Prefer recent downstream work when the anchor paper is old.
53
+ - Prefer recipe-level claims such as "dataset X + method Y + model Z produced metric M".
54
+ - Treat linked datasets as candidates, not assumptions. Validate schema and sample rows first.
55
 
56
  ## Example
57
 
plugins/{mlintern → ml-intern}/skills/ml-intern-harness/SKILL.md RENAMED
@@ -17,8 +17,17 @@ This skill is for doing ML work end to end — not just advising. Research first
17
  For any non-trivial ML task, follow this loop:
18
 
19
  1. **Clarify**: One-sentence deliverable. Is it a model, a benchmark result, a dataset, a report?
20
- 2. **Research**: Find at least one current working implementation pattern. For novel tasks, search landmark and recent papers; prefer methodology sections over abstracts. Extract recipes: dataset, model, method, hyperparameters, metrics.
21
- 3. **Validate inputs**: Inspect dataset schema, splits, sample rows. Verify model repo exists, architecture matches, tokenizer available, license compatible.
 
 
 
 
 
 
 
 
 
22
  4. **Implement smallest working version**: Use current HF docs and a working example as reference. Do not rely on memory for imports, config names, or trainer arguments.
23
  5. **Smoke test**: Run locally or in a small HF Job before the full run.
24
  6. **Run full job**: Submit to HF Jobs with realistic timeout, `push_to_hub=True`, and monitoring.
@@ -26,6 +35,33 @@ For any non-trivial ML task, follow this loop:
26
  8. **Ship**: Save code and configs to the source repo. Publish model weights, datasets, or Spaces to Hugging Face. Return artifact URLs.
27
  9. **Iterate**: If results are weak or broken, diagnose and run the next experiment. Do not stop after a plan.
28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  ## High-Risk Mistakes To Avoid
30
 
31
  - Hallucinated imports or trainer arguments from outdated memory.
@@ -47,8 +83,85 @@ For paper-backed tasks:
47
  5. Find linked Hugging Face datasets, models, and collections.
48
  6. Inspect promising datasets before using them.
49
  7. Read current HF docs and GitHub examples before implementing.
 
 
 
 
 
 
 
 
 
50
 
51
  Use the `hf-paper-search` skill for paper operations.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
  ## Dataset Audit Pattern
54
 
@@ -56,6 +169,7 @@ Before training or evaluating:
56
  - Verify repo, config, split, and revision.
57
  - Check row counts, column names, and representative rows.
58
  - Look for missing values, invalid records, class imbalance, or reward/preference balance.
 
59
  - Check text/message schema compatibility with the trainer:
60
  - SFT: needs `messages`, `text`, or `prompt`/`completion`
61
  - DPO: needs `prompt`, `chosen`, `rejected`
@@ -75,6 +189,8 @@ A training script should include:
75
  - Plain-text logging: `disable_tqdm=True`, `logging_strategy="steps"`, `logging_first_step=True`.
76
  - Evaluation or validation pass.
77
  - `push_to_hub=True` or explicit upload of final artifacts.
 
 
78
 
79
  For long runs, include Trackio or equivalent monitoring and return the dashboard URL.
80
 
@@ -102,6 +218,7 @@ When something fails:
102
  - OOM: reduce per-device batch size while increasing gradient accumulation to keep effective batch size; enable gradient checkpointing; or choose larger hardware. Do not switch methods.
103
  - Divergence/NaN: lower learning rate, check labels/rewards, inspect samples.
104
  - Weak metric: compare against paper recipes, inspect errors, tune with a small sweep.
 
105
 
106
  Do not hide compromises. If preserving the original request is impossible, explain the constraint and ask for approval.
107
 
 
17
  For any non-trivial ML task, follow this loop:
18
 
19
  1. **Clarify**: One-sentence deliverable. Is it a model, a benchmark result, a dataset, a report?
20
+ 2. **Research**: Use the strongest evidence first.
21
+ - For broad or novel research, delegate focused literature/code/dataset probes to a research sub-agent when the active Codex runtime explicitly allows delegation. If delegation is unavailable, do the same probes directly and say that no separate sub-agent was allowed.
22
+ - Start with landmark and recent papers for novel tasks.
23
+ - Read methodology, experiments, and results sections before relying on abstracts.
24
+ - Trace citations to find recent downstream improvements.
25
+ - Locate linked datasets, models, and collections.
26
+ - Read current HF docs and GitHub examples before implementing.
27
+ - Use current web sources only when the answer depends on information outside HF and GitHub.
28
+ - Extract recipes: dataset, model, method, hyperparameters, metrics.
29
+ - If the user only wants a plan, stop after the research floor below and synthesize the plan. Do not implement.
30
+ 3. **Validate inputs**: Inspect dataset schema, splits, sample rows, and parquet availability. Verify model repo exists, architecture matches, tokenizer available, license compatible.
31
  4. **Implement smallest working version**: Use current HF docs and a working example as reference. Do not rely on memory for imports, config names, or trainer arguments.
32
  5. **Smoke test**: Run locally or in a small HF Job before the full run.
33
  6. **Run full job**: Submit to HF Jobs with realistic timeout, `push_to_hub=True`, and monitoring.
 
35
  8. **Ship**: Save code and configs to the source repo. Publish model weights, datasets, or Spaces to Hugging Face. Return artifact URLs.
36
  9. **Iterate**: If results are weak or broken, diagnose and run the next experiment. Do not stop after a plan.
37
 
38
+ ## Upstream plan_tool Compatibility
39
+
40
+ Upstream `huggingface/ml-intern` has a real `plan_tool` with strict semantics. In Codex, emulate it as closely as possible with `update_plan`.
41
+
42
+ Rules to preserve:
43
+ - Use plan tracking for tasks with 3 or more meaningful steps.
44
+ - Start with a full plan before deep work begins.
45
+ - Each update replaces the whole visible plan, not just one item.
46
+ - Keep exactly one item `in_progress` at a time.
47
+ - Mark items `completed` immediately after they fully succeed.
48
+ - Do not mark an item completed if it failed, is partial, or is blocked.
49
+ - Update at phase transitions, after major research tracks complete, and at final completion.
50
+
51
+ Compatibility mapping:
52
+ - Upstream `todos[].id` becomes a stable numeric prefix in the step text, for example `1. Research papers`.
53
+ - Upstream `todos[].content` becomes the human-readable step text.
54
+ - Upstream `todos[].status` maps directly to Codex `pending`, `in_progress`, `completed`.
55
+
56
+ Preferred shape:
57
+ 1. Research constraints / external platform facts
58
+ 2. Research papers, datasets, and benchmarks
59
+ 3. Research current docs and working code examples
60
+ 4. Synthesize plan or implement
61
+ 5. Verify and report
62
+
63
+ When the user only wants a plan, the final `update_plan` call should still mark the synthesis step completed before returning.
64
+
65
  ## High-Risk Mistakes To Avoid
66
 
67
  - Hallucinated imports or trainer arguments from outdated memory.
 
83
  5. Find linked Hugging Face datasets, models, and collections.
84
  6. Inspect promising datasets before using them.
85
  7. Read current HF docs and GitHub examples before implementing.
86
+ 8. Use `github-example-search` when you need a working file path quickly, then read the file with GitHub.
87
+ 9. Use `web-search` when you need current information outside the Hugging Face Hub or GitHub.
88
+
89
+ When choosing sources, prefer them in this order:
90
+ 1. Published papers and citation graphs.
91
+ 2. Linked datasets/models/collections from the paper.
92
+ 3. Current HF docs.
93
+ 4. Working GitHub examples.
94
+ 5. Current web sources for non-HF facts or updates.
95
 
96
  Use the `hf-paper-search` skill for paper operations.
97
+ Use the `github-example-search` skill for example-file discovery in GitHub repos.
98
+ Use the `web-search` skill for general current web research.
99
+
100
+ ## Plan-Only Research Floor
101
+
102
+ When the user asks for research, recommendations, architecture, or a plan only, still behave like ML Intern's research pass. Do not stop at papers alone.
103
+
104
+ Minimum research floor:
105
+ - **Literature**: Search anchor papers and recent downstream papers. Read method/experiment/result sections where available. Include citation-graph or related-paper exploration for old anchors or fast-moving areas.
106
+ - **Datasets**: Search HF Hub for task-adjacent datasets and inspect at least the most plausible candidates. If no suitable public dataset exists, state that as a verified gap and propose custom collection/eval.
107
+ - **Code precedent**: Search GitHub for working implementations, especially in `transformers`, `trl`, `datasets`, `peft`, `accelerate`, `sentence-transformers`, or task-specific repos. Read the best candidate files before making API or architecture claims.
108
+ - **Docs**: Read current HF docs for any library/API that the plan depends on.
109
+ - **External constraints**: Use current web/official docs for non-HF platform constraints, policies, rate limits, pricing, or APIs.
110
+
111
+ For plan-only outputs, return a compact evidence table before the plan when useful:
112
+ - Source or artifact.
113
+ - What was verified.
114
+ - Design implication.
115
+ - Confidence or gap.
116
+
117
+ If runtime policy prevents spawning a research sub-agent, note that only as a process limitation; do not use it as a reason to skip dataset, code, docs, or citation-graph research.
118
+
119
+ ## Upstream research Compatibility
120
+
121
+ Upstream `huggingface/ml-intern` has a built-in `research` tool that launches an isolated research sub-agent with its own context and a read-only tool subset. In Codex, emulate that behavior as closely as the runtime allows.
122
+
123
+ Primary intent:
124
+ - Keep the main context focused on decisions and synthesis.
125
+ - Offload heavy literature/doc/code crawling into a separate agent context when delegation is explicitly allowed in the active runtime and by the user request.
126
+ - Use a read-only research scope: papers, docs, GitHub/file reads, dataset inspection, and web search. Avoid writes or implementation inside research.
127
+
128
+ When delegation is allowed:
129
+ - Spawn one focused sub-agent for broad research, or multiple focused sub-agents for parallel tracks.
130
+ - Give each sub-agent a narrow task with explicit scope, such as papers, datasets, docs, or code precedents.
131
+ - Ask for compact findings, not raw notes.
132
+ - Require recipe-level outputs: `dataset + method + hyperparameters -> result`.
133
+ - Require concrete references: paper ids, dataset ids, repo paths, doc pages.
134
+
135
+ When delegation is not allowed:
136
+ - Perform the same probes directly in the main context.
137
+ - State the limitation briefly as a process note only.
138
+ - Still preserve the upstream research order: papers first, then datasets, then docs/examples, then current external constraints.
139
+
140
+ Research prompt pattern to emulate:
141
+ - Start from anchor papers or landmark work.
142
+ - Crawl citation graphs or related recent work.
143
+ - Read methodology/experiment/result sections, not just abstracts.
144
+ - Validate linked datasets and code instead of trusting claims.
145
+ - End with a compact evidence-backed summary for the main agent.
146
+
147
+ Codex-specific wrap-up behavior:
148
+ - If research gets too broad, stop expanding scope and summarize verified findings.
149
+ - Return an evidence table before the final plan when useful.
150
+ - Distinguish `verified`, `inferred`, and `not checked`.
151
+
152
+ ## Parity With Upstream ML Intern
153
+
154
+ Upstream `huggingface/ml-intern` includes a built-in `research` tool that launches an isolated read-only research agent with papers, citation graph, dataset inspection, HF docs, GitHub example search, repository file reads, and web search. This Codex plugin approximates that behavior through skills and available tools.
155
+
156
+ To stay close to upstream behavior:
157
+ - Keep the main context focused on synthesis and decisions.
158
+ - Use parallel research where the runtime explicitly permits it; otherwise run focused sequential probes.
159
+ - Treat `update_plan` as the compatibility layer for upstream `plan_tool`.
160
+ - Prefer recipe-level findings: `dataset + method + hyperparameters -> metric/result`.
161
+ - Validate linked resources instead of trusting paper or README claims.
162
+ - Include concrete file paths, API names, dataset IDs, model IDs, and citation links.
163
+ - Distinguish "verified", "inferred", and "not checked".
164
+ - Preserve a trace of the research in the final response when no artifact is written: sources consulted, datasets/code/docs checked, and remaining gaps.
165
 
166
  ## Dataset Audit Pattern
167
 
 
169
  - Verify repo, config, split, and revision.
170
  - Check row counts, column names, and representative rows.
171
  - Look for missing values, invalid records, class imbalance, or reward/preference balance.
172
+ - If a paper suggested the dataset, verify the paper claim against the actual HF dataset contents instead of assuming they match.
173
  - Check text/message schema compatibility with the trainer:
174
  - SFT: needs `messages`, `text`, or `prompt`/`completion`
175
  - DPO: needs `prompt`, `chosen`, `rejected`
 
189
  - Plain-text logging: `disable_tqdm=True`, `logging_strategy="steps"`, `logging_first_step=True`.
190
  - Evaluation or validation pass.
191
  - `push_to_hub=True` or explicit upload of final artifacts.
192
+ - The smallest working path first, then optional refinements after the baseline is proven.
193
+ - If a step depends on a repository example, read the example before changing the implementation.
194
 
195
  For long runs, include Trackio or equivalent monitoring and return the dashboard URL.
196
 
 
218
  - OOM: reduce per-device batch size while increasing gradient accumulation to keep effective batch size; enable gradient checkpointing; or choose larger hardware. Do not switch methods.
219
  - Divergence/NaN: lower learning rate, check labels/rewards, inspect samples.
220
  - Weak metric: compare against paper recipes, inspect errors, tune with a small sweep.
221
+ - If the issue is ambiguous, return to the most authoritative source available before making a speculative change.
222
 
223
  Do not hide compromises. If preserving the original request is impossible, explain the constraint and ask for approval.
224
 
plugins/{mlintern → ml-intern}/skills/ml-intern-harness/scripts/inspect_dataset.py RENAMED
File without changes
plugins/{mlintern → ml-intern}/skills/ml-intern-harness/scripts/papers.py RENAMED
File without changes
plugins/{mlintern → ml-intern}/skills/ml-intern-harness/scripts/preflight_check.py RENAMED
File without changes
plugins/ml-intern/skills/ml-intern/SKILL.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: ml-intern
3
+ description: "Shortcut entry point for the ML Intern plugin. Use this when you want the main Hugging Face ML workflow without remembering the longer harness name."
4
+ disable-model-invocation: false
5
+ ---
6
+
7
+ # ml-intern
8
+
9
+ Use this as the short alias for the main plugin workflow.
10
+
11
+ This skill is only a router into `ml-intern-harness`. Do not improvise a different workflow.
12
+
13
+ If the task is non-trivial, delegate to `ml-intern-harness` and follow its exact research-first ML workflow:
14
+
15
+ - clarify the deliverable in one sentence
16
+ - if the task has 3 or more real steps, start with a full `update_plan` call that mirrors upstream `plan_tool` semantics
17
+ - for broad or novel research, use a research sub-agent when the active runtime explicitly allows delegation; otherwise perform the same focused research probes directly and disclose the limitation
18
+ - if the task is paper-backed or novel, use `hf-paper-search` first
19
+ - for plan-only research, still check likely HF datasets, current docs, and GitHub implementation precedents before synthesizing
20
+ - use `web-search` only when the answer depends on current information outside HF and GitHub
21
+ - validate datasets and models
22
+ - implement the smallest working version only after research
23
+ - smoke test before full runs
24
+ - run Hugging Face Jobs when needed
25
+ - evaluate and ship artifacts
26
+
27
+ For parity with upstream ML Intern:
28
+ - treat Codex `update_plan` as the compatibility layer for upstream `plan_tool`
29
+ - treat delegated focused research as the compatibility layer for upstream `research`
30
+ - update the plan at start, major phase changes, and completion
31
+ - keep research outputs compact, citation-backed, and recipe-level
32
+
33
+ If the user says they only want a plan, stop after the full research floor and return the plan. Do not start implementation.
34
+
35
+ For focused tasks, use the specialized skills directly:
36
+
37
+ - `hf-model-search`
38
+ - `hf-dataset-search`
39
+ - `hf-paper-search`
40
+ - `hf-docs`
41
+ - `hf-jobs`
plugins/ml-intern/skills/web-search/SKILL.md ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: web-search
3
+ description: "Search the current web for up-to-date information, prefer authoritative sources, and return concise cited results with optional domain filtering."
4
+ disable-model-invocation: false
5
+ ---
6
+
7
+ # Web Search
8
+
9
+ ## Purpose
10
+
11
+ Replicate the useful part of ML Intern's generic web search behavior:
12
+ find current information outside the Hugging Face Hub and GitHub, prefer high-quality sources, and return a compact result list with links.
13
+
14
+ This skill is best used for:
15
+
16
+ - current product or API updates
17
+ - release notes and changelogs
18
+ - standards, policies, pricing, schedules, and announcements
19
+ - official docs and primary sources that are not on the Hub
20
+
21
+ ## Search Behavior
22
+
23
+ When searching the web:
24
+
25
+ 1. Start with the narrowest query that can reasonably work.
26
+ 2. Prefer authoritative sources first:
27
+ - official docs
28
+ - vendor docs
29
+ - standards bodies
30
+ - primary sources
31
+ - project repositories
32
+ 3. If a search is broad, use multiple focused queries instead of one giant query.
33
+ 4. If the result set looks noisy, restrict to known-good domains.
34
+ 5. If the source is time-sensitive, verify the publication or update date before using it.
35
+
36
+ ## Domain Filtering
37
+
38
+ Use domain filtering when possible:
39
+
40
+ - `allowed_domains` for official or trusted sources only
41
+ - `blocked_domains` to avoid low-quality aggregators, mirrors, or SEO spam
42
+
43
+ If both are available, prefer allowlists.
44
+
45
+ ## Workflow
46
+
47
+ 1. Define the question in one sentence.
48
+ 2. Search the web with 1-3 targeted queries.
49
+ 3. Narrow to authoritative sources or request domain restrictions if needed.
50
+ 4. Open the best sources and extract the relevant facts.
51
+ 5. Cross-check conflicting claims with at least one other source when the answer matters.
52
+ 6. Return links, dates, and a short explanation of why the source is trustworthy.
53
+
54
+ ## Output Expectations
55
+
56
+ Return:
57
+
58
+ - a short answer to the question
59
+ - 3-8 source links, ordered by usefulness
60
+ - publication or update dates when available
61
+ - a note if the answer is based on inference rather than direct source text
62
+
63
+ ## Style
64
+
65
+ - Keep results compact and citation-friendly.
66
+ - Prefer source titles and URLs over long quotes.
67
+ - For current or changing facts, always mention the date context.
68
+
69
+ ## Notes
70
+
71
+ - This is a best-effort Codex analogue to ML Intern's web search behavior.
72
+ - It does not replace paper search, docs search, or GitHub example search.
73
+ - Use it when the answer depends on current web sources rather than HF-native resources.
plugins/mlintern/agents/openai.yaml DELETED
@@ -1,4 +0,0 @@
1
- interface:
2
- display_name: "ML Intern"
3
- short_description: "Hugging Face ML engineering agent"
4
- default_prompt: "Act as an ML engineering intern. Research the task, validate datasets and models, write and test code, run training or evaluation on Hugging Face Jobs, and ship artifacts."