lewtun HF Staff OpenAI Codex commited on
Commit
4668dbd
·
unverified ·
1 Parent(s): 6aebbdf

Add CLI local model support (#228)

Browse files

* Add CLI local model support

Co-authored-by: OpenAI Codex <codex@openai.com>

* Add shared local model endpoint fallback

Support LOCAL_LLM_BASE_URL and LOCAL_LLM_API_KEY as shared fallbacks while preserving provider-specific local overrides.

Co-authored-by: OpenAI Codex <codex@openai.com>

* Address local model review feedback

Clarify local probe failure behavior, add regression coverage for rejected local switches, and simplify local model validation.

Co-authored-by: OpenAI Codex <codex@openai.com>

---------

Co-authored-by: OpenAI Codex <codex@openai.com>

README.md CHANGED
@@ -28,10 +28,14 @@ Create a `.env` file in the project root (or export these in your shell):
28
  ```bash
29
  ANTHROPIC_API_KEY=<your-anthropic-api-key> # if using anthropic models
30
  OPENAI_API_KEY=<your-openai-api-key> # if using openai models
 
 
31
  HF_TOKEN=<your-hugging-face-token>
32
  GITHUB_TOKEN=<github-personal-access-token>
33
  ```
34
- If no `HF_TOKEN` is set, the CLI will prompt you to paste one on first launch. To get a GITHUB_TOKEN follow the tutorial [here](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens#creating-a-fine-grained-personal-access-token).
 
 
35
 
36
  ### Usage
37
 
@@ -52,12 +56,41 @@ ml-intern "fine-tune llama on my dataset"
52
  ```bash
53
  ml-intern --model anthropic/claude-opus-4-7 "your prompt" # requires ANTHROPIC_API_KEY
54
  ml-intern --model openai/gpt-5.5 "your prompt" # requires OPENAI_API_KEY
 
 
55
  ml-intern --max-iterations 100 "your prompt"
56
  ml-intern --no-stream "your prompt"
57
  ```
58
 
59
  Run `ml-intern` then `/model` to see the full list of suggested model ids
60
- (Claude, GPT, and HF-router models like MiniMax, Kimi, GLM, DeepSeek).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
 
62
  ## Sharing Traces
63
 
 
28
  ```bash
29
  ANTHROPIC_API_KEY=<your-anthropic-api-key> # if using anthropic models
30
  OPENAI_API_KEY=<your-openai-api-key> # if using openai models
31
+ LOCAL_LLM_BASE_URL=http://localhost:8000 # shared fallback for local model prefixes
32
+ LOCAL_LLM_API_KEY=<optional-local-api-key> # optional shared local API key
33
  HF_TOKEN=<your-hugging-face-token>
34
  GITHUB_TOKEN=<github-personal-access-token>
35
  ```
36
+ If no `HF_TOKEN` is set, the CLI will prompt you to paste one on first launch
37
+ unless you start on a local model. To get a GITHUB_TOKEN follow the tutorial
38
+ [here](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens#creating-a-fine-grained-personal-access-token).
39
 
40
  ### Usage
41
 
 
56
  ```bash
57
  ml-intern --model anthropic/claude-opus-4-7 "your prompt" # requires ANTHROPIC_API_KEY
58
  ml-intern --model openai/gpt-5.5 "your prompt" # requires OPENAI_API_KEY
59
+ ml-intern --model ollama/llama3.1:8b "your prompt"
60
+ ml-intern --model vllm/meta-llama/Llama-3.1-8B-Instruct "your prompt"
61
  ml-intern --max-iterations 100 "your prompt"
62
  ml-intern --no-stream "your prompt"
63
  ```
64
 
65
  Run `ml-intern` then `/model` to see the full list of suggested model ids
66
+ (Claude, GPT, HF-router models like MiniMax, Kimi, GLM, DeepSeek, and local
67
+ model prefixes).
68
+
69
+ **Local models:**
70
+
71
+ Local model support uses OpenAI-compatible HTTP endpoints through LiteLLM. The
72
+ agent does not load model weights directly from disk; start your inference
73
+ server first, then select it with a provider-specific model prefix:
74
+
75
+ ```bash
76
+ ml-intern --model ollama/llama3.1:8b "your prompt"
77
+ ml-intern --model vllm/meta-llama/Llama-3.1-8B-Instruct "your prompt"
78
+ ```
79
+
80
+ Inside interactive mode, switch with `/model`:
81
+
82
+ ```text
83
+ /model ollama/llama3.1:8b
84
+ /model lm_studio/google/gemma-3-4b
85
+ /model llamacpp/llama-3.1-8b-instruct
86
+ ```
87
+
88
+ Supported local prefixes are `ollama/`, `vllm/`, `lm_studio/`, and
89
+ `llamacpp/`. Set `LOCAL_LLM_BASE_URL` and optional `LOCAL_LLM_API_KEY` to use
90
+ one shared local endpoint, or override a specific provider with its matching
91
+ `*_BASE_URL` / `*_API_KEY` variable, such as `OLLAMA_BASE_URL` or
92
+ `VLLM_API_KEY`. Provider-specific variables take precedence over the shared
93
+ local variables. Base URLs may include or omit `/v1`.
94
 
95
  ## Sharing Traces
96
 
agent/core/llm_params.py CHANGED
@@ -5,7 +5,17 @@ can import it without pulling in the whole agent loop / tool router and
5
  creating circular imports.
6
  """
7
 
 
 
8
  from agent.core.hf_tokens import get_hf_bill_to, resolve_hf_router_token
 
 
 
 
 
 
 
 
9
 
10
 
11
  def _resolve_hf_router_token(session_hf_token: str | None = None) -> str | None:
@@ -96,6 +106,46 @@ class UnsupportedEffortError(ValueError):
96
  """
97
 
98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
  def _resolve_llm_params(
100
  model_name: str,
101
  session_hf_token: str | None = None,
@@ -121,6 +171,12 @@ def _resolve_llm_params(
121
  • ``openai/<model>`` — ``reasoning_effort`` forwarded as a top-level
122
  kwarg (GPT-5 / o-series). LiteLLM uses the user's ``OPENAI_API_KEY``.
123
 
 
 
 
 
 
 
124
  • Anything else is treated as a HuggingFace router id. We hit the
125
  auto-routing OpenAI-compatible endpoint at
126
  ``https://router.huggingface.co/v1``. The id can be bare or carry an
@@ -187,6 +243,12 @@ def _resolve_llm_params(
187
  params["reasoning_effort"] = reasoning_effort
188
  return params
189
 
 
 
 
 
 
 
190
  hf_model = model_name.removeprefix("huggingface/")
191
  api_key = _resolve_hf_router_token(session_hf_token)
192
  params = {
 
5
  creating circular imports.
6
  """
7
 
8
+ import os
9
+
10
  from agent.core.hf_tokens import get_hf_bill_to, resolve_hf_router_token
11
+ from agent.core.local_models import (
12
+ LOCAL_MODEL_API_KEY_DEFAULT,
13
+ LOCAL_MODEL_API_KEY_ENV,
14
+ LOCAL_MODEL_BASE_URL_ENV,
15
+ is_reserved_local_model_id,
16
+ local_model_name,
17
+ local_model_provider,
18
+ )
19
 
20
 
21
  def _resolve_hf_router_token(session_hf_token: str | None = None) -> str | None:
 
106
  """
107
 
108
 
109
+ def _local_api_base(base_url: str) -> str:
110
+ base = base_url.strip().rstrip("/")
111
+ if base.endswith("/v1"):
112
+ return base
113
+ return f"{base}/v1"
114
+
115
+
116
+ def _resolve_local_model_params(
117
+ model_name: str,
118
+ reasoning_effort: str | None = None,
119
+ strict: bool = False,
120
+ ) -> dict:
121
+ if reasoning_effort and strict:
122
+ raise UnsupportedEffortError(
123
+ "Local OpenAI-compatible endpoints don't accept reasoning_effort"
124
+ )
125
+
126
+ local_name = local_model_name(model_name)
127
+ if local_name is None:
128
+ raise ValueError(f"Unsupported local model id: {model_name}")
129
+
130
+ provider = local_model_provider(model_name)
131
+ assert provider is not None
132
+ raw_base = (
133
+ os.environ.get(provider["base_url_env"])
134
+ or os.environ.get(LOCAL_MODEL_BASE_URL_ENV)
135
+ or provider["base_url_default"]
136
+ )
137
+ api_key = (
138
+ os.environ.get(provider["api_key_env"])
139
+ or os.environ.get(LOCAL_MODEL_API_KEY_ENV)
140
+ or LOCAL_MODEL_API_KEY_DEFAULT
141
+ )
142
+ return {
143
+ "model": f"openai/{local_name}",
144
+ "api_base": _local_api_base(raw_base),
145
+ "api_key": api_key,
146
+ }
147
+
148
+
149
  def _resolve_llm_params(
150
  model_name: str,
151
  session_hf_token: str | None = None,
 
171
  • ``openai/<model>`` — ``reasoning_effort`` forwarded as a top-level
172
  kwarg (GPT-5 / o-series). LiteLLM uses the user's ``OPENAI_API_KEY``.
173
 
174
+ • ``ollama/<model>``, ``vllm/<model>``, ``lm_studio/<model>``, and
175
+ ``llamacpp/<model>`` — local OpenAI-compatible endpoints. The id prefix
176
+ selects a configurable localhost base URL, and the model suffix is sent
177
+ to LiteLLM as ``openai/<model>``. These endpoints don't receive
178
+ ``reasoning_effort``.
179
+
180
  • Anything else is treated as a HuggingFace router id. We hit the
181
  auto-routing OpenAI-compatible endpoint at
182
  ``https://router.huggingface.co/v1``. The id can be bare or carry an
 
243
  params["reasoning_effort"] = reasoning_effort
244
  return params
245
 
246
+ if is_reserved_local_model_id(model_name):
247
+ raise ValueError(f"Unsupported local model id: {model_name}")
248
+
249
+ if local_model_provider(model_name) is not None:
250
+ return _resolve_local_model_params(model_name, reasoning_effort, strict)
251
+
252
  hf_model = model_name.removeprefix("huggingface/")
253
  api_key = _resolve_hf_router_token(session_hf_token)
254
  params = {
agent/core/local_models.py ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Helpers for CLI local OpenAI-compatible model ids."""
2
+
3
+ LOCAL_MODEL_PROVIDERS: dict[str, dict[str, str]] = {
4
+ "ollama/": {
5
+ "base_url_env": "OLLAMA_BASE_URL",
6
+ "base_url_default": "http://localhost:11434",
7
+ "api_key_env": "OLLAMA_API_KEY",
8
+ },
9
+ "vllm/": {
10
+ "base_url_env": "VLLM_BASE_URL",
11
+ "base_url_default": "http://localhost:8000",
12
+ "api_key_env": "VLLM_API_KEY",
13
+ },
14
+ "lm_studio/": {
15
+ "base_url_env": "LMSTUDIO_BASE_URL",
16
+ "base_url_default": "http://127.0.0.1:1234",
17
+ "api_key_env": "LMSTUDIO_API_KEY",
18
+ },
19
+ "llamacpp/": {
20
+ "base_url_env": "LLAMACPP_BASE_URL",
21
+ "base_url_default": "http://localhost:8080",
22
+ "api_key_env": "LLAMACPP_API_KEY",
23
+ },
24
+ }
25
+
26
+ LOCAL_MODEL_PREFIXES = tuple(LOCAL_MODEL_PROVIDERS)
27
+ RESERVED_LOCAL_MODEL_PREFIXES = ("openai-compat/",)
28
+ LOCAL_MODEL_BASE_URL_ENV = "LOCAL_LLM_BASE_URL"
29
+ LOCAL_MODEL_API_KEY_ENV = "LOCAL_LLM_API_KEY"
30
+ LOCAL_MODEL_API_KEY_DEFAULT = "sk-local-no-key-required"
31
+
32
+
33
+ def local_model_provider(model_id: str) -> dict[str, str] | None:
34
+ """Return provider config for a local model id, if it uses a local prefix."""
35
+ for prefix, config in LOCAL_MODEL_PROVIDERS.items():
36
+ if model_id.startswith(prefix):
37
+ return config
38
+ return None
39
+
40
+
41
+ def local_model_name(model_id: str) -> str | None:
42
+ """Return the backend model name with the local provider prefix removed."""
43
+ for prefix in LOCAL_MODEL_PREFIXES:
44
+ if model_id.startswith(prefix):
45
+ name = model_id[len(prefix) :]
46
+ return name or None
47
+ return None
48
+
49
+
50
+ def is_local_model_id(model_id: str) -> bool:
51
+ """Return True for non-empty, whitespace-free local model ids."""
52
+ if not model_id or any(char.isspace() for char in model_id):
53
+ return False
54
+ return local_model_name(model_id) is not None
55
+
56
+
57
+ def is_reserved_local_model_id(model_id: str) -> bool:
58
+ """Return True for local-style prefixes intentionally not supported."""
59
+ return model_id.startswith(RESERVED_LOCAL_MODEL_PREFIXES)
agent/core/model_switcher.py CHANGED
@@ -15,7 +15,17 @@ glues it to CLI output + session state.
15
 
16
  from __future__ import annotations
17
 
 
 
 
 
18
  from agent.core.effort_probe import ProbeInconclusive, probe_effort
 
 
 
 
 
 
19
 
20
 
21
  # Suggested models shown by `/model` (not a gate). Users can paste any HF
@@ -40,6 +50,8 @@ SUGGESTED_MODELS = [
40
 
41
 
42
  _ROUTING_POLICIES = {"fastest", "cheapest", "preferred"}
 
 
43
 
44
 
45
  def is_valid_model_id(model_id: str) -> bool:
@@ -48,13 +60,22 @@ def is_valid_model_id(model_id: str) -> bool:
48
  Accepts:
49
  • anthropic/<model>
50
  • openai/<model>
 
51
  • <org>/<model>[:<tag>] (HF router; tag = provider or policy)
52
  • huggingface/<org>/<model>[:<tag>] (same, accepts legacy prefix)
53
 
54
  Actual availability is verified against the HF router catalog on
55
  switch, and by the provider on the probe's ping call.
56
  """
57
- if not model_id or "/" not in model_id:
 
 
 
 
 
 
 
 
58
  return False
59
  head = model_id.split(":", 1)[0]
60
  parts = head.split("/")
@@ -70,7 +91,7 @@ def _print_hf_routing_info(model_id: str, console) -> bool:
70
  Anthropic / OpenAI ids return ``True`` without printing anything —
71
  the probe below covers "does this model exist".
72
  """
73
- if model_id.startswith(("anthropic/", "openai/")):
74
  return True
75
 
76
  from agent.core import hf_router_catalog as cat
@@ -141,7 +162,9 @@ def print_model_listing(config, console) -> None:
141
  console.print(
142
  "\n[dim]Paste any HF model id (e.g. 'MiniMaxAI/MiniMax-M2.7').\n"
143
  "Add ':fastest', ':cheapest', ':preferred', or ':<provider>' to override routing.\n"
144
- "Use 'anthropic/<model>' or 'openai/<model>' for direct API access.[/dim]"
 
 
145
  )
146
 
147
 
@@ -151,7 +174,21 @@ def print_invalid_id(arg: str, console) -> None:
151
  "[dim]Expected:\n"
152
  " • <org>/<model>[:tag] (HF router — paste from huggingface.co)\n"
153
  " • anthropic/<model>\n"
154
- " • openai/<model>[/dim]"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155
  )
156
 
157
 
@@ -173,9 +210,26 @@ async def probe_and_switch_model(
173
  * ✗ hard error (auth, model-not-found, quota) — we reject the switch
174
  and keep the current model so the user isn't stranded
175
 
176
- Transient errors (5xx, timeout) complete the switch with a yellow
177
- warning; the next real call re-surfaces the error if it's persistent.
 
 
178
  """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179
  preference = config.reasoning_effort
180
  if not _print_hf_routing_info(model_id, console):
181
  return
 
15
 
16
  from __future__ import annotations
17
 
18
+ import asyncio
19
+
20
+ from litellm import acompletion
21
+
22
  from agent.core.effort_probe import ProbeInconclusive, probe_effort
23
+ from agent.core.llm_params import _resolve_llm_params
24
+ from agent.core.local_models import (
25
+ LOCAL_MODEL_PREFIXES,
26
+ is_local_model_id,
27
+ is_reserved_local_model_id,
28
+ )
29
 
30
 
31
  # Suggested models shown by `/model` (not a gate). Users can paste any HF
 
50
 
51
 
52
  _ROUTING_POLICIES = {"fastest", "cheapest", "preferred"}
53
+ _DIRECT_PREFIXES = ("anthropic/", "openai/", *LOCAL_MODEL_PREFIXES)
54
+ _LOCAL_PROBE_TIMEOUT = 15.0
55
 
56
 
57
  def is_valid_model_id(model_id: str) -> bool:
 
60
  Accepts:
61
  • anthropic/<model>
62
  • openai/<model>
63
+ • ollama/<model>, vllm/<model>, lm_studio/<model>, llamacpp/<model>
64
  • <org>/<model>[:<tag>] (HF router; tag = provider or policy)
65
  • huggingface/<org>/<model>[:<tag>] (same, accepts legacy prefix)
66
 
67
  Actual availability is verified against the HF router catalog on
68
  switch, and by the provider on the probe's ping call.
69
  """
70
+ if not model_id:
71
+ return False
72
+ if is_local_model_id(model_id):
73
+ return True
74
+ if is_reserved_local_model_id(model_id):
75
+ return False
76
+ if any(model_id.startswith(prefix) for prefix in LOCAL_MODEL_PREFIXES):
77
+ return False
78
+ if "/" not in model_id:
79
  return False
80
  head = model_id.split(":", 1)[0]
81
  parts = head.split("/")
 
91
  Anthropic / OpenAI ids return ``True`` without printing anything —
92
  the probe below covers "does this model exist".
93
  """
94
+ if model_id.startswith(_DIRECT_PREFIXES):
95
  return True
96
 
97
  from agent.core import hf_router_catalog as cat
 
162
  console.print(
163
  "\n[dim]Paste any HF model id (e.g. 'MiniMaxAI/MiniMax-M2.7').\n"
164
  "Add ':fastest', ':cheapest', ':preferred', or ':<provider>' to override routing.\n"
165
+ "Use 'anthropic/<model>' or 'openai/<model>' for direct API access.\n"
166
+ "Use 'ollama/<model>', 'vllm/<model>', 'lm_studio/<model>', or "
167
+ "'llamacpp/<model>' for local OpenAI-compatible endpoints.[/dim]"
168
  )
169
 
170
 
 
174
  "[dim]Expected:\n"
175
  " • <org>/<model>[:tag] (HF router — paste from huggingface.co)\n"
176
  " • anthropic/<model>\n"
177
+ " • openai/<model>\n"
178
+ " • ollama/<model> | vllm/<model> | lm_studio/<model> | llamacpp/<model>[/dim]"
179
+ )
180
+
181
+
182
+ async def _probe_local_model(model_id: str) -> None:
183
+ params = _resolve_llm_params(model_id)
184
+ await asyncio.wait_for(
185
+ acompletion(
186
+ messages=[{"role": "user", "content": "ping"}],
187
+ max_tokens=1,
188
+ stream=False,
189
+ **params,
190
+ ),
191
+ timeout=_LOCAL_PROBE_TIMEOUT,
192
  )
193
 
194
 
 
210
  * ✗ hard error (auth, model-not-found, quota) — we reject the switch
211
  and keep the current model so the user isn't stranded
212
 
213
+ For non-local models, transient errors (5xx, timeout) complete the switch
214
+ with a yellow warning; the next real call re-surfaces the error if it's
215
+ persistent. Local models reject every probe error, including timeouts, and
216
+ keep the current model.
217
  """
218
+ if is_local_model_id(model_id):
219
+ console.print(f"[dim]checking local model {model_id}...[/dim]")
220
+ try:
221
+ await _probe_local_model(model_id)
222
+ except Exception as e:
223
+ console.print(f"[bold red]Switch failed:[/bold red] {e}")
224
+ console.print(f"[dim]Keeping current model: {config.model_name}[/dim]")
225
+ return
226
+
227
+ _commit_switch(model_id, config, session, effective=None, cache=True)
228
+ console.print(
229
+ f"[green]Model switched to {model_id}[/green] [dim](effort: off)[/dim]"
230
+ )
231
+ return
232
+
233
  preference = config.reasoning_effort
234
  if not _print_hf_routing_info(model_id, console):
235
  return
agent/main.py CHANGED
@@ -25,6 +25,7 @@ from agent.core.approval_policy import is_scheduled_operation
25
  from agent.core.agent_loop import submission_loop
26
  from agent.core import model_switcher
27
  from agent.core.hf_tokens import resolve_hf_token
 
28
  from agent.core.session import OpType
29
  from agent.core.tools import ToolRouter
30
  from agent.messaging.gateway import NotificationGateway
@@ -967,15 +968,15 @@ async def main(model: str | None = None):
967
  # Create prompt session for input (needed early for token prompt)
968
  prompt_session = PromptSession()
969
 
970
- # HF token — required, prompt if missing
971
- hf_token = resolve_hf_token()
972
- if not hf_token:
973
- hf_token = await _prompt_and_save_hf_token(prompt_session)
974
-
975
  config = load_config(CLI_CONFIG_PATH, include_user_defaults=True)
976
  if model:
977
  config.model_name = model
978
 
 
 
 
 
 
979
  # Resolve username for banner
980
  hf_user = _get_hf_user(hf_token)
981
 
@@ -1198,25 +1199,27 @@ async def headless_main(
1198
  logging.basicConfig(level=logging.WARNING)
1199
  _configure_runtime_logging()
1200
 
 
 
 
 
 
 
1201
  hf_token = resolve_hf_token()
1202
- if not hf_token:
1203
  print(
1204
  "ERROR: No HF token found. Set HF_TOKEN or run `huggingface-cli login`.",
1205
  file=sys.stderr,
1206
  )
1207
  sys.exit(1)
1208
 
1209
- print("HF token loaded", file=sys.stderr)
 
1210
 
1211
- config = load_config(CLI_CONFIG_PATH, include_user_defaults=True)
1212
- config.yolo_mode = True # Auto-approve everything in headless mode
1213
  notification_gateway = NotificationGateway(config.messaging)
1214
  await notification_gateway.start()
1215
  hf_user = _get_hf_user(hf_token)
1216
 
1217
- if model:
1218
- config.model_name = model
1219
-
1220
  if max_iterations is not None:
1221
  config.max_iterations = max_iterations
1222
 
 
25
  from agent.core.agent_loop import submission_loop
26
  from agent.core import model_switcher
27
  from agent.core.hf_tokens import resolve_hf_token
28
+ from agent.core.local_models import is_local_model_id
29
  from agent.core.session import OpType
30
  from agent.core.tools import ToolRouter
31
  from agent.messaging.gateway import NotificationGateway
 
968
  # Create prompt session for input (needed early for token prompt)
969
  prompt_session = PromptSession()
970
 
 
 
 
 
 
971
  config = load_config(CLI_CONFIG_PATH, include_user_defaults=True)
972
  if model:
973
  config.model_name = model
974
 
975
+ # HF token — required for Hub-backed models/tools, but not for local LLMs.
976
+ hf_token = resolve_hf_token()
977
+ if not hf_token and not is_local_model_id(config.model_name):
978
+ hf_token = await _prompt_and_save_hf_token(prompt_session)
979
+
980
  # Resolve username for banner
981
  hf_user = _get_hf_user(hf_token)
982
 
 
1199
  logging.basicConfig(level=logging.WARNING)
1200
  _configure_runtime_logging()
1201
 
1202
+ config = load_config(CLI_CONFIG_PATH, include_user_defaults=True)
1203
+ config.yolo_mode = True # Auto-approve everything in headless mode
1204
+
1205
+ if model:
1206
+ config.model_name = model
1207
+
1208
  hf_token = resolve_hf_token()
1209
+ if not hf_token and not is_local_model_id(config.model_name):
1210
  print(
1211
  "ERROR: No HF token found. Set HF_TOKEN or run `huggingface-cli login`.",
1212
  file=sys.stderr,
1213
  )
1214
  sys.exit(1)
1215
 
1216
+ if hf_token:
1217
+ print("HF token loaded", file=sys.stderr)
1218
 
 
 
1219
  notification_gateway = NotificationGateway(config.messaging)
1220
  await notification_gateway.start()
1221
  hf_user = _get_hf_user(hf_token)
1222
 
 
 
 
1223
  if max_iterations is not None:
1224
  config.max_iterations = max_iterations
1225
 
tests/unit/test_cli_local_models.py ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pytest
2
+
3
+ from agent.core import model_switcher
4
+ from agent.core.local_models import is_local_model_id
5
+
6
+
7
+ def test_local_model_helper_accepts_supported_prefixes():
8
+ assert is_local_model_id("ollama/llama3.1:8b")
9
+ assert is_local_model_id("vllm/meta-llama/Llama-3.1-8B-Instruct")
10
+ assert is_local_model_id("lm_studio/google/gemma-3-4b")
11
+ assert is_local_model_id("llamacpp/unsloth/Qwen3.5-2B")
12
+
13
+
14
+ def test_model_switcher_accepts_supported_local_prefixes():
15
+ assert model_switcher.is_valid_model_id("ollama/llama3.1:8b")
16
+ assert model_switcher.is_valid_model_id("vllm/meta-llama/Llama-3.1-8B")
17
+ assert model_switcher.is_valid_model_id("lm_studio/google/gemma-3-4b")
18
+ assert model_switcher.is_valid_model_id("llamacpp/llama-3.1-8b")
19
+
20
+
21
+ def test_model_switcher_rejects_empty_or_whitespace_local_ids():
22
+ assert not model_switcher.is_valid_model_id("ollama/")
23
+ assert not model_switcher.is_valid_model_id("vllm/")
24
+ assert not model_switcher.is_valid_model_id("lm_studio/")
25
+ assert not model_switcher.is_valid_model_id("llamacpp/")
26
+ assert not model_switcher.is_valid_model_id("ollama/llama 3.1")
27
+
28
+
29
+ def test_openai_compat_prefix_is_not_supported():
30
+ assert not model_switcher.is_valid_model_id("openai-compat/custom-model")
31
+
32
+
33
+ def test_local_models_skip_hf_router_catalog_output():
34
+ class NoPrintConsole:
35
+ def print(self, *args, **kwargs):
36
+ raise AssertionError("local models should not print HF catalog info")
37
+
38
+ assert model_switcher._print_hf_routing_info(
39
+ "ollama/llama3.1:8b",
40
+ NoPrintConsole(),
41
+ )
42
+
43
+
44
+ @pytest.mark.asyncio
45
+ async def test_probe_and_switch_local_model_uses_no_effort(monkeypatch):
46
+ calls = []
47
+
48
+ async def fake_acompletion(**kwargs):
49
+ calls.append(kwargs)
50
+ return object()
51
+
52
+ monkeypatch.setattr(model_switcher, "acompletion", fake_acompletion)
53
+
54
+ class Config:
55
+ model_name = "openai/gpt-5.5"
56
+ reasoning_effort = "max"
57
+
58
+ class Session:
59
+ def __init__(self):
60
+ self.model_id = None
61
+ self.model_effective_effort = {}
62
+
63
+ def update_model(self, model_id):
64
+ self.model_id = model_id
65
+
66
+ class Console:
67
+ def print(self, *args, **kwargs):
68
+ pass
69
+
70
+ session = Session()
71
+ await model_switcher.probe_and_switch_model(
72
+ "ollama/llama3.1:8b",
73
+ Config(),
74
+ session,
75
+ Console(),
76
+ hf_token=None,
77
+ )
78
+
79
+ assert session.model_id == "ollama/llama3.1:8b"
80
+ assert session.model_effective_effort["ollama/llama3.1:8b"] is None
81
+ assert calls[0]["model"] == "openai/llama3.1:8b"
82
+ assert "reasoning_effort" not in calls[0]
83
+ assert "extra_body" not in calls[0]
84
+
85
+
86
+ @pytest.mark.asyncio
87
+ async def test_probe_and_switch_local_model_rejects_probe_errors(monkeypatch):
88
+ async def failing_acompletion(**kwargs):
89
+ raise ConnectionRefusedError("no server")
90
+
91
+ monkeypatch.setattr(model_switcher, "acompletion", failing_acompletion)
92
+
93
+ class Config:
94
+ model_name = "openai/gpt-5.5"
95
+ reasoning_effort = None
96
+
97
+ class Session:
98
+ def __init__(self):
99
+ self.model_id = None
100
+ self.model_effective_effort = {}
101
+
102
+ def update_model(self, model_id):
103
+ self.model_id = model_id
104
+
105
+ class Console:
106
+ def print(self, *args, **kwargs):
107
+ pass
108
+
109
+ config = Config()
110
+ session = Session()
111
+ await model_switcher.probe_and_switch_model(
112
+ "ollama/llama3.1:8b",
113
+ config,
114
+ session,
115
+ Console(),
116
+ hf_token=None,
117
+ )
118
+
119
+ assert config.model_name == "openai/gpt-5.5"
120
+ assert session.model_id is None
121
+ assert "ollama/llama3.1:8b" not in session.model_effective_effort
tests/unit/test_llm_params.py CHANGED
@@ -1,3 +1,5 @@
 
 
1
  from agent.core.hf_tokens import resolve_hf_request_token
2
  from agent.core.llm_params import (
3
  UnsupportedEffortError,
@@ -30,6 +32,93 @@ def test_openai_max_effort_is_still_rejected():
30
  raise AssertionError("Expected UnsupportedEffortError for max effort")
31
 
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  def test_hf_router_token_prefers_inference_token(monkeypatch):
34
  monkeypatch.setenv("INFERENCE_TOKEN", " inference-token ")
35
  monkeypatch.setenv("HF_TOKEN", "hf-token")
 
1
+ import pytest
2
+
3
  from agent.core.hf_tokens import resolve_hf_request_token
4
  from agent.core.llm_params import (
5
  UnsupportedEffortError,
 
32
  raise AssertionError("Expected UnsupportedEffortError for max effort")
33
 
34
 
35
+ def test_resolve_ollama_params_adds_v1_and_uses_default_key(monkeypatch):
36
+ monkeypatch.delenv("OLLAMA_API_KEY", raising=False)
37
+ monkeypatch.setenv("OLLAMA_BASE_URL", "http://localhost:11434")
38
+
39
+ params = _resolve_llm_params("ollama/llama3.1:8b")
40
+
41
+ assert params == {
42
+ "model": "openai/llama3.1:8b",
43
+ "api_base": "http://localhost:11434/v1",
44
+ "api_key": "sk-local-no-key-required",
45
+ }
46
+
47
+
48
+ def test_resolve_vllm_params_keeps_existing_v1_and_trims_slash(monkeypatch):
49
+ monkeypatch.delenv("VLLM_API_KEY", raising=False)
50
+ monkeypatch.setenv("VLLM_BASE_URL", "http://localhost:8000/v1/")
51
+
52
+ params = _resolve_llm_params("vllm/meta-llama/Llama-3.1-8B-Instruct")
53
+
54
+ assert params["model"] == "openai/meta-llama/Llama-3.1-8B-Instruct"
55
+ assert params["api_base"] == "http://localhost:8000/v1"
56
+ assert params["api_key"] == "sk-local-no-key-required"
57
+
58
+
59
+ def test_resolve_lm_studio_params_uses_api_key_override(monkeypatch):
60
+ monkeypatch.setenv("LMSTUDIO_BASE_URL", "http://127.0.0.1:1234")
61
+ monkeypatch.setenv("LMSTUDIO_API_KEY", "local-secret")
62
+ monkeypatch.setenv("LOCAL_LLM_BASE_URL", "http://localhost:9999")
63
+ monkeypatch.setenv("LOCAL_LLM_API_KEY", "shared-secret")
64
+
65
+ params = _resolve_llm_params("lm_studio/google/gemma-3-4b")
66
+
67
+ assert params["model"] == "openai/google/gemma-3-4b"
68
+ assert params["api_base"] == "http://127.0.0.1:1234/v1"
69
+ assert params["api_key"] == "local-secret"
70
+
71
+
72
+ def test_resolve_local_params_uses_shared_fallback_env(monkeypatch):
73
+ monkeypatch.delenv("VLLM_BASE_URL", raising=False)
74
+ monkeypatch.delenv("VLLM_API_KEY", raising=False)
75
+ monkeypatch.setenv("LOCAL_LLM_BASE_URL", "http://localhost:9000/v1/")
76
+ monkeypatch.setenv("LOCAL_LLM_API_KEY", "shared-local-secret")
77
+
78
+ params = _resolve_llm_params("vllm/custom-model")
79
+
80
+ assert params["model"] == "openai/custom-model"
81
+ assert params["api_base"] == "http://localhost:9000/v1"
82
+ assert params["api_key"] == "shared-local-secret"
83
+
84
+
85
+ def test_resolve_llamacpp_params_strips_provider_prefix(monkeypatch):
86
+ monkeypatch.delenv("LLAMACPP_API_KEY", raising=False)
87
+ monkeypatch.setenv("LLAMACPP_BASE_URL", "http://localhost:8080")
88
+
89
+ params = _resolve_llm_params("llamacpp/unsloth/Qwen3.5-2B")
90
+
91
+ assert params["model"] == "openai/unsloth/Qwen3.5-2B"
92
+ assert params["api_base"] == "http://localhost:8080/v1"
93
+
94
+
95
+ def test_local_params_reject_reasoning_effort_in_strict_mode():
96
+ with pytest.raises(UnsupportedEffortError, match="reasoning_effort"):
97
+ _resolve_llm_params("ollama/llama3.1", reasoning_effort="high", strict=True)
98
+
99
+
100
+ def test_local_params_drop_reasoning_effort_in_non_strict_mode():
101
+ params = _resolve_llm_params(
102
+ "ollama/llama3.1",
103
+ reasoning_effort="high",
104
+ strict=False,
105
+ )
106
+
107
+ assert params["model"] == "openai/llama3.1"
108
+ assert "reasoning_effort" not in params
109
+ assert "extra_body" not in params
110
+
111
+
112
+ def test_openai_compat_prefix_is_not_a_local_escape_hatch():
113
+ with pytest.raises(ValueError, match="Unsupported local model id"):
114
+ _resolve_llm_params("openai-compat/custom-model")
115
+
116
+
117
+ def test_empty_local_model_id_is_not_treated_as_hf_router():
118
+ with pytest.raises(ValueError, match="Unsupported local model id"):
119
+ _resolve_llm_params("ollama/")
120
+
121
+
122
  def test_hf_router_token_prefers_inference_token(monkeypatch):
123
  monkeypatch.setenv("INFERENCE_TOKEN", " inference-token ")
124
  monkeypatch.setenv("HF_TOKEN", "hf-token")