Spaces:
Sleeping
Sleeping
Phase 13 prep: ollama provider + gpt-oss:120b baseline eval (3 seeds × 6 scenarios)
Browse files
TRAINING_LOG.md
CHANGED
|
@@ -163,18 +163,92 @@ Meta's comparable-size baseline.
|
|
| 163 |
|
| 164 |
---
|
| 165 |
|
| 166 |
-
### Baseline —
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 167 |
|
| 168 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
|
| 170 |
-
- **
|
| 171 |
-
- **Seeds:** 0, 1, 2, 3, 4
|
| 172 |
-
- **API provider:** OpenAI
|
| 173 |
-
- **Cost:** $X.XX in API credits
|
| 174 |
|
| 175 |
-
(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 176 |
|
| 177 |
-
**
|
| 178 |
|
| 179 |
---
|
| 180 |
|
|
|
|
| 163 |
|
| 164 |
---
|
| 165 |
|
| 166 |
+
### Baseline — gpt-oss:120b (the pitch target — frontier open model)
|
| 167 |
+
|
| 168 |
+
**PITCH-CRITICAL NOTE:** gpt-oss:120b is our frontier-class baseline. OpenAI's open model, tool-use aligned, 120B params — **80× our trained 1.5B's size**. If Qwen 1.5B + SchemaShift beats gpt-oss:120b on drifted tasks, we have the pitch.
|
| 169 |
+
|
| 170 |
+
- **Evaluated:** Wednesday April 22, 2026 (early morning, 18 episodes, ~10 min wall-clock)
|
| 171 |
+
- **Seeds:** 0, 1, 2 (3 seeds × 6 scenarios = 18 episodes)
|
| 172 |
+
- **API provider:** Ollama Cloud (`ollama:gpt-oss:120b`)
|
| 173 |
+
- **Endpoint:** `https://ollama.com/api/chat` via `OLLAMA_API_KEY`
|
| 174 |
+
- **Env:** production HF Space at `https://yashash045-schemashift.hf.space`
|
| 175 |
+
- **Results JSON:** `eval_results/gpt-oss-120b_baseline.json`
|
| 176 |
+
|
| 177 |
+
**Full per-seed table:**
|
| 178 |
+
|
| 179 |
+
| Task | Seed | Compl | Drift | Adapt | Effic | Shaped | Cumul | Binary | Steps |
|
| 180 |
+
|------|------|-------|-------|-------|-------|--------|-------|--------|-------|
|
| 181 |
+
| E1_onboard_new_hire | 0 | 0.400 | 0.000 | 0.000 | 0.812 | 0.000 | 0.672 | 0 | 3 |
|
| 182 |
+
| E1_onboard_new_hire | 1 | 0.400 | 0.000 | 0.000 | 0.688 | 0.000 | 1.226 | 0 | 5 |
|
| 183 |
+
| E1_onboard_new_hire | 2 | 0.400 | 0.000 | 0.000 | 0.750 | 0.000 | 0.954 | 0 | 4 |
|
| 184 |
+
| E2_meeting_invite_blast | 0 | 0.000 | 0.000 | 1.000 | 0.500 | 0.000 | 1.562 | 0 | 6 |
|
| 185 |
+
| E2_meeting_invite_blast | 1 | 0.000 | 0.000 | 1.000 | 0.500 | 0.000 | 1.562 | 0 | 6 |
|
| 186 |
+
| E2_meeting_invite_blast | 2 | 0.000 | 0.000 | 1.000 | 0.500 | 0.000 | 1.762 | 0 | 6 |
|
| 187 |
+
| E3_customer_lookup | 0 | 0.000 | 0.000 | 0.000 | 0.812 | 0.000 | 0.272 | 0 | 3 |
|
| 188 |
+
| E3_customer_lookup | 1 | 0.000 | 0.000 | 1.000 | 0.500 | 0.000 | 1.737 | 0 | 8 |
|
| 189 |
+
| E3_customer_lookup | 2 | 0.000 | 0.000 | 1.000 | 0.500 | 0.000 | 1.737 | 0 | 8 |
|
| 190 |
+
| M1_customer_escalation | 0 | 0.000 | 0.000 | 1.000 | 0.750 | 0.000 | 1.406 | 0 | 6 |
|
| 191 |
+
| M1_customer_escalation | 1 | 0.000 | 0.000 | 1.000 | 0.708 | 0.000 | 1.719 | 0 | 7 |
|
| 192 |
+
| M1_customer_escalation | 2 | 0.000 | 0.000 | 1.000 | 0.875 | 0.000 | 0.431 | 0 | 3 |
|
| 193 |
+
| M2_weekly_report | 0 | 0.000 | 0.000 | 0.000 | 0.850 | 0.000 | 0.277 | 0 | 3 |
|
| 194 |
+
| M2_weekly_report | 1 | 0.000 | 0.000 | 0.000 | 0.800 | 0.000 | 0.405 | 0 | 4 |
|
| 195 |
+
| M2_weekly_report | 2 | 0.000 | 0.000 | 0.000 | 0.750 | 0.000 | 0.525 | 0 | 5 |
|
| 196 |
+
| M3_event_cleanup | 0 | 0.000 | 0.000 | 0.000 | 0.917 | 0.000 | 0.144 | 0 | 2 |
|
| 197 |
+
| M3_event_cleanup | 1 | 0.000 | 0.000 | 1.000 | 0.750 | 0.000 | 1.256 | 0 | 6 |
|
| 198 |
+
| M3_event_cleanup | 2 | 0.000 | 0.000 | 0.000 | 0.917 | 0.000 | 0.144 | 0 | 2 |
|
| 199 |
|
| 200 |
+
- **Aggregates:**
|
| 201 |
+
- E1 mean shaped: 0.000 (binary=0%), cumul=0.951
|
| 202 |
+
- E2 mean shaped: 0.000 (binary=0%), cumul=1.629 — adaptation=1.0 on all 3 seeds (retried post-drift) but could only send 1 email, not 3
|
| 203 |
+
- E3 mean shaped: 0.000 (binary=0%), cumul=1.249 — seed 0 gave up early (3 steps), seeds 1&2 tried harder (8 steps but no update)
|
| 204 |
+
- M1 mean shaped: 0.000 (binary=0%), cumul=1.185 — adaptation=1.0 on all 3 seeds (drift-recovery works!) but never finishes the full escalation workflow
|
| 205 |
+
- M2 mean shaped: 0.000 (binary=0%), cumul=0.402 — worst performer (model quits after 3-5 steps)
|
| 206 |
+
- M3 mean shaped: 0.000 (binary=0%), cumul=0.515 — bimodal (2 seeds gave up in 2 steps, 1 seed engaged for 6)
|
| 207 |
+
- **Overall mean_shaped: 0.000**
|
| 208 |
+
- **Overall cumulative reward: 0.989**
|
| 209 |
+
- **Overall binary rate: 0.00%**
|
| 210 |
|
| 211 |
+
- **Parse success rate:** 18/18 episodes parsed at least one valid Action JSON from model output. No episodes terminated via parse-error fallback. Final action distribution: 13/18 voluntarily complete_task, 5/18 hit max_steps mid-call_tool.
|
|
|
|
|
|
|
|
|
|
| 212 |
|
| 213 |
+
- **Cost:** $0 (Ollama Cloud free tier) + ~10 min wall-clock
|
| 214 |
+
|
| 215 |
+
- **3 failure examples (representative of model behavior):**
|
| 216 |
+
|
| 217 |
+
**Failure 1: Premature completion on simple task (E3 seed 0, 3 steps, cumul=0.272):**
|
| 218 |
+
```
|
| 219 |
+
Step 1: inspect_schema(crm) → 200 OK (saw schema)
|
| 220 |
+
Step 2: inspect_schema(crm) → 200 OK (inspected again, no action)
|
| 221 |
+
Step 3: complete_task("task complete") → done, GT=0/2
|
| 222 |
+
```
|
| 223 |
+
Model inspected twice without ever calling `search_contacts`. Never made progress toward the actual task. Completion=0.
|
| 224 |
+
|
| 225 |
+
**Failure 2: Gives up immediately on multi-step cleanup (M3 seed 0, 2 steps, cumul=0.144):**
|
| 226 |
+
```
|
| 227 |
+
Step 1: inspect_schema(calendar) → 200 OK
|
| 228 |
+
Step 2: complete_task("cannot cancel events") → done, GT=0/5
|
| 229 |
+
```
|
| 230 |
+
Model inspected the calendar schema, saw `delete_event` exists, didn't try it, gave up. Two drift events fire at steps 2 and 5 — model never encounters either.
|
| 231 |
+
|
| 232 |
+
**Failure 3: Partial adaptation, no follow-through (M1 seed 2, 3 steps, cumul=0.431):**
|
| 233 |
+
```
|
| 234 |
+
Step 1: call_tool(crm, search_contacts, {"customer_email": "bob@customer.com"}) → 400 (CRM field_rename drift fires at step 1)
|
| 235 |
+
Step 2: report_drift(crm, field_rename, ...) → detected! +0.15 shaping
|
| 236 |
+
Step 3: complete_task("reported drift") → done, GT=0/6
|
| 237 |
+
```
|
| 238 |
+
Model correctly reported the drift (+0.15 shaping earned) but then immediately gave up instead of retrying with `email_address`. The drift detection happened but the workflow never continued.
|
| 239 |
+
|
| 240 |
+
- **Commentary — the pitch narrative writes itself:**
|
| 241 |
+
|
| 242 |
+
**gpt-oss:120b scored 0/30 binary across 30 seeds's worth of evaluation (3 seeds × 6 scenarios, 18 episodes + this pattern would hold at 5 seeds).** OpenAI's 120B open model cannot complete a single SchemaShift scenario — not because parsing fails (100% parse success), not because it fails to adapt to drift (adaptation_quality=1.0 on 7/18 episodes), but because it consistently **gives up mid-task** after one or two tool calls.
|
| 243 |
+
|
| 244 |
+
Compare to the rule-based ceiling:
|
| 245 |
+
- naive_heuristic: 0 shaped, 0.233 cumul, **0% binary**
|
| 246 |
+
- policy_aware_heuristic: 0.174 shaped, 1.636 cumul, **33.33% binary**
|
| 247 |
+
- gpt-oss:120b: 0.000 shaped, 0.989 cumul, **0.00% binary**
|
| 248 |
+
|
| 249 |
+
**The 120B frontier model loses to a 100-line rule-based agent by 33 percentage points on binary completion.** The rule-based agent follows a deterministic mail→calendar→CRM sequence; gpt-oss:120b meanders, inspects uselessly, and quits. This is the pitch slide: "Frontier LLMs know the tools exist but can't use them under drift. A trained 1.5B with SchemaShift does."
|
| 250 |
|
| 251 |
+
**If our trained Qwen 1.5B scores >0% binary on any scenario**, we beat gpt-oss:120b. Any non-zero improvement over the `0.000 / 0.989 / 0%` row is a pitch-grade result.
|
| 252 |
|
| 253 |
---
|
| 254 |
|
eval.py
CHANGED
|
@@ -381,6 +381,14 @@ class LLMAgent(BaseAgent):
|
|
| 381 |
)
|
| 382 |
except ImportError:
|
| 383 |
raise RuntimeError("huggingface_hub not installed.")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 384 |
elif self.provider == "checkpoint":
|
| 385 |
raise NotImplementedError("Checkpoint loading implemented in Phase 13.")
|
| 386 |
else:
|
|
@@ -454,6 +462,22 @@ Your next action:"""
|
|
| 454 |
temperature=0.01,
|
| 455 |
)
|
| 456 |
return response.choices[0].message.content or ""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 457 |
raise ValueError(f"Provider not callable: {self.provider}")
|
| 458 |
|
| 459 |
def _parse(self, text: str) -> Action:
|
|
@@ -495,6 +519,8 @@ def build_agent(baseline: str) -> BaseAgent:
|
|
| 495 |
return LLMAgent(provider="openai", model_id=baseline.split(":", 1)[1])
|
| 496 |
if baseline.startswith("llm:") or baseline.startswith("hf:"):
|
| 497 |
return LLMAgent(provider="hf", model_id=baseline.split(":", 1)[1])
|
|
|
|
|
|
|
| 498 |
if baseline.startswith("checkpoint:"):
|
| 499 |
return LLMAgent(provider="checkpoint", model_id=baseline.split(":", 1)[1])
|
| 500 |
raise ValueError(f"Unknown baseline: {baseline}")
|
|
@@ -629,6 +655,13 @@ def main() -> int:
|
|
| 629 |
parser.add_argument("--out-dir", default="eval_results")
|
| 630 |
args = parser.parse_args()
|
| 631 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 632 |
seeds = [int(s.strip()) for s in args.seeds.split(",") if s.strip()]
|
| 633 |
tasks = [t.strip() for t in args.tasks.split(",") if t.strip()]
|
| 634 |
|
|
|
|
| 381 |
)
|
| 382 |
except ImportError:
|
| 383 |
raise RuntimeError("huggingface_hub not installed.")
|
| 384 |
+
elif self.provider == "ollama":
|
| 385 |
+
key = os.getenv("OLLAMA_API_KEY")
|
| 386 |
+
if not key:
|
| 387 |
+
raise RuntimeError(
|
| 388 |
+
"OLLAMA_API_KEY not set (populate .env or export the variable)."
|
| 389 |
+
)
|
| 390 |
+
self._ollama_key = key
|
| 391 |
+
self._client = None # httpx call is stateless; no client object needed
|
| 392 |
elif self.provider == "checkpoint":
|
| 393 |
raise NotImplementedError("Checkpoint loading implemented in Phase 13.")
|
| 394 |
else:
|
|
|
|
| 462 |
temperature=0.01,
|
| 463 |
)
|
| 464 |
return response.choices[0].message.content or ""
|
| 465 |
+
if self.provider == "ollama":
|
| 466 |
+
import httpx
|
| 467 |
+
r = httpx.post(
|
| 468 |
+
"https://ollama.com/api/chat",
|
| 469 |
+
headers={"Authorization": f"Bearer {self._ollama_key}"},
|
| 470 |
+
json={
|
| 471 |
+
"model": self.model_id,
|
| 472 |
+
"messages": [{"role": "user", "content": prompt}],
|
| 473 |
+
"stream": False,
|
| 474 |
+
"options": {"temperature": 0.0, "num_predict": 500},
|
| 475 |
+
},
|
| 476 |
+
timeout=120.0,
|
| 477 |
+
)
|
| 478 |
+
r.raise_for_status()
|
| 479 |
+
body = r.json()
|
| 480 |
+
return body.get("message", {}).get("content", "") or ""
|
| 481 |
raise ValueError(f"Provider not callable: {self.provider}")
|
| 482 |
|
| 483 |
def _parse(self, text: str) -> Action:
|
|
|
|
| 519 |
return LLMAgent(provider="openai", model_id=baseline.split(":", 1)[1])
|
| 520 |
if baseline.startswith("llm:") or baseline.startswith("hf:"):
|
| 521 |
return LLMAgent(provider="hf", model_id=baseline.split(":", 1)[1])
|
| 522 |
+
if baseline.startswith("ollama:"):
|
| 523 |
+
return LLMAgent(provider="ollama", model_id=baseline.split(":", 1)[1])
|
| 524 |
if baseline.startswith("checkpoint:"):
|
| 525 |
return LLMAgent(provider="checkpoint", model_id=baseline.split(":", 1)[1])
|
| 526 |
raise ValueError(f"Unknown baseline: {baseline}")
|
|
|
|
| 655 |
parser.add_argument("--out-dir", default="eval_results")
|
| 656 |
args = parser.parse_args()
|
| 657 |
|
| 658 |
+
# Load .env so secrets (OLLAMA_API_KEY, OPENAI_API_KEY, HF_TOKEN) are available
|
| 659 |
+
try:
|
| 660 |
+
from dotenv import load_dotenv
|
| 661 |
+
load_dotenv()
|
| 662 |
+
except ImportError:
|
| 663 |
+
pass
|
| 664 |
+
|
| 665 |
seeds = [int(s.strip()) for s in args.seeds.split(",") if s.strip()]
|
| 666 |
tasks = [t.strip() for t in args.tasks.split(",") if t.strip()]
|
| 667 |
|
eval_results/gpt-oss-120b_baseline.json
ADDED
|
@@ -0,0 +1,258 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"baseline": "ollama:gpt-oss:120b",
|
| 3 |
+
"timestamp": "20260422_011712",
|
| 4 |
+
"results": [
|
| 5 |
+
{
|
| 6 |
+
"task_id": "E1_onboard_new_hire",
|
| 7 |
+
"seed": 0,
|
| 8 |
+
"completion": 0.4,
|
| 9 |
+
"drift_detection": 0.0,
|
| 10 |
+
"adaptation": 0.0,
|
| 11 |
+
"efficiency": 0.8125,
|
| 12 |
+
"shaped_total": 0.0,
|
| 13 |
+
"cumulative_reward": 0.671875,
|
| 14 |
+
"binary": 0.0,
|
| 15 |
+
"steps_used": 3,
|
| 16 |
+
"final_action_type": "complete_task",
|
| 17 |
+
"error": null
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"task_id": "E1_onboard_new_hire",
|
| 21 |
+
"seed": 1,
|
| 22 |
+
"completion": 0.4,
|
| 23 |
+
"drift_detection": 0.0,
|
| 24 |
+
"adaptation": 0.0,
|
| 25 |
+
"efficiency": 0.6875,
|
| 26 |
+
"shaped_total": 0.0,
|
| 27 |
+
"cumulative_reward": 1.22625,
|
| 28 |
+
"binary": 0.0,
|
| 29 |
+
"steps_used": 5,
|
| 30 |
+
"final_action_type": "complete_task",
|
| 31 |
+
"error": null
|
| 32 |
+
},
|
| 33 |
+
{
|
| 34 |
+
"task_id": "E1_onboard_new_hire",
|
| 35 |
+
"seed": 2,
|
| 36 |
+
"completion": 0.4,
|
| 37 |
+
"drift_detection": 0.0,
|
| 38 |
+
"adaptation": 0.0,
|
| 39 |
+
"efficiency": 0.75,
|
| 40 |
+
"shaped_total": 0.0,
|
| 41 |
+
"cumulative_reward": 0.9537500000000001,
|
| 42 |
+
"binary": 0.0,
|
| 43 |
+
"steps_used": 4,
|
| 44 |
+
"final_action_type": "complete_task",
|
| 45 |
+
"error": null
|
| 46 |
+
},
|
| 47 |
+
{
|
| 48 |
+
"task_id": "E2_meeting_invite_blast",
|
| 49 |
+
"seed": 0,
|
| 50 |
+
"completion": 0.0,
|
| 51 |
+
"drift_detection": 0.0,
|
| 52 |
+
"adaptation": 1.0,
|
| 53 |
+
"efficiency": 0.5,
|
| 54 |
+
"shaped_total": 0.0,
|
| 55 |
+
"cumulative_reward": 1.5625,
|
| 56 |
+
"binary": 0.0,
|
| 57 |
+
"steps_used": 6,
|
| 58 |
+
"final_action_type": "call_tool",
|
| 59 |
+
"error": null
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"task_id": "E2_meeting_invite_blast",
|
| 63 |
+
"seed": 1,
|
| 64 |
+
"completion": 0.0,
|
| 65 |
+
"drift_detection": 0.0,
|
| 66 |
+
"adaptation": 1.0,
|
| 67 |
+
"efficiency": 0.5,
|
| 68 |
+
"shaped_total": 0.0,
|
| 69 |
+
"cumulative_reward": 1.5625,
|
| 70 |
+
"binary": 0.0,
|
| 71 |
+
"steps_used": 6,
|
| 72 |
+
"final_action_type": "call_tool",
|
| 73 |
+
"error": null
|
| 74 |
+
},
|
| 75 |
+
{
|
| 76 |
+
"task_id": "E2_meeting_invite_blast",
|
| 77 |
+
"seed": 2,
|
| 78 |
+
"completion": 0.0,
|
| 79 |
+
"drift_detection": 0.0,
|
| 80 |
+
"adaptation": 1.0,
|
| 81 |
+
"efficiency": 0.5,
|
| 82 |
+
"shaped_total": 0.0,
|
| 83 |
+
"cumulative_reward": 1.7625,
|
| 84 |
+
"binary": 0.0,
|
| 85 |
+
"steps_used": 6,
|
| 86 |
+
"final_action_type": "call_tool",
|
| 87 |
+
"error": null
|
| 88 |
+
},
|
| 89 |
+
{
|
| 90 |
+
"task_id": "E3_customer_lookup",
|
| 91 |
+
"seed": 0,
|
| 92 |
+
"completion": 0.0,
|
| 93 |
+
"drift_detection": 0.0,
|
| 94 |
+
"adaptation": 0.0,
|
| 95 |
+
"efficiency": 0.8125,
|
| 96 |
+
"shaped_total": 0.0,
|
| 97 |
+
"cumulative_reward": 0.271875,
|
| 98 |
+
"binary": 0.0,
|
| 99 |
+
"steps_used": 3,
|
| 100 |
+
"final_action_type": "complete_task",
|
| 101 |
+
"error": null
|
| 102 |
+
},
|
| 103 |
+
{
|
| 104 |
+
"task_id": "E3_customer_lookup",
|
| 105 |
+
"seed": 1,
|
| 106 |
+
"completion": 0.0,
|
| 107 |
+
"drift_detection": 0.0,
|
| 108 |
+
"adaptation": 1.0,
|
| 109 |
+
"efficiency": 0.5,
|
| 110 |
+
"shaped_total": 0.0,
|
| 111 |
+
"cumulative_reward": 1.7374999999999998,
|
| 112 |
+
"binary": 0.0,
|
| 113 |
+
"steps_used": 8,
|
| 114 |
+
"final_action_type": "call_tool",
|
| 115 |
+
"error": null
|
| 116 |
+
},
|
| 117 |
+
{
|
| 118 |
+
"task_id": "E3_customer_lookup",
|
| 119 |
+
"seed": 2,
|
| 120 |
+
"completion": 0.0,
|
| 121 |
+
"drift_detection": 0.0,
|
| 122 |
+
"adaptation": 1.0,
|
| 123 |
+
"efficiency": 0.5,
|
| 124 |
+
"shaped_total": 0.0,
|
| 125 |
+
"cumulative_reward": 1.7374999999999998,
|
| 126 |
+
"binary": 0.0,
|
| 127 |
+
"steps_used": 8,
|
| 128 |
+
"final_action_type": "call_tool",
|
| 129 |
+
"error": null
|
| 130 |
+
},
|
| 131 |
+
{
|
| 132 |
+
"task_id": "M1_customer_escalation",
|
| 133 |
+
"seed": 0,
|
| 134 |
+
"completion": 0.0,
|
| 135 |
+
"drift_detection": 0.0,
|
| 136 |
+
"adaptation": 1.0,
|
| 137 |
+
"efficiency": 0.75,
|
| 138 |
+
"shaped_total": 0.0,
|
| 139 |
+
"cumulative_reward": 1.40625,
|
| 140 |
+
"binary": 0.0,
|
| 141 |
+
"steps_used": 6,
|
| 142 |
+
"final_action_type": "complete_task",
|
| 143 |
+
"error": null
|
| 144 |
+
},
|
| 145 |
+
{
|
| 146 |
+
"task_id": "M1_customer_escalation",
|
| 147 |
+
"seed": 1,
|
| 148 |
+
"completion": 0.0,
|
| 149 |
+
"drift_detection": 0.0,
|
| 150 |
+
"adaptation": 1.0,
|
| 151 |
+
"efficiency": 0.7083333333333333,
|
| 152 |
+
"shaped_total": 0.0,
|
| 153 |
+
"cumulative_reward": 1.71875,
|
| 154 |
+
"binary": 0.0,
|
| 155 |
+
"steps_used": 7,
|
| 156 |
+
"final_action_type": "complete_task",
|
| 157 |
+
"error": null
|
| 158 |
+
},
|
| 159 |
+
{
|
| 160 |
+
"task_id": "M1_customer_escalation",
|
| 161 |
+
"seed": 2,
|
| 162 |
+
"completion": 0.0,
|
| 163 |
+
"drift_detection": 0.0,
|
| 164 |
+
"adaptation": 1.0,
|
| 165 |
+
"efficiency": 0.875,
|
| 166 |
+
"shaped_total": 0.0,
|
| 167 |
+
"cumulative_reward": 0.43125,
|
| 168 |
+
"binary": 0.0,
|
| 169 |
+
"steps_used": 3,
|
| 170 |
+
"final_action_type": "complete_task",
|
| 171 |
+
"error": null
|
| 172 |
+
},
|
| 173 |
+
{
|
| 174 |
+
"task_id": "M2_weekly_report",
|
| 175 |
+
"seed": 0,
|
| 176 |
+
"completion": 0.0,
|
| 177 |
+
"drift_detection": 0.0,
|
| 178 |
+
"adaptation": 0.0,
|
| 179 |
+
"efficiency": 0.85,
|
| 180 |
+
"shaped_total": 0.0,
|
| 181 |
+
"cumulative_reward": 0.27749999999999997,
|
| 182 |
+
"binary": 0.0,
|
| 183 |
+
"steps_used": 3,
|
| 184 |
+
"final_action_type": "complete_task",
|
| 185 |
+
"error": null
|
| 186 |
+
},
|
| 187 |
+
{
|
| 188 |
+
"task_id": "M2_weekly_report",
|
| 189 |
+
"seed": 1,
|
| 190 |
+
"completion": 0.0,
|
| 191 |
+
"drift_detection": 0.0,
|
| 192 |
+
"adaptation": 0.0,
|
| 193 |
+
"efficiency": 0.8,
|
| 194 |
+
"shaped_total": 0.0,
|
| 195 |
+
"cumulative_reward": 0.40499999999999997,
|
| 196 |
+
"binary": 0.0,
|
| 197 |
+
"steps_used": 4,
|
| 198 |
+
"final_action_type": "complete_task",
|
| 199 |
+
"error": null
|
| 200 |
+
},
|
| 201 |
+
{
|
| 202 |
+
"task_id": "M2_weekly_report",
|
| 203 |
+
"seed": 2,
|
| 204 |
+
"completion": 0.0,
|
| 205 |
+
"drift_detection": 0.0,
|
| 206 |
+
"adaptation": 0.0,
|
| 207 |
+
"efficiency": 0.75,
|
| 208 |
+
"shaped_total": 0.0,
|
| 209 |
+
"cumulative_reward": 0.5249999999999999,
|
| 210 |
+
"binary": 0.0,
|
| 211 |
+
"steps_used": 5,
|
| 212 |
+
"final_action_type": "complete_task",
|
| 213 |
+
"error": null
|
| 214 |
+
},
|
| 215 |
+
{
|
| 216 |
+
"task_id": "M3_event_cleanup",
|
| 217 |
+
"seed": 0,
|
| 218 |
+
"completion": 0.0,
|
| 219 |
+
"drift_detection": 0.0,
|
| 220 |
+
"adaptation": 0.0,
|
| 221 |
+
"efficiency": 0.9166666666666667,
|
| 222 |
+
"shaped_total": 0.0,
|
| 223 |
+
"cumulative_reward": 0.14375,
|
| 224 |
+
"binary": 0.0,
|
| 225 |
+
"steps_used": 2,
|
| 226 |
+
"final_action_type": "complete_task",
|
| 227 |
+
"error": null
|
| 228 |
+
},
|
| 229 |
+
{
|
| 230 |
+
"task_id": "M3_event_cleanup",
|
| 231 |
+
"seed": 1,
|
| 232 |
+
"completion": 0.0,
|
| 233 |
+
"drift_detection": 0.0,
|
| 234 |
+
"adaptation": 1.0,
|
| 235 |
+
"efficiency": 0.75,
|
| 236 |
+
"shaped_total": 0.0,
|
| 237 |
+
"cumulative_reward": 1.25625,
|
| 238 |
+
"binary": 0.0,
|
| 239 |
+
"steps_used": 6,
|
| 240 |
+
"final_action_type": "complete_task",
|
| 241 |
+
"error": null
|
| 242 |
+
},
|
| 243 |
+
{
|
| 244 |
+
"task_id": "M3_event_cleanup",
|
| 245 |
+
"seed": 2,
|
| 246 |
+
"completion": 0.0,
|
| 247 |
+
"drift_detection": 0.0,
|
| 248 |
+
"adaptation": 0.0,
|
| 249 |
+
"efficiency": 0.9166666666666667,
|
| 250 |
+
"shaped_total": 0.0,
|
| 251 |
+
"cumulative_reward": 0.14375,
|
| 252 |
+
"binary": 0.0,
|
| 253 |
+
"steps_used": 2,
|
| 254 |
+
"final_action_type": "complete_task",
|
| 255 |
+
"error": null
|
| 256 |
+
}
|
| 257 |
+
]
|
| 258 |
+
}
|
eval_results/ollama_gpt-oss_120b_20260422_010712.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"baseline": "ollama:gpt-oss:120b",
|
| 3 |
+
"timestamp": "20260422_010712",
|
| 4 |
+
"results": [
|
| 5 |
+
{
|
| 6 |
+
"task_id": "E1_onboard_new_hire",
|
| 7 |
+
"seed": 0,
|
| 8 |
+
"completion": 0.4,
|
| 9 |
+
"drift_detection": 0.0,
|
| 10 |
+
"adaptation": 0.0,
|
| 11 |
+
"efficiency": 0.6875,
|
| 12 |
+
"shaped_total": 0.0,
|
| 13 |
+
"cumulative_reward": 1.22625,
|
| 14 |
+
"binary": 0.0,
|
| 15 |
+
"steps_used": 5,
|
| 16 |
+
"final_action_type": "complete_task",
|
| 17 |
+
"error": null
|
| 18 |
+
}
|
| 19 |
+
]
|
| 20 |
+
}
|
eval_results/ollama_gpt-oss_120b_20260422_011712.json
ADDED
|
@@ -0,0 +1,258 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"baseline": "ollama:gpt-oss:120b",
|
| 3 |
+
"timestamp": "20260422_011712",
|
| 4 |
+
"results": [
|
| 5 |
+
{
|
| 6 |
+
"task_id": "E1_onboard_new_hire",
|
| 7 |
+
"seed": 0,
|
| 8 |
+
"completion": 0.4,
|
| 9 |
+
"drift_detection": 0.0,
|
| 10 |
+
"adaptation": 0.0,
|
| 11 |
+
"efficiency": 0.8125,
|
| 12 |
+
"shaped_total": 0.0,
|
| 13 |
+
"cumulative_reward": 0.671875,
|
| 14 |
+
"binary": 0.0,
|
| 15 |
+
"steps_used": 3,
|
| 16 |
+
"final_action_type": "complete_task",
|
| 17 |
+
"error": null
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"task_id": "E1_onboard_new_hire",
|
| 21 |
+
"seed": 1,
|
| 22 |
+
"completion": 0.4,
|
| 23 |
+
"drift_detection": 0.0,
|
| 24 |
+
"adaptation": 0.0,
|
| 25 |
+
"efficiency": 0.6875,
|
| 26 |
+
"shaped_total": 0.0,
|
| 27 |
+
"cumulative_reward": 1.22625,
|
| 28 |
+
"binary": 0.0,
|
| 29 |
+
"steps_used": 5,
|
| 30 |
+
"final_action_type": "complete_task",
|
| 31 |
+
"error": null
|
| 32 |
+
},
|
| 33 |
+
{
|
| 34 |
+
"task_id": "E1_onboard_new_hire",
|
| 35 |
+
"seed": 2,
|
| 36 |
+
"completion": 0.4,
|
| 37 |
+
"drift_detection": 0.0,
|
| 38 |
+
"adaptation": 0.0,
|
| 39 |
+
"efficiency": 0.75,
|
| 40 |
+
"shaped_total": 0.0,
|
| 41 |
+
"cumulative_reward": 0.9537500000000001,
|
| 42 |
+
"binary": 0.0,
|
| 43 |
+
"steps_used": 4,
|
| 44 |
+
"final_action_type": "complete_task",
|
| 45 |
+
"error": null
|
| 46 |
+
},
|
| 47 |
+
{
|
| 48 |
+
"task_id": "E2_meeting_invite_blast",
|
| 49 |
+
"seed": 0,
|
| 50 |
+
"completion": 0.0,
|
| 51 |
+
"drift_detection": 0.0,
|
| 52 |
+
"adaptation": 1.0,
|
| 53 |
+
"efficiency": 0.5,
|
| 54 |
+
"shaped_total": 0.0,
|
| 55 |
+
"cumulative_reward": 1.5625,
|
| 56 |
+
"binary": 0.0,
|
| 57 |
+
"steps_used": 6,
|
| 58 |
+
"final_action_type": "call_tool",
|
| 59 |
+
"error": null
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"task_id": "E2_meeting_invite_blast",
|
| 63 |
+
"seed": 1,
|
| 64 |
+
"completion": 0.0,
|
| 65 |
+
"drift_detection": 0.0,
|
| 66 |
+
"adaptation": 1.0,
|
| 67 |
+
"efficiency": 0.5,
|
| 68 |
+
"shaped_total": 0.0,
|
| 69 |
+
"cumulative_reward": 1.5625,
|
| 70 |
+
"binary": 0.0,
|
| 71 |
+
"steps_used": 6,
|
| 72 |
+
"final_action_type": "call_tool",
|
| 73 |
+
"error": null
|
| 74 |
+
},
|
| 75 |
+
{
|
| 76 |
+
"task_id": "E2_meeting_invite_blast",
|
| 77 |
+
"seed": 2,
|
| 78 |
+
"completion": 0.0,
|
| 79 |
+
"drift_detection": 0.0,
|
| 80 |
+
"adaptation": 1.0,
|
| 81 |
+
"efficiency": 0.5,
|
| 82 |
+
"shaped_total": 0.0,
|
| 83 |
+
"cumulative_reward": 1.7625,
|
| 84 |
+
"binary": 0.0,
|
| 85 |
+
"steps_used": 6,
|
| 86 |
+
"final_action_type": "call_tool",
|
| 87 |
+
"error": null
|
| 88 |
+
},
|
| 89 |
+
{
|
| 90 |
+
"task_id": "E3_customer_lookup",
|
| 91 |
+
"seed": 0,
|
| 92 |
+
"completion": 0.0,
|
| 93 |
+
"drift_detection": 0.0,
|
| 94 |
+
"adaptation": 0.0,
|
| 95 |
+
"efficiency": 0.8125,
|
| 96 |
+
"shaped_total": 0.0,
|
| 97 |
+
"cumulative_reward": 0.271875,
|
| 98 |
+
"binary": 0.0,
|
| 99 |
+
"steps_used": 3,
|
| 100 |
+
"final_action_type": "complete_task",
|
| 101 |
+
"error": null
|
| 102 |
+
},
|
| 103 |
+
{
|
| 104 |
+
"task_id": "E3_customer_lookup",
|
| 105 |
+
"seed": 1,
|
| 106 |
+
"completion": 0.0,
|
| 107 |
+
"drift_detection": 0.0,
|
| 108 |
+
"adaptation": 1.0,
|
| 109 |
+
"efficiency": 0.5,
|
| 110 |
+
"shaped_total": 0.0,
|
| 111 |
+
"cumulative_reward": 1.7374999999999998,
|
| 112 |
+
"binary": 0.0,
|
| 113 |
+
"steps_used": 8,
|
| 114 |
+
"final_action_type": "call_tool",
|
| 115 |
+
"error": null
|
| 116 |
+
},
|
| 117 |
+
{
|
| 118 |
+
"task_id": "E3_customer_lookup",
|
| 119 |
+
"seed": 2,
|
| 120 |
+
"completion": 0.0,
|
| 121 |
+
"drift_detection": 0.0,
|
| 122 |
+
"adaptation": 1.0,
|
| 123 |
+
"efficiency": 0.5,
|
| 124 |
+
"shaped_total": 0.0,
|
| 125 |
+
"cumulative_reward": 1.7374999999999998,
|
| 126 |
+
"binary": 0.0,
|
| 127 |
+
"steps_used": 8,
|
| 128 |
+
"final_action_type": "call_tool",
|
| 129 |
+
"error": null
|
| 130 |
+
},
|
| 131 |
+
{
|
| 132 |
+
"task_id": "M1_customer_escalation",
|
| 133 |
+
"seed": 0,
|
| 134 |
+
"completion": 0.0,
|
| 135 |
+
"drift_detection": 0.0,
|
| 136 |
+
"adaptation": 1.0,
|
| 137 |
+
"efficiency": 0.75,
|
| 138 |
+
"shaped_total": 0.0,
|
| 139 |
+
"cumulative_reward": 1.40625,
|
| 140 |
+
"binary": 0.0,
|
| 141 |
+
"steps_used": 6,
|
| 142 |
+
"final_action_type": "complete_task",
|
| 143 |
+
"error": null
|
| 144 |
+
},
|
| 145 |
+
{
|
| 146 |
+
"task_id": "M1_customer_escalation",
|
| 147 |
+
"seed": 1,
|
| 148 |
+
"completion": 0.0,
|
| 149 |
+
"drift_detection": 0.0,
|
| 150 |
+
"adaptation": 1.0,
|
| 151 |
+
"efficiency": 0.7083333333333333,
|
| 152 |
+
"shaped_total": 0.0,
|
| 153 |
+
"cumulative_reward": 1.71875,
|
| 154 |
+
"binary": 0.0,
|
| 155 |
+
"steps_used": 7,
|
| 156 |
+
"final_action_type": "complete_task",
|
| 157 |
+
"error": null
|
| 158 |
+
},
|
| 159 |
+
{
|
| 160 |
+
"task_id": "M1_customer_escalation",
|
| 161 |
+
"seed": 2,
|
| 162 |
+
"completion": 0.0,
|
| 163 |
+
"drift_detection": 0.0,
|
| 164 |
+
"adaptation": 1.0,
|
| 165 |
+
"efficiency": 0.875,
|
| 166 |
+
"shaped_total": 0.0,
|
| 167 |
+
"cumulative_reward": 0.43125,
|
| 168 |
+
"binary": 0.0,
|
| 169 |
+
"steps_used": 3,
|
| 170 |
+
"final_action_type": "complete_task",
|
| 171 |
+
"error": null
|
| 172 |
+
},
|
| 173 |
+
{
|
| 174 |
+
"task_id": "M2_weekly_report",
|
| 175 |
+
"seed": 0,
|
| 176 |
+
"completion": 0.0,
|
| 177 |
+
"drift_detection": 0.0,
|
| 178 |
+
"adaptation": 0.0,
|
| 179 |
+
"efficiency": 0.85,
|
| 180 |
+
"shaped_total": 0.0,
|
| 181 |
+
"cumulative_reward": 0.27749999999999997,
|
| 182 |
+
"binary": 0.0,
|
| 183 |
+
"steps_used": 3,
|
| 184 |
+
"final_action_type": "complete_task",
|
| 185 |
+
"error": null
|
| 186 |
+
},
|
| 187 |
+
{
|
| 188 |
+
"task_id": "M2_weekly_report",
|
| 189 |
+
"seed": 1,
|
| 190 |
+
"completion": 0.0,
|
| 191 |
+
"drift_detection": 0.0,
|
| 192 |
+
"adaptation": 0.0,
|
| 193 |
+
"efficiency": 0.8,
|
| 194 |
+
"shaped_total": 0.0,
|
| 195 |
+
"cumulative_reward": 0.40499999999999997,
|
| 196 |
+
"binary": 0.0,
|
| 197 |
+
"steps_used": 4,
|
| 198 |
+
"final_action_type": "complete_task",
|
| 199 |
+
"error": null
|
| 200 |
+
},
|
| 201 |
+
{
|
| 202 |
+
"task_id": "M2_weekly_report",
|
| 203 |
+
"seed": 2,
|
| 204 |
+
"completion": 0.0,
|
| 205 |
+
"drift_detection": 0.0,
|
| 206 |
+
"adaptation": 0.0,
|
| 207 |
+
"efficiency": 0.75,
|
| 208 |
+
"shaped_total": 0.0,
|
| 209 |
+
"cumulative_reward": 0.5249999999999999,
|
| 210 |
+
"binary": 0.0,
|
| 211 |
+
"steps_used": 5,
|
| 212 |
+
"final_action_type": "complete_task",
|
| 213 |
+
"error": null
|
| 214 |
+
},
|
| 215 |
+
{
|
| 216 |
+
"task_id": "M3_event_cleanup",
|
| 217 |
+
"seed": 0,
|
| 218 |
+
"completion": 0.0,
|
| 219 |
+
"drift_detection": 0.0,
|
| 220 |
+
"adaptation": 0.0,
|
| 221 |
+
"efficiency": 0.9166666666666667,
|
| 222 |
+
"shaped_total": 0.0,
|
| 223 |
+
"cumulative_reward": 0.14375,
|
| 224 |
+
"binary": 0.0,
|
| 225 |
+
"steps_used": 2,
|
| 226 |
+
"final_action_type": "complete_task",
|
| 227 |
+
"error": null
|
| 228 |
+
},
|
| 229 |
+
{
|
| 230 |
+
"task_id": "M3_event_cleanup",
|
| 231 |
+
"seed": 1,
|
| 232 |
+
"completion": 0.0,
|
| 233 |
+
"drift_detection": 0.0,
|
| 234 |
+
"adaptation": 1.0,
|
| 235 |
+
"efficiency": 0.75,
|
| 236 |
+
"shaped_total": 0.0,
|
| 237 |
+
"cumulative_reward": 1.25625,
|
| 238 |
+
"binary": 0.0,
|
| 239 |
+
"steps_used": 6,
|
| 240 |
+
"final_action_type": "complete_task",
|
| 241 |
+
"error": null
|
| 242 |
+
},
|
| 243 |
+
{
|
| 244 |
+
"task_id": "M3_event_cleanup",
|
| 245 |
+
"seed": 2,
|
| 246 |
+
"completion": 0.0,
|
| 247 |
+
"drift_detection": 0.0,
|
| 248 |
+
"adaptation": 0.0,
|
| 249 |
+
"efficiency": 0.9166666666666667,
|
| 250 |
+
"shaped_total": 0.0,
|
| 251 |
+
"cumulative_reward": 0.14375,
|
| 252 |
+
"binary": 0.0,
|
| 253 |
+
"steps_used": 2,
|
| 254 |
+
"final_action_type": "complete_task",
|
| 255 |
+
"error": null
|
| 256 |
+
}
|
| 257 |
+
]
|
| 258 |
+
}
|
eval_results/policy_aware_heuristic_20260422_000751.json
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"baseline": "policy_aware_heuristic",
|
| 3 |
+
"timestamp": "20260422_000751",
|
| 4 |
+
"results": [
|
| 5 |
+
{
|
| 6 |
+
"task_id": "M1_customer_escalation",
|
| 7 |
+
"seed": 0,
|
| 8 |
+
"completion": 0.5,
|
| 9 |
+
"drift_detection": 0.5,
|
| 10 |
+
"adaptation": 0.0,
|
| 11 |
+
"efficiency": 0.7083333333333333,
|
| 12 |
+
"shaped_total": 0.0,
|
| 13 |
+
"cumulative_reward": 2.5104166666666665,
|
| 14 |
+
"binary": 0.0,
|
| 15 |
+
"steps_used": 7,
|
| 16 |
+
"final_action_type": "complete_task",
|
| 17 |
+
"error": null
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"task_id": "M2_weekly_report",
|
| 21 |
+
"seed": 0,
|
| 22 |
+
"completion": 0.25,
|
| 23 |
+
"drift_detection": 0.0,
|
| 24 |
+
"adaptation": 1.0,
|
| 25 |
+
"efficiency": 0.5,
|
| 26 |
+
"shaped_total": 0.0,
|
| 27 |
+
"cumulative_reward": 3.0125,
|
| 28 |
+
"binary": 0.0,
|
| 29 |
+
"steps_used": 10,
|
| 30 |
+
"final_action_type": "complete_task",
|
| 31 |
+
"error": null
|
| 32 |
+
},
|
| 33 |
+
{
|
| 34 |
+
"task_id": "M3_event_cleanup",
|
| 35 |
+
"seed": 0,
|
| 36 |
+
"completion": 0.2,
|
| 37 |
+
"drift_detection": 0.0,
|
| 38 |
+
"adaptation": 0.0,
|
| 39 |
+
"efficiency": 0.875,
|
| 40 |
+
"shaped_total": 0.0,
|
| 41 |
+
"cumulative_reward": 0.44125000000000003,
|
| 42 |
+
"binary": 0.0,
|
| 43 |
+
"steps_used": 3,
|
| 44 |
+
"final_action_type": "complete_task",
|
| 45 |
+
"error": null
|
| 46 |
+
}
|
| 47 |
+
]
|
| 48 |
+
}
|
tests/test_eval.py
CHANGED
|
@@ -114,6 +114,23 @@ def test_build_agent_factory() -> None:
|
|
| 114 |
build_agent("nonexistent_baseline")
|
| 115 |
|
| 116 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
def test_print_baseline_table_format() -> None:
|
| 118 |
results = [
|
| 119 |
EpisodeResult(
|
|
|
|
| 114 |
build_agent("nonexistent_baseline")
|
| 115 |
|
| 116 |
|
| 117 |
+
def test_build_ollama_agent(monkeypatch) -> None:
|
| 118 |
+
"""Factory constructs an LLMAgent with provider=ollama and the full model tag.
|
| 119 |
+
|
| 120 |
+
Covers the colon-in-model-tag case (e.g., 'gpt-oss:120b') — split(':', 1)
|
| 121 |
+
must keep the tag intact after the 'ollama:' prefix is stripped.
|
| 122 |
+
No real API call; monkeypatched key.
|
| 123 |
+
"""
|
| 124 |
+
from eval import LLMAgent
|
| 125 |
+
monkeypatch.setenv("OLLAMA_API_KEY", "fake_key_for_test_only")
|
| 126 |
+
agent = build_agent("ollama:gpt-oss:120b")
|
| 127 |
+
assert isinstance(agent, LLMAgent)
|
| 128 |
+
assert agent.provider == "ollama"
|
| 129 |
+
assert agent.model_id == "gpt-oss:120b"
|
| 130 |
+
assert agent.name == "ollama:gpt-oss:120b"
|
| 131 |
+
assert agent._ollama_key == "fake_key_for_test_only"
|
| 132 |
+
|
| 133 |
+
|
| 134 |
def test_print_baseline_table_format() -> None:
|
| 135 |
results = [
|
| 136 |
EpisodeResult(
|