SwapnilPatil28 commited on
Commit
c3648b5
·
verified ·
1 Parent(s): 58af620

Final Update - Add training artifacts, README updates, and scripts

Browse files
.dockerignore CHANGED
@@ -5,9 +5,12 @@
5
  __pycache__
6
  **/__pycache__
7
  **/*.pyc
8
- artifacts/
 
 
9
  outputs/
10
  tests/
11
  .pytest_cache/
12
  .cursor
13
  *.ipynb_checkpoints
 
 
5
  __pycache__
6
  **/__pycache__
7
  **/*.pyc
8
+ # Keep the committed evidence (plots, JSON metrics) so the HF Space dashboard
9
+ # can render them; only exclude the heavy fine-tuned checkpoint directory.
10
+ artifacts/sft_model/
11
  outputs/
12
  tests/
13
  .pytest_cache/
14
  .cursor
15
  *.ipynb_checkpoints
16
+ docs/
.gitattributes CHANGED
@@ -1,2 +1,3 @@
1
  # Auto detect text files and perform LF normalization
2
  * text=auto
 
 
1
  # Auto detect text files and perform LF normalization
2
  * text=auto
3
+ artifacts/reward_components.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -23,9 +23,21 @@ tags:
23
 
24
  [![Tests](https://img.shields.io/badge/tests-21%20passing-brightgreen)](./tests) [![OpenEnv](https://img.shields.io/badge/OpenEnv-v0.2%2B-blue)](https://github.com/meta-pytorch/openenv) [![License](https://img.shields.io/badge/license-MIT-blue.svg)](./LICENSE) ![Python](https://img.shields.io/badge/python-3.10%2B-blue)
25
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  Three specialist agents — **Triage**, **Investigator**, and **Ops Manager** — cooperate to resolve a queue of production incidents while operating under strict **SLA budgets**, **investigation costs**, and **customer-tier impact multipliers**. The environment is designed to reward *real* operational reasoning, not pattern matching on the root-cause label.
27
 
28
- This repository is the hackathon submission for the **OpenEnv India 2026 Round 2** finals across three themes:
29
 
30
  - **Theme #1 Multi-Agent Interactions** — role-gated action space, negotiation, handoff.
31
  - **Theme #2 (Super) Long-Horizon Planning** — delayed rewards, carried constraints across multiple incidents, postmortem requirements.
@@ -66,6 +78,16 @@ This environment captures five properties that are hard to teach with static dat
66
  | **Anti-gaming** | Clue bonuses are unique per root-cause keyword; repeated lookups get a small penalty. Closing without enough clues triggers an under-investigated penalty even when the guess is right. |
67
  | **Carry-over state** | Budget and SLA decrement across the whole incident queue, so early sloppy episodes ruin later ones. Postmortems must be filed for high-impact incidents. |
68
 
 
 
 
 
 
 
 
 
 
 
69
  ---
70
 
71
  ## Architecture
@@ -136,29 +158,48 @@ Both action and observation schemas are defined in [`models.py`](./models.py) wi
136
 
137
  ## Reward model
138
 
139
- The rubric engine lives in [`server/domain/reward.py`](./server/domain/reward.py). Every step accumulates named components that are summed into the final reward and echoed to the agent.
 
 
140
 
141
  | Component | Typical value | Triggers |
142
  |---|---:|---|
143
- | `step_cost` | −0.02 … −0.08 | Every action (type-specific) |
144
- | `wrong_actor_penalty` | −0.08 | Action invoked by a role not authorised to perform it |
145
- | `clue_bonus` | **+0.12** | Lookup text contains a *new* root-cause keyword (capped at 3 per incident) |
 
146
  | `repeated_lookup_penalty` | −0.02 | Same clue keyword surfaced again |
147
  | `handoff_correct` / `handoff_wrong` | **+0.15** / −0.10 | Handoff target matches the incident's expected owner |
148
- | `mitigation_correct` / `mitigation_wrong` | **+0.35** / −0.30 | `apply_fix` text matches accepted fix keywords |
149
- | `closure_correct` | **+0.80 × tier** | Correct root cause, tier multiplier: free 0.6, standard 1.0, premium 1.4, enterprise 1.8 |
150
- | `closure_mitigation_bonus` | +0.30 | Closed *after* a successful mitigation |
 
 
 
 
 
 
 
 
 
 
151
  | `closure_under_investigated` | −0.20 | Closed before collecting the required number of clues |
152
  | `speed_bonus` | +0.10 … +0.20 | Resolved in ≤ 7 / ≤ 4 steps on that incident |
153
- | `postmortem_bonus` / `postmortem_missing` | +0.12 / −0.15 | Postmortem filed for high-impact incidents |
154
- | `closure_wrong` | −1.10 × tier | Wrong root cause, scaled by tier |
155
- | `sla_exhausted` | −1.2 × tier | Global SLA minutes hit zero |
 
 
 
 
156
  | `budget_exhausted` | −1.5 | Investigation action budget hit zero |
157
 
 
 
158
  Design goals:
159
 
160
- 1. **Transparent** — agents and humans can see *why* each step was scored.
161
- 2. **Hard to game** — unique clue bonuses, under-investigation penalty, role gating.
162
  3. **Business-aware** — tier multipliers mirror real enterprise SLA contracts.
163
 
164
  ---
@@ -180,8 +221,8 @@ Full incident catalog with logs, metrics, KB and accepted fixes is defined in [`
180
  ### 1. Clone and install
181
 
182
  ```bash
183
- git clone https://github.com/<you>/CustomerSupportTicketRoutingEnv
184
- cd CustomerSupportTicketRoutingEnv
185
 
186
  python -m venv .venv
187
  # Windows PowerShell
@@ -238,7 +279,12 @@ Expected output: **21 passing** (domain rubric, incident catalog, environment in
238
  1. **Rollout** — the `HeuristicCoordinator` drives the live environment to collect `(prompt, completion)` pairs. Prompts include customer tier, revenue impact, visible signals and investigation targets; completions are structured JSON actions.
239
  2. **SFT** — the dataset is collapsed into a single `text` column (robust across TRL ≥ 0.20) and fed to `SFTTrainer`. The fine-tuned weights + tokenizer are saved to `artifacts/sft_model/`.
240
  3. **Evaluation** — four policies are rolled out under identical seeds: `random`, `heuristic`, `base_model` (raw `BASE_MODEL` HF checkpoint), and `sft_model` (the fine-tuned checkpoint just saved). LLM evaluation auto-enables on a CUDA GPU; force it with `EVAL_LLM_MODELS=true` or disable with `EVAL_LLM_MODELS=false`.
241
- 4. **Artifacts** — `artifacts/reward_curve.png` (4 lines) and `artifacts/summary_metrics.json` (random / heuristic / base / SFT rewards + per-task SFT-over-base improvements) are written.
 
 
 
 
 
242
 
243
  ### Local run (small model)
244
 
@@ -246,22 +292,42 @@ Expected output: **21 passing** (domain rubric, incident catalog, environment in
246
  BASE_MODEL=Qwen/Qwen2.5-0.5B-Instruct python train_trl.py
247
  ```
248
 
249
- ### Colab / HF Spaces (T4 GPU)
 
 
 
 
250
 
251
  ```python
252
- # Cell 1
253
- !git clone https://github.com/<you>/CustomerSupportTicketRoutingEnv
254
- %cd CustomerSupportTicketRoutingEnv
255
- !pip install -r requirements.txt
 
256
 
257
  # Cell 2 — start the environment server in the background
258
- import subprocess, time
259
- server = subprocess.Popen(["uvicorn", "server.app:app", "--host", "127.0.0.1", "--port", "8000"])
260
- time.sleep(10)
261
-
262
- # Cell 3 — run baseline + SFT
 
 
 
 
 
 
 
 
 
263
  import os
264
- os.environ["BASE_MODEL"] = "Qwen/Qwen2.5-0.5B-Instruct"
 
 
 
 
 
 
265
  !python train_trl.py
266
  ```
267
 
@@ -302,22 +368,87 @@ the model emits invalid JSON.
302
 
303
  ## Training results
304
 
305
- ![Reward curve comparing heuristic coordinator vs random baseline](./artifacts/reward_curve.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
306
 
307
- *Heuristic coordinator vs random baseline on all three task difficulties (same seed). The heuristic dominates at every difficulty a clean behavioral gap that SFT on the same rollouts reinforces.*
308
 
309
- Summary metrics (from `artifacts/summary_metrics.json`):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
310
 
311
  ```json
312
  {
313
- "base_model": "Qwen/Qwen2.5-0.5B-Instruct",
314
- "random_rewards": [ ... ],
315
- "heuristic_rewards": [ ... ],
316
- "improvement_absolute": [ ... ]
 
 
 
 
 
317
  }
318
  ```
319
 
320
- Training loss is saved by TRL to `outputs/sft_run/trainer_state.json` and prints to stdout every 5 steps. A typical run shows train loss dropping from ~3.1 → ~0.24 and mean-token accuracy climbing from ~0.5 → ~0.95 over a single epoch on ~135 rollout rows — evidence that the model is learning the structured action JSON the environment expects.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
321
 
322
  ---
323
 
@@ -354,7 +485,7 @@ All tunables are environment variables so the image is 12-factor compatible:
354
  pytest tests/ -q
355
  ```
356
 
357
- Three test modules:
358
 
359
  - `tests/test_reward.py` — invariants of the rubric engine (capping, anti-gaming, tier scaling).
360
  - `tests/test_incidents.py` — catalog completeness, uniqueness, deterministic instantiation.
@@ -362,40 +493,85 @@ Three test modules:
362
 
363
  The domain suites are pure-python and run without `openenv-core` installed.
364
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
365
  ---
366
 
367
  ## Repository layout
368
 
369
  ```
370
  .
371
- ├── models.py # Pydantic schemas (IncidentAction / Observation / State)
372
- ├── client.py # Typed EnvClient (reset / step / state / close)
373
- ├── inference.py # HeuristicCoordinator + random baseline
374
- ├── train_trl.py # Rollout SFT evaluation → artifacts
375
- ├── openenv.yaml # OpenEnv manifest
376
- ├── pyproject.toml # Package metadata, extras, entry points
377
- ├── requirements.txt # Full stack requirements (training incl.)
378
- ├── Dockerfile # Root image (parity with server/Dockerfile)
379
- ├── artifacts/
380
- ├── reward_curve.png # Committed training-evidence plot
381
- │ └── summary_metrics.json # Committed training-evidence metrics
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
382
  ├── server/
383
- │ ├── app.py # FastAPI app with health/metrics/dashboard
384
- │ ├── environment.py # OpenEnv-compliant Environment implementation
385
- │ ├── config.py # 12-factor runtime configuration
386
- │ ├── logging_utils.py # Structured JSON logging
387
- │ ├── requirements.txt # Slim server image requirements
388
- │ ├── Dockerfile # Production image (HEALTHCHECK included)
 
 
389
  │ └── domain/
390
- │ ├── incidents.py # 13 enterprise incident templates + factory
391
- │ ├── reward.py # Composable rubric engine
392
- │ ├── roles.py # Role-based permission policy
393
- ── rng.py # Deterministic per-episode RNG
394
- └── tests/
395
- ├── conftest.py # sys.path + env defaults
396
- ── test_reward.py # Rubric invariants
397
- ├── test_incidents.py # Catalog invariants
398
- ── test_environment.py # End-to-end environment tests
 
 
399
  ```
400
 
401
  ---
@@ -419,18 +595,22 @@ ENV_LOG_LEVEL: "INFO"
419
 
420
  ## Submission checklist
421
 
422
- - [x] OpenEnv latest runtime and `openenv validate` passing
423
- - [x] Multi-agent, long-horizon environment with role-gated action space
424
- - [x] Composable, transparent, anti-gaming reward rubric
425
- - [x] Business-impact-aware scoring (customer tier, revenue, SLA)
426
- - [x] 13 incident templates across 3 difficulties with red herrings and playbooks
427
- - [x] End-to-end TRL SFT pipeline committed (`train_trl.py`)
428
- - [x] Real training artifacts committed (`artifacts/reward_curve.png`, `artifacts/summary_metrics.json`)
429
- - [x] 21 passing unit tests
430
- - [x] Production-quality HTTP server: `/healthz`, `/version`, `/env-info`, `/metrics`, Dockerfile with `HEALTHCHECK`
431
- - [x] Structured JSON logging + 12-factor configuration
432
- - [ ] Hugging Face Space URL (fill me in)
433
- - [ ] 2-minute demo video or HF blog (fill me in)
 
 
 
 
434
 
435
  ---
436
 
 
23
 
24
  [![Tests](https://img.shields.io/badge/tests-21%20passing-brightgreen)](./tests) [![OpenEnv](https://img.shields.io/badge/OpenEnv-v0.2%2B-blue)](https://github.com/meta-pytorch/openenv) [![License](https://img.shields.io/badge/license-MIT-blue.svg)](./LICENSE) ![Python](https://img.shields.io/badge/python-3.10%2B-blue)
25
 
26
+ ### Live links
27
+
28
+ | What | Where |
29
+ |---|---|
30
+ | **Live environment (OpenEnv-compatible)** | **[`https://swapnilpatil28-multi-agent-incident-command-center.hf.space`](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)** |
31
+ | Hugging Face Space page | **[`huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
32
+ | GitHub repository | **[`github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
33
+ | Training notebook (Colab T4, one-click reproducible) | **[Open in Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
34
+ | 2-minute video walkthrough | *Coming soon — [`docs/VIDEO_SCRIPT.md`](./docs/VIDEO_SCRIPT.md) has the shot list* |
35
+ | Mini blog post | *Coming soon — full draft in [`docs/BLOG_POST.md`](./docs/BLOG_POST.md), ready to publish on hf.co/blog* |
36
+ | Training script (Python) | [`train_trl.py`](./train_trl.py) |
37
+
38
  Three specialist agents — **Triage**, **Investigator**, and **Ops Manager** — cooperate to resolve a queue of production incidents while operating under strict **SLA budgets**, **investigation costs**, and **customer-tier impact multipliers**. The environment is designed to reward *real* operational reasoning, not pattern matching on the root-cause label.
39
 
40
+ This repository is the hackathon submission for the **OpenEnv India 2026 Round 2** finals across three themes simultaneously:
41
 
42
  - **Theme #1 Multi-Agent Interactions** — role-gated action space, negotiation, handoff.
43
  - **Theme #2 (Super) Long-Horizon Planning** — delayed rewards, carried constraints across multiple incidents, postmortem requirements.
 
78
  | **Anti-gaming** | Clue bonuses are unique per root-cause keyword; repeated lookups get a small penalty. Closing without enough clues triggers an under-investigated penalty even when the guess is right. |
79
  | **Carry-over state** | Budget and SLA decrement across the whole incident queue, so early sloppy episodes ruin later ones. Postmortems must be filed for high-impact incidents. |
80
 
81
+ ### Mapping to the hackathon themes
82
+
83
+ One environment, three themes checked — each one addressed by a concrete mechanic, not just a claim:
84
+
85
+ | Hackathon theme | How this environment satisfies it |
86
+ |---|---|
87
+ | **Theme #1 — Multi-Agent Interactions** | Three *distinct* specialist roles (`triage_agent`, `investigator_agent`, `ops_manager_agent`) with **non-overlapping permissions**. `negotiate_handoff` scores correct cooperation (+0.15) and wrong owners (−0.10). `wrong_actor_penalty` (−0.08) teaches the *belief* that "I should pick the right specialist for this phase" — a minimal theory-of-mind signal over who-can-do-what. |
88
+ | **Theme #2 — (Super) Long-Horizon Planning** | **Each episode carries 3–5 sequential incidents** under a single investigation budget and a single ticking SLA counter. Rewards are **sparse and delayed**: the +0.80 closure reward only fires when you pick the right root cause after collecting enough clues, running a correct mitigation, and filing a postmortem — steps that may happen 20–60 turns apart. Early sloppy episodes visibly corrupt later ones via the shared budget/SLA. |
89
+ | **Theme #3.1 — World Modeling (Professional Tasks)** | Incidents carry **realistic logs, metrics, and KB articles** with **red-herring signals mixed into real ones**, making root-cause identification require *tool-use discipline*, not shortcut guessing. Customer tiers, affected-user counts, and $/min revenue impact create a **persistent business world-model** that the agent has to reason about — closing an enterprise incident incorrectly costs ~2x what closing a free-tier one costs. |
90
+
91
  ---
92
 
93
  ## Architecture
 
158
 
159
  ## Reward model
160
 
161
+ The rubric engine lives in [`server/domain/reward.py`](./server/domain/reward.py) and [`server/environment.py`](./server/environment.py). Every step accumulates named components that are summed into the final reward and echoed back to the agent in `observation.reward_components`.
162
+
163
+ ### Step-level components (what each action pays or earns)
164
 
165
  | Component | Typical value | Triggers |
166
  |---|---:|---|
167
+ | `step_cost` | −0.01 … −0.08 | Every action (type-specific: `-0.01` postmortem, `-0.02` handoff/fix, `-0.03` KB, `-0.04` logs/metrics, `-0.05` escalate, `-0.08` rollback) |
168
+ | `wrong_actor_penalty` | −0.08 | Action invoked by a role not authorised for it |
169
+ | `invalid_action` | 0.25 | Unrecognised `action_type` |
170
+ | `clue_bonus` | **+0.12** | Lookup surfaces a *new* root-cause keyword (capped at 3 per incident) |
171
  | `repeated_lookup_penalty` | −0.02 | Same clue keyword surfaced again |
172
  | `handoff_correct` / `handoff_wrong` | **+0.15** / −0.10 | Handoff target matches the incident's expected owner |
173
+ | `mitigation_correct` / `mitigation_wrong` / `mitigation_empty` | **+0.35** / −0.30 / −0.30 | `apply_fix` text matches accepted fix keywords |
174
+ | `rollback_effective` / `rollback_ineffective` | +0.20 / −0.15 | `rollback` summary aligns with the incident's accepted playbook |
175
+ | `escalation_needed` / `escalation_not_needed` | +0.10 / −0.10 | Escalation raised for an incident that actually meets the paging threshold (≥50K users OR ≥$800/min OR postmortem required) |
176
+ | `postmortem_logged` / `postmortem_empty` | +0.05 / −0.10 | `submit_postmortem` with/without a `postmortem_note` |
177
+
178
+ ### Closure components (scored when `close_incident` fires)
179
+
180
+ | Component | Typical value | Triggers |
181
+ |---|---:|---|
182
+ | `closure_correct` | **+0.80 × tier** | Correct root cause, tier multiplier: free ×0.6, standard ×1.0, premium ×1.4, enterprise ×1.8 |
183
+ | `closure_wrong` | **−1.10 × tier** | Wrong root cause, scaled by tier |
184
+ | `closure_mitigation_bonus` | +0.30 | Closed *after* a successful `apply_fix` |
185
+ | `closure_no_mitigation` | −0.15 | Closed on a mitigation-required incident without having applied one |
186
  | `closure_under_investigated` | −0.20 | Closed before collecting the required number of clues |
187
  | `speed_bonus` | +0.10 … +0.20 | Resolved in ≤ 7 / ≤ 4 steps on that incident |
188
+ | `postmortem_bonus` / `postmortem_missing` | +0.12 / −0.15 | Postmortem filed (or not) for a high-impact incident |
189
+
190
+ ### Terminal components (episode-ending penalties)
191
+
192
+ | Component | Typical value | Triggers |
193
+ |---|---:|---|
194
+ | `sla_exhausted` | **−1.2 × tier** | Global SLA minutes hit zero while an incident is still open |
195
  | `budget_exhausted` | −1.5 | Investigation action budget hit zero |
196
 
197
+ Every component is persisted to `observation.reward_components`, surfaced in Prometheus `/metrics`, and aggregated into the `reward_components_by_policy` block of [`artifacts/summary_metrics.json`](./artifacts/summary_metrics.json).
198
+
199
  Design goals:
200
 
201
+ 1. **Transparent** — agents and humans can see *why* each step was scored (the [Reward components](#3-reward-components--where-each-policy-actually-earns-reward) chart below is the rubric made visible).
202
+ 2. **Hard to game** — unique clue bonuses, under-investigation penalty, role gating, anti-churn `rollback_ineffective` and `escalation_not_needed`.
203
  3. **Business-aware** — tier multipliers mirror real enterprise SLA contracts.
204
 
205
  ---
 
221
  ### 1. Clone and install
222
 
223
  ```bash
224
+ git clone https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center.git
225
+ cd Multi-Agent-Incident-Command-Center
226
 
227
  python -m venv .venv
228
  # Windows PowerShell
 
279
  1. **Rollout** — the `HeuristicCoordinator` drives the live environment to collect `(prompt, completion)` pairs. Prompts include customer tier, revenue impact, visible signals and investigation targets; completions are structured JSON actions.
280
  2. **SFT** — the dataset is collapsed into a single `text` column (robust across TRL ≥ 0.20) and fed to `SFTTrainer`. The fine-tuned weights + tokenizer are saved to `artifacts/sft_model/`.
281
  3. **Evaluation** — four policies are rolled out under identical seeds: `random`, `heuristic`, `base_model` (raw `BASE_MODEL` HF checkpoint), and `sft_model` (the fine-tuned checkpoint just saved). LLM evaluation auto-enables on a CUDA GPU; force it with `EVAL_LLM_MODELS=true` or disable with `EVAL_LLM_MODELS=false`.
282
+ 4. **Artifacts** — a single run writes all five evidence files committed to [`artifacts/`](./artifacts):
283
+ - `reward_curve.png` (4 lines: random / heuristic / base / SFT vs easy/medium/hard, both axes labelled)
284
+ - `training_curve.png` (TRL loss + mean token accuracy vs training step)
285
+ - `reward_components.png` (stacked bars showing *where* each policy's reward came from)
286
+ - `training_log.json` (full `trainer.state.log_history` for reproducibility)
287
+ - `summary_metrics.json` (random / heuristic / base / SFT rewards + per-task `improvement_sft_over_base` + `reward_components_by_policy`)
288
 
289
  ### Local run (small model)
290
 
 
292
  BASE_MODEL=Qwen/Qwen2.5-0.5B-Instruct python train_trl.py
293
  ```
294
 
295
+ ### Colab (T4 GPU) — one-click reproducible
296
+
297
+ **[Open the full training notebook on Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)**
298
+
299
+ Or run the cells manually:
300
 
301
  ```python
302
+ # Cell 1 — clone and install
303
+ !git clone https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center.git /content/repo
304
+ %cd /content/repo
305
+ !pip install -q -r requirements.txt
306
+ !pip install -q "openenv-core[core]>=0.2.2"
307
 
308
  # Cell 2 — start the environment server in the background
309
+ import subprocess, time, os, requests
310
+ os.environ["ENV_STRUCTURED_LOGGING"] = "false"
311
+ server = subprocess.Popen(
312
+ ["uvicorn", "server.app:app", "--host", "127.0.0.1", "--port", "8000"],
313
+ stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
314
+ )
315
+ for _ in range(30):
316
+ try:
317
+ if requests.get("http://127.0.0.1:8000/healthz", timeout=1).status_code == 200:
318
+ print("server up"); break
319
+ except Exception:
320
+ time.sleep(1)
321
+
322
+ # Cell 3 — full pipeline (dataset → SFT → evaluate 4 policies → plots)
323
  import os
324
+ os.environ["BASE_MODEL"] = "Qwen/Qwen2.5-1.5B-Instruct"
325
+ os.environ["ENV_URL"] = "http://127.0.0.1:8000"
326
+ os.environ["EVAL_LLM_MODELS"] = "true"
327
+ os.environ["EPISODES_PER_TASK"] = "8"
328
+ os.environ["TRAIN_EPOCHS"] = "3"
329
+ os.environ["TRAIN_MAX_LENGTH"] = "1024"
330
+ os.environ["MAX_LLM_EVAL_STEPS"] = "120"
331
  !python train_trl.py
332
  ```
333
 
 
368
 
369
  ## Training results
370
 
371
+ Four policies (**random**, **heuristic**, **base Qwen2.5-1.5B-Instruct**, **SFT fine-tuned**) evaluated under identical seeds across all three task difficulties. All three plots below are produced automatically by a single `python train_trl.py` run and committed to [`artifacts/`](./artifacts).
372
+
373
+ ### Headline: SFT closes a +10-point reward gap on hard incidents
374
+
375
+ | Task | Random | Base LLM | **Fine-tuned LLM** | Heuristic (oracle) |
376
+ |---|---:|---:|---:|---:|
377
+ | easy | -5.96 | -2.92 | **-4.72** | -4.72 |
378
+ | medium | -11.48 | -4.00 | **-0.87** | -0.87 |
379
+ | hard | -12.50 | -4.28 | **+5.89** | +5.89 |
380
+ | **SFT − Base** | — | — | **-1.80 / +3.13 / +10.17** | — |
381
+
382
+ > **Why SFT matches the heuristic component-for-component:** the environment is deterministic (same task → same incidents → same observations), and so is the heuristic (same observation → same action). With TRL SFT achieving ~0.99 token accuracy, the student memorises the teacher's policy and reproduces it under greedy decoding. Behavior cloning has converged to the expert. The meaningful comparison is therefore **SFT vs the untrained base model**, where fine-tuning earns **+10.17 reward on hard-difficulty incidents** and unlocks closure/mitigation/postmortem reward components the base model never fires.
383
+
384
+ ### 1. Reward curve — four policies head-to-head
385
+
386
+ ![Reward curve comparing random / heuristic / base LLM / fine-tuned LLM on easy, medium, and hard tasks](./artifacts/reward_curve.png)
387
 
388
+ *Random (red) is the floor. Base LLM (orange) already beats random on easy by producing structured JSON but plateaus because it never learns to close an incident. **Fine-tuned LLM (green) climbs sharply with difficulty**, reaching +5.89 on hard matching the hand-coded expert.*
389
 
390
+ ### 2. Training curve — loss drops, token accuracy climbs
391
+
392
+ ![TRL SFT training loss and mean token accuracy vs training step — loss from ~2.8 to ~0.02, token accuracy from 0.49 to 0.99](./artifacts/training_curve.png)
393
+
394
+ *Qwen2.5-1.5B-Instruct fine-tuned for 3 epochs on 680 rollout examples. Loss falls from ~2.84 → ~0.02; mean token accuracy climbs from ~0.49 to ~0.99. Satisfies the hackathon "loss AND reward plots" minimum requirement.*
395
+
396
+ ### 3. Reward components — where each policy actually earns reward
397
+
398
+ ![Reward components earned per policy summed across all three tasks — fine-tuned model unlocks closure_correct, mitigation_correct, handoff_correct that the base model never earns](./artifacts/reward_components.png)
399
+
400
+ *This chart is the rubric made visible. **Random** gets crushed by `closure_wrong` and `wrong_actor_penalty`. **Base LLM** only earns `clue_bonus`, then bleeds out via `step_cost` and `sla_exhausted` — it never closes an incident. **Fine-tuned LLM** and the **heuristic** both unlock the positive-reward components (`closure_correct +7.36`, `mitigation_correct +2.10`, `closure_mitigation_bonus +1.80`, `postmortem_bonus +0.60`). Training has redirected the LLM's reward mass from "bleeding" to "solving."*
401
+
402
+ ### 4. Summary metrics
403
+
404
+ The full numbers live in [`artifacts/summary_metrics.json`](./artifacts/summary_metrics.json). Top-level excerpt:
405
 
406
  ```json
407
  {
408
+ "base_model": "Qwen/Qwen2.5-1.5B-Instruct",
409
+ "dataset_rows": 680,
410
+ "episodes_per_task": 8,
411
+ "random_rewards": [ -5.96, -11.48, -12.50 ],
412
+ "heuristic_rewards": [ -4.72, -0.87, +5.89 ],
413
+ "base_model_rewards": [ -2.92, -4.00, -4.28 ],
414
+ "sft_model_rewards": [ -4.72, -0.87, +5.89 ],
415
+ "improvement_sft_over_base": [ -1.80, +3.13, +10.17 ],
416
+ "improvement_heuristic_over_random":[ +1.24, +10.61, +18.39 ]
417
  }
418
  ```
419
 
420
+ Full `reward_components_by_policy` (used to generate plot 3) is included alongside.
421
+
422
+ ### 5. Ablation: model scale matters for imitation learning
423
+
424
+ The same pipeline with the **smaller Qwen2.5-0.5B-Instruct** backbone, **identical seeds and environment config** (so random / heuristic numbers are directly comparable), but a smaller training dataset (3 episodes/task → 255 rows vs 8 episodes/task → 680 rows):
425
+
426
+ ![Reward curve — four policies on Qwen2.5-0.5B-Instruct](./artifacts/reward_curve_qwen0p5b.png)
427
+
428
+ | Task | Random | Base 0.5B | **SFT 0.5B** | Heuristic | **SFT − Base (0.5B)** |
429
+ |---|---:|---:|---:|---:|---:|
430
+ | easy | -5.96 | -2.92 | **-2.49** | -4.72 | +0.43 |
431
+ | medium | -11.48 | -4.00 | **-3.86** | -0.87 | +0.14 |
432
+ | hard | -12.50 | -2.40 | **-2.40** | +5.89 | **0.00** |
433
+
434
+ **The punchline — scale is the story.** With the 0.5B backbone, SFT delivers only a **+0.43 / +0.14 / +0.00** improvement over the base model and **never closes a single hard-incident**. Bumping the backbone to **1.5B** (same SFT code, same data pipeline, same environment) unlocks a **-1.80 / +3.13 / +10.17** improvement and makes the LLM match the heuristic's component-for-component behavior on hard incidents.
435
+
436
+ | Run config | 0.5B | **1.5B (headline)** |
437
+ |---|---|---|
438
+ | Model | Qwen2.5-0.5B-Instruct | Qwen2.5-1.5B-Instruct |
439
+ | Episodes / task (rollout) | 3 | 8 |
440
+ | Dataset rows | 255 | 680 |
441
+ | Train epochs | 1 | 3 |
442
+ | Base → SFT improvement on **hard** | **+0.00** | **+10.17** |
443
+ | Hard incidents closed by SFT | 0 | full heuristic behavior |
444
+
445
+ Interpretation: **at 0.5B the model is too small to absorb the multi-step, role-gated policy from SFT**, even though it can emit syntactically valid JSON. At 1.5B the capacity suddenly becomes sufficient to internalize the full action schedule, and behavior cloning converges. This is the kind of finding the environment is designed to surface — *the rubric makes it visible in one plot*, not hidden behind a single aggregate score.
446
+
447
+ Raw numbers live in [`artifacts/summary_metrics_qwen0p5b.json`](./artifacts/summary_metrics_qwen0p5b.json).
448
+
449
+ ### Reproduce the whole training run
450
+
451
+ One click: **[Open Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** (T4 GPU, ~1 h 15 min wall clock end-to-end, including base-model + SFT-model evaluation).
452
 
453
  ---
454
 
 
485
  pytest tests/ -q
486
  ```
487
 
488
+ Expected: `21 passed`. Three test modules:
489
 
490
  - `tests/test_reward.py` — invariants of the rubric engine (capping, anti-gaming, tier scaling).
491
  - `tests/test_incidents.py` — catalog completeness, uniqueness, deterministic instantiation.
 
493
 
494
  The domain suites are pure-python and run without `openenv-core` installed.
495
 
496
+ ### Pre-submission smoke tests
497
+
498
+ Two scripts judges (or you) can run without a local IDE:
499
+
500
+ ```bash
501
+ # 1. Local: manifest + files + domain tests
502
+ ./pre_validate.sh
503
+
504
+ # 2. Remote: hit the deployed HF Space end-to-end
505
+ ./validate-submission.sh https://swapnilpatil28-multi-agent-incident-command-center.hf.space
506
+ ```
507
+
508
+ [`pre_validate.sh`](./pre_validate.sh) runs the OpenEnv validator against the local manifest, confirms the training / inference scripts exist, and re-runs the domain test suite. [`validate-submission.sh`](./validate-submission.sh) pings `/reset` + `/healthz` on a live URL, checks the `Dockerfile` is in the submitted tree, and re-runs `openenv validate` — exactly what the judges' CI pipeline expects.
509
+
510
  ---
511
 
512
  ## Repository layout
513
 
514
  ```
515
  .
516
+ ├── README.md # This file
517
+ ├── LICENSE # MIT
518
+ ├── openenv.yaml # OpenEnv manifest (version 3.0)
519
+ ├── pyproject.toml # Package metadata + entry points
520
+ ├── requirements.txt # Full stack (server + training)
521
+ ├── uv.lock # Reproducible dependency lock
522
+ ├── Dockerfile # Root image (parity with server/Dockerfile)
523
+ ├── .dockerignore # Keeps the image small
524
+ ├── .gitignore # Excludes venv / artifacts-cache
525
+ ├── .gitattributes # EOL normalization
526
+ ── __init__.py # Makes the repo root importable for tests
527
+
528
+ ├── models.py # Pydantic schemas (IncidentAction/Observation/State)
529
+ ├── client.py # Typed EnvClient (reset / step / state / close)
530
+ ├── inference.py # HeuristicCoordinator + random baseline + POLICY_MODEL hook
531
+ ├── llm_policy.py # HF causal-LM → environment-ready policy wrapper
532
+ ├── train_trl.py # Rollout → SFT → 4-policy evaluation → plots
533
+
534
+ ├── pre_validate.sh # Local 5-step pre-submission smoke test
535
+ ├── validate-submission.sh # Remote /reset + /healthz + openenv validate against Space
536
+
537
+ ├── scripts/
538
+ │ └── before_after_demo.py # Side-by-side base vs SFT trace generator
539
+
540
+ ├── docs/
541
+ │ ├── BLOG_POST.md # HF blog draft (publish to hf.co/blog)
542
+ │ ├── VIDEO_SCRIPT.md # 2-minute YouTube script with link list
543
+ │ └── SUBMISSION_CHECKLIST.md # Judging-criteria checklist + smoke tests
544
+
545
+ ├── artifacts/ # All committed training evidence
546
+ │ ├── reward_curve.png # 4-policy reward comparison (1.5B headline)
547
+ │ ├── training_curve.png # TRL SFT loss + token accuracy (1.5B)
548
+ │ ├── reward_components.png # Per-policy rubric breakdown (1.5B)
549
+ │ ├── training_log.json # Full TRL log history (1.5B)
550
+ │ ├── summary_metrics.json # All reward + component numbers (1.5B)
551
+ │ ├── reward_curve_qwen0p5b.png # Ablation: same pipeline on 0.5B backbone
552
+ │ └── summary_metrics_qwen0p5b.json # Ablation numbers
553
+
554
  ├── server/
555
+ │ ├── __init__.py
556
+ │ ├── app.py # FastAPI app with health/metrics/dashboard
557
+ │ ├── environment.py # OpenEnv-compliant Environment implementation
558
+ │ ├── support_env_environment.py # Backward-compat alias module
559
+ │ ├── config.py # 12-factor runtime configuration
560
+ │ ├── logging_utils.py # Structured JSON logging
561
+ │ ├── requirements.txt # Slim server image requirements
562
+ │ ├── Dockerfile # Production image (HEALTHCHECK included)
563
  │ └── domain/
564
+ │ ├── __init__.py
565
+ │ ├── incidents.py # 13 enterprise incident templates + factory
566
+ │ ├── reward.py # Composable rubric engine (20+ components)
567
+ ── roles.py # Role-based permission policy
568
+ └── rng.py # Deterministic per-episode RNG
569
+
570
+ ── tests/ # 21 passing tests
571
+ ├── conftest.py # sys.path + env defaults
572
+ ── test_reward.py # Rubric invariants (capping, anti-gaming, tier scaling)
573
+ ├── test_incidents.py # Catalog invariants (uniqueness, determinism)
574
+ └── test_environment.py # reset/step invariants, wrong-actor, closure
575
  ```
576
 
577
  ---
 
595
 
596
  ## Submission checklist
597
 
598
+ Full checklist with pre-submission smoke tests [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md).
599
+
600
+ - [x] **OpenEnv latest runtime** and `openenv validate` passing — [Space live](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)
601
+ - [x] **Multi-agent, long-horizon environment** with role-gated action space (3 roles × 9 actions, 13 incidents)
602
+ - [x] **Composable, transparent, anti-gaming reward rubric** (14+ named components, tier-scaled)
603
+ - [x] **Business-impact-aware scoring** (customer tier, revenue impact, SLA countdown)
604
+ - [x] **End-to-end TRL SFT pipeline** that saves a checkpoint and re-evaluates it in the environment ([`train_trl.py`](./train_trl.py))
605
+ - [x] **Reward curve + training-loss curve + reward-components chart** committed to [`artifacts/`](./artifacts)
606
+ - [x] **Concrete SFT Base improvement**: **+10.17 reward on hard-difficulty incidents**
607
+ - [x] **21 passing unit tests** (domain invariants + environment integration)
608
+ - [x] **Production-quality HTTP server**: `/healthz`, `/version`, `/env-info`, `/metrics`, Dockerfile with `HEALTHCHECK`
609
+ - [x] **Structured JSON logging** + 12-factor configuration
610
+ - [x] **One-click Colab training notebook** → [Open ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)
611
+ - [x] **Blog draft** ([`docs/BLOG_POST.md`](./docs/BLOG_POST.md)) + **video script** ([`docs/VIDEO_SCRIPT.md`](./docs/VIDEO_SCRIPT.md))
612
+ - [ ] Publish the Hugging Face blog post and swap the "Coming soon" link in the Live-links table
613
+ - [ ] Upload the YouTube video and swap the "Coming soon" link in the Live-links table
614
 
615
  ---
616
 
artifacts/reward_components.png ADDED

Git LFS Details

  • SHA256: ee525913d499b4e4a5dc4a00c28b0d25df9f674ca6aec0bc5959ff0c55654938
  • Pointer size: 131 Bytes
  • Size of remote file: 162 kB
artifacts/reward_curve.png CHANGED
artifacts/reward_curve_qwen0p5b.png ADDED
artifacts/summary_metrics.json CHANGED
@@ -1,14 +1,89 @@
1
  {
2
- "base_model": "Qwen/Qwen2.5-0.5B-Instruct",
3
- "dataset_rows": 135,
 
4
  "random_rewards": [
5
- -3.2300000000000004,
6
- -5.53,
7
- -7.03
8
  ],
9
  "heuristic_rewards": [
10
- -3.02,
11
- -1.6900000000000002,
12
- -0.13999999999999996
13
- ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  }
 
1
  {
2
+ "base_model": "Qwen/Qwen2.5-1.5B-Instruct",
3
+ "dataset_rows": 680,
4
+ "episodes_per_task": 8,
5
  "random_rewards": [
6
+ -5.96,
7
+ -11.48,
8
+ -12.5
9
  ],
10
  "heuristic_rewards": [
11
+ -4.72,
12
+ -0.87,
13
+ 5.89
14
+ ],
15
+ "base_model_rewards": [
16
+ -2.92,
17
+ -4.0,
18
+ -4.28
19
+ ],
20
+ "sft_model_rewards": [
21
+ -4.72,
22
+ -0.87,
23
+ 5.89
24
+ ],
25
+ "improvement_sft_over_base": [
26
+ -1.8,
27
+ 3.13,
28
+ 10.17
29
+ ],
30
+ "improvement_heuristic_over_random": [
31
+ 1.24,
32
+ 10.61,
33
+ 18.39
34
+ ],
35
+ "reward_components_by_policy": {
36
+ "random": {
37
+ "wrong_actor_penalty": -3.12,
38
+ "closure_wrong": -17.82,
39
+ "step_cost": -2.61,
40
+ "postmortem_empty": -1.0,
41
+ "escalation_not_needed": -0.3,
42
+ "clue_bonus": 0.48,
43
+ "handoff_wrong": -0.8,
44
+ "mitigation_wrong": -2.1,
45
+ "rollback_ineffective": -1.65,
46
+ "sla_exhausted": -1.2,
47
+ "repeated_lookup_penalty": -0.02,
48
+ "escalation_needed": 0.2
49
+ },
50
+ "heuristic": {
51
+ "step_cost": -2.02,
52
+ "clue_bonus": 2.52,
53
+ "handoff_wrong": -0.8,
54
+ "mitigation_wrong": -2.1,
55
+ "closure_wrong": -9.9,
56
+ "repeated_lookup_penalty": -0.16,
57
+ "handoff_correct": 0.75,
58
+ "postmortem_logged": 0.35,
59
+ "mitigation_correct": 2.1,
60
+ "closure_correct": 7.36,
61
+ "closure_mitigation_bonus": 1.8,
62
+ "speed_bonus": 0.6,
63
+ "postmortem_bonus": 0.6,
64
+ "closure_under_investigated": -0.8
65
+ },
66
+ "base_model": {
67
+ "step_cost": -5.16,
68
+ "clue_bonus": 0.24,
69
+ "repeated_lookup_penalty": -1.24,
70
+ "sla_exhausted": -5.04
71
+ },
72
+ "sft_model": {
73
+ "step_cost": -2.02,
74
+ "clue_bonus": 2.52,
75
+ "handoff_wrong": -0.8,
76
+ "mitigation_wrong": -2.1,
77
+ "closure_wrong": -9.9,
78
+ "repeated_lookup_penalty": -0.16,
79
+ "handoff_correct": 0.75,
80
+ "postmortem_logged": 0.35,
81
+ "mitigation_correct": 2.1,
82
+ "closure_correct": 7.36,
83
+ "closure_mitigation_bonus": 1.8,
84
+ "speed_bonus": 0.6,
85
+ "postmortem_bonus": 0.6,
86
+ "closure_under_investigated": -0.8
87
+ }
88
+ }
89
  }
artifacts/summary_metrics_qwen0p5b.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "base_model": "Qwen/Qwen2.5-0.5B-Instruct",
3
+ "dataset_rows": 255,
4
+ "episodes_per_task": 3,
5
+ "random_rewards": [
6
+ -5.96,
7
+ -11.48,
8
+ -12.5
9
+ ],
10
+ "heuristic_rewards": [
11
+ -4.72,
12
+ -0.87,
13
+ 5.89
14
+ ],
15
+ "base_model_rewards": [
16
+ -2.92,
17
+ -4.0,
18
+ -2.4
19
+ ],
20
+ "sft_model_rewards": [
21
+ -2.49,
22
+ -3.86,
23
+ -2.4
24
+ ],
25
+ "improvement_sft_over_base": [
26
+ 0.43,
27
+ 0.14,
28
+ 0.0
29
+ ],
30
+ "improvement_heuristic_over_random": [
31
+ 1.24,
32
+ 10.61,
33
+ 18.39
34
+ ]
35
+ }
artifacts/training_curve.png ADDED
artifacts/training_log.json ADDED
@@ -0,0 +1,2051 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "loss": 2.836225128173828,
4
+ "grad_norm": 64.5,
5
+ "learning_rate": 1.9921568627450984e-05,
6
+ "entropy": 2.411133313179016,
7
+ "num_tokens": 3137.0,
8
+ "mean_token_accuracy": 0.49307813346385954,
9
+ "epoch": 0.014705882352941176,
10
+ "step": 5
11
+ },
12
+ {
13
+ "loss": 1.3722827911376954,
14
+ "grad_norm": 10.0,
15
+ "learning_rate": 1.9823529411764708e-05,
16
+ "entropy": 1.489565873146057,
17
+ "num_tokens": 6240.0,
18
+ "mean_token_accuracy": 0.7310294091701508,
19
+ "epoch": 0.029411764705882353,
20
+ "step": 10
21
+ },
22
+ {
23
+ "loss": 0.9681278228759765,
24
+ "grad_norm": 9.0,
25
+ "learning_rate": 1.9725490196078433e-05,
26
+ "entropy": 1.0941020846366882,
27
+ "num_tokens": 9372.0,
28
+ "mean_token_accuracy": 0.7977278172969818,
29
+ "epoch": 0.04411764705882353,
30
+ "step": 15
31
+ },
32
+ {
33
+ "loss": 0.7952256202697754,
34
+ "grad_norm": 7.5625,
35
+ "learning_rate": 1.9627450980392157e-05,
36
+ "entropy": 0.7959236443042755,
37
+ "num_tokens": 12496.0,
38
+ "mean_token_accuracy": 0.8263253927230835,
39
+ "epoch": 0.058823529411764705,
40
+ "step": 20
41
+ },
42
+ {
43
+ "loss": 0.7038975715637207,
44
+ "grad_norm": 10.0,
45
+ "learning_rate": 1.9529411764705885e-05,
46
+ "entropy": 0.7730603992938996,
47
+ "num_tokens": 15726.0,
48
+ "mean_token_accuracy": 0.8362560391426086,
49
+ "epoch": 0.07352941176470588,
50
+ "step": 25
51
+ },
52
+ {
53
+ "loss": 0.5153284072875977,
54
+ "grad_norm": 9.5,
55
+ "learning_rate": 1.943137254901961e-05,
56
+ "entropy": 0.5871870815753937,
57
+ "num_tokens": 18807.0,
58
+ "mean_token_accuracy": 0.8711118042469025,
59
+ "epoch": 0.08823529411764706,
60
+ "step": 30
61
+ },
62
+ {
63
+ "loss": 0.4624673843383789,
64
+ "grad_norm": 9.375,
65
+ "learning_rate": 1.9333333333333333e-05,
66
+ "entropy": 0.5334561973810196,
67
+ "num_tokens": 21955.0,
68
+ "mean_token_accuracy": 0.8878682732582093,
69
+ "epoch": 0.10294117647058823,
70
+ "step": 35
71
+ },
72
+ {
73
+ "loss": 0.3805722236633301,
74
+ "grad_norm": 7.0625,
75
+ "learning_rate": 1.923529411764706e-05,
76
+ "entropy": 0.490571403503418,
77
+ "num_tokens": 25129.0,
78
+ "mean_token_accuracy": 0.9082872688770294,
79
+ "epoch": 0.11764705882352941,
80
+ "step": 40
81
+ },
82
+ {
83
+ "loss": 0.2753485679626465,
84
+ "grad_norm": 8.75,
85
+ "learning_rate": 1.9137254901960786e-05,
86
+ "entropy": 0.3105604648590088,
87
+ "num_tokens": 28291.0,
88
+ "mean_token_accuracy": 0.9394680917263031,
89
+ "epoch": 0.1323529411764706,
90
+ "step": 45
91
+ },
92
+ {
93
+ "loss": 0.22170100212097169,
94
+ "grad_norm": 5.65625,
95
+ "learning_rate": 1.903921568627451e-05,
96
+ "entropy": 0.28098965287208555,
97
+ "num_tokens": 31415.0,
98
+ "mean_token_accuracy": 0.949154794216156,
99
+ "epoch": 0.14705882352941177,
100
+ "step": 50
101
+ },
102
+ {
103
+ "loss": 0.18951488733291627,
104
+ "grad_norm": 9.9375,
105
+ "learning_rate": 1.8941176470588238e-05,
106
+ "entropy": 0.20550020337104796,
107
+ "num_tokens": 34603.0,
108
+ "mean_token_accuracy": 0.9539743661880493,
109
+ "epoch": 0.16176470588235295,
110
+ "step": 55
111
+ },
112
+ {
113
+ "loss": 0.17650480270385743,
114
+ "grad_norm": 4.25,
115
+ "learning_rate": 1.8843137254901962e-05,
116
+ "entropy": 0.21026135981082916,
117
+ "num_tokens": 37754.0,
118
+ "mean_token_accuracy": 0.9567391991615295,
119
+ "epoch": 0.17647058823529413,
120
+ "step": 60
121
+ },
122
+ {
123
+ "loss": 0.18774482011795043,
124
+ "grad_norm": 5.5,
125
+ "learning_rate": 1.8745098039215686e-05,
126
+ "entropy": 0.23240296691656112,
127
+ "num_tokens": 40848.0,
128
+ "mean_token_accuracy": 0.9520188570022583,
129
+ "epoch": 0.19117647058823528,
130
+ "step": 65
131
+ },
132
+ {
133
+ "loss": 0.12736810445785524,
134
+ "grad_norm": 10.625,
135
+ "learning_rate": 1.8647058823529414e-05,
136
+ "entropy": 0.16197917684912683,
137
+ "num_tokens": 44001.0,
138
+ "mean_token_accuracy": 0.9676418542861939,
139
+ "epoch": 0.20588235294117646,
140
+ "step": 70
141
+ },
142
+ {
143
+ "loss": 0.14076029062271117,
144
+ "grad_norm": 4.53125,
145
+ "learning_rate": 1.854901960784314e-05,
146
+ "entropy": 0.15784153044223787,
147
+ "num_tokens": 47159.0,
148
+ "mean_token_accuracy": 0.9648099303245544,
149
+ "epoch": 0.22058823529411764,
150
+ "step": 75
151
+ },
152
+ {
153
+ "loss": 0.10759507417678833,
154
+ "grad_norm": 3.328125,
155
+ "learning_rate": 1.8450980392156866e-05,
156
+ "entropy": 0.14289679378271103,
157
+ "num_tokens": 50298.0,
158
+ "mean_token_accuracy": 0.9671541452407837,
159
+ "epoch": 0.23529411764705882,
160
+ "step": 80
161
+ },
162
+ {
163
+ "loss": 0.12589149475097655,
164
+ "grad_norm": 5.46875,
165
+ "learning_rate": 1.8352941176470587e-05,
166
+ "entropy": 0.13958239406347275,
167
+ "num_tokens": 53455.0,
168
+ "mean_token_accuracy": 0.9665216684341431,
169
+ "epoch": 0.25,
170
+ "step": 85
171
+ },
172
+ {
173
+ "loss": 0.12024720907211303,
174
+ "grad_norm": 4.53125,
175
+ "learning_rate": 1.8254901960784315e-05,
176
+ "entropy": 0.13711344972252845,
177
+ "num_tokens": 56595.0,
178
+ "mean_token_accuracy": 0.9648710668087006,
179
+ "epoch": 0.2647058823529412,
180
+ "step": 90
181
+ },
182
+ {
183
+ "loss": 0.10167303085327148,
184
+ "grad_norm": 4.8125,
185
+ "learning_rate": 1.815686274509804e-05,
186
+ "entropy": 0.13078619986772538,
187
+ "num_tokens": 59674.0,
188
+ "mean_token_accuracy": 0.9712324619293213,
189
+ "epoch": 0.27941176470588236,
190
+ "step": 95
191
+ },
192
+ {
193
+ "loss": 0.08662314414978027,
194
+ "grad_norm": 3.671875,
195
+ "learning_rate": 1.8058823529411767e-05,
196
+ "entropy": 0.10740345045924186,
197
+ "num_tokens": 62774.0,
198
+ "mean_token_accuracy": 0.9719909071922302,
199
+ "epoch": 0.29411764705882354,
200
+ "step": 100
201
+ },
202
+ {
203
+ "loss": 0.09073780775070191,
204
+ "grad_norm": 4.15625,
205
+ "learning_rate": 1.796078431372549e-05,
206
+ "entropy": 0.09185975939035415,
207
+ "num_tokens": 65866.0,
208
+ "mean_token_accuracy": 0.9742748856544494,
209
+ "epoch": 0.3088235294117647,
210
+ "step": 105
211
+ },
212
+ {
213
+ "loss": 0.07408615350723266,
214
+ "grad_norm": 2.734375,
215
+ "learning_rate": 1.786274509803922e-05,
216
+ "entropy": 0.10024651288986205,
217
+ "num_tokens": 68995.0,
218
+ "mean_token_accuracy": 0.9773713290691376,
219
+ "epoch": 0.3235294117647059,
220
+ "step": 110
221
+ },
222
+ {
223
+ "loss": 0.08644189834594726,
224
+ "grad_norm": 6.71875,
225
+ "learning_rate": 1.776470588235294e-05,
226
+ "entropy": 0.09930562153458596,
227
+ "num_tokens": 72160.0,
228
+ "mean_token_accuracy": 0.9748322486877441,
229
+ "epoch": 0.3382352941176471,
230
+ "step": 115
231
+ },
232
+ {
233
+ "loss": 0.11685197353363037,
234
+ "grad_norm": 10.3125,
235
+ "learning_rate": 1.7666666666666668e-05,
236
+ "entropy": 0.11419346779584885,
237
+ "num_tokens": 75262.0,
238
+ "mean_token_accuracy": 0.9695464611053467,
239
+ "epoch": 0.35294117647058826,
240
+ "step": 120
241
+ },
242
+ {
243
+ "loss": 0.10757300853729249,
244
+ "grad_norm": 8.9375,
245
+ "learning_rate": 1.7568627450980392e-05,
246
+ "entropy": 0.12836654633283615,
247
+ "num_tokens": 78384.0,
248
+ "mean_token_accuracy": 0.9728550255298615,
249
+ "epoch": 0.36764705882352944,
250
+ "step": 125
251
+ },
252
+ {
253
+ "loss": 0.07711289525032043,
254
+ "grad_norm": 3.015625,
255
+ "learning_rate": 1.747058823529412e-05,
256
+ "entropy": 0.10070741027593613,
257
+ "num_tokens": 81583.0,
258
+ "mean_token_accuracy": 0.9778402209281921,
259
+ "epoch": 0.38235294117647056,
260
+ "step": 130
261
+ },
262
+ {
263
+ "loss": 0.08512116074562073,
264
+ "grad_norm": 5.375,
265
+ "learning_rate": 1.7372549019607845e-05,
266
+ "entropy": 0.09163436144590378,
267
+ "num_tokens": 84729.0,
268
+ "mean_token_accuracy": 0.9748329102993012,
269
+ "epoch": 0.39705882352941174,
270
+ "step": 135
271
+ },
272
+ {
273
+ "loss": 0.09534031748771668,
274
+ "grad_norm": 3.40625,
275
+ "learning_rate": 1.7274509803921572e-05,
276
+ "entropy": 0.09555450975894927,
277
+ "num_tokens": 87916.0,
278
+ "mean_token_accuracy": 0.9727975726127625,
279
+ "epoch": 0.4117647058823529,
280
+ "step": 140
281
+ },
282
+ {
283
+ "loss": 0.0699828803539276,
284
+ "grad_norm": 2.828125,
285
+ "learning_rate": 1.7176470588235293e-05,
286
+ "entropy": 0.089533219486475,
287
+ "num_tokens": 90982.0,
288
+ "mean_token_accuracy": 0.9772566497325897,
289
+ "epoch": 0.4264705882352941,
290
+ "step": 145
291
+ },
292
+ {
293
+ "loss": 0.06004565954208374,
294
+ "grad_norm": 4.28125,
295
+ "learning_rate": 1.707843137254902e-05,
296
+ "entropy": 0.07979470491409302,
297
+ "num_tokens": 94197.0,
298
+ "mean_token_accuracy": 0.980064970254898,
299
+ "epoch": 0.4411764705882353,
300
+ "step": 150
301
+ },
302
+ {
303
+ "loss": 0.07095102667808532,
304
+ "grad_norm": 3.8125,
305
+ "learning_rate": 1.6980392156862745e-05,
306
+ "entropy": 0.07709958106279373,
307
+ "num_tokens": 97332.0,
308
+ "mean_token_accuracy": 0.9785419166088104,
309
+ "epoch": 0.45588235294117646,
310
+ "step": 155
311
+ },
312
+ {
313
+ "loss": 0.05590643882751465,
314
+ "grad_norm": 1.671875,
315
+ "learning_rate": 1.6882352941176473e-05,
316
+ "entropy": 0.07423891946673393,
317
+ "num_tokens": 100515.0,
318
+ "mean_token_accuracy": 0.9827289760112763,
319
+ "epoch": 0.47058823529411764,
320
+ "step": 160
321
+ },
322
+ {
323
+ "loss": 0.06335585117340088,
324
+ "grad_norm": 2.390625,
325
+ "learning_rate": 1.6784313725490198e-05,
326
+ "entropy": 0.08311136476695538,
327
+ "num_tokens": 103630.0,
328
+ "mean_token_accuracy": 0.9795481741428376,
329
+ "epoch": 0.4852941176470588,
330
+ "step": 165
331
+ },
332
+ {
333
+ "loss": 0.06994503140449523,
334
+ "grad_norm": 3.625,
335
+ "learning_rate": 1.6686274509803922e-05,
336
+ "entropy": 0.07972728088498116,
337
+ "num_tokens": 106741.0,
338
+ "mean_token_accuracy": 0.9786823868751526,
339
+ "epoch": 0.5,
340
+ "step": 170
341
+ },
342
+ {
343
+ "loss": 0.047742915153503415,
344
+ "grad_norm": 5.71875,
345
+ "learning_rate": 1.658823529411765e-05,
346
+ "entropy": 0.059984054416418076,
347
+ "num_tokens": 109921.0,
348
+ "mean_token_accuracy": 0.9847357928752899,
349
+ "epoch": 0.5147058823529411,
350
+ "step": 175
351
+ },
352
+ {
353
+ "loss": 0.05979984998703003,
354
+ "grad_norm": 7.0625,
355
+ "learning_rate": 1.6490196078431374e-05,
356
+ "entropy": 0.06703888289630414,
357
+ "num_tokens": 112994.0,
358
+ "mean_token_accuracy": 0.9824592292308807,
359
+ "epoch": 0.5294117647058824,
360
+ "step": 180
361
+ },
362
+ {
363
+ "loss": 0.04938005805015564,
364
+ "grad_norm": 2.90625,
365
+ "learning_rate": 1.63921568627451e-05,
366
+ "entropy": 0.054279588535428046,
367
+ "num_tokens": 116201.0,
368
+ "mean_token_accuracy": 0.9846667230129242,
369
+ "epoch": 0.5441176470588235,
370
+ "step": 185
371
+ },
372
+ {
373
+ "loss": 0.06785057783126831,
374
+ "grad_norm": 7.4375,
375
+ "learning_rate": 1.6294117647058826e-05,
376
+ "entropy": 0.06177988387644291,
377
+ "num_tokens": 119381.0,
378
+ "mean_token_accuracy": 0.9796367406845092,
379
+ "epoch": 0.5588235294117647,
380
+ "step": 190
381
+ },
382
+ {
383
+ "loss": 0.05383546352386474,
384
+ "grad_norm": 5.40625,
385
+ "learning_rate": 1.619607843137255e-05,
386
+ "entropy": 0.0636073287576437,
387
+ "num_tokens": 122517.0,
388
+ "mean_token_accuracy": 0.9798873722553253,
389
+ "epoch": 0.5735294117647058,
390
+ "step": 195
391
+ },
392
+ {
393
+ "loss": 0.0490637868642807,
394
+ "grad_norm": 1.96875,
395
+ "learning_rate": 1.6098039215686275e-05,
396
+ "entropy": 0.0639917254447937,
397
+ "num_tokens": 125663.0,
398
+ "mean_token_accuracy": 0.9849890351295472,
399
+ "epoch": 0.5882352941176471,
400
+ "step": 200
401
+ },
402
+ {
403
+ "loss": 0.06412197351455688,
404
+ "grad_norm": 6.84375,
405
+ "learning_rate": 1.6000000000000003e-05,
406
+ "entropy": 0.06784685887396336,
407
+ "num_tokens": 128856.0,
408
+ "mean_token_accuracy": 0.9818105876445771,
409
+ "epoch": 0.6029411764705882,
410
+ "step": 205
411
+ },
412
+ {
413
+ "loss": 0.04346465170383453,
414
+ "grad_norm": 4.375,
415
+ "learning_rate": 1.5901960784313727e-05,
416
+ "entropy": 0.06049864292144776,
417
+ "num_tokens": 131995.0,
418
+ "mean_token_accuracy": 0.9882112145423889,
419
+ "epoch": 0.6176470588235294,
420
+ "step": 210
421
+ },
422
+ {
423
+ "loss": 0.04320838153362274,
424
+ "grad_norm": 2.015625,
425
+ "learning_rate": 1.580392156862745e-05,
426
+ "entropy": 0.047596517577767374,
427
+ "num_tokens": 135181.0,
428
+ "mean_token_accuracy": 0.985132920742035,
429
+ "epoch": 0.6323529411764706,
430
+ "step": 215
431
+ },
432
+ {
433
+ "loss": 0.06799347996711731,
434
+ "grad_norm": 8.5625,
435
+ "learning_rate": 1.570588235294118e-05,
436
+ "entropy": 0.06635901145637035,
437
+ "num_tokens": 138254.0,
438
+ "mean_token_accuracy": 0.9791639804840088,
439
+ "epoch": 0.6470588235294118,
440
+ "step": 220
441
+ },
442
+ {
443
+ "loss": 0.041108173131942746,
444
+ "grad_norm": 2.859375,
445
+ "learning_rate": 1.5607843137254904e-05,
446
+ "entropy": 0.051696383953094484,
447
+ "num_tokens": 141381.0,
448
+ "mean_token_accuracy": 0.9862416744232178,
449
+ "epoch": 0.6617647058823529,
450
+ "step": 225
451
+ },
452
+ {
453
+ "loss": 0.045146191120147706,
454
+ "grad_norm": 3.078125,
455
+ "learning_rate": 1.5509803921568628e-05,
456
+ "entropy": 0.055339107289910316,
457
+ "num_tokens": 144583.0,
458
+ "mean_token_accuracy": 0.9822882294654847,
459
+ "epoch": 0.6764705882352942,
460
+ "step": 230
461
+ },
462
+ {
463
+ "loss": 0.04143168330192566,
464
+ "grad_norm": 1.578125,
465
+ "learning_rate": 1.5411764705882356e-05,
466
+ "entropy": 0.05063906572759151,
467
+ "num_tokens": 147764.0,
468
+ "mean_token_accuracy": 0.9831606447696686,
469
+ "epoch": 0.6911764705882353,
470
+ "step": 235
471
+ },
472
+ {
473
+ "loss": 0.03947827816009521,
474
+ "grad_norm": 1.9921875,
475
+ "learning_rate": 1.531372549019608e-05,
476
+ "entropy": 0.05209046043455601,
477
+ "num_tokens": 150961.0,
478
+ "mean_token_accuracy": 0.9848346650600434,
479
+ "epoch": 0.7058823529411765,
480
+ "step": 240
481
+ },
482
+ {
483
+ "loss": 0.034212198853492734,
484
+ "grad_norm": 1.8984375,
485
+ "learning_rate": 1.5215686274509804e-05,
486
+ "entropy": 0.04912327118217945,
487
+ "num_tokens": 154174.0,
488
+ "mean_token_accuracy": 0.9855735838413239,
489
+ "epoch": 0.7205882352941176,
490
+ "step": 245
491
+ },
492
+ {
493
+ "loss": 0.03223183453083038,
494
+ "grad_norm": 1.7265625,
495
+ "learning_rate": 1.511764705882353e-05,
496
+ "entropy": 0.045325061306357384,
497
+ "num_tokens": 157374.0,
498
+ "mean_token_accuracy": 0.9866909861564637,
499
+ "epoch": 0.7352941176470589,
500
+ "step": 250
501
+ },
502
+ {
503
+ "loss": 0.04085415601730347,
504
+ "grad_norm": 2.625,
505
+ "learning_rate": 1.5019607843137257e-05,
506
+ "entropy": 0.045074894279241565,
507
+ "num_tokens": 160519.0,
508
+ "mean_token_accuracy": 0.9865182876586914,
509
+ "epoch": 0.75,
510
+ "step": 255
511
+ },
512
+ {
513
+ "loss": 0.03927797079086304,
514
+ "grad_norm": 2.671875,
515
+ "learning_rate": 1.4921568627450983e-05,
516
+ "entropy": 0.039533843845129014,
517
+ "num_tokens": 163756.0,
518
+ "mean_token_accuracy": 0.9872985363006592,
519
+ "epoch": 0.7647058823529411,
520
+ "step": 260
521
+ },
522
+ {
523
+ "loss": 0.042234039306640624,
524
+ "grad_norm": 1.7109375,
525
+ "learning_rate": 1.4823529411764707e-05,
526
+ "entropy": 0.043326519429683685,
527
+ "num_tokens": 166884.0,
528
+ "mean_token_accuracy": 0.9839499652385711,
529
+ "epoch": 0.7794117647058824,
530
+ "step": 265
531
+ },
532
+ {
533
+ "loss": 0.04218446910381317,
534
+ "grad_norm": 3.671875,
535
+ "learning_rate": 1.4725490196078433e-05,
536
+ "entropy": 0.05446031875908375,
537
+ "num_tokens": 170021.0,
538
+ "mean_token_accuracy": 0.983331423997879,
539
+ "epoch": 0.7941176470588235,
540
+ "step": 270
541
+ },
542
+ {
543
+ "loss": 0.031345850229263304,
544
+ "grad_norm": 1.375,
545
+ "learning_rate": 1.4627450980392157e-05,
546
+ "entropy": 0.044994413107633593,
547
+ "num_tokens": 173138.0,
548
+ "mean_token_accuracy": 0.9864144027233124,
549
+ "epoch": 0.8088235294117647,
550
+ "step": 275
551
+ },
552
+ {
553
+ "loss": 0.03718245923519135,
554
+ "grad_norm": 2.03125,
555
+ "learning_rate": 1.4529411764705883e-05,
556
+ "entropy": 0.04372772537171841,
557
+ "num_tokens": 176269.0,
558
+ "mean_token_accuracy": 0.9855779051780701,
559
+ "epoch": 0.8235294117647058,
560
+ "step": 280
561
+ },
562
+ {
563
+ "loss": 0.038416677713394166,
564
+ "grad_norm": 3.234375,
565
+ "learning_rate": 1.443137254901961e-05,
566
+ "entropy": 0.04306882936507463,
567
+ "num_tokens": 179436.0,
568
+ "mean_token_accuracy": 0.9847787022590637,
569
+ "epoch": 0.8382352941176471,
570
+ "step": 285
571
+ },
572
+ {
573
+ "loss": 0.03612026274204254,
574
+ "grad_norm": 4.28125,
575
+ "learning_rate": 1.4333333333333334e-05,
576
+ "entropy": 0.04190887995064259,
577
+ "num_tokens": 182619.0,
578
+ "mean_token_accuracy": 0.9853791892528534,
579
+ "epoch": 0.8529411764705882,
580
+ "step": 290
581
+ },
582
+ {
583
+ "loss": 0.03549243807792664,
584
+ "grad_norm": 1.5546875,
585
+ "learning_rate": 1.423529411764706e-05,
586
+ "entropy": 0.041007821820676325,
587
+ "num_tokens": 185835.0,
588
+ "mean_token_accuracy": 0.987481951713562,
589
+ "epoch": 0.8676470588235294,
590
+ "step": 295
591
+ },
592
+ {
593
+ "loss": 0.03658969700336456,
594
+ "grad_norm": 1.9921875,
595
+ "learning_rate": 1.4137254901960786e-05,
596
+ "entropy": 0.03911938704550266,
597
+ "num_tokens": 189059.0,
598
+ "mean_token_accuracy": 0.9859034955501557,
599
+ "epoch": 0.8823529411764706,
600
+ "step": 300
601
+ },
602
+ {
603
+ "loss": 0.03189299702644348,
604
+ "grad_norm": 1.3984375,
605
+ "learning_rate": 1.403921568627451e-05,
606
+ "entropy": 0.04015427939593792,
607
+ "num_tokens": 192245.0,
608
+ "mean_token_accuracy": 0.9858013272285462,
609
+ "epoch": 0.8970588235294118,
610
+ "step": 305
611
+ },
612
+ {
613
+ "loss": 0.04162760376930237,
614
+ "grad_norm": 4.6875,
615
+ "learning_rate": 1.3941176470588236e-05,
616
+ "entropy": 0.04337671361863613,
617
+ "num_tokens": 195334.0,
618
+ "mean_token_accuracy": 0.9834910809993744,
619
+ "epoch": 0.9117647058823529,
620
+ "step": 310
621
+ },
622
+ {
623
+ "loss": 0.03357888162136078,
624
+ "grad_norm": 1.515625,
625
+ "learning_rate": 1.384313725490196e-05,
626
+ "entropy": 0.043437547981739044,
627
+ "num_tokens": 198482.0,
628
+ "mean_token_accuracy": 0.9839794993400574,
629
+ "epoch": 0.9264705882352942,
630
+ "step": 315
631
+ },
632
+ {
633
+ "loss": 0.03252431154251099,
634
+ "grad_norm": 2.390625,
635
+ "learning_rate": 1.3745098039215687e-05,
636
+ "entropy": 0.041450836881995204,
637
+ "num_tokens": 201737.0,
638
+ "mean_token_accuracy": 0.9883051753044129,
639
+ "epoch": 0.9411764705882353,
640
+ "step": 320
641
+ },
642
+ {
643
+ "loss": 0.03779064118862152,
644
+ "grad_norm": 2.953125,
645
+ "learning_rate": 1.3647058823529413e-05,
646
+ "entropy": 0.03566624131053686,
647
+ "num_tokens": 204889.0,
648
+ "mean_token_accuracy": 0.9875539124011994,
649
+ "epoch": 0.9558823529411765,
650
+ "step": 325
651
+ },
652
+ {
653
+ "loss": 0.0329700767993927,
654
+ "grad_norm": 2.15625,
655
+ "learning_rate": 1.3549019607843139e-05,
656
+ "entropy": 0.03808465227484703,
657
+ "num_tokens": 208114.0,
658
+ "mean_token_accuracy": 0.986751276254654,
659
+ "epoch": 0.9705882352941176,
660
+ "step": 330
661
+ },
662
+ {
663
+ "loss": 0.031173259019851685,
664
+ "grad_norm": 1.546875,
665
+ "learning_rate": 1.3450980392156865e-05,
666
+ "entropy": 0.04065078347921371,
667
+ "num_tokens": 211217.0,
668
+ "mean_token_accuracy": 0.9860772728919983,
669
+ "epoch": 0.9852941176470589,
670
+ "step": 335
671
+ },
672
+ {
673
+ "loss": 0.03390420079231262,
674
+ "grad_norm": 1.515625,
675
+ "learning_rate": 1.3352941176470588e-05,
676
+ "entropy": 0.04108036197721958,
677
+ "num_tokens": 214368.0,
678
+ "mean_token_accuracy": 0.9871271908283233,
679
+ "epoch": 1.0,
680
+ "step": 340
681
+ },
682
+ {
683
+ "loss": 0.03671025633811951,
684
+ "grad_norm": 1.5625,
685
+ "learning_rate": 1.3254901960784314e-05,
686
+ "entropy": 0.04091338850557804,
687
+ "num_tokens": 217480.0,
688
+ "mean_token_accuracy": 0.9861762046813964,
689
+ "epoch": 1.0147058823529411,
690
+ "step": 345
691
+ },
692
+ {
693
+ "loss": 0.030594143271446227,
694
+ "grad_norm": 1.5546875,
695
+ "learning_rate": 1.315686274509804e-05,
696
+ "entropy": 0.040245630964636805,
697
+ "num_tokens": 220615.0,
698
+ "mean_token_accuracy": 0.9881528139114379,
699
+ "epoch": 1.0294117647058822,
700
+ "step": 350
701
+ },
702
+ {
703
+ "loss": 0.027347692847251893,
704
+ "grad_norm": 1.7734375,
705
+ "learning_rate": 1.3058823529411766e-05,
706
+ "entropy": 0.03420254942029714,
707
+ "num_tokens": 223751.0,
708
+ "mean_token_accuracy": 0.989202469587326,
709
+ "epoch": 1.0441176470588236,
710
+ "step": 355
711
+ },
712
+ {
713
+ "loss": 0.03148679435253143,
714
+ "grad_norm": 1.9609375,
715
+ "learning_rate": 1.2960784313725492e-05,
716
+ "entropy": 0.03210772704333067,
717
+ "num_tokens": 226948.0,
718
+ "mean_token_accuracy": 0.9868246436119079,
719
+ "epoch": 1.0588235294117647,
720
+ "step": 360
721
+ },
722
+ {
723
+ "loss": 0.031260594725608826,
724
+ "grad_norm": 1.8046875,
725
+ "learning_rate": 1.2862745098039218e-05,
726
+ "entropy": 0.033671201393008235,
727
+ "num_tokens": 230088.0,
728
+ "mean_token_accuracy": 0.9856015264987945,
729
+ "epoch": 1.0735294117647058,
730
+ "step": 365
731
+ },
732
+ {
733
+ "loss": 0.028061491250991822,
734
+ "grad_norm": 1.2890625,
735
+ "learning_rate": 1.276470588235294e-05,
736
+ "entropy": 0.03639122284948826,
737
+ "num_tokens": 233247.0,
738
+ "mean_token_accuracy": 0.9885319888591766,
739
+ "epoch": 1.088235294117647,
740
+ "step": 370
741
+ },
742
+ {
743
+ "loss": 0.0304165780544281,
744
+ "grad_norm": 2.203125,
745
+ "learning_rate": 1.2666666666666667e-05,
746
+ "entropy": 0.03107942212373018,
747
+ "num_tokens": 236423.0,
748
+ "mean_token_accuracy": 0.9864429414272309,
749
+ "epoch": 1.1029411764705883,
750
+ "step": 375
751
+ },
752
+ {
753
+ "loss": 0.028667458891868593,
754
+ "grad_norm": 1.4453125,
755
+ "learning_rate": 1.2568627450980393e-05,
756
+ "entropy": 0.03269361965358257,
757
+ "num_tokens": 239698.0,
758
+ "mean_token_accuracy": 0.9882214546203614,
759
+ "epoch": 1.1176470588235294,
760
+ "step": 380
761
+ },
762
+ {
763
+ "loss": 0.03024893403053284,
764
+ "grad_norm": 1.4375,
765
+ "learning_rate": 1.2470588235294119e-05,
766
+ "entropy": 0.036648140475153926,
767
+ "num_tokens": 242904.0,
768
+ "mean_token_accuracy": 0.9854198694229126,
769
+ "epoch": 1.1323529411764706,
770
+ "step": 385
771
+ },
772
+ {
773
+ "loss": 0.03237654864788055,
774
+ "grad_norm": 1.140625,
775
+ "learning_rate": 1.2372549019607845e-05,
776
+ "entropy": 0.036488327011466024,
777
+ "num_tokens": 246044.0,
778
+ "mean_token_accuracy": 0.9868141651153565,
779
+ "epoch": 1.1470588235294117,
780
+ "step": 390
781
+ },
782
+ {
783
+ "loss": 0.026534423232078552,
784
+ "grad_norm": 1.2890625,
785
+ "learning_rate": 1.2274509803921571e-05,
786
+ "entropy": 0.03317699953913689,
787
+ "num_tokens": 249199.0,
788
+ "mean_token_accuracy": 0.9891056835651397,
789
+ "epoch": 1.161764705882353,
790
+ "step": 395
791
+ },
792
+ {
793
+ "loss": 0.02918187975883484,
794
+ "grad_norm": 1.546875,
795
+ "learning_rate": 1.2176470588235294e-05,
796
+ "entropy": 0.033053198270499705,
797
+ "num_tokens": 252416.0,
798
+ "mean_token_accuracy": 0.9872093260288238,
799
+ "epoch": 1.1764705882352942,
800
+ "step": 400
801
+ },
802
+ {
803
+ "loss": 0.027815410494804384,
804
+ "grad_norm": 1.5,
805
+ "learning_rate": 1.207843137254902e-05,
806
+ "entropy": 0.03630108144134283,
807
+ "num_tokens": 255505.0,
808
+ "mean_token_accuracy": 0.9886294066905975,
809
+ "epoch": 1.1911764705882353,
810
+ "step": 405
811
+ },
812
+ {
813
+ "loss": 0.029119834303855896,
814
+ "grad_norm": 1.640625,
815
+ "learning_rate": 1.1980392156862746e-05,
816
+ "entropy": 0.0321140518411994,
817
+ "num_tokens": 258679.0,
818
+ "mean_token_accuracy": 0.9888967990875244,
819
+ "epoch": 1.2058823529411764,
820
+ "step": 410
821
+ },
822
+ {
823
+ "loss": 0.025961104035377502,
824
+ "grad_norm": 1.8203125,
825
+ "learning_rate": 1.1882352941176472e-05,
826
+ "entropy": 0.02944366242736578,
827
+ "num_tokens": 261856.0,
828
+ "mean_token_accuracy": 0.9895209610462189,
829
+ "epoch": 1.2205882352941178,
830
+ "step": 415
831
+ },
832
+ {
833
+ "loss": 0.03058839440345764,
834
+ "grad_norm": 2.390625,
835
+ "learning_rate": 1.1784313725490198e-05,
836
+ "entropy": 0.03461700212210417,
837
+ "num_tokens": 264960.0,
838
+ "mean_token_accuracy": 0.9882765769958496,
839
+ "epoch": 1.2352941176470589,
840
+ "step": 420
841
+ },
842
+ {
843
+ "loss": 0.028424999117851256,
844
+ "grad_norm": 1.28125,
845
+ "learning_rate": 1.1686274509803922e-05,
846
+ "entropy": 0.02985447719693184,
847
+ "num_tokens": 268114.0,
848
+ "mean_token_accuracy": 0.9882177650928498,
849
+ "epoch": 1.25,
850
+ "step": 425
851
+ },
852
+ {
853
+ "loss": 0.03086719512939453,
854
+ "grad_norm": 2.265625,
855
+ "learning_rate": 1.1588235294117648e-05,
856
+ "entropy": 0.03250212036073208,
857
+ "num_tokens": 271274.0,
858
+ "mean_token_accuracy": 0.9888392806053161,
859
+ "epoch": 1.2647058823529411,
860
+ "step": 430
861
+ },
862
+ {
863
+ "loss": 0.027977922558784486,
864
+ "grad_norm": 1.3046875,
865
+ "learning_rate": 1.1490196078431373e-05,
866
+ "entropy": 0.034127247892320155,
867
+ "num_tokens": 274452.0,
868
+ "mean_token_accuracy": 0.9908244907855988,
869
+ "epoch": 1.2794117647058822,
870
+ "step": 435
871
+ },
872
+ {
873
+ "loss": 0.02676369547843933,
874
+ "grad_norm": 1.09375,
875
+ "learning_rate": 1.1392156862745099e-05,
876
+ "entropy": 0.03699512742459774,
877
+ "num_tokens": 277562.0,
878
+ "mean_token_accuracy": 0.9871235430240631,
879
+ "epoch": 1.2941176470588236,
880
+ "step": 440
881
+ },
882
+ {
883
+ "loss": 0.02789466977119446,
884
+ "grad_norm": 2.203125,
885
+ "learning_rate": 1.1294117647058825e-05,
886
+ "entropy": 0.03514884728938341,
887
+ "num_tokens": 280635.0,
888
+ "mean_token_accuracy": 0.990158212184906,
889
+ "epoch": 1.3088235294117647,
890
+ "step": 445
891
+ },
892
+ {
893
+ "loss": 0.03088509142398834,
894
+ "grad_norm": 1.8359375,
895
+ "learning_rate": 1.119607843137255e-05,
896
+ "entropy": 0.034746605530381204,
897
+ "num_tokens": 283725.0,
898
+ "mean_token_accuracy": 0.9876766622066497,
899
+ "epoch": 1.3235294117647058,
900
+ "step": 450
901
+ },
902
+ {
903
+ "loss": 0.03232976496219635,
904
+ "grad_norm": 1.734375,
905
+ "learning_rate": 1.1098039215686275e-05,
906
+ "entropy": 0.031742793321609494,
907
+ "num_tokens": 286888.0,
908
+ "mean_token_accuracy": 0.9871384859085083,
909
+ "epoch": 1.3382352941176472,
910
+ "step": 455
911
+ },
912
+ {
913
+ "loss": 0.02845146059989929,
914
+ "grad_norm": 2.0,
915
+ "learning_rate": 1.1000000000000001e-05,
916
+ "entropy": 0.03175645042210817,
917
+ "num_tokens": 290064.0,
918
+ "mean_token_accuracy": 0.9873914003372193,
919
+ "epoch": 1.3529411764705883,
920
+ "step": 460
921
+ },
922
+ {
923
+ "loss": 0.029486137628555297,
924
+ "grad_norm": 1.265625,
925
+ "learning_rate": 1.0901960784313726e-05,
926
+ "entropy": 0.03463620245456696,
927
+ "num_tokens": 293189.0,
928
+ "mean_token_accuracy": 0.9874814569950103,
929
+ "epoch": 1.3676470588235294,
930
+ "step": 465
931
+ },
932
+ {
933
+ "loss": 0.02618069648742676,
934
+ "grad_norm": 1.109375,
935
+ "learning_rate": 1.0803921568627452e-05,
936
+ "entropy": 0.033889508619904515,
937
+ "num_tokens": 296268.0,
938
+ "mean_token_accuracy": 0.9882802128791809,
939
+ "epoch": 1.3823529411764706,
940
+ "step": 470
941
+ },
942
+ {
943
+ "loss": 0.025544488430023195,
944
+ "grad_norm": 0.8984375,
945
+ "learning_rate": 1.0705882352941178e-05,
946
+ "entropy": 0.03317532502114773,
947
+ "num_tokens": 299418.0,
948
+ "mean_token_accuracy": 0.9891822457313537,
949
+ "epoch": 1.3970588235294117,
950
+ "step": 475
951
+ },
952
+ {
953
+ "loss": 0.02922942042350769,
954
+ "grad_norm": 1.5859375,
955
+ "learning_rate": 1.0607843137254902e-05,
956
+ "entropy": 0.03228537701070309,
957
+ "num_tokens": 302608.0,
958
+ "mean_token_accuracy": 0.9864252746105194,
959
+ "epoch": 1.4117647058823528,
960
+ "step": 480
961
+ },
962
+ {
963
+ "loss": 0.025081342458724974,
964
+ "grad_norm": 1.4140625,
965
+ "learning_rate": 1.0509803921568628e-05,
966
+ "entropy": 0.033559339493513106,
967
+ "num_tokens": 305748.0,
968
+ "mean_token_accuracy": 0.9891697466373444,
969
+ "epoch": 1.4264705882352942,
970
+ "step": 485
971
+ },
972
+ {
973
+ "loss": 0.028987354040145873,
974
+ "grad_norm": 1.2109375,
975
+ "learning_rate": 1.0411764705882354e-05,
976
+ "entropy": 0.029655468463897706,
977
+ "num_tokens": 308946.0,
978
+ "mean_token_accuracy": 0.9884015321731567,
979
+ "epoch": 1.4411764705882353,
980
+ "step": 490
981
+ },
982
+ {
983
+ "loss": 0.022376981377601624,
984
+ "grad_norm": 1.5859375,
985
+ "learning_rate": 1.031372549019608e-05,
986
+ "entropy": 0.030257853865623473,
987
+ "num_tokens": 312060.0,
988
+ "mean_token_accuracy": 0.990349942445755,
989
+ "epoch": 1.4558823529411764,
990
+ "step": 495
991
+ },
992
+ {
993
+ "loss": 0.027941384911537172,
994
+ "grad_norm": 1.2734375,
995
+ "learning_rate": 1.0215686274509805e-05,
996
+ "entropy": 0.029427625238895416,
997
+ "num_tokens": 315202.0,
998
+ "mean_token_accuracy": 0.9894903540611267,
999
+ "epoch": 1.4705882352941178,
1000
+ "step": 500
1001
+ },
1002
+ {
1003
+ "loss": 0.02513147294521332,
1004
+ "grad_norm": 1.8828125,
1005
+ "learning_rate": 1.011764705882353e-05,
1006
+ "entropy": 0.029220272414386274,
1007
+ "num_tokens": 318423.0,
1008
+ "mean_token_accuracy": 0.9887598037719727,
1009
+ "epoch": 1.4852941176470589,
1010
+ "step": 505
1011
+ },
1012
+ {
1013
+ "loss": 0.024520005285739898,
1014
+ "grad_norm": 1.3515625,
1015
+ "learning_rate": 1.0019607843137255e-05,
1016
+ "entropy": 0.027622674778103828,
1017
+ "num_tokens": 321643.0,
1018
+ "mean_token_accuracy": 0.9881017684936524,
1019
+ "epoch": 1.5,
1020
+ "step": 510
1021
+ },
1022
+ {
1023
+ "loss": 0.022774545848369597,
1024
+ "grad_norm": 0.96875,
1025
+ "learning_rate": 9.921568627450981e-06,
1026
+ "entropy": 0.027344943769276143,
1027
+ "num_tokens": 324896.0,
1028
+ "mean_token_accuracy": 0.9891824662685395,
1029
+ "epoch": 1.5147058823529411,
1030
+ "step": 515
1031
+ },
1032
+ {
1033
+ "loss": 0.026902440190315246,
1034
+ "grad_norm": 1.34375,
1035
+ "learning_rate": 9.823529411764706e-06,
1036
+ "entropy": 0.03210813459008932,
1037
+ "num_tokens": 327953.0,
1038
+ "mean_token_accuracy": 0.9872022986412048,
1039
+ "epoch": 1.5294117647058822,
1040
+ "step": 520
1041
+ },
1042
+ {
1043
+ "loss": 0.02404342144727707,
1044
+ "grad_norm": 1.34375,
1045
+ "learning_rate": 9.725490196078432e-06,
1046
+ "entropy": 0.03047515023499727,
1047
+ "num_tokens": 331110.0,
1048
+ "mean_token_accuracy": 0.9887873768806458,
1049
+ "epoch": 1.5441176470588234,
1050
+ "step": 525
1051
+ },
1052
+ {
1053
+ "loss": 0.022797247767448424,
1054
+ "grad_norm": 1.2265625,
1055
+ "learning_rate": 9.627450980392158e-06,
1056
+ "entropy": 0.03160413987934589,
1057
+ "num_tokens": 334226.0,
1058
+ "mean_token_accuracy": 0.9889481067657471,
1059
+ "epoch": 1.5588235294117647,
1060
+ "step": 530
1061
+ },
1062
+ {
1063
+ "loss": 0.023706996440887453,
1064
+ "grad_norm": 1.078125,
1065
+ "learning_rate": 9.529411764705882e-06,
1066
+ "entropy": 0.0283035334199667,
1067
+ "num_tokens": 337371.0,
1068
+ "mean_token_accuracy": 0.9890589594841004,
1069
+ "epoch": 1.5735294117647058,
1070
+ "step": 535
1071
+ },
1072
+ {
1073
+ "loss": 0.023340512812137604,
1074
+ "grad_norm": 2.5625,
1075
+ "learning_rate": 9.431372549019608e-06,
1076
+ "entropy": 0.029125319607555867,
1077
+ "num_tokens": 340563.0,
1078
+ "mean_token_accuracy": 0.9882973015308381,
1079
+ "epoch": 1.5882352941176472,
1080
+ "step": 540
1081
+ },
1082
+ {
1083
+ "loss": 0.025814762711524962,
1084
+ "grad_norm": 1.8046875,
1085
+ "learning_rate": 9.333333333333334e-06,
1086
+ "entropy": 0.029474343173205853,
1087
+ "num_tokens": 343715.0,
1088
+ "mean_token_accuracy": 0.9888520836830139,
1089
+ "epoch": 1.6029411764705883,
1090
+ "step": 545
1091
+ },
1092
+ {
1093
+ "loss": 0.024609880149364473,
1094
+ "grad_norm": 1.359375,
1095
+ "learning_rate": 9.23529411764706e-06,
1096
+ "entropy": 0.02793533504009247,
1097
+ "num_tokens": 346928.0,
1098
+ "mean_token_accuracy": 0.9896528542041778,
1099
+ "epoch": 1.6176470588235294,
1100
+ "step": 550
1101
+ },
1102
+ {
1103
+ "loss": 0.024091285467147828,
1104
+ "grad_norm": 1.171875,
1105
+ "learning_rate": 9.137254901960785e-06,
1106
+ "entropy": 0.03169798478484154,
1107
+ "num_tokens": 349942.0,
1108
+ "mean_token_accuracy": 0.9896469593048096,
1109
+ "epoch": 1.6323529411764706,
1110
+ "step": 555
1111
+ },
1112
+ {
1113
+ "loss": 0.022402273118495943,
1114
+ "grad_norm": 1.3203125,
1115
+ "learning_rate": 9.03921568627451e-06,
1116
+ "entropy": 0.02854564245790243,
1117
+ "num_tokens": 353063.0,
1118
+ "mean_token_accuracy": 0.9894876420497895,
1119
+ "epoch": 1.6470588235294117,
1120
+ "step": 560
1121
+ },
1122
+ {
1123
+ "loss": 0.023489847779273987,
1124
+ "grad_norm": 1.8359375,
1125
+ "learning_rate": 8.941176470588237e-06,
1126
+ "entropy": 0.028600608371198176,
1127
+ "num_tokens": 356180.0,
1128
+ "mean_token_accuracy": 0.9890201330184937,
1129
+ "epoch": 1.6617647058823528,
1130
+ "step": 565
1131
+ },
1132
+ {
1133
+ "loss": 0.02147035002708435,
1134
+ "grad_norm": 1.0859375,
1135
+ "learning_rate": 8.843137254901961e-06,
1136
+ "entropy": 0.026650307327508928,
1137
+ "num_tokens": 359351.0,
1138
+ "mean_token_accuracy": 0.9898578941822052,
1139
+ "epoch": 1.6764705882352942,
1140
+ "step": 570
1141
+ },
1142
+ {
1143
+ "loss": 0.022052311897277833,
1144
+ "grad_norm": 1.3515625,
1145
+ "learning_rate": 8.745098039215687e-06,
1146
+ "entropy": 0.027873093821108343,
1147
+ "num_tokens": 362470.0,
1148
+ "mean_token_accuracy": 0.989058256149292,
1149
+ "epoch": 1.6911764705882353,
1150
+ "step": 575
1151
+ },
1152
+ {
1153
+ "loss": 0.023864805698394775,
1154
+ "grad_norm": 1.5859375,
1155
+ "learning_rate": 8.647058823529413e-06,
1156
+ "entropy": 0.027629780396819115,
1157
+ "num_tokens": 365614.0,
1158
+ "mean_token_accuracy": 0.9894056558609009,
1159
+ "epoch": 1.7058823529411766,
1160
+ "step": 580
1161
+ },
1162
+ {
1163
+ "loss": 0.027744096517562867,
1164
+ "grad_norm": 1.6875,
1165
+ "learning_rate": 8.549019607843138e-06,
1166
+ "entropy": 0.028794774785637856,
1167
+ "num_tokens": 368805.0,
1168
+ "mean_token_accuracy": 0.9880473792552948,
1169
+ "epoch": 1.7205882352941178,
1170
+ "step": 585
1171
+ },
1172
+ {
1173
+ "loss": 0.021863000094890596,
1174
+ "grad_norm": 1.1796875,
1175
+ "learning_rate": 8.450980392156864e-06,
1176
+ "entropy": 0.028252063691616057,
1177
+ "num_tokens": 371947.0,
1178
+ "mean_token_accuracy": 0.9904429137706756,
1179
+ "epoch": 1.7352941176470589,
1180
+ "step": 590
1181
+ },
1182
+ {
1183
+ "loss": 0.021520544588565827,
1184
+ "grad_norm": 1.3203125,
1185
+ "learning_rate": 8.35294117647059e-06,
1186
+ "entropy": 0.028264945745468138,
1187
+ "num_tokens": 375103.0,
1188
+ "mean_token_accuracy": 0.9904776751995087,
1189
+ "epoch": 1.75,
1190
+ "step": 595
1191
+ },
1192
+ {
1193
+ "loss": 0.026353719830513,
1194
+ "grad_norm": 1.1953125,
1195
+ "learning_rate": 8.254901960784314e-06,
1196
+ "entropy": 0.027113928645849227,
1197
+ "num_tokens": 378317.0,
1198
+ "mean_token_accuracy": 0.9884898960590363,
1199
+ "epoch": 1.7647058823529411,
1200
+ "step": 600
1201
+ },
1202
+ {
1203
+ "loss": 0.026097461581230164,
1204
+ "grad_norm": 1.421875,
1205
+ "learning_rate": 8.15686274509804e-06,
1206
+ "entropy": 0.028313294425606726,
1207
+ "num_tokens": 381417.0,
1208
+ "mean_token_accuracy": 0.9879869103431702,
1209
+ "epoch": 1.7794117647058822,
1210
+ "step": 605
1211
+ },
1212
+ {
1213
+ "loss": 0.02049378156661987,
1214
+ "grad_norm": 1.0546875,
1215
+ "learning_rate": 8.058823529411766e-06,
1216
+ "entropy": 0.026570411399006844,
1217
+ "num_tokens": 384632.0,
1218
+ "mean_token_accuracy": 0.9887495577335358,
1219
+ "epoch": 1.7941176470588234,
1220
+ "step": 610
1221
+ },
1222
+ {
1223
+ "loss": 0.022221173346042632,
1224
+ "grad_norm": 1.1171875,
1225
+ "learning_rate": 7.96078431372549e-06,
1226
+ "entropy": 0.02754255346953869,
1227
+ "num_tokens": 387836.0,
1228
+ "mean_token_accuracy": 0.9899809181690216,
1229
+ "epoch": 1.8088235294117647,
1230
+ "step": 615
1231
+ },
1232
+ {
1233
+ "loss": 0.023856499791145326,
1234
+ "grad_norm": 1.3203125,
1235
+ "learning_rate": 7.862745098039217e-06,
1236
+ "entropy": 0.031241112016141416,
1237
+ "num_tokens": 390887.0,
1238
+ "mean_token_accuracy": 0.9897979915142059,
1239
+ "epoch": 1.8235294117647058,
1240
+ "step": 620
1241
+ },
1242
+ {
1243
+ "loss": 0.0225734680891037,
1244
+ "grad_norm": 1.40625,
1245
+ "learning_rate": 7.764705882352941e-06,
1246
+ "entropy": 0.02798519879579544,
1247
+ "num_tokens": 394027.0,
1248
+ "mean_token_accuracy": 0.9890839040279389,
1249
+ "epoch": 1.8382352941176472,
1250
+ "step": 625
1251
+ },
1252
+ {
1253
+ "loss": 0.022729092836380006,
1254
+ "grad_norm": 1.25,
1255
+ "learning_rate": 7.666666666666667e-06,
1256
+ "entropy": 0.02719390895217657,
1257
+ "num_tokens": 397202.0,
1258
+ "mean_token_accuracy": 0.9886514127254487,
1259
+ "epoch": 1.8529411764705883,
1260
+ "step": 630
1261
+ },
1262
+ {
1263
+ "loss": 0.021688875555992127,
1264
+ "grad_norm": 1.0859375,
1265
+ "learning_rate": 7.5686274509803925e-06,
1266
+ "entropy": 0.027222988195717335,
1267
+ "num_tokens": 400378.0,
1268
+ "mean_token_accuracy": 0.9908071339130402,
1269
+ "epoch": 1.8676470588235294,
1270
+ "step": 635
1271
+ },
1272
+ {
1273
+ "loss": 0.023884420096874238,
1274
+ "grad_norm": 1.4296875,
1275
+ "learning_rate": 7.4705882352941185e-06,
1276
+ "entropy": 0.028057356551289558,
1277
+ "num_tokens": 403503.0,
1278
+ "mean_token_accuracy": 0.9900456726551056,
1279
+ "epoch": 1.8823529411764706,
1280
+ "step": 640
1281
+ },
1282
+ {
1283
+ "loss": 0.020375268161296846,
1284
+ "grad_norm": 1.6953125,
1285
+ "learning_rate": 7.372549019607845e-06,
1286
+ "entropy": 0.02543655373156071,
1287
+ "num_tokens": 406768.0,
1288
+ "mean_token_accuracy": 0.9911065042018891,
1289
+ "epoch": 1.8970588235294117,
1290
+ "step": 645
1291
+ },
1292
+ {
1293
+ "loss": 0.020015493035316467,
1294
+ "grad_norm": 1.7421875,
1295
+ "learning_rate": 7.274509803921569e-06,
1296
+ "entropy": 0.027230485714972018,
1297
+ "num_tokens": 409875.0,
1298
+ "mean_token_accuracy": 0.9906234502792358,
1299
+ "epoch": 1.9117647058823528,
1300
+ "step": 650
1301
+ },
1302
+ {
1303
+ "loss": 0.022530680894851683,
1304
+ "grad_norm": 1.421875,
1305
+ "learning_rate": 7.176470588235295e-06,
1306
+ "entropy": 0.028223772905766963,
1307
+ "num_tokens": 412987.0,
1308
+ "mean_token_accuracy": 0.9903216242790223,
1309
+ "epoch": 1.9264705882352942,
1310
+ "step": 655
1311
+ },
1312
+ {
1313
+ "loss": 0.021129874885082243,
1314
+ "grad_norm": 1.109375,
1315
+ "learning_rate": 7.07843137254902e-06,
1316
+ "entropy": 0.02674291282892227,
1317
+ "num_tokens": 416181.0,
1318
+ "mean_token_accuracy": 0.9886639952659607,
1319
+ "epoch": 1.9411764705882353,
1320
+ "step": 660
1321
+ },
1322
+ {
1323
+ "loss": 0.021244224905967713,
1324
+ "grad_norm": 0.9453125,
1325
+ "learning_rate": 6.9803921568627454e-06,
1326
+ "entropy": 0.028005971759557723,
1327
+ "num_tokens": 419323.0,
1328
+ "mean_token_accuracy": 0.9905200719833374,
1329
+ "epoch": 1.9558823529411766,
1330
+ "step": 665
1331
+ },
1332
+ {
1333
+ "loss": 0.022309188544750214,
1334
+ "grad_norm": 1.375,
1335
+ "learning_rate": 6.8823529411764715e-06,
1336
+ "entropy": 0.027272411435842515,
1337
+ "num_tokens": 422484.0,
1338
+ "mean_token_accuracy": 0.9878733932971955,
1339
+ "epoch": 1.9705882352941178,
1340
+ "step": 670
1341
+ },
1342
+ {
1343
+ "loss": 0.022459632158279418,
1344
+ "grad_norm": 1.203125,
1345
+ "learning_rate": 6.784313725490197e-06,
1346
+ "entropy": 0.026817415095865726,
1347
+ "num_tokens": 425583.0,
1348
+ "mean_token_accuracy": 0.9908780753612518,
1349
+ "epoch": 1.9852941176470589,
1350
+ "step": 675
1351
+ },
1352
+ {
1353
+ "loss": 0.021811096370220183,
1354
+ "grad_norm": 1.265625,
1355
+ "learning_rate": 6.686274509803922e-06,
1356
+ "entropy": 0.026038615591824056,
1357
+ "num_tokens": 428736.0,
1358
+ "mean_token_accuracy": 0.9897907853126526,
1359
+ "epoch": 2.0,
1360
+ "step": 680
1361
+ },
1362
+ {
1363
+ "loss": 0.019171090424060823,
1364
+ "grad_norm": 1.078125,
1365
+ "learning_rate": 6.588235294117647e-06,
1366
+ "entropy": 0.02475190218538046,
1367
+ "num_tokens": 431976.0,
1368
+ "mean_token_accuracy": 0.989355844259262,
1369
+ "epoch": 2.014705882352941,
1370
+ "step": 685
1371
+ },
1372
+ {
1373
+ "loss": 0.023474155366420744,
1374
+ "grad_norm": 1.1640625,
1375
+ "learning_rate": 6.490196078431373e-06,
1376
+ "entropy": 0.026115396432578562,
1377
+ "num_tokens": 435142.0,
1378
+ "mean_token_accuracy": 0.9885824680328369,
1379
+ "epoch": 2.0294117647058822,
1380
+ "step": 690
1381
+ },
1382
+ {
1383
+ "loss": 0.020176805555820465,
1384
+ "grad_norm": 1.0,
1385
+ "learning_rate": 6.3921568627450984e-06,
1386
+ "entropy": 0.026907235756516455,
1387
+ "num_tokens": 438259.0,
1388
+ "mean_token_accuracy": 0.9919745445251464,
1389
+ "epoch": 2.0441176470588234,
1390
+ "step": 695
1391
+ },
1392
+ {
1393
+ "loss": 0.022543656826019286,
1394
+ "grad_norm": 1.34375,
1395
+ "learning_rate": 6.294117647058824e-06,
1396
+ "entropy": 0.02749718502163887,
1397
+ "num_tokens": 441366.0,
1398
+ "mean_token_accuracy": 0.9880188047885895,
1399
+ "epoch": 2.0588235294117645,
1400
+ "step": 700
1401
+ },
1402
+ {
1403
+ "loss": 0.019685085117816924,
1404
+ "grad_norm": 0.9453125,
1405
+ "learning_rate": 6.19607843137255e-06,
1406
+ "entropy": 0.024849089048802852,
1407
+ "num_tokens": 444474.0,
1408
+ "mean_token_accuracy": 0.9906105160713196,
1409
+ "epoch": 2.073529411764706,
1410
+ "step": 705
1411
+ },
1412
+ {
1413
+ "loss": 0.020225000381469727,
1414
+ "grad_norm": 1.234375,
1415
+ "learning_rate": 6.098039215686276e-06,
1416
+ "entropy": 0.023934758827090265,
1417
+ "num_tokens": 447652.0,
1418
+ "mean_token_accuracy": 0.9896179974079132,
1419
+ "epoch": 2.088235294117647,
1420
+ "step": 710
1421
+ },
1422
+ {
1423
+ "loss": 0.02128472626209259,
1424
+ "grad_norm": 1.078125,
1425
+ "learning_rate": 6e-06,
1426
+ "entropy": 0.02389440070837736,
1427
+ "num_tokens": 450833.0,
1428
+ "mean_token_accuracy": 0.9899099349975586,
1429
+ "epoch": 2.1029411764705883,
1430
+ "step": 715
1431
+ },
1432
+ {
1433
+ "loss": 0.021367147564888,
1434
+ "grad_norm": 1.6015625,
1435
+ "learning_rate": 5.901960784313726e-06,
1436
+ "entropy": 0.02620517127215862,
1437
+ "num_tokens": 453949.0,
1438
+ "mean_token_accuracy": 0.988726532459259,
1439
+ "epoch": 2.1176470588235294,
1440
+ "step": 720
1441
+ },
1442
+ {
1443
+ "loss": 0.01960753947496414,
1444
+ "grad_norm": 1.03125,
1445
+ "learning_rate": 5.803921568627452e-06,
1446
+ "entropy": 0.02435927651822567,
1447
+ "num_tokens": 457147.0,
1448
+ "mean_token_accuracy": 0.9908569097518921,
1449
+ "epoch": 2.1323529411764706,
1450
+ "step": 725
1451
+ },
1452
+ {
1453
+ "loss": 0.022167882323265074,
1454
+ "grad_norm": 1.234375,
1455
+ "learning_rate": 5.705882352941177e-06,
1456
+ "entropy": 0.02521121110767126,
1457
+ "num_tokens": 460308.0,
1458
+ "mean_token_accuracy": 0.9891940593719483,
1459
+ "epoch": 2.1470588235294117,
1460
+ "step": 730
1461
+ },
1462
+ {
1463
+ "loss": 0.0210279181599617,
1464
+ "grad_norm": 1.359375,
1465
+ "learning_rate": 5.607843137254903e-06,
1466
+ "entropy": 0.02500821612775326,
1467
+ "num_tokens": 463449.0,
1468
+ "mean_token_accuracy": 0.9884547054767608,
1469
+ "epoch": 2.161764705882353,
1470
+ "step": 735
1471
+ },
1472
+ {
1473
+ "loss": 0.01987575888633728,
1474
+ "grad_norm": 1.03125,
1475
+ "learning_rate": 5.509803921568628e-06,
1476
+ "entropy": 0.025977463461458683,
1477
+ "num_tokens": 466590.0,
1478
+ "mean_token_accuracy": 0.9888093769550323,
1479
+ "epoch": 2.176470588235294,
1480
+ "step": 740
1481
+ },
1482
+ {
1483
+ "loss": 0.019111356139183043,
1484
+ "grad_norm": 1.25,
1485
+ "learning_rate": 5.411764705882353e-06,
1486
+ "entropy": 0.02638601940125227,
1487
+ "num_tokens": 469726.0,
1488
+ "mean_token_accuracy": 0.9917258858680725,
1489
+ "epoch": 2.1911764705882355,
1490
+ "step": 745
1491
+ },
1492
+ {
1493
+ "loss": 0.020354922115802764,
1494
+ "grad_norm": 1.171875,
1495
+ "learning_rate": 5.313725490196079e-06,
1496
+ "entropy": 0.026662386767566205,
1497
+ "num_tokens": 472853.0,
1498
+ "mean_token_accuracy": 0.99064000248909,
1499
+ "epoch": 2.2058823529411766,
1500
+ "step": 750
1501
+ },
1502
+ {
1503
+ "loss": 0.01959734410047531,
1504
+ "grad_norm": 0.80859375,
1505
+ "learning_rate": 5.2156862745098044e-06,
1506
+ "entropy": 0.02579411044716835,
1507
+ "num_tokens": 476008.0,
1508
+ "mean_token_accuracy": 0.9904728531837463,
1509
+ "epoch": 2.2205882352941178,
1510
+ "step": 755
1511
+ },
1512
+ {
1513
+ "loss": 0.020466303825378417,
1514
+ "grad_norm": 1.3828125,
1515
+ "learning_rate": 5.11764705882353e-06,
1516
+ "entropy": 0.0256651122123003,
1517
+ "num_tokens": 479150.0,
1518
+ "mean_token_accuracy": 0.9903539717197418,
1519
+ "epoch": 2.235294117647059,
1520
+ "step": 760
1521
+ },
1522
+ {
1523
+ "loss": 0.01983775794506073,
1524
+ "grad_norm": 0.99609375,
1525
+ "learning_rate": 5.019607843137255e-06,
1526
+ "entropy": 0.02584236618131399,
1527
+ "num_tokens": 482321.0,
1528
+ "mean_token_accuracy": 0.9914842903614044,
1529
+ "epoch": 2.25,
1530
+ "step": 765
1531
+ },
1532
+ {
1533
+ "loss": 0.020100761950016022,
1534
+ "grad_norm": 1.046875,
1535
+ "learning_rate": 4.921568627450981e-06,
1536
+ "entropy": 0.02499296572059393,
1537
+ "num_tokens": 485510.0,
1538
+ "mean_token_accuracy": 0.991219836473465,
1539
+ "epoch": 2.264705882352941,
1540
+ "step": 770
1541
+ },
1542
+ {
1543
+ "loss": 0.02088477313518524,
1544
+ "grad_norm": 1.328125,
1545
+ "learning_rate": 4.823529411764706e-06,
1546
+ "entropy": 0.024959737621247768,
1547
+ "num_tokens": 488698.0,
1548
+ "mean_token_accuracy": 0.9898148238658905,
1549
+ "epoch": 2.2794117647058822,
1550
+ "step": 775
1551
+ },
1552
+ {
1553
+ "loss": 0.0195361465215683,
1554
+ "grad_norm": 1.2421875,
1555
+ "learning_rate": 4.725490196078431e-06,
1556
+ "entropy": 0.023672481067478657,
1557
+ "num_tokens": 491906.0,
1558
+ "mean_token_accuracy": 0.9900302290916443,
1559
+ "epoch": 2.2941176470588234,
1560
+ "step": 780
1561
+ },
1562
+ {
1563
+ "loss": 0.019702821969985962,
1564
+ "grad_norm": 1.265625,
1565
+ "learning_rate": 4.627450980392157e-06,
1566
+ "entropy": 0.025737580843269825,
1567
+ "num_tokens": 494997.0,
1568
+ "mean_token_accuracy": 0.9905776441097259,
1569
+ "epoch": 2.3088235294117645,
1570
+ "step": 785
1571
+ },
1572
+ {
1573
+ "loss": 0.018527360260486604,
1574
+ "grad_norm": 1.078125,
1575
+ "learning_rate": 4.529411764705883e-06,
1576
+ "entropy": 0.02454463895410299,
1577
+ "num_tokens": 498138.0,
1578
+ "mean_token_accuracy": 0.9910318195819855,
1579
+ "epoch": 2.323529411764706,
1580
+ "step": 790
1581
+ },
1582
+ {
1583
+ "loss": 0.018923106789588928,
1584
+ "grad_norm": 1.359375,
1585
+ "learning_rate": 4.431372549019608e-06,
1586
+ "entropy": 0.0245100449770689,
1587
+ "num_tokens": 501316.0,
1588
+ "mean_token_accuracy": 0.9911953806877136,
1589
+ "epoch": 2.338235294117647,
1590
+ "step": 795
1591
+ },
1592
+ {
1593
+ "loss": 0.01874026209115982,
1594
+ "grad_norm": 1.140625,
1595
+ "learning_rate": 4.333333333333334e-06,
1596
+ "entropy": 0.023334310948848726,
1597
+ "num_tokens": 504533.0,
1598
+ "mean_token_accuracy": 0.9910171329975128,
1599
+ "epoch": 2.3529411764705883,
1600
+ "step": 800
1601
+ },
1602
+ {
1603
+ "loss": 0.022160655260086058,
1604
+ "grad_norm": 1.2578125,
1605
+ "learning_rate": 4.235294117647059e-06,
1606
+ "entropy": 0.026187057420611382,
1607
+ "num_tokens": 507616.0,
1608
+ "mean_token_accuracy": 0.9876076638698578,
1609
+ "epoch": 2.3676470588235294,
1610
+ "step": 805
1611
+ },
1612
+ {
1613
+ "loss": 0.018640576303005217,
1614
+ "grad_norm": 1.03125,
1615
+ "learning_rate": 4.137254901960784e-06,
1616
+ "entropy": 0.02308085039258003,
1617
+ "num_tokens": 510793.0,
1618
+ "mean_token_accuracy": 0.9908162891864777,
1619
+ "epoch": 2.3823529411764706,
1620
+ "step": 810
1621
+ },
1622
+ {
1623
+ "loss": 0.019237047433853148,
1624
+ "grad_norm": 0.8984375,
1625
+ "learning_rate": 4.03921568627451e-06,
1626
+ "entropy": 0.024417817965149878,
1627
+ "num_tokens": 513995.0,
1628
+ "mean_token_accuracy": 0.9902299284934998,
1629
+ "epoch": 2.3970588235294117,
1630
+ "step": 815
1631
+ },
1632
+ {
1633
+ "loss": 0.020626239478588104,
1634
+ "grad_norm": 1.1640625,
1635
+ "learning_rate": 3.941176470588236e-06,
1636
+ "entropy": 0.025944224931299685,
1637
+ "num_tokens": 517128.0,
1638
+ "mean_token_accuracy": 0.9896773338317871,
1639
+ "epoch": 2.411764705882353,
1640
+ "step": 820
1641
+ },
1642
+ {
1643
+ "loss": 0.018906430900096895,
1644
+ "grad_norm": 1.0546875,
1645
+ "learning_rate": 3.843137254901962e-06,
1646
+ "entropy": 0.02529167104512453,
1647
+ "num_tokens": 520219.0,
1648
+ "mean_token_accuracy": 0.9905548214912414,
1649
+ "epoch": 2.426470588235294,
1650
+ "step": 825
1651
+ },
1652
+ {
1653
+ "loss": 0.01989607810974121,
1654
+ "grad_norm": 1.171875,
1655
+ "learning_rate": 3.7450980392156865e-06,
1656
+ "entropy": 0.025429282896220685,
1657
+ "num_tokens": 523368.0,
1658
+ "mean_token_accuracy": 0.9910161614418029,
1659
+ "epoch": 2.4411764705882355,
1660
+ "step": 830
1661
+ },
1662
+ {
1663
+ "loss": 0.019511505961418152,
1664
+ "grad_norm": 1.046875,
1665
+ "learning_rate": 3.6470588235294117e-06,
1666
+ "entropy": 0.026134114153683184,
1667
+ "num_tokens": 526516.0,
1668
+ "mean_token_accuracy": 0.9898114144802094,
1669
+ "epoch": 2.4558823529411766,
1670
+ "step": 835
1671
+ },
1672
+ {
1673
+ "loss": 0.018582092225551607,
1674
+ "grad_norm": 1.1328125,
1675
+ "learning_rate": 3.5490196078431378e-06,
1676
+ "entropy": 0.02343358173966408,
1677
+ "num_tokens": 529660.0,
1678
+ "mean_token_accuracy": 0.9904271245002747,
1679
+ "epoch": 2.4705882352941178,
1680
+ "step": 840
1681
+ },
1682
+ {
1683
+ "loss": 0.020261451601982117,
1684
+ "grad_norm": 1.453125,
1685
+ "learning_rate": 3.450980392156863e-06,
1686
+ "entropy": 0.024460323713719846,
1687
+ "num_tokens": 532778.0,
1688
+ "mean_token_accuracy": 0.9899402976036071,
1689
+ "epoch": 2.485294117647059,
1690
+ "step": 845
1691
+ },
1692
+ {
1693
+ "loss": 0.020383948087692262,
1694
+ "grad_norm": 1.1796875,
1695
+ "learning_rate": 3.352941176470588e-06,
1696
+ "entropy": 0.024987665377557276,
1697
+ "num_tokens": 535932.0,
1698
+ "mean_token_accuracy": 0.9898059248924256,
1699
+ "epoch": 2.5,
1700
+ "step": 850
1701
+ },
1702
+ {
1703
+ "loss": 0.019448164105415344,
1704
+ "grad_norm": 1.3515625,
1705
+ "learning_rate": 3.2549019607843143e-06,
1706
+ "entropy": 0.02465162370353937,
1707
+ "num_tokens": 539037.0,
1708
+ "mean_token_accuracy": 0.9913235783576966,
1709
+ "epoch": 2.514705882352941,
1710
+ "step": 855
1711
+ },
1712
+ {
1713
+ "loss": 0.018925553560256957,
1714
+ "grad_norm": 1.046875,
1715
+ "learning_rate": 3.1568627450980395e-06,
1716
+ "entropy": 0.025184641405940057,
1717
+ "num_tokens": 542197.0,
1718
+ "mean_token_accuracy": 0.991470605134964,
1719
+ "epoch": 2.5294117647058822,
1720
+ "step": 860
1721
+ },
1722
+ {
1723
+ "loss": 0.01913969814777374,
1724
+ "grad_norm": 1.0546875,
1725
+ "learning_rate": 3.058823529411765e-06,
1726
+ "entropy": 0.024113286659121512,
1727
+ "num_tokens": 545387.0,
1728
+ "mean_token_accuracy": 0.9914486467838287,
1729
+ "epoch": 2.5441176470588234,
1730
+ "step": 865
1731
+ },
1732
+ {
1733
+ "loss": 0.018765930831432343,
1734
+ "grad_norm": 1.0703125,
1735
+ "learning_rate": 2.9607843137254903e-06,
1736
+ "entropy": 0.02413007989525795,
1737
+ "num_tokens": 548534.0,
1738
+ "mean_token_accuracy": 0.9907777428627014,
1739
+ "epoch": 2.5588235294117645,
1740
+ "step": 870
1741
+ },
1742
+ {
1743
+ "loss": 0.019279350340366364,
1744
+ "grad_norm": 2.1875,
1745
+ "learning_rate": 2.8627450980392155e-06,
1746
+ "entropy": 0.024522659182548524,
1747
+ "num_tokens": 551721.0,
1748
+ "mean_token_accuracy": 0.9905555963516235,
1749
+ "epoch": 2.5735294117647056,
1750
+ "step": 875
1751
+ },
1752
+ {
1753
+ "loss": 0.019660860300064087,
1754
+ "grad_norm": 1.1015625,
1755
+ "learning_rate": 2.7647058823529416e-06,
1756
+ "entropy": 0.024852845631539822,
1757
+ "num_tokens": 554912.0,
1758
+ "mean_token_accuracy": 0.9898727238178253,
1759
+ "epoch": 2.588235294117647,
1760
+ "step": 880
1761
+ },
1762
+ {
1763
+ "loss": 0.018780362606048585,
1764
+ "grad_norm": 1.0703125,
1765
+ "learning_rate": 2.666666666666667e-06,
1766
+ "entropy": 0.02551023568958044,
1767
+ "num_tokens": 558028.0,
1768
+ "mean_token_accuracy": 0.99192915558815,
1769
+ "epoch": 2.6029411764705883,
1770
+ "step": 885
1771
+ },
1772
+ {
1773
+ "loss": 0.01949601024389267,
1774
+ "grad_norm": 1.1953125,
1775
+ "learning_rate": 2.568627450980392e-06,
1776
+ "entropy": 0.025155650451779366,
1777
+ "num_tokens": 561189.0,
1778
+ "mean_token_accuracy": 0.990712708234787,
1779
+ "epoch": 2.6176470588235294,
1780
+ "step": 890
1781
+ },
1782
+ {
1783
+ "loss": 0.019716159999370576,
1784
+ "grad_norm": 1.296875,
1785
+ "learning_rate": 2.470588235294118e-06,
1786
+ "entropy": 0.024883992783725262,
1787
+ "num_tokens": 564374.0,
1788
+ "mean_token_accuracy": 0.989579439163208,
1789
+ "epoch": 2.6323529411764706,
1790
+ "step": 895
1791
+ },
1792
+ {
1793
+ "loss": 0.017295162379741668,
1794
+ "grad_norm": 0.97265625,
1795
+ "learning_rate": 2.3725490196078433e-06,
1796
+ "entropy": 0.0241273645311594,
1797
+ "num_tokens": 567550.0,
1798
+ "mean_token_accuracy": 0.9934020042419434,
1799
+ "epoch": 2.6470588235294117,
1800
+ "step": 900
1801
+ },
1802
+ {
1803
+ "loss": 0.020695842802524567,
1804
+ "grad_norm": 1.109375,
1805
+ "learning_rate": 2.274509803921569e-06,
1806
+ "entropy": 0.02697849553078413,
1807
+ "num_tokens": 570611.0,
1808
+ "mean_token_accuracy": 0.9914706110954284,
1809
+ "epoch": 2.661764705882353,
1810
+ "step": 905
1811
+ },
1812
+ {
1813
+ "loss": 0.017908445000648497,
1814
+ "grad_norm": 1.2734375,
1815
+ "learning_rate": 2.176470588235294e-06,
1816
+ "entropy": 0.022997986152768136,
1817
+ "num_tokens": 573767.0,
1818
+ "mean_token_accuracy": 0.9898150980472564,
1819
+ "epoch": 2.6764705882352944,
1820
+ "step": 910
1821
+ },
1822
+ {
1823
+ "loss": 0.020641934871673585,
1824
+ "grad_norm": 1.4921875,
1825
+ "learning_rate": 2.07843137254902e-06,
1826
+ "entropy": 0.027346356958150863,
1827
+ "num_tokens": 576830.0,
1828
+ "mean_token_accuracy": 0.9897843182086945,
1829
+ "epoch": 2.6911764705882355,
1830
+ "step": 915
1831
+ },
1832
+ {
1833
+ "loss": 0.019691270589828492,
1834
+ "grad_norm": 1.2890625,
1835
+ "learning_rate": 1.980392156862745e-06,
1836
+ "entropy": 0.023718219250440598,
1837
+ "num_tokens": 580065.0,
1838
+ "mean_token_accuracy": 0.9901076138019562,
1839
+ "epoch": 2.7058823529411766,
1840
+ "step": 920
1841
+ },
1842
+ {
1843
+ "loss": 0.02009253352880478,
1844
+ "grad_norm": 1.2109375,
1845
+ "learning_rate": 1.8823529411764707e-06,
1846
+ "entropy": 0.024860053882002832,
1847
+ "num_tokens": 583200.0,
1848
+ "mean_token_accuracy": 0.9894306361675262,
1849
+ "epoch": 2.7205882352941178,
1850
+ "step": 925
1851
+ },
1852
+ {
1853
+ "loss": 0.019820311665534975,
1854
+ "grad_norm": 1.1796875,
1855
+ "learning_rate": 1.7843137254901963e-06,
1856
+ "entropy": 0.02641481179744005,
1857
+ "num_tokens": 586247.0,
1858
+ "mean_token_accuracy": 0.9888152658939362,
1859
+ "epoch": 2.735294117647059,
1860
+ "step": 930
1861
+ },
1862
+ {
1863
+ "loss": 0.020238989591598512,
1864
+ "grad_norm": 1.34375,
1865
+ "learning_rate": 1.6862745098039217e-06,
1866
+ "entropy": 0.025426279939711093,
1867
+ "num_tokens": 589348.0,
1868
+ "mean_token_accuracy": 0.9893324971199036,
1869
+ "epoch": 2.75,
1870
+ "step": 935
1871
+ },
1872
+ {
1873
+ "loss": 0.020529073476791383,
1874
+ "grad_norm": 1.1953125,
1875
+ "learning_rate": 1.5882352941176472e-06,
1876
+ "entropy": 0.025489212945103645,
1877
+ "num_tokens": 592483.0,
1878
+ "mean_token_accuracy": 0.9883848607540131,
1879
+ "epoch": 2.764705882352941,
1880
+ "step": 940
1881
+ },
1882
+ {
1883
+ "loss": 0.019503119587898254,
1884
+ "grad_norm": 1.875,
1885
+ "learning_rate": 1.4901960784313726e-06,
1886
+ "entropy": 0.025844238512218,
1887
+ "num_tokens": 595654.0,
1888
+ "mean_token_accuracy": 0.9898752987384796,
1889
+ "epoch": 2.7794117647058822,
1890
+ "step": 945
1891
+ },
1892
+ {
1893
+ "loss": 0.020725423097610475,
1894
+ "grad_norm": 1.3359375,
1895
+ "learning_rate": 1.3921568627450982e-06,
1896
+ "entropy": 0.025542815588414668,
1897
+ "num_tokens": 598757.0,
1898
+ "mean_token_accuracy": 0.9899684190750122,
1899
+ "epoch": 2.7941176470588234,
1900
+ "step": 950
1901
+ },
1902
+ {
1903
+ "loss": 0.020795242488384248,
1904
+ "grad_norm": 1.1640625,
1905
+ "learning_rate": 1.2941176470588237e-06,
1906
+ "entropy": 0.023506213910877705,
1907
+ "num_tokens": 602069.0,
1908
+ "mean_token_accuracy": 0.9894281327724457,
1909
+ "epoch": 2.8088235294117645,
1910
+ "step": 955
1911
+ },
1912
+ {
1913
+ "loss": 0.01915638893842697,
1914
+ "grad_norm": 1.21875,
1915
+ "learning_rate": 1.196078431372549e-06,
1916
+ "entropy": 0.024655142053961753,
1917
+ "num_tokens": 605286.0,
1918
+ "mean_token_accuracy": 0.9900248169898986,
1919
+ "epoch": 2.8235294117647056,
1920
+ "step": 960
1921
+ },
1922
+ {
1923
+ "loss": 0.01975841522216797,
1924
+ "grad_norm": 1.1484375,
1925
+ "learning_rate": 1.0980392156862745e-06,
1926
+ "entropy": 0.025551106408238412,
1927
+ "num_tokens": 608374.0,
1928
+ "mean_token_accuracy": 0.9892638444900512,
1929
+ "epoch": 2.838235294117647,
1930
+ "step": 965
1931
+ },
1932
+ {
1933
+ "loss": 0.020852866768836974,
1934
+ "grad_norm": 1.2421875,
1935
+ "learning_rate": 1.0000000000000002e-06,
1936
+ "entropy": 0.02480896282941103,
1937
+ "num_tokens": 611577.0,
1938
+ "mean_token_accuracy": 0.9892595648765564,
1939
+ "epoch": 2.8529411764705883,
1940
+ "step": 970
1941
+ },
1942
+ {
1943
+ "loss": 0.019326749444007873,
1944
+ "grad_norm": 0.875,
1945
+ "learning_rate": 9.019607843137256e-07,
1946
+ "entropy": 0.02385783474892378,
1947
+ "num_tokens": 614761.0,
1948
+ "mean_token_accuracy": 0.9904800593852997,
1949
+ "epoch": 2.8676470588235294,
1950
+ "step": 975
1951
+ },
1952
+ {
1953
+ "loss": 0.019405061006546022,
1954
+ "grad_norm": 1.1875,
1955
+ "learning_rate": 8.039215686274511e-07,
1956
+ "entropy": 0.026029090210795403,
1957
+ "num_tokens": 617870.0,
1958
+ "mean_token_accuracy": 0.9896216452121734,
1959
+ "epoch": 2.8823529411764706,
1960
+ "step": 980
1961
+ },
1962
+ {
1963
+ "loss": 0.019337351620197295,
1964
+ "grad_norm": 0.9921875,
1965
+ "learning_rate": 7.058823529411766e-07,
1966
+ "entropy": 0.026062553003430366,
1967
+ "num_tokens": 620943.0,
1968
+ "mean_token_accuracy": 0.9899002552032471,
1969
+ "epoch": 2.8970588235294117,
1970
+ "step": 985
1971
+ },
1972
+ {
1973
+ "loss": 0.01972263157367706,
1974
+ "grad_norm": 1.5625,
1975
+ "learning_rate": 6.07843137254902e-07,
1976
+ "entropy": 0.025324805453419686,
1977
+ "num_tokens": 624094.0,
1978
+ "mean_token_accuracy": 0.9898600101470947,
1979
+ "epoch": 2.911764705882353,
1980
+ "step": 990
1981
+ },
1982
+ {
1983
+ "loss": 0.017833781242370606,
1984
+ "grad_norm": 1.2265625,
1985
+ "learning_rate": 5.098039215686275e-07,
1986
+ "entropy": 0.023284821771085262,
1987
+ "num_tokens": 627253.0,
1988
+ "mean_token_accuracy": 0.9910983681678772,
1989
+ "epoch": 2.9264705882352944,
1990
+ "step": 995
1991
+ },
1992
+ {
1993
+ "loss": 0.020137375593185423,
1994
+ "grad_norm": 1.3984375,
1995
+ "learning_rate": 4.1176470588235295e-07,
1996
+ "entropy": 0.024203809909522533,
1997
+ "num_tokens": 630427.0,
1998
+ "mean_token_accuracy": 0.9907480180263519,
1999
+ "epoch": 2.9411764705882355,
2000
+ "step": 1000
2001
+ },
2002
+ {
2003
+ "loss": 0.019109995663166048,
2004
+ "grad_norm": 1.21875,
2005
+ "learning_rate": 3.1372549019607843e-07,
2006
+ "entropy": 0.02416255362331867,
2007
+ "num_tokens": 633632.0,
2008
+ "mean_token_accuracy": 0.9915190756320953,
2009
+ "epoch": 2.9558823529411766,
2010
+ "step": 1005
2011
+ },
2012
+ {
2013
+ "loss": 0.02000269144773483,
2014
+ "grad_norm": 1.859375,
2015
+ "learning_rate": 2.1568627450980394e-07,
2016
+ "entropy": 0.024217843264341354,
2017
+ "num_tokens": 636805.0,
2018
+ "mean_token_accuracy": 0.9894875824451447,
2019
+ "epoch": 2.9705882352941178,
2020
+ "step": 1010
2021
+ },
2022
+ {
2023
+ "loss": 0.020338763296604157,
2024
+ "grad_norm": 1.546875,
2025
+ "learning_rate": 1.1764705882352942e-07,
2026
+ "entropy": 0.024258859269320966,
2027
+ "num_tokens": 639984.0,
2028
+ "mean_token_accuracy": 0.9892021059989929,
2029
+ "epoch": 2.985294117647059,
2030
+ "step": 1015
2031
+ },
2032
+ {
2033
+ "loss": 0.020995336771011352,
2034
+ "grad_norm": 1.046875,
2035
+ "learning_rate": 1.9607843137254902e-08,
2036
+ "entropy": 0.025342148169875144,
2037
+ "num_tokens": 643104.0,
2038
+ "mean_token_accuracy": 0.9887544453144074,
2039
+ "epoch": 3.0,
2040
+ "step": 1020
2041
+ },
2042
+ {
2043
+ "train_runtime": 3944.5682,
2044
+ "train_samples_per_second": 0.517,
2045
+ "train_steps_per_second": 0.259,
2046
+ "total_flos": 5056111718203392.0,
2047
+ "train_loss": 0.07629515403041652,
2048
+ "epoch": 3.0,
2049
+ "step": 1020
2050
+ }
2051
+ ]
llm_policy.py CHANGED
@@ -76,13 +76,33 @@ class LLMPolicy:
76
  if self.tokenizer.pad_token is None:
77
  self.tokenizer.pad_token = self.tokenizer.eos_token
78
 
79
- self.model = AutoModelForCausalLM.from_pretrained(
80
- model_name_or_path,
81
- torch_dtype=torch_dtype,
82
- ).to(resolved_device)
 
 
 
 
 
 
 
 
83
  self.model.eval()
84
  self.device = resolved_device
85
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  # ------------------------------------------------------------------
87
  # Public API
88
  # ------------------------------------------------------------------
 
76
  if self.tokenizer.pad_token is None:
77
  self.tokenizer.pad_token = self.tokenizer.eos_token
78
 
79
+ # transformers renamed torch_dtype -> dtype; try new kwarg first and
80
+ # fall back for older versions. Works silently on both.
81
+ try:
82
+ self.model = AutoModelForCausalLM.from_pretrained(
83
+ model_name_or_path,
84
+ dtype=torch_dtype,
85
+ ).to(resolved_device)
86
+ except TypeError:
87
+ self.model = AutoModelForCausalLM.from_pretrained(
88
+ model_name_or_path,
89
+ torch_dtype=torch_dtype,
90
+ ).to(resolved_device)
91
  self.model.eval()
92
  self.device = resolved_device
93
 
94
+ # Strip sampling-only fields from the shipped generation_config so
95
+ # transformers doesn't warn "these flags will be ignored" when we
96
+ # decode greedily (do_sample=False).
97
+ gen_config = getattr(self.model, "generation_config", None)
98
+ if gen_config is not None:
99
+ for attr in ("temperature", "top_p", "top_k"):
100
+ if hasattr(gen_config, attr):
101
+ try:
102
+ setattr(gen_config, attr, None)
103
+ except Exception:
104
+ pass
105
+
106
  # ------------------------------------------------------------------
107
  # Public API
108
  # ------------------------------------------------------------------
scripts/before_after_demo.py ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Before/after demo: base model vs fine-tuned model on the SAME incident.
2
+
3
+ Runs both policies against the same task under the same seed, prints a clean
4
+ side-by-side trace, and writes ``artifacts/before_after_demo.md`` which you
5
+ can paste into the blog post or screen-record for the video.
6
+
7
+ Usage (after ``train_trl.py`` has saved ``artifacts/sft_model``)::
8
+
9
+ ENV_URL=http://127.0.0.1:8000 python scripts/before_after_demo.py
10
+
11
+ Env variables:
12
+ ENV_URL — URL of a running Incident Command Center server
13
+ BASE_MODEL — HF hub id of the base model
14
+ SFT_MODEL_DIR — path to the fine-tuned checkpoint (default: artifacts/sft_model)
15
+ DEMO_TASK — task difficulty to demo (default: hard)
16
+ DEMO_MAX_STEPS — per-episode step cap (default: 120)
17
+ """
18
+
19
+ from __future__ import annotations
20
+
21
+ import json
22
+ import os
23
+ import random
24
+ import sys
25
+ from pathlib import Path
26
+ from typing import Dict, List, Optional
27
+
28
+ # Ensure repo root on sys.path when invoked from subdirectory
29
+ _REPO_ROOT = Path(__file__).resolve().parents[1]
30
+ if str(_REPO_ROOT) not in sys.path:
31
+ sys.path.insert(0, str(_REPO_ROOT))
32
+
33
+ from client import IncidentCommandEnvClient # noqa: E402
34
+ from models import IncidentAction, IncidentObservation # noqa: E402
35
+
36
+
37
+ ENV_URL = os.getenv("ENV_URL", "http://127.0.0.1:8000")
38
+ BASE_MODEL = os.getenv("BASE_MODEL", "Qwen/Qwen2.5-0.5B-Instruct")
39
+ SFT_MODEL_DIR = os.getenv("SFT_MODEL_DIR", "artifacts/sft_model")
40
+ DEMO_TASK = os.getenv("DEMO_TASK", "hard")
41
+ DEMO_MAX_STEPS = int(os.getenv("DEMO_MAX_STEPS", "120"))
42
+ DEMO_SEED = int(os.getenv("DEMO_SEED", "2026"))
43
+
44
+
45
+ def _format_obs_summary(obs: IncidentObservation) -> str:
46
+ return (
47
+ f"{obs.incident_title} "
48
+ f"(tier={obs.customer_tier}, users={obs.affected_users_estimate}, "
49
+ f"$/min={obs.revenue_impact_usd_per_min})"
50
+ )
51
+
52
+
53
+ def _format_action(action: IncidentAction) -> str:
54
+ target = action.target or "-"
55
+ bits = [f"{action.actor}:{action.action_type}:{target}"]
56
+ if action.reason:
57
+ bits.append(f"reason={action.reason[:80]}")
58
+ return " | ".join(bits)
59
+
60
+
61
+ def _format_components(components: Optional[Dict[str, float]]) -> str:
62
+ if not components:
63
+ return "-"
64
+ return ", ".join(f"{k}={v:+.2f}" for k, v in components.items())
65
+
66
+
67
+ def _rollout_with_policy(policy_name: str, select_fn) -> Dict:
68
+ env = IncidentCommandEnvClient(base_url=ENV_URL).sync()
69
+ random.seed(DEMO_SEED)
70
+ steps_log: List[Dict] = []
71
+ total_reward = 0.0
72
+ components_sum: Dict[str, float] = {}
73
+ closed_incidents = 0
74
+ incident_seen: List[str] = []
75
+ try:
76
+ result = env.reset(task_name=DEMO_TASK)
77
+ step_idx = 0
78
+ while not result.done and step_idx < DEMO_MAX_STEPS:
79
+ step_idx += 1
80
+ obs = result.observation
81
+ if obs.incident_id not in incident_seen:
82
+ incident_seen.append(obs.incident_id)
83
+ action = select_fn(obs)
84
+ result = env.step(action)
85
+ reward = float(result.reward or 0.0)
86
+ total_reward += reward
87
+ new_obs = result.observation
88
+ step_components = getattr(new_obs, "reward_components", None) or {}
89
+ for k, v in step_components.items():
90
+ components_sum[k] = components_sum.get(k, 0.0) + float(v)
91
+ if action.action_type == "close_incident" and reward > 0:
92
+ closed_incidents += 1
93
+ steps_log.append(
94
+ {
95
+ "step": step_idx,
96
+ "incident": obs.incident_id,
97
+ "summary": _format_obs_summary(obs),
98
+ "action": _format_action(action),
99
+ "reward": round(reward, 3),
100
+ "components": _format_components(step_components),
101
+ }
102
+ )
103
+ finally:
104
+ try:
105
+ env.close()
106
+ except Exception:
107
+ pass
108
+ return {
109
+ "policy": policy_name,
110
+ "task": DEMO_TASK,
111
+ "steps": len(steps_log),
112
+ "total_reward": round(total_reward, 3),
113
+ "incidents_seen": incident_seen,
114
+ "incidents_closed": closed_incidents,
115
+ "components_sum": {k: round(v, 3) for k, v in components_sum.items()},
116
+ "trace": steps_log,
117
+ }
118
+
119
+
120
+ def _write_markdown(base_run: Dict, sft_run: Dict, out_path: Path) -> None:
121
+ lines: List[str] = []
122
+ lines.append(f"# Before vs After — {DEMO_TASK.title()} task demo\n")
123
+ lines.append(f"Both policies ran against the same seeded task (`{DEMO_TASK}`, seed {DEMO_SEED}) ")
124
+ lines.append("on an identical Incident Command Center server. Each sees the same incident ")
125
+ lines.append("queue in the same order.\n")
126
+ lines.append("## Headline\n")
127
+ lines.append(f"| Policy | Total reward | Steps | Incidents closed |")
128
+ lines.append(f"|---|---:|---:|---:|")
129
+ lines.append(
130
+ f"| Base `{BASE_MODEL}` | {base_run['total_reward']:+.2f} | "
131
+ f"{base_run['steps']} | {base_run['incidents_closed']} |"
132
+ )
133
+ lines.append(
134
+ f"| **Fine-tuned (SFT)** | **{sft_run['total_reward']:+.2f}** | "
135
+ f"{sft_run['steps']} | {sft_run['incidents_closed']} |"
136
+ )
137
+ delta = sft_run["total_reward"] - base_run["total_reward"]
138
+ lines.append(f"\n**Reward delta: {delta:+.2f}** in favor of fine-tuned.\n")
139
+
140
+ lines.append("## Reward sources (summed across the episode)\n")
141
+ lines.append("| Component | Base | Fine-tuned |")
142
+ lines.append("|---|---:|---:|")
143
+ all_keys = sorted(set(base_run["components_sum"]) | set(sft_run["components_sum"]))
144
+ for k in all_keys:
145
+ lines.append(
146
+ f"| `{k}` | {base_run['components_sum'].get(k, 0.0):+.2f} | "
147
+ f"{sft_run['components_sum'].get(k, 0.0):+.2f} |"
148
+ )
149
+
150
+ def _trace_block(run: Dict, title: str) -> None:
151
+ lines.append(f"\n## Trace — {title}\n")
152
+ lines.append("```")
153
+ for row in run["trace"]:
154
+ lines.append(
155
+ f"step {row['step']:>3} | incident={row['incident']} | "
156
+ f"{row['action']} | reward={row['reward']:+.2f} | {row['components']}"
157
+ )
158
+ lines.append("```")
159
+
160
+ _trace_block(base_run, f"Base model ({BASE_MODEL})")
161
+ _trace_block(sft_run, "Fine-tuned (SFT) model")
162
+
163
+ out_path.write_text("\n".join(lines), encoding="utf-8")
164
+
165
+
166
+ def main() -> None:
167
+ from llm_policy import LLMPolicy
168
+
169
+ print(f"[demo] task={DEMO_TASK} seed={DEMO_SEED} env={ENV_URL}")
170
+
171
+ print(f"[demo] Loading base model: {BASE_MODEL}")
172
+ base_policy = LLMPolicy(BASE_MODEL, label="base_model")
173
+ base_run = _rollout_with_policy("base_model", base_policy.select_action)
174
+ base_policy.release()
175
+
176
+ print(f"[demo] Loading SFT model: {SFT_MODEL_DIR}")
177
+ sft_policy = LLMPolicy(SFT_MODEL_DIR, label="sft_model")
178
+ sft_run = _rollout_with_policy("sft_model", sft_policy.select_action)
179
+ sft_policy.release()
180
+
181
+ art_dir = Path("artifacts")
182
+ art_dir.mkdir(exist_ok=True)
183
+ md_path = art_dir / "before_after_demo.md"
184
+ json_path = art_dir / "before_after_demo.json"
185
+ _write_markdown(base_run, sft_run, md_path)
186
+ with json_path.open("w", encoding="utf-8") as f:
187
+ json.dump({"base": base_run, "sft": sft_run}, f, indent=2)
188
+
189
+ print(f"[demo] Base total={base_run['total_reward']:+.2f} "
190
+ f"steps={base_run['steps']} closed={base_run['incidents_closed']}")
191
+ print(f"[demo] SFT total={sft_run['total_reward']:+.2f} "
192
+ f"steps={sft_run['steps']} closed={sft_run['incidents_closed']}")
193
+ print(f"[demo] Wrote {md_path} and {json_path}")
194
+
195
+
196
+ if __name__ == "__main__":
197
+ main()
server/app.py CHANGED
@@ -17,10 +17,12 @@ from __future__ import annotations
17
 
18
  import json
19
  import logging
 
20
  from typing import Any, Dict
21
 
22
  import uvicorn
23
  from fastapi.responses import HTMLResponse, JSONResponse, PlainTextResponse
 
24
  from openenv.core.env_server import create_fastapi_app
25
 
26
  from models import IncidentAction, IncidentObservation
@@ -42,12 +44,41 @@ _LOG = logging.getLogger("icc.app")
42
  _CONFIG = EnvConfig.from_env()
43
  configure_logging(level=_CONFIG.log_level, structured=_CONFIG.structured_logging)
44
 
 
 
 
 
 
 
45
  app = create_fastapi_app(
46
  IncidentCommandCenterEnvironment,
47
  IncidentAction,
48
  IncidentObservation,
49
  )
50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
  # ---------------------------------------------------------------------------
53
  # Introspection helpers
@@ -161,6 +192,153 @@ async def root() -> HTMLResponse:
161
 
162
  def _dashboard_html() -> str:
163
  metadata_json = json.dumps(_metadata_payload(), indent=2)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
164
  return f"""
165
  <!DOCTYPE html>
166
  <html lang='en'>
@@ -180,24 +358,39 @@ def _dashboard_html() -> str:
180
  background: radial-gradient(1000px 600px at 10% -10%, #1e293b, var(--bg));
181
  color: var(--text); padding: 2rem; margin: 0; min-height: 100vh;
182
  }}
183
- header {{ display:flex; align-items:center; justify-content:space-between; max-width:1100px; margin:0 auto 1.5rem; }}
184
  .brand {{ display:flex; align-items:center; gap:0.75rem; }}
185
  .logo {{ width:44px; height:44px; border-radius:10px; background:linear-gradient(135deg,var(--primary),var(--accent)); }}
186
  h1 {{ font-size:1.6rem; margin:0; }}
187
- h2 {{ font-size:1.1rem; margin:1.4rem 0 0.6rem; color:#cbd5e1; }}
188
  .sub {{ color: var(--muted); }}
189
- .grid {{ display:grid; grid-template-columns: repeat(auto-fit,minmax(260px,1fr)); gap:1rem; max-width:1100px; margin:0 auto; }}
 
190
  .card {{ background: var(--card); border: 1px solid #1f2a44; padding: 1.25rem; border-radius: 14px; }}
191
  .card h3 {{ margin:0 0 0.5rem; font-size:1rem; color:#f1f5f9; }}
192
  .pill {{ display:inline-block; padding:2px 8px; margin:2px; border-radius:999px; background:#1e293b; border:1px solid #334155; color:#cbd5e1; font-size:0.78rem; }}
 
193
  .container {{ max-width: 1100px; margin: 0 auto; }}
194
  code {{ background:#0b1225; border:1px solid #1f2a44; padding:2px 6px; border-radius:6px; color:#67e8f9; font-family:'JetBrains Mono', monospace; }}
195
  pre {{ background:#0b1225; border:1px solid #1f2a44; padding: 1rem; border-radius: 10px; color:#cbd5e1; overflow-x:auto; font-size:0.85rem; }}
196
  a {{ color: var(--accent); text-decoration: none; }}
 
197
  .kpi {{ display:flex; flex-direction:column; gap:0.25rem; }}
198
  .kpi .num {{ font-size:1.6rem; font-weight:700; color:#f8fafc; }}
199
  .kpi .lbl {{ color: var(--muted); font-size:0.8rem; }}
 
200
  footer {{ max-width:1100px; margin:2rem auto 0; color:var(--muted); font-size:0.85rem; }}
 
 
 
 
 
 
 
 
 
 
 
201
  </style>
202
  </head>
203
  <body>
@@ -206,16 +399,53 @@ def _dashboard_html() -> str:
206
  <div class='logo'></div>
207
  <div>
208
  <h1>Incident Command Center</h1>
209
- <div class='sub'>OpenEnv · Multi-Agent · Long-Horizon · Enterprise Simulation</div>
210
  </div>
211
  </div>
212
- <div>
 
 
 
213
  <span class='pill'>v{_CONFIG.version}</span>
214
  <span class='pill'>task: easy / medium / hard</span>
215
  </div>
216
  </header>
217
 
218
  <div class='container'>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
219
  <div class='grid'>
220
  <div class='card'>
221
  <div class='kpi'>
@@ -246,6 +476,33 @@ def _dashboard_html() -> str:
246
  </div>
247
  </div>
248
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
249
  <h2>Endpoints</h2>
250
  <div class='card'>
251
  <p class='sub'>Standard OpenEnv contract plus operational endpoints.</p>
@@ -258,22 +515,40 @@ def _dashboard_html() -> str:
258
  <li><code>GET /env-info</code> — action space, reward model, budgets.</li>
259
  <li><code>GET /metrics</code> — Prometheus-style counters.</li>
260
  <li><code>GET /docs</code> — interactive OpenAPI documentation.</li>
 
261
  </ul>
262
  </div>
263
 
264
  <h2>Action space</h2>
265
  <div class='card'>
266
  {"".join(f"<span class='pill'>{a}</span>" for a in ALL_ACTIONS)}
267
- <p class='sub'>Each action is gated by the acting role; wrong-actor calls are penalised.</p>
 
 
268
  </div>
269
 
270
- <h2>Reward model (summary)</h2>
271
  <div class='card'>
272
- <p>Composable rubric with anti-gaming safeguards. Every step returns a
273
- <code>reward_components</code> dictionary so training curves are
274
- interpretable. Closure rewards and SLA penalties are scaled by
275
- customer-tier multipliers:</p>
276
- {"".join(f"<span class='pill'>{tier}: x{mult}</span>" for tier, mult in TIER_MULTIPLIER.items())}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
277
  </div>
278
 
279
  <h2>Metadata</h2>
@@ -284,7 +559,9 @@ def _dashboard_html() -> str:
284
 
285
  <footer>
286
  Incident Command Center v{_CONFIG.version} · Built on
287
- <a href='https://github.com/meta-pytorch/openenv'>OpenEnv</a>.
 
 
288
  </footer>
289
 
290
  <script>
 
17
 
18
  import json
19
  import logging
20
+ from pathlib import Path
21
  from typing import Any, Dict
22
 
23
  import uvicorn
24
  from fastapi.responses import HTMLResponse, JSONResponse, PlainTextResponse
25
+ from fastapi.staticfiles import StaticFiles
26
  from openenv.core.env_server import create_fastapi_app
27
 
28
  from models import IncidentAction, IncidentObservation
 
44
  _CONFIG = EnvConfig.from_env()
45
  configure_logging(level=_CONFIG.log_level, structured=_CONFIG.structured_logging)
46
 
47
+ # External URLs surfaced on the dashboard so judges can jump straight from
48
+ # the HF Space to the GitHub / Colab / training artifacts.
49
+ GITHUB_URL = "https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center"
50
+ SPACE_PAGE_URL = "https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center"
51
+ COLAB_URL = "https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing"
52
+
53
  app = create_fastapi_app(
54
  IncidentCommandCenterEnvironment,
55
  IncidentAction,
56
  IncidentObservation,
57
  )
58
 
59
+ # Serve the committed training-evidence artifacts (reward_curve.png,
60
+ # training_curve.png, reward_components.png, summary_metrics.json, ...)
61
+ # so the dashboard can embed them without depending on external hosts.
62
+ _ARTIFACTS_DIR = Path(__file__).resolve().parent.parent / "artifacts"
63
+ if _ARTIFACTS_DIR.exists():
64
+ app.mount(
65
+ "/artifacts",
66
+ StaticFiles(directory=str(_ARTIFACTS_DIR)),
67
+ name="artifacts",
68
+ )
69
+
70
+
71
+ def _load_summary_metrics() -> Dict[str, Any]:
72
+ """Best-effort load of the committed training results for the dashboard."""
73
+ path = _ARTIFACTS_DIR / "summary_metrics.json"
74
+ if not path.exists():
75
+ return {}
76
+ try:
77
+ with path.open("r", encoding="utf-8") as fh:
78
+ return json.load(fh)
79
+ except (OSError, json.JSONDecodeError):
80
+ return {}
81
+
82
 
83
  # ---------------------------------------------------------------------------
84
  # Introspection helpers
 
192
 
193
  def _dashboard_html() -> str:
194
  metadata_json = json.dumps(_metadata_payload(), indent=2)
195
+ metrics = _load_summary_metrics()
196
+ artifacts_available = _ARTIFACTS_DIR.exists() and (
197
+ _ARTIFACTS_DIR / "reward_curve.png"
198
+ ).exists()
199
+
200
+ # --- Headline training numbers (1.5B SFT vs base, hard task) -------------
201
+ base_rewards = metrics.get("base_model_rewards") or [0.0, 0.0, 0.0]
202
+ sft_rewards = metrics.get("sft_model_rewards") or [0.0, 0.0, 0.0]
203
+ improvement = metrics.get("improvement_sft_over_base") or [0.0, 0.0, 0.0]
204
+ headline_delta = improvement[2] if len(improvement) >= 3 else 0.0
205
+
206
+ def _fmt(val: Any) -> str:
207
+ try:
208
+ return f"{float(val):+.2f}"
209
+ except (TypeError, ValueError):
210
+ return "—"
211
+
212
+ training_rows = "".join(
213
+ f"<tr><td>{tier}</td><td>{_fmt(base_rewards[idx])}</td>"
214
+ f"<td>{_fmt(sft_rewards[idx])}</td>"
215
+ f"<td class='delta'>{_fmt(improvement[idx])}</td></tr>"
216
+ for idx, tier in enumerate(("easy", "medium", "hard"))
217
+ if idx < len(base_rewards)
218
+ )
219
+
220
+ # --- Training-evidence block (plots + caption) ---------------------------
221
+ if artifacts_available:
222
+ plots_html = """
223
+ <h2>Training evidence</h2>
224
+ <p class='sub'>
225
+ Committed artifacts from the reference training run
226
+ (Qwen2.5-1.5B-Instruct, 8 episodes/task, 3 epochs).
227
+ </p>
228
+ <div class='plots'>
229
+ <figure>
230
+ <img src='/artifacts/reward_curve.png' alt='Reward curve by policy' loading='lazy' />
231
+ <figcaption>Mean episodic reward per task tier across Random / Heuristic /
232
+ Base-LLM / SFT-LLM. SFT matches the heuristic demonstrator across every tier
233
+ and outperforms the untuned base by <strong>+{hard}</strong> on hard incidents.</figcaption>
234
+ </figure>
235
+ <figure>
236
+ <img src='/artifacts/training_curve.png' alt='SFT training loss and token accuracy' loading='lazy' />
237
+ <figcaption>Supervised loss collapses from <code>~2.84 → ~0.02</code> and
238
+ next-token accuracy climbs from <code>~0.49 → ~0.99</code> in three epochs on 680 rollout tokens.</figcaption>
239
+ </figure>
240
+ <figure>
241
+ <img src='/artifacts/reward_components.png' alt='Reward component decomposition' loading='lazy' />
242
+ <figcaption>Per-component reward decomposition. SFT reproduces the
243
+ heuristic's positive components (clue_bonus, mitigation_correct, closure_correct,
244
+ speed_bonus) while the base model stalls on step_cost and SLA penalties.</figcaption>
245
+ </figure>
246
+ </div>
247
+ <p class='sub' style='margin-top:0.75rem'>
248
+ Raw files:
249
+ <a href='/artifacts/summary_metrics.json'>summary_metrics.json</a>
250
+ ·
251
+ <a href='/artifacts/training_log.json'>training_log.json</a>
252
+ ·
253
+ <a href='/artifacts/reward_curve_qwen0p5b.png'>0.5B ablation plot</a>
254
+ ·
255
+ <a href='/artifacts/summary_metrics_qwen0p5b.json'>0.5B metrics</a>
256
+ </p>
257
+ """.format(hard=_fmt(headline_delta))
258
+ else:
259
+ plots_html = (
260
+ "<h2>Training evidence</h2>"
261
+ "<div class='card'><p class='sub'>Plots not bundled in this image. "
262
+ "See the <a href='" + GITHUB_URL + "/tree/main/artifacts'>GitHub artifacts folder</a>.</p></div>"
263
+ )
264
+
265
+ # --- 0.5B ablation summary ----------------------------------------------
266
+ ablation_html = """
267
+ <h2>Ablation: model scale matters for imitation learning</h2>
268
+ <div class='card'>
269
+ <p class='sub'>
270
+ Same pipeline, same data schema — only the base-model size differs. The 0.5B
271
+ model cannot absorb the expert policy; 1.5B matches it exactly.
272
+ </p>
273
+ <div class='table-wrap'>
274
+ <table>
275
+ <thead>
276
+ <tr>
277
+ <th>Model</th><th>Easy Δ</th><th>Medium Δ</th><th>Hard Δ</th>
278
+ <th>Heuristic match?</th>
279
+ </tr>
280
+ </thead>
281
+ <tbody>
282
+ <tr>
283
+ <td>Qwen2.5-0.5B-Instruct</td>
284
+ <td>+0.43</td><td>+0.14</td><td class='delta'>+0.00</td>
285
+ <td>No (stuck on step-cost)</td>
286
+ </tr>
287
+ <tr>
288
+ <td><strong>Qwen2.5-1.5B-Instruct</strong></td>
289
+ <td>-1.80</td><td>+3.13</td><td class='delta good'>+10.17</td>
290
+ <td><strong>Yes (exact match)</strong></td>
291
+ </tr>
292
+ </tbody>
293
+ </table>
294
+ </div>
295
+ </div>
296
+ """
297
+
298
+ # --- Theme-mapping block (Multi-Agent / Long-Horizon / Professional) -----
299
+ themes_html = """
300
+ <h2>Hackathon theme mapping</h2>
301
+ <div class='grid grid-3'>
302
+ <div class='card'>
303
+ <h3>Theme #1 — Multi-Agent Interactions</h3>
304
+ <p class='sub'>
305
+ Three gated specialist roles (triage, investigator, ops manager) exchange
306
+ structured handoffs. Acting out-of-role triggers a
307
+ <code>wrong_actor_penalty</code>, so collaboration is trained, not hard-coded.
308
+ </p>
309
+ </div>
310
+ <div class='card'>
311
+ <h3>Theme #2 — Long-Horizon Planning</h3>
312
+ <p class='sub'>
313
+ Episodes span up to 28 steps across stacked incidents with delayed,
314
+ sparse rewards (closure &amp; post-mortem) and per-tier budget / SLA
315
+ constraints — a proper credit-assignment stress test.
316
+ </p>
317
+ </div>
318
+ <div class='card'>
319
+ <h3>Theme #3 — World Modeling / Professional Tasks</h3>
320
+ <p class='sub'>
321
+ A realistic enterprise incident-response simulation with customer tiers,
322
+ rollbacks, escalation policies, post-mortems, and a transparent,
323
+ anti-gamed reward rubric.
324
+ </p>
325
+ </div>
326
+ </div>
327
+ """
328
+
329
+ # --- Reward-rubric details ----------------------------------------------
330
+ reward_rubric_rows = "".join(
331
+ f"<tr><td><code>{name}</code></td><td>{value}</td></tr>"
332
+ for name, value in (
333
+ ("step_cost", f"{STEP_COST_INVESTIGATION} per investigation step"),
334
+ ("clue_reward", f"+{CLUE_REWARD} per new fact"),
335
+ ("handoff_correct", f"+{HANDOFF_CORRECT_REWARD}"),
336
+ ("mitigation_correct", f"+{MITIGATION_CORRECT_REWARD}"),
337
+ ("closure_correct_base", f"+{CLOSURE_CORRECT_BASE} × tier multiplier"),
338
+ ("closure_wrong", f"{CLOSURE_WRONG_PENALTY} × tier multiplier"),
339
+ )
340
+ )
341
+
342
  return f"""
343
  <!DOCTYPE html>
344
  <html lang='en'>
 
358
  background: radial-gradient(1000px 600px at 10% -10%, #1e293b, var(--bg));
359
  color: var(--text); padding: 2rem; margin: 0; min-height: 100vh;
360
  }}
361
+ header {{ display:flex; align-items:center; justify-content:space-between; max-width:1100px; margin:0 auto 1.5rem; flex-wrap:wrap; gap:1rem; }}
362
  .brand {{ display:flex; align-items:center; gap:0.75rem; }}
363
  .logo {{ width:44px; height:44px; border-radius:10px; background:linear-gradient(135deg,var(--primary),var(--accent)); }}
364
  h1 {{ font-size:1.6rem; margin:0; }}
365
+ h2 {{ font-size:1.2rem; margin:1.8rem 0 0.6rem; color:#cbd5e1; }}
366
  .sub {{ color: var(--muted); }}
367
+ .grid {{ display:grid; grid-template-columns: repeat(auto-fit,minmax(240px,1fr)); gap:1rem; max-width:1100px; margin:0 auto; }}
368
+ .grid-3 {{ grid-template-columns: repeat(auto-fit,minmax(280px,1fr)); }}
369
  .card {{ background: var(--card); border: 1px solid #1f2a44; padding: 1.25rem; border-radius: 14px; }}
370
  .card h3 {{ margin:0 0 0.5rem; font-size:1rem; color:#f1f5f9; }}
371
  .pill {{ display:inline-block; padding:2px 8px; margin:2px; border-radius:999px; background:#1e293b; border:1px solid #334155; color:#cbd5e1; font-size:0.78rem; }}
372
+ .pill.cta {{ background:linear-gradient(135deg,var(--primary),var(--accent)); color:#0b1225; border-color:transparent; font-weight:600; }}
373
  .container {{ max-width: 1100px; margin: 0 auto; }}
374
  code {{ background:#0b1225; border:1px solid #1f2a44; padding:2px 6px; border-radius:6px; color:#67e8f9; font-family:'JetBrains Mono', monospace; }}
375
  pre {{ background:#0b1225; border:1px solid #1f2a44; padding: 1rem; border-radius: 10px; color:#cbd5e1; overflow-x:auto; font-size:0.85rem; }}
376
  a {{ color: var(--accent); text-decoration: none; }}
377
+ a:hover {{ text-decoration: underline; }}
378
  .kpi {{ display:flex; flex-direction:column; gap:0.25rem; }}
379
  .kpi .num {{ font-size:1.6rem; font-weight:700; color:#f8fafc; }}
380
  .kpi .lbl {{ color: var(--muted); font-size:0.8rem; }}
381
+ .kpi .num.good {{ color: var(--good); }}
382
  footer {{ max-width:1100px; margin:2rem auto 0; color:var(--muted); font-size:0.85rem; }}
383
+ .plots {{ display:grid; grid-template-columns: repeat(auto-fit,minmax(300px,1fr)); gap:1rem; max-width:1100px; margin:0 auto; }}
384
+ .plots figure {{ background: var(--card); border:1px solid #1f2a44; border-radius: 14px; padding: 0.75rem; margin:0; }}
385
+ .plots img {{ width:100%; height:auto; border-radius:8px; background:#0b1225; }}
386
+ .plots figcaption {{ color: var(--muted); font-size:0.8rem; margin-top:0.5rem; line-height:1.4; }}
387
+ .table-wrap {{ overflow-x:auto; }}
388
+ table {{ width:100%; border-collapse: collapse; margin-top:0.5rem; font-size:0.9rem; }}
389
+ th, td {{ padding:0.5rem 0.75rem; text-align:left; border-bottom:1px solid #1f2a44; }}
390
+ th {{ color:#cbd5e1; font-weight:600; }}
391
+ td.delta {{ font-weight:600; color:#f8fafc; }}
392
+ td.delta.good {{ color: var(--good); }}
393
+ .links {{ display:flex; flex-wrap:wrap; gap:0.5rem; }}
394
  </style>
395
  </head>
396
  <body>
 
399
  <div class='logo'></div>
400
  <div>
401
  <h1>Incident Command Center</h1>
402
+ <div class='sub'>OpenEnv · Multi-Agent · Long-Horizon · Professional-Task Simulation</div>
403
  </div>
404
  </div>
405
+ <div class='links'>
406
+ <a class='pill cta' href='{GITHUB_URL}' target='_blank' rel='noopener'>GitHub</a>
407
+ <a class='pill cta' href='{COLAB_URL}' target='_blank' rel='noopener'>Open in Colab</a>
408
+ <a class='pill' href='{SPACE_PAGE_URL}' target='_blank' rel='noopener'>Space page</a>
409
  <span class='pill'>v{_CONFIG.version}</span>
410
  <span class='pill'>task: easy / medium / hard</span>
411
  </div>
412
  </header>
413
 
414
  <div class='container'>
415
+
416
+ <h2>Headline results</h2>
417
+ <div class='grid'>
418
+ <div class='card'>
419
+ <div class='kpi'>
420
+ <span class='lbl'>SFT reward lift on hard tasks</span>
421
+ <span class='num good'>{_fmt(headline_delta)}</span>
422
+ <span class='sub'>vs Qwen2.5-1.5B-Instruct base</span>
423
+ </div>
424
+ </div>
425
+ <div class='card'>
426
+ <div class='kpi'>
427
+ <span class='lbl'>Heuristic-policy match</span>
428
+ <span class='num'>Exact</span>
429
+ <span class='sub'>SFT clones the demonstrator across every tier</span>
430
+ </div>
431
+ </div>
432
+ <div class='card'>
433
+ <div class='kpi'>
434
+ <span class='lbl'>Scale ablation (hard Δ)</span>
435
+ <span class='num'>0.5B → 1.5B</span>
436
+ <span class='sub'>+0.00 → +10.17: capacity matters</span>
437
+ </div>
438
+ </div>
439
+ <div class='card'>
440
+ <div class='kpi'>
441
+ <span class='lbl'>Training data</span>
442
+ <span class='num'>680 rows</span>
443
+ <span class='sub'>24 heuristic rollouts · 3 epochs</span>
444
+ </div>
445
+ </div>
446
+ </div>
447
+
448
+ <h2>Environment at a glance</h2>
449
  <div class='grid'>
450
  <div class='card'>
451
  <div class='kpi'>
 
476
  </div>
477
  </div>
478
 
479
+ <h2>1.5B SFT vs base (reference run)</h2>
480
+ <div class='card'>
481
+ <div class='table-wrap'>
482
+ <table>
483
+ <thead>
484
+ <tr>
485
+ <th>Task tier</th><th>Base reward</th><th>SFT reward</th><th>Δ</th>
486
+ </tr>
487
+ </thead>
488
+ <tbody>
489
+ {training_rows}
490
+ </tbody>
491
+ </table>
492
+ </div>
493
+ <p class='sub' style='margin-top:0.75rem'>
494
+ Numbers loaded live from
495
+ <a href='/artifacts/summary_metrics.json'>summary_metrics.json</a>
496
+ committed alongside this Space.
497
+ </p>
498
+ </div>
499
+
500
+ {plots_html}
501
+
502
+ {ablation_html}
503
+
504
+ {themes_html}
505
+
506
  <h2>Endpoints</h2>
507
  <div class='card'>
508
  <p class='sub'>Standard OpenEnv contract plus operational endpoints.</p>
 
515
  <li><code>GET /env-info</code> — action space, reward model, budgets.</li>
516
  <li><code>GET /metrics</code> — Prometheus-style counters.</li>
517
  <li><code>GET /docs</code> — interactive OpenAPI documentation.</li>
518
+ <li><code>GET /artifacts/…</code> — committed training plots &amp; metrics.</li>
519
  </ul>
520
  </div>
521
 
522
  <h2>Action space</h2>
523
  <div class='card'>
524
  {"".join(f"<span class='pill'>{a}</span>" for a in ALL_ACTIONS)}
525
+ <p class='sub' style='margin-top:0.5rem'>
526
+ Each action is gated by the acting role; wrong-actor calls are penalised.
527
+ </p>
528
  </div>
529
 
530
+ <h2>Reward model</h2>
531
  <div class='card'>
532
+ <p>
533
+ Composable rubric with anti-gaming safeguards. Every step returns a
534
+ <code>reward_components</code> dictionary so training curves are
535
+ interpretable. Closure rewards and SLA penalties are scaled by
536
+ customer-tier multipliers:
537
+ </p>
538
+ <p>
539
+ {"".join(f"<span class='pill'>{tier}: x{mult}</span>" for tier, mult in TIER_MULTIPLIER.items())}
540
+ </p>
541
+ <div class='table-wrap'>
542
+ <table>
543
+ <thead><tr><th>Component</th><th>Signal</th></tr></thead>
544
+ <tbody>{reward_rubric_rows}</tbody>
545
+ </table>
546
+ </div>
547
+ <p class='sub' style='margin-top:0.75rem'>
548
+ Full rubric (invalid-action, repeated-lookup, rollback-effective,
549
+ post-mortem-logged, etc.) is documented in the
550
+ <a href='{GITHUB_URL}#reward-model' target='_blank' rel='noopener'>README</a>.
551
+ </p>
552
  </div>
553
 
554
  <h2>Metadata</h2>
 
559
 
560
  <footer>
561
  Incident Command Center v{_CONFIG.version} · Built on
562
+ <a href='https://github.com/meta-pytorch/openenv' target='_blank' rel='noopener'>OpenEnv</a>
563
+ · <a href='{GITHUB_URL}' target='_blank' rel='noopener'>Source on GitHub</a>
564
+ · <a href='{COLAB_URL}' target='_blank' rel='noopener'>Reproduce training on Colab</a>
565
  </footer>
566
 
567
  <script>
train_trl.py CHANGED
@@ -59,6 +59,7 @@ class EpisodeStats:
59
  total_reward: float
60
  steps: int
61
  success: bool
 
62
 
63
 
64
  # ---------------------------------------------------------------------------
@@ -119,6 +120,7 @@ def rollout(
119
  coordinator = HeuristicCoordinator()
120
  records: List[Dict[str, str]] = []
121
  rewards: List[float] = []
 
122
  steps = 0
123
  step_cap = max_steps if max_steps is not None else MAX_ROLLOUT_STEPS
124
 
@@ -143,6 +145,9 @@ def rollout(
143
 
144
  result = env.step(action)
145
  rewards.append(float(result.reward or 0.0))
 
 
 
146
  finally:
147
  try:
148
  env.close()
@@ -151,11 +156,15 @@ def rollout(
151
 
152
  total_reward = sum(rewards)
153
  success = total_reward > 0.0
154
- return (
155
- EpisodeStats(policy_name, task_name, total_reward, steps, success),
156
- records,
157
- rewards,
 
 
 
158
  )
 
159
 
160
 
161
  def build_training_dataset(episodes_per_task: int = EPISODES_PER_TASK) -> Dataset:
@@ -259,7 +268,12 @@ def run_trl_sft(dataset: Dataset) -> Path:
259
  SFT_MODEL_DIR.mkdir(parents=True, exist_ok=True)
260
  trainer.save_model(str(SFT_MODEL_DIR))
261
  tokenizer.save_pretrained(str(SFT_MODEL_DIR))
 
 
 
 
262
  print(f"[train] Saved SFT checkpoint to {SFT_MODEL_DIR}")
 
263
 
264
  del trainer, model, tokenizer
265
  _free_gpu_memory()
@@ -306,6 +320,7 @@ def _evaluate_single_policy(
306
  policy_name: str,
307
  select_fn: Callable[[IncidentObservation], IncidentAction],
308
  max_steps: Optional[int] = None,
 
309
  ) -> List[float]:
310
  scores: List[float] = []
311
  for task in ["easy", "medium", "hard"]:
@@ -320,17 +335,20 @@ def _evaluate_single_policy(
320
  f"reward={stats.total_reward:+.2f} steps={stats.steps}"
321
  )
322
  scores.append(round(stats.total_reward, 4))
 
 
 
323
  return scores
324
 
325
 
326
  def evaluate_policies(
327
  seed: int = 7,
328
  evaluate_llms: Optional[bool] = None,
329
- ) -> Dict[str, List[float]]:
330
  """Run each policy once per task under the same seed.
331
 
332
- The random policy is seeded for reproducibility. The heuristic policy is
333
- deterministic already. LLM policies are evaluated with greedy decoding.
334
  """
335
  random.seed(seed)
336
 
@@ -340,30 +358,43 @@ def evaluate_policies(
340
  "base_model": [],
341
  "sft_model": [],
342
  }
 
 
 
 
 
 
343
 
344
  for task in ["easy", "medium", "hard"]:
345
  random_stats, _, _ = rollout("random", task)
346
  heuristic_stats, _, _ = rollout("heuristic", task)
347
  scores["random"].append(round(random_stats.total_reward, 4))
348
  scores["heuristic"].append(round(heuristic_stats.total_reward, 4))
 
 
 
 
349
 
350
  should_eval_llms = _should_evaluate_llms() if evaluate_llms is None else evaluate_llms
351
  if not should_eval_llms:
352
  print("[eval] Skipping LLM evaluation (no GPU or EVAL_LLM_MODELS=false).")
353
- return scores
354
 
355
  try:
356
  from llm_policy import LLMPolicy
357
  except Exception as exc: # pragma: no cover - import-time safety
358
  print(f"[eval] Could not import LLMPolicy ({exc}); skipping LLM eval.")
359
- return scores
360
 
361
  # Base model
362
  try:
363
  print(f"[eval] Loading BASE model: {BASE_MODEL}")
364
  base = LLMPolicy(BASE_MODEL, label="base_model")
365
  scores["base_model"] = _evaluate_single_policy(
366
- "base_model", base.select_action, max_steps=MAX_LLM_EVAL_STEPS
 
 
 
367
  )
368
  base.release()
369
  _free_gpu_memory()
@@ -376,7 +407,10 @@ def evaluate_policies(
376
  print(f"[eval] Loading SFT model: {SFT_MODEL_DIR}")
377
  sft = LLMPolicy(str(SFT_MODEL_DIR), label="sft_model")
378
  scores["sft_model"] = _evaluate_single_policy(
379
- "sft_model", sft.select_action, max_steps=MAX_LLM_EVAL_STEPS
 
 
 
380
  )
381
  sft.release()
382
  _free_gpu_memory()
@@ -385,7 +419,122 @@ def evaluate_policies(
385
  else:
386
  print(f"[eval] No SFT checkpoint found at {SFT_MODEL_DIR}; skipping SFT eval.")
387
 
388
- return scores
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
389
 
390
 
391
  def plot_rewards(score_map: Dict[str, List[float]]) -> None:
@@ -423,8 +572,13 @@ def main() -> None:
423
  dataset.save_to_disk(str(ARTIFACT_DIR / "trl_dataset"))
424
 
425
  run_trl_sft(dataset)
426
- scores = evaluate_policies()
 
 
 
427
  plot_rewards(scores)
 
 
428
 
429
  summary = {
430
  "base_model": BASE_MODEL,
@@ -442,6 +596,11 @@ def main() -> None:
442
  round(h - r, 4)
443
  for h, r in zip(scores.get("heuristic", []), scores.get("random", []))
444
  ],
 
 
 
 
 
445
  }
446
  with open(ARTIFACT_DIR / "summary_metrics.json", "w", encoding="utf-8") as f:
447
  json.dump(summary, f, indent=2)
 
59
  total_reward: float
60
  steps: int
61
  success: bool
62
+ components: Dict[str, float] = None # type: ignore[assignment]
63
 
64
 
65
  # ---------------------------------------------------------------------------
 
120
  coordinator = HeuristicCoordinator()
121
  records: List[Dict[str, str]] = []
122
  rewards: List[float] = []
123
+ components_sum: Dict[str, float] = {}
124
  steps = 0
125
  step_cap = max_steps if max_steps is not None else MAX_ROLLOUT_STEPS
126
 
 
145
 
146
  result = env.step(action)
147
  rewards.append(float(result.reward or 0.0))
148
+ step_components = getattr(result.observation, "reward_components", None) or {}
149
+ for key, value in step_components.items():
150
+ components_sum[key] = components_sum.get(key, 0.0) + float(value)
151
  finally:
152
  try:
153
  env.close()
 
156
 
157
  total_reward = sum(rewards)
158
  success = total_reward > 0.0
159
+ stats = EpisodeStats(
160
+ policy_name=policy_name,
161
+ task_name=task_name,
162
+ total_reward=total_reward,
163
+ steps=steps,
164
+ success=success,
165
+ components={k: round(v, 4) for k, v in components_sum.items()},
166
  )
167
+ return (stats, records, rewards)
168
 
169
 
170
  def build_training_dataset(episodes_per_task: int = EPISODES_PER_TASK) -> Dataset:
 
268
  SFT_MODEL_DIR.mkdir(parents=True, exist_ok=True)
269
  trainer.save_model(str(SFT_MODEL_DIR))
270
  tokenizer.save_pretrained(str(SFT_MODEL_DIR))
271
+
272
+ log_path = ARTIFACT_DIR / "training_log.json"
273
+ with log_path.open("w", encoding="utf-8") as f:
274
+ json.dump(trainer.state.log_history, f, indent=2, default=str)
275
  print(f"[train] Saved SFT checkpoint to {SFT_MODEL_DIR}")
276
+ print(f"[train] Saved training log to {log_path}")
277
 
278
  del trainer, model, tokenizer
279
  _free_gpu_memory()
 
320
  policy_name: str,
321
  select_fn: Callable[[IncidentObservation], IncidentAction],
322
  max_steps: Optional[int] = None,
323
+ components_accumulator: Optional[Dict[str, float]] = None,
324
  ) -> List[float]:
325
  scores: List[float] = []
326
  for task in ["easy", "medium", "hard"]:
 
335
  f"reward={stats.total_reward:+.2f} steps={stats.steps}"
336
  )
337
  scores.append(round(stats.total_reward, 4))
338
+ if components_accumulator is not None and stats.components:
339
+ for k, v in stats.components.items():
340
+ components_accumulator[k] = components_accumulator.get(k, 0.0) + v
341
  return scores
342
 
343
 
344
  def evaluate_policies(
345
  seed: int = 7,
346
  evaluate_llms: Optional[bool] = None,
347
+ ) -> Dict[str, object]:
348
  """Run each policy once per task under the same seed.
349
 
350
+ Returns a dict with keys ``scores`` (mapping policy -> [easy, medium, hard])
351
+ and ``components`` (mapping policy -> {component_name: summed_value}).
352
  """
353
  random.seed(seed)
354
 
 
358
  "base_model": [],
359
  "sft_model": [],
360
  }
361
+ components: Dict[str, Dict[str, float]] = {
362
+ "random": {},
363
+ "heuristic": {},
364
+ "base_model": {},
365
+ "sft_model": {},
366
+ }
367
 
368
  for task in ["easy", "medium", "hard"]:
369
  random_stats, _, _ = rollout("random", task)
370
  heuristic_stats, _, _ = rollout("heuristic", task)
371
  scores["random"].append(round(random_stats.total_reward, 4))
372
  scores["heuristic"].append(round(heuristic_stats.total_reward, 4))
373
+ for k, v in (random_stats.components or {}).items():
374
+ components["random"][k] = components["random"].get(k, 0.0) + v
375
+ for k, v in (heuristic_stats.components or {}).items():
376
+ components["heuristic"][k] = components["heuristic"].get(k, 0.0) + v
377
 
378
  should_eval_llms = _should_evaluate_llms() if evaluate_llms is None else evaluate_llms
379
  if not should_eval_llms:
380
  print("[eval] Skipping LLM evaluation (no GPU or EVAL_LLM_MODELS=false).")
381
+ return {"scores": scores, "components": components}
382
 
383
  try:
384
  from llm_policy import LLMPolicy
385
  except Exception as exc: # pragma: no cover - import-time safety
386
  print(f"[eval] Could not import LLMPolicy ({exc}); skipping LLM eval.")
387
+ return {"scores": scores, "components": components}
388
 
389
  # Base model
390
  try:
391
  print(f"[eval] Loading BASE model: {BASE_MODEL}")
392
  base = LLMPolicy(BASE_MODEL, label="base_model")
393
  scores["base_model"] = _evaluate_single_policy(
394
+ "base_model",
395
+ base.select_action,
396
+ max_steps=MAX_LLM_EVAL_STEPS,
397
+ components_accumulator=components["base_model"],
398
  )
399
  base.release()
400
  _free_gpu_memory()
 
407
  print(f"[eval] Loading SFT model: {SFT_MODEL_DIR}")
408
  sft = LLMPolicy(str(SFT_MODEL_DIR), label="sft_model")
409
  scores["sft_model"] = _evaluate_single_policy(
410
+ "sft_model",
411
+ sft.select_action,
412
+ max_steps=MAX_LLM_EVAL_STEPS,
413
+ components_accumulator=components["sft_model"],
414
  )
415
  sft.release()
416
  _free_gpu_memory()
 
419
  else:
420
  print(f"[eval] No SFT checkpoint found at {SFT_MODEL_DIR}; skipping SFT eval.")
421
 
422
+ return {"scores": scores, "components": components}
423
+
424
+
425
+ def plot_training_curve(
426
+ log_path: Path = ARTIFACT_DIR / "training_log.json",
427
+ out_path: Path = ARTIFACT_DIR / "training_curve.png",
428
+ ) -> None:
429
+ """Plot loss (and token accuracy if present) vs training step from TRL log.
430
+
431
+ Satisfies the hackathon minimum requirement of showing BOTH loss and reward plots.
432
+ """
433
+ if not log_path.exists():
434
+ return
435
+ try:
436
+ log = json.loads(log_path.read_text(encoding="utf-8"))
437
+ except Exception:
438
+ return
439
+
440
+ steps: List[int] = []
441
+ losses: List[float] = []
442
+ accs: List[Optional[float]] = []
443
+ for entry in log:
444
+ if "loss" not in entry or "step" not in entry:
445
+ continue
446
+ try:
447
+ steps.append(int(entry["step"]))
448
+ losses.append(float(entry["loss"]))
449
+ acc = entry.get("mean_token_accuracy")
450
+ accs.append(float(acc) if acc is not None else None)
451
+ except Exception:
452
+ continue
453
+
454
+ if not steps:
455
+ return
456
+
457
+ fig, ax1 = plt.subplots(figsize=(9, 5))
458
+ ax1.plot(steps, losses, marker="o", color="tab:blue", label="Training loss", linewidth=2)
459
+ ax1.set_xlabel("Training step")
460
+ ax1.set_ylabel("Loss", color="tab:blue")
461
+ ax1.tick_params(axis="y", labelcolor="tab:blue")
462
+ ax1.grid(alpha=0.3)
463
+
464
+ if all(a is not None for a in accs):
465
+ ax2 = ax1.twinx()
466
+ ax2.plot(
467
+ steps,
468
+ accs,
469
+ marker="^",
470
+ color="tab:orange",
471
+ label="Mean token accuracy",
472
+ linewidth=2,
473
+ )
474
+ ax2.set_ylabel("Mean token accuracy", color="tab:orange")
475
+ ax2.tick_params(axis="y", labelcolor="tab:orange")
476
+ ax2.set_ylim(0.0, 1.05)
477
+
478
+ plt.title("TRL SFT training curve — loss & token accuracy")
479
+ plt.tight_layout()
480
+ plt.savefig(out_path, dpi=160)
481
+ plt.close()
482
+
483
+
484
+ def plot_reward_components(
485
+ components_by_policy: Dict[str, Dict[str, float]],
486
+ out_path: Path = ARTIFACT_DIR / "reward_components.png",
487
+ ) -> None:
488
+ """Grouped bar chart of reward-component contributions per policy.
489
+
490
+ Visualizes the rubric-based reward signal: where each policy's reward
491
+ actually comes from (step cost, clue bonus, handoff, mitigation, closure,
492
+ etc.). Makes the reward design visible to judges at a glance.
493
+ """
494
+ if not components_by_policy:
495
+ return
496
+
497
+ all_keys: List[str] = []
498
+ for comps in components_by_policy.values():
499
+ for k in comps:
500
+ if k not in all_keys:
501
+ all_keys.append(k)
502
+ if not all_keys:
503
+ return
504
+
505
+ policies = list(components_by_policy.keys())
506
+ n_policies = len(policies)
507
+ n_keys = len(all_keys)
508
+
509
+ fig, ax = plt.subplots(figsize=(max(10, n_keys * 0.6), 6))
510
+ bar_width = 0.8 / max(n_policies, 1)
511
+ colors = {
512
+ "random": "tab:red",
513
+ "heuristic": "tab:blue",
514
+ "base_model": "tab:orange",
515
+ "sft_model": "tab:green",
516
+ }
517
+ for i, policy in enumerate(policies):
518
+ values = [components_by_policy[policy].get(k, 0.0) for k in all_keys]
519
+ offsets = [x + i * bar_width - 0.4 + bar_width / 2 for x in range(n_keys)]
520
+ ax.bar(
521
+ offsets,
522
+ values,
523
+ width=bar_width,
524
+ label=policy,
525
+ color=colors.get(policy, None),
526
+ )
527
+
528
+ ax.axhline(0, color="gray", linewidth=0.8)
529
+ ax.set_xticks(range(n_keys))
530
+ ax.set_xticklabels(all_keys, rotation=35, ha="right")
531
+ ax.set_ylabel("Summed reward contribution (all tasks)")
532
+ ax.set_title("Where each policy earns / loses reward — rubric components")
533
+ ax.legend()
534
+ ax.grid(axis="y", alpha=0.3)
535
+ plt.tight_layout()
536
+ plt.savefig(out_path, dpi=160)
537
+ plt.close()
538
 
539
 
540
  def plot_rewards(score_map: Dict[str, List[float]]) -> None:
 
572
  dataset.save_to_disk(str(ARTIFACT_DIR / "trl_dataset"))
573
 
574
  run_trl_sft(dataset)
575
+ eval_out = evaluate_policies()
576
+ scores: Dict[str, List[float]] = eval_out["scores"] # type: ignore[assignment]
577
+ components: Dict[str, Dict[str, float]] = eval_out["components"] # type: ignore[assignment]
578
+
579
  plot_rewards(scores)
580
+ plot_training_curve()
581
+ plot_reward_components(components)
582
 
583
  summary = {
584
  "base_model": BASE_MODEL,
 
596
  round(h - r, 4)
597
  for h, r in zip(scores.get("heuristic", []), scores.get("random", []))
598
  ],
599
+ "reward_components_by_policy": {
600
+ policy: {k: round(v, 4) for k, v in comps.items()}
601
+ for policy, comps in components.items()
602
+ if comps
603
+ },
604
  }
605
  with open(ARTIFACT_DIR / "summary_metrics.json", "w", encoding="utf-8") as f:
606
  json.dump(summary, f, indent=2)