Spaces:

Torchflow1
/

Multi-Agent-Incident-Command-Center

Sleeping

App Files Files Community

SwapnilPatil28 commited on Apr 25

Commit

c3648b5

verified ·

1 Parent(s): 58af620

Final Update - Add training artifacts, README updates, and scripts

Browse files

Files changed (14) hide show

.dockerignore +4 -1
.gitattributes +1 -0
README.md +254 -74
artifacts/reward_components.png +3 -0
artifacts/reward_curve.png +0 -0
artifacts/reward_curve_qwen0p5b.png +0 -0
artifacts/summary_metrics.json +84 -9
artifacts/summary_metrics_qwen0p5b.json +35 -0
artifacts/training_curve.png +0 -0
artifacts/training_log.json +2051 -0
llm_policy.py +24 -4
scripts/before_after_demo.py +197 -0
server/app.py +290 -13
train_trl.py +172 -13

.dockerignore CHANGED Viewed

@@ -5,9 +5,12 @@
 __pycache__
 **/__pycache__
 **/*.pyc
-artifacts/
 outputs/
 tests/
 .pytest_cache/
 .cursor
 *.ipynb_checkpoints

 __pycache__
 **/__pycache__
 **/*.pyc
+# Keep the committed evidence (plots, JSON metrics) so the HF Space dashboard
+# can render them; only exclude the heavy fine-tuned checkpoint directory.
+artifacts/sft_model/
 outputs/
 tests/
 .pytest_cache/
 .cursor
 *.ipynb_checkpoints
+docs/

.gitattributes CHANGED Viewed

@@ -1,2 +1,3 @@
 # Auto detect text files and perform LF normalization
 * text=auto

 # Auto detect text files and perform LF normalization
 * text=auto
+artifacts/reward_components.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -23,9 +23,21 @@ tags:
 [![Tests](https://img.shields.io/badge/tests-21%20passing-brightgreen)](./tests) [![OpenEnv](https://img.shields.io/badge/OpenEnv-v0.2%2B-blue)](https://github.com/meta-pytorch/openenv) [![License](https://img.shields.io/badge/license-MIT-blue.svg)](./LICENSE) ![Python](https://img.shields.io/badge/python-3.10%2B-blue)
 Three specialist agents — **Triage**, **Investigator**, and **Ops Manager** — cooperate to resolve a queue of production incidents while operating under strict **SLA budgets**, **investigation costs**, and **customer-tier impact multipliers**. The environment is designed to reward *real* operational reasoning, not pattern matching on the root-cause label.
-This repository is the hackathon submission for the **OpenEnv India 2026 Round 2** finals across three themes:
 - **Theme #1 Multi-Agent Interactions** — role-gated action space, negotiation, handoff.
 - **Theme #2 (Super) Long-Horizon Planning** — delayed rewards, carried constraints across multiple incidents, postmortem requirements.
@@ -66,6 +78,16 @@ This environment captures five properties that are hard to teach with static dat
 | **Anti-gaming** | Clue bonuses are unique per root-cause keyword; repeated lookups get a small penalty. Closing without enough clues triggers an under-investigated penalty even when the guess is right. |
 | **Carry-over state** | Budget and SLA decrement across the whole incident queue, so early sloppy episodes ruin later ones. Postmortems must be filed for high-impact incidents. |
 ---
 ## Architecture
@@ -136,29 +158,48 @@ Both action and observation schemas are defined in [`models.py`](./models.py) wi
 ## Reward model
-The rubric engine lives in [`server/domain/reward.py`](./server/domain/reward.py). Every step accumulates named components that are summed into the final reward and echoed to the agent.
 | Component | Typical value | Triggers |
 |---|---:|---|
-| `step_cost` | −0.02 … −0.08 | Every action (type-specific) |
-| `wrong_actor_penalty` | −0.08 | Action invoked by a role not authorised to perform it |
-| `clue_bonus` | **+0.12** | Lookup text contains a *new* root-cause keyword (capped at 3 per incident) |
 | `repeated_lookup_penalty` | −0.02 | Same clue keyword surfaced again |
 | `handoff_correct` / `handoff_wrong` | **+0.15** / −0.10 | Handoff target matches the incident's expected owner |
-| `mitigation_correct` / `mitigation_wrong` | **+0.35** / −0.30 | `apply_fix` text matches accepted fix keywords |
-| `closure_correct` | **+0.80 × tier** | Correct root cause, tier multiplier: free 0.6, standard 1.0, premium 1.4, enterprise 1.8 |
-| `closure_mitigation_bonus` | +0.30 | Closed *after* a successful mitigation |
 | `closure_under_investigated` | −0.20 | Closed before collecting the required number of clues |
 | `speed_bonus` | +0.10 … +0.20 | Resolved in ≤ 7 / ≤ 4 steps on that incident |
-| `postmortem_bonus` / `postmortem_missing` | +0.12 / −0.15 | Postmortem filed for high-impact incidents |
-| `closure_wrong` | −1.10 × tier | Wrong root cause, scaled by tier |
-| `sla_exhausted` | −1.2 × tier | Global SLA minutes hit zero |
 | `budget_exhausted` | −1.5 | Investigation action budget hit zero |
 Design goals:
-1. **Transparent** — agents and humans can see *why* each step was scored.
-2. **Hard to game** — unique clue bonuses, under-investigation penalty, role gating.
 3. **Business-aware** — tier multipliers mirror real enterprise SLA contracts.
 ---
@@ -180,8 +221,8 @@ Full incident catalog with logs, metrics, KB and accepted fixes is defined in [`
 ### 1. Clone and install
 ```bash
-git clone https://github.com/<you>/CustomerSupportTicketRoutingEnv
-cd CustomerSupportTicketRoutingEnv
 python -m venv .venv
 # Windows PowerShell
@@ -238,7 +279,12 @@ Expected output: **21 passing** (domain rubric, incident catalog, environment in
 1. **Rollout** — the `HeuristicCoordinator` drives the live environment to collect `(prompt, completion)` pairs. Prompts include customer tier, revenue impact, visible signals and investigation targets; completions are structured JSON actions.
 2. **SFT** — the dataset is collapsed into a single `text` column (robust across TRL ≥ 0.20) and fed to `SFTTrainer`. The fine-tuned weights + tokenizer are saved to `artifacts/sft_model/`.
 3. **Evaluation** — four policies are rolled out under identical seeds: `random`, `heuristic`, `base_model` (raw `BASE_MODEL` HF checkpoint), and `sft_model` (the fine-tuned checkpoint just saved). LLM evaluation auto-enables on a CUDA GPU; force it with `EVAL_LLM_MODELS=true` or disable with `EVAL_LLM_MODELS=false`.
-4. **Artifacts** — `artifacts/reward_curve.png` (4 lines) and `artifacts/summary_metrics.json` (random / heuristic / base / SFT rewards + per-task SFT-over-base improvements) are written.
 ### Local run (small model)
@@ -246,22 +292,42 @@ Expected output: **21 passing** (domain rubric, incident catalog, environment in
 BASE_MODEL=Qwen/Qwen2.5-0.5B-Instruct python train_trl.py
 ```
-### Colab / HF Spaces (T4 GPU)
 ```python
-# Cell 1
-!git clone https://github.com/<you>/CustomerSupportTicketRoutingEnv
-%cd CustomerSupportTicketRoutingEnv
-!pip install -r requirements.txt
 # Cell 2 — start the environment server in the background
-import subprocess, time
-server = subprocess.Popen(["uvicorn", "server.app:app", "--host", "127.0.0.1", "--port", "8000"])
-time.sleep(10)
-# Cell 3 — run baseline + SFT
 import os
-os.environ["BASE_MODEL"] = "Qwen/Qwen2.5-0.5B-Instruct"
 !python train_trl.py
 ```
@@ -302,22 +368,87 @@ the model emits invalid JSON.
 ## Training results
-![Reward curve comparing heuristic coordinator vs random baseline](./artifacts/reward_curve.png)
-*Heuristic coordinator vs random baseline on all three task difficulties (same seed). The heuristic dominates at every difficulty — a clean behavioral gap that SFT on the same rollouts reinforces.*
-Summary metrics (from `artifacts/summary_metrics.json`):
 ```json
 {
-  "base_model": "Qwen/Qwen2.5-0.5B-Instruct",
-  "random_rewards":    [ ... ],
-  "heuristic_rewards": [ ... ],
-  "improvement_absolute": [ ... ]
 }
 ```
-Training loss is saved by TRL to `outputs/sft_run/trainer_state.json` and prints to stdout every 5 steps. A typical run shows train loss dropping from ~3.1 → ~0.24 and mean-token accuracy climbing from ~0.5 → ~0.95 over a single epoch on ~135 rollout rows — evidence that the model is learning the structured action JSON the environment expects.
 ---
@@ -354,7 +485,7 @@ All tunables are environment variables so the image is 12-factor compatible:
 pytest tests/ -q
 ```
-Three test modules:
 - `tests/test_reward.py` — invariants of the rubric engine (capping, anti-gaming, tier scaling).
 - `tests/test_incidents.py` — catalog completeness, uniqueness, deterministic instantiation.
@@ -362,40 +493,85 @@ Three test modules:
 The domain suites are pure-python and run without `openenv-core` installed.
 ---
 ## Repository layout
 ```
 .
-├── models.py                         # Pydantic schemas (IncidentAction / Observation / State)
-├── client.py                         # Typed EnvClient (reset / step / state / close)
-├── inference.py                      # HeuristicCoordinator + random baseline
-├── train_trl.py                      # Rollout → SFT → evaluation → artifacts
-├── openenv.yaml                      # OpenEnv manifest
-├── pyproject.toml                    # Package metadata, extras, entry points
-├── requirements.txt                  # Full stack requirements (training incl.)
-├── Dockerfile                        # Root image (parity with server/Dockerfile)
-├── artifacts/
-│   ├── reward_curve.png              # Committed training-evidence plot
-│   └── summary_metrics.json          # Committed training-evidence metrics
 ├── server/
-│   ├── app.py                        # FastAPI app with health/metrics/dashboard
-│   ├── environment.py                # OpenEnv-compliant Environment implementation
-│   ├── config.py                     # 12-factor runtime configuration
-│   ├── logging_utils.py              # Structured JSON logging
-│   ├── requirements.txt              # Slim server image requirements
-│   ├── Dockerfile                    # Production image (HEALTHCHECK included)
 │   └── domain/
-│       ├── incidents.py              # 13 enterprise incident templates + factory
-│       ├── reward.py                 # Composable rubric engine
-│       ├── roles.py                  # Role-based permission policy
-│       └── rng.py                    # Deterministic per-episode RNG
-└── tests/
-    ├── conftest.py                   # sys.path + env defaults
-    ├── test_reward.py                # Rubric invariants
-    ├── test_incidents.py             # Catalog invariants
-    └── test_environment.py           # End-to-end environment tests
 ```
 ---
@@ -419,18 +595,22 @@ ENV_LOG_LEVEL: "INFO"
 ## Submission checklist
-- [x] OpenEnv latest runtime and `openenv validate` passing
-- [x] Multi-agent, long-horizon environment with role-gated action space
-- [x] Composable, transparent, anti-gaming reward rubric
-- [x] Business-impact-aware scoring (customer tier, revenue, SLA)
-- [x] 13 incident templates across 3 difficulties with red herrings and playbooks
-- [x] End-to-end TRL SFT pipeline committed (`train_trl.py`)
-- [x] Real training artifacts committed (`artifacts/reward_curve.png`, `artifacts/summary_metrics.json`)
-- [x] 21 passing unit tests
-- [x] Production-quality HTTP server: `/healthz`, `/version`, `/env-info`, `/metrics`, Dockerfile with `HEALTHCHECK`
-- [x] Structured JSON logging + 12-factor configuration
-- [ ] Hugging Face Space URL (fill me in)
-- [ ] 2-minute demo video or HF blog (fill me in)
 ---

 [![Tests](https://img.shields.io/badge/tests-21%20passing-brightgreen)](./tests) [![OpenEnv](https://img.shields.io/badge/OpenEnv-v0.2%2B-blue)](https://github.com/meta-pytorch/openenv) [![License](https://img.shields.io/badge/license-MIT-blue.svg)](./LICENSE) ![Python](https://img.shields.io/badge/python-3.10%2B-blue)
+### Live links
+| What | Where |
+|---|---|
+| **Live environment (OpenEnv-compatible)** | **[`https://swapnilpatil28-multi-agent-incident-command-center.hf.space`](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)** |
+| Hugging Face Space page | **[`huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
+| GitHub repository | **[`github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
+| Training notebook (Colab T4, one-click reproducible) | **[Open in Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
+| 2-minute video walkthrough | *Coming soon — [`docs/VIDEO_SCRIPT.md`](./docs/VIDEO_SCRIPT.md) has the shot list* |
+| Mini blog post | *Coming soon — full draft in [`docs/BLOG_POST.md`](./docs/BLOG_POST.md), ready to publish on hf.co/blog* |
+| Training script (Python) | [`train_trl.py`](./train_trl.py) |
 Three specialist agents — **Triage**, **Investigator**, and **Ops Manager** — cooperate to resolve a queue of production incidents while operating under strict **SLA budgets**, **investigation costs**, and **customer-tier impact multipliers**. The environment is designed to reward *real* operational reasoning, not pattern matching on the root-cause label.
+This repository is the hackathon submission for the **OpenEnv India 2026 Round 2** finals across three themes simultaneously:
 - **Theme #1 Multi-Agent Interactions** — role-gated action space, negotiation, handoff.
 - **Theme #2 (Super) Long-Horizon Planning** — delayed rewards, carried constraints across multiple incidents, postmortem requirements.
 | **Anti-gaming** | Clue bonuses are unique per root-cause keyword; repeated lookups get a small penalty. Closing without enough clues triggers an under-investigated penalty even when the guess is right. |
 | **Carry-over state** | Budget and SLA decrement across the whole incident queue, so early sloppy episodes ruin later ones. Postmortems must be filed for high-impact incidents. |
+### Mapping to the hackathon themes
+One environment, three themes checked — each one addressed by a concrete mechanic, not just a claim:
+| Hackathon theme | How this environment satisfies it |
+|---|---|
+| **Theme #1 — Multi-Agent Interactions** | Three *distinct* specialist roles (`triage_agent`, `investigator_agent`, `ops_manager_agent`) with **non-overlapping permissions**. `negotiate_handoff` scores correct cooperation (+0.15) and wrong owners (−0.10). `wrong_actor_penalty` (−0.08) teaches the *belief* that "I should pick the right specialist for this phase" — a minimal theory-of-mind signal over who-can-do-what. |
+| **Theme #2 — (Super) Long-Horizon Planning** | **Each episode carries 3–5 sequential incidents** under a single investigation budget and a single ticking SLA counter. Rewards are **sparse and delayed**: the +0.80 closure reward only fires when you pick the right root cause after collecting enough clues, running a correct mitigation, and filing a postmortem — steps that may happen 20–60 turns apart. Early sloppy episodes visibly corrupt later ones via the shared budget/SLA. |
+| **Theme #3.1 — World Modeling (Professional Tasks)** | Incidents carry **realistic logs, metrics, and KB articles** with **red-herring signals mixed into real ones**, making root-cause identification require *tool-use discipline*, not shortcut guessing. Customer tiers, affected-user counts, and $/min revenue impact create a **persistent business world-model** that the agent has to reason about — closing an enterprise incident incorrectly costs ~2x what closing a free-tier one costs. |
 ---
 ## Architecture
 ## Reward model
+The rubric engine lives in [`server/domain/reward.py`](./server/domain/reward.py) and [`server/environment.py`](./server/environment.py). Every step accumulates named components that are summed into the final reward and echoed back to the agent in `observation.reward_components`.
+### Step-level components (what each action pays or earns)
 | Component | Typical value | Triggers |
 |---|---:|---|
+| `step_cost` | −0.01 … −0.08 | Every action (type-specific: `-0.01` postmortem, `-0.02` handoff/fix, `-0.03` KB, `-0.04` logs/metrics, `-0.05` escalate, `-0.08` rollback) |
+| `wrong_actor_penalty` | −0.08 | Action invoked by a role not authorised for it |
+| `invalid_action` | −0.25 | Unrecognised `action_type` |
+| `clue_bonus` | **+0.12** | Lookup surfaces a *new* root-cause keyword (capped at 3 per incident) |
 | `repeated_lookup_penalty` | −0.02 | Same clue keyword surfaced again |
 | `handoff_correct` / `handoff_wrong` | **+0.15** / −0.10 | Handoff target matches the incident's expected owner |
+| `mitigation_correct` / `mitigation_wrong` / `mitigation_empty` | **+0.35** / −0.30 / −0.30 | `apply_fix` text matches accepted fix keywords |
+| `rollback_effective` / `rollback_ineffective` | +0.20 / −0.15 | `rollback` summary aligns with the incident's accepted playbook |
+| `escalation_needed` / `escalation_not_needed` | +0.10 / −0.10 | Escalation raised for an incident that actually meets the paging threshold (≥50K users OR ≥$800/min OR postmortem required) |
+| `postmortem_logged` / `postmortem_empty` | +0.05 / −0.10 | `submit_postmortem` with/without a `postmortem_note` |
+### Closure components (scored when `close_incident` fires)
+| Component | Typical value | Triggers |
+|---|---:|---|
+| `closure_correct` | **+0.80 × tier** | Correct root cause, tier multiplier: free ×0.6, standard ×1.0, premium ×1.4, enterprise ×1.8 |
+| `closure_wrong` | **−1.10 × tier** | Wrong root cause, scaled by tier |
+| `closure_mitigation_bonus` | +0.30 | Closed *after* a successful `apply_fix` |
+| `closure_no_mitigation` | −0.15 | Closed on a mitigation-required incident without having applied one |
 | `closure_under_investigated` | −0.20 | Closed before collecting the required number of clues |
 | `speed_bonus` | +0.10 … +0.20 | Resolved in ≤ 7 / ≤ 4 steps on that incident |
+| `postmortem_bonus` / `postmortem_missing` | +0.12 / −0.15 | Postmortem filed (or not) for a high-impact incident |
+### Terminal components (episode-ending penalties)
+| Component | Typical value | Triggers |
+|---|---:|---|
+| `sla_exhausted` | **−1.2 × tier** | Global SLA minutes hit zero while an incident is still open |
 | `budget_exhausted` | −1.5 | Investigation action budget hit zero |
+Every component is persisted to `observation.reward_components`, surfaced in Prometheus `/metrics`, and aggregated into the `reward_components_by_policy` block of [`artifacts/summary_metrics.json`](./artifacts/summary_metrics.json).
 Design goals:
+1. **Transparent** — agents and humans can see *why* each step was scored (the [Reward components](#3-reward-components--where-each-policy-actually-earns-reward) chart below is the rubric made visible).
+2. **Hard to game** — unique clue bonuses, under-investigation penalty, role gating, anti-churn `rollback_ineffective` and `escalation_not_needed`.
 3. **Business-aware** — tier multipliers mirror real enterprise SLA contracts.
 ---
 ### 1. Clone and install
 ```bash
+git clone https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center.git
+cd Multi-Agent-Incident-Command-Center
 python -m venv .venv
 # Windows PowerShell
 1. **Rollout** — the `HeuristicCoordinator` drives the live environment to collect `(prompt, completion)` pairs. Prompts include customer tier, revenue impact, visible signals and investigation targets; completions are structured JSON actions.
 2. **SFT** — the dataset is collapsed into a single `text` column (robust across TRL ≥ 0.20) and fed to `SFTTrainer`. The fine-tuned weights + tokenizer are saved to `artifacts/sft_model/`.
 3. **Evaluation** — four policies are rolled out under identical seeds: `random`, `heuristic`, `base_model` (raw `BASE_MODEL` HF checkpoint), and `sft_model` (the fine-tuned checkpoint just saved). LLM evaluation auto-enables on a CUDA GPU; force it with `EVAL_LLM_MODELS=true` or disable with `EVAL_LLM_MODELS=false`.
+4. **Artifacts** — a single run writes all five evidence files committed to [`artifacts/`](./artifacts):
+   - `reward_curve.png` (4 lines: random / heuristic / base / SFT vs easy/medium/hard, both axes labelled)
+   - `training_curve.png` (TRL loss + mean token accuracy vs training step)
+   - `reward_components.png` (stacked bars showing *where* each policy's reward came from)
+   - `training_log.json` (full `trainer.state.log_history` for reproducibility)
+   - `summary_metrics.json` (random / heuristic / base / SFT rewards + per-task `improvement_sft_over_base` + `reward_components_by_policy`)
 ### Local run (small model)
 BASE_MODEL=Qwen/Qwen2.5-0.5B-Instruct python train_trl.py
 ```
+### Colab (T4 GPU) — one-click reproducible
+**[Open the full training notebook on Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)**
+Or run the cells manually:
 ```python
+# Cell 1 — clone and install
+!git clone https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center.git /content/repo
+%cd /content/repo
+!pip install -q -r requirements.txt
+!pip install -q "openenv-core[core]>=0.2.2"
 # Cell 2 — start the environment server in the background
+import subprocess, time, os, requests
+os.environ["ENV_STRUCTURED_LOGGING"] = "false"
+server = subprocess.Popen(
+    ["uvicorn", "server.app:app", "--host", "127.0.0.1", "--port", "8000"],
+    stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
+)
+for _ in range(30):
+    try:
+        if requests.get("http://127.0.0.1:8000/healthz", timeout=1).status_code == 200:
+            print("server up"); break
+    except Exception:
+        time.sleep(1)
+# Cell 3 — full pipeline (dataset → SFT → evaluate 4 policies → plots)
 import os
+os.environ["BASE_MODEL"]         = "Qwen/Qwen2.5-1.5B-Instruct"
+os.environ["ENV_URL"]            = "http://127.0.0.1:8000"
+os.environ["EVAL_LLM_MODELS"]    = "true"
+os.environ["EPISODES_PER_TASK"]  = "8"
+os.environ["TRAIN_EPOCHS"]       = "3"
+os.environ["TRAIN_MAX_LENGTH"]   = "1024"
+os.environ["MAX_LLM_EVAL_STEPS"] = "120"
 !python train_trl.py
 ```
 ## Training results
+Four policies (**random**, **heuristic**, **base Qwen2.5-1.5B-Instruct**, **SFT fine-tuned**) evaluated under identical seeds across all three task difficulties. All three plots below are produced automatically by a single `python train_trl.py` run and committed to [`artifacts/`](./artifacts).
+### Headline: SFT closes a +10-point reward gap on hard incidents
+| Task | Random | Base LLM | **Fine-tuned LLM** | Heuristic (oracle) |
+|---|---:|---:|---:|---:|
+| easy | -5.96 | -2.92 | **-4.72** | -4.72 |
+| medium | -11.48 | -4.00 | **-0.87** | -0.87 |
+| hard | -12.50 | -4.28 | **+5.89** | +5.89 |
+| **SFT − Base** | — | — | **-1.80 / +3.13 / +10.17** | — |
+> **Why SFT matches the heuristic component-for-component:** the environment is deterministic (same task → same incidents → same observations), and so is the heuristic (same observation → same action). With TRL SFT achieving ~0.99 token accuracy, the student memorises the teacher's policy and reproduces it under greedy decoding. Behavior cloning has converged to the expert. The meaningful comparison is therefore **SFT vs the untrained base model**, where fine-tuning earns **+10.17 reward on hard-difficulty incidents** and unlocks closure/mitigation/postmortem reward components the base model never fires.
+### 1. Reward curve — four policies head-to-head
+![Reward curve comparing random / heuristic / base LLM / fine-tuned LLM on easy, medium, and hard tasks](./artifacts/reward_curve.png)
+*Random (red) is the floor. Base LLM (orange) already beats random on easy by producing structured JSON but plateaus because it never learns to close an incident. **Fine-tuned LLM (green) climbs sharply with difficulty**, reaching +5.89 on hard — matching the hand-coded expert.*
+### 2. Training curve — loss drops, token accuracy climbs
+![TRL SFT training loss and mean token accuracy vs training step — loss from ~2.8 to ~0.02, token accuracy from 0.49 to 0.99](./artifacts/training_curve.png)
+*Qwen2.5-1.5B-Instruct fine-tuned for 3 epochs on 680 rollout examples. Loss falls from ~2.84 → ~0.02; mean token accuracy climbs from ~0.49 to ~0.99. Satisfies the hackathon "loss AND reward plots" minimum requirement.*
+### 3. Reward components — where each policy actually earns reward
+![Reward components earned per policy summed across all three tasks — fine-tuned model unlocks closure_correct, mitigation_correct, handoff_correct that the base model never earns](./artifacts/reward_components.png)
+*This chart is the rubric made visible. **Random** gets crushed by `closure_wrong` and `wrong_actor_penalty`. **Base LLM** only earns `clue_bonus`, then bleeds out via `step_cost` and `sla_exhausted` — it never closes an incident. **Fine-tuned LLM** and the **heuristic** both unlock the positive-reward components (`closure_correct +7.36`, `mitigation_correct +2.10`, `closure_mitigation_bonus +1.80`, `postmortem_bonus +0.60`). Training has redirected the LLM's reward mass from "bleeding" to "solving."*
+### 4. Summary metrics
+The full numbers live in [`artifacts/summary_metrics.json`](./artifacts/summary_metrics.json). Top-level excerpt:
 ```json
 {
+  "base_model": "Qwen/Qwen2.5-1.5B-Instruct",
+  "dataset_rows": 680,
+  "episodes_per_task": 8,
+  "random_rewards":       [ -5.96, -11.48, -12.50 ],
+  "heuristic_rewards":    [ -4.72,  -0.87,  +5.89 ],
+  "base_model_rewards":   [ -2.92,  -4.00,  -4.28 ],
+  "sft_model_rewards":    [ -4.72,  -0.87,  +5.89 ],
+  "improvement_sft_over_base":        [ -1.80, +3.13, +10.17 ],
+  "improvement_heuristic_over_random":[ +1.24, +10.61, +18.39 ]
 }
 ```
+Full `reward_components_by_policy` (used to generate plot 3) is included alongside.
+### 5. Ablation: model scale matters for imitation learning
+The same pipeline with the **smaller Qwen2.5-0.5B-Instruct** backbone, **identical seeds and environment config** (so random / heuristic numbers are directly comparable), but a smaller training dataset (3 episodes/task → 255 rows vs 8 episodes/task → 680 rows):
+![Reward curve — four policies on Qwen2.5-0.5B-Instruct](./artifacts/reward_curve_qwen0p5b.png)
+| Task | Random | Base 0.5B | **SFT 0.5B** | Heuristic | **SFT − Base (0.5B)** |
+|---|---:|---:|---:|---:|---:|
+| easy | -5.96 | -2.92 | **-2.49** | -4.72 | +0.43 |
+| medium | -11.48 | -4.00 | **-3.86** | -0.87 | +0.14 |
+| hard | -12.50 | -2.40 | **-2.40** | +5.89 | **0.00** |
+**The punchline — scale is the story.** With the 0.5B backbone, SFT delivers only a **+0.43 / +0.14 / +0.00** improvement over the base model and **never closes a single hard-incident**. Bumping the backbone to **1.5B** (same SFT code, same data pipeline, same environment) unlocks a **-1.80 / +3.13 / +10.17** improvement and makes the LLM match the heuristic's component-for-component behavior on hard incidents.
+| Run config | 0.5B | **1.5B (headline)** |
+|---|---|---|
+| Model | Qwen2.5-0.5B-Instruct | Qwen2.5-1.5B-Instruct |
+| Episodes / task (rollout) | 3 | 8 |
+| Dataset rows | 255 | 680 |
+| Train epochs | 1 | 3 |
+| Base → SFT improvement on **hard** | **+0.00** | **+10.17** |
+| Hard incidents closed by SFT | 0 | full heuristic behavior |
+Interpretation: **at 0.5B the model is too small to absorb the multi-step, role-gated policy from SFT**, even though it can emit syntactically valid JSON. At 1.5B the capacity suddenly becomes sufficient to internalize the full action schedule, and behavior cloning converges. This is the kind of finding the environment is designed to surface — *the rubric makes it visible in one plot*, not hidden behind a single aggregate score.
+Raw numbers live in [`artifacts/summary_metrics_qwen0p5b.json`](./artifacts/summary_metrics_qwen0p5b.json).
+### Reproduce the whole training run
+One click: **[Open Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** (T4 GPU, ~1 h 15 min wall clock end-to-end, including base-model + SFT-model evaluation).
 ---
 pytest tests/ -q
 ```
+Expected: `21 passed`. Three test modules:
 - `tests/test_reward.py` — invariants of the rubric engine (capping, anti-gaming, tier scaling).
 - `tests/test_incidents.py` — catalog completeness, uniqueness, deterministic instantiation.
 The domain suites are pure-python and run without `openenv-core` installed.
+### Pre-submission smoke tests
+Two scripts judges (or you) can run without a local IDE:
+```bash
+# 1. Local: manifest + files + domain tests
+./pre_validate.sh
+# 2. Remote: hit the deployed HF Space end-to-end
+./validate-submission.sh https://swapnilpatil28-multi-agent-incident-command-center.hf.space
+```
+[`pre_validate.sh`](./pre_validate.sh) runs the OpenEnv validator against the local manifest, confirms the training / inference scripts exist, and re-runs the domain test suite. [`validate-submission.sh`](./validate-submission.sh) pings `/reset` + `/healthz` on a live URL, checks the `Dockerfile` is in the submitted tree, and re-runs `openenv validate` — exactly what the judges' CI pipeline expects.
 ---
 ## Repository layout
 ```
 .
+├── README.md                          # This file
+├── LICENSE                            # MIT
+├── openenv.yaml                       # OpenEnv manifest (version 3.0)
+├── pyproject.toml                     # Package metadata + entry points
+├── requirements.txt                   # Full stack (server + training)
+├── uv.lock                            # Reproducible dependency lock
+├── Dockerfile                         # Root image (parity with server/Dockerfile)
+├── .dockerignore                      # Keeps the image small
+├── .gitignore                         # Excludes venv / artifacts-cache
+├── .gitattributes                     # EOL normalization
+├── __init__.py                        # Makes the repo root importable for tests
+│
+├── models.py                          # Pydantic schemas (IncidentAction/Observation/State)
+├── client.py                          # Typed EnvClient (reset / step / state / close)
+├── inference.py                       # HeuristicCoordinator + random baseline + POLICY_MODEL hook
+├── llm_policy.py                      # HF causal-LM → environment-ready policy wrapper
+├── train_trl.py                       # Rollout → SFT → 4-policy evaluation → plots
+│
+├── pre_validate.sh                    # Local 5-step pre-submission smoke test
+├── validate-submission.sh             # Remote /reset + /healthz + openenv validate against Space
+│
+├── scripts/
+│   └── before_after_demo.py           # Side-by-side base vs SFT trace generator
+│
+├── docs/
+│   ├── BLOG_POST.md                   # HF blog draft (publish to hf.co/blog)
+│   ├── VIDEO_SCRIPT.md                # 2-minute YouTube script with link list
+│   └── SUBMISSION_CHECKLIST.md        # Judging-criteria checklist + smoke tests
+│
+├── artifacts/                         # All committed training evidence
+│   ├── reward_curve.png               # 4-policy reward comparison (1.5B headline)
+│   ├── training_curve.png             # TRL SFT loss + token accuracy (1.5B)
+│   ├── reward_components.png          # Per-policy rubric breakdown (1.5B)
+│   ├── training_log.json              # Full TRL log history (1.5B)
+│   ├── summary_metrics.json           # All reward + component numbers (1.5B)
+│   ├── reward_curve_qwen0p5b.png      # Ablation: same pipeline on 0.5B backbone
+│   └── summary_metrics_qwen0p5b.json  # Ablation numbers
+│
 ├── server/
+│   ├── __init__.py
+│   ├── app.py                         # FastAPI app with health/metrics/dashboard
+│   ├── environment.py                 # OpenEnv-compliant Environment implementation
+│   ├── support_env_environment.py     # Backward-compat alias module
+│   ├── config.py                      # 12-factor runtime configuration
+│   ├── logging_utils.py               # Structured JSON logging
+│   ├── requirements.txt               # Slim server image requirements
+│   ├── Dockerfile                     # Production image (HEALTHCHECK included)
 │   └── domain/
+│       ├── __init__.py
+│       ├── incidents.py               # 13 enterprise incident templates + factory
+│       ├── reward.py                  # Composable rubric engine (20+ components)
+│       ├── roles.py                   # Role-based permission policy
+│       └── rng.py                     # Deterministic per-episode RNG
+│
+└── tests/                             # 21 passing tests
+    ├── conftest.py                    # sys.path + env defaults
+    ├── test_reward.py                 # Rubric invariants (capping, anti-gaming, tier scaling)
+    ├── test_incidents.py              # Catalog invariants (uniqueness, determinism)
+    └── test_environment.py            # reset/step invariants, wrong-actor, closure
 ```
 ---
 ## Submission checklist
+Full checklist with pre-submission smoke tests → [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md).
+- [x] **OpenEnv latest runtime** and `openenv validate` passing — [Space live](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)
+- [x] **Multi-agent, long-horizon environment** with role-gated action space (3 roles × 9 actions, 13 incidents)
+- [x] **Composable, transparent, anti-gaming reward rubric** (14+ named components, tier-scaled)
+- [x] **Business-impact-aware scoring** (customer tier, revenue impact, SLA countdown)
+- [x] **End-to-end TRL SFT pipeline** that saves a checkpoint and re-evaluates it in the environment ([`train_trl.py`](./train_trl.py))
+- [x] **Reward curve + training-loss curve + reward-components chart** committed to [`artifacts/`](./artifacts)
+- [x] **Concrete SFT → Base improvement**: **+10.17 reward on hard-difficulty incidents**
+- [x] **21 passing unit tests** (domain invariants + environment integration)
+- [x] **Production-quality HTTP server**: `/healthz`, `/version`, `/env-info`, `/metrics`, Dockerfile with `HEALTHCHECK`
+- [x] **Structured JSON logging** + 12-factor configuration
+- [x] **One-click Colab training notebook** → [Open ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)
+- [x] **Blog draft** ([`docs/BLOG_POST.md`](./docs/BLOG_POST.md)) + **video script** ([`docs/VIDEO_SCRIPT.md`](./docs/VIDEO_SCRIPT.md))
+- [ ] Publish the Hugging Face blog post and swap the "Coming soon" link in the Live-links table
+- [ ] Upload the YouTube video and swap the "Coming soon" link in the Live-links table
 ---

artifacts/reward_components.png ADDED Viewed

Git LFS Details

SHA256: ee525913d499b4e4a5dc4a00c28b0d25df9f674ca6aec0bc5959ff0c55654938
Pointer size: 131 Bytes
Size of remote file: 162 kB

artifacts/reward_curve.png CHANGED Viewed

artifacts/reward_curve_qwen0p5b.png ADDED Viewed

artifacts/summary_metrics.json CHANGED Viewed

@@ -1,14 +1,89 @@
 {
-  "base_model": "Qwen/Qwen2.5-0.5B-Instruct",
-  "dataset_rows": 135,
   "random_rewards": [
-    -3.2300000000000004,
-    -5.53,
-    -7.03
   ],
   "heuristic_rewards": [
-    -3.02,
-    -1.6900000000000002,
-    -0.13999999999999996
-  ]
 }

 {
+  "base_model": "Qwen/Qwen2.5-1.5B-Instruct",
+  "dataset_rows": 680,
+  "episodes_per_task": 8,
   "random_rewards": [
+    -5.96,
+    -11.48,
+    -12.5
   ],
   "heuristic_rewards": [
+    -4.72,
+    -0.87,
+    5.89
+  ],
+  "base_model_rewards": [
+    -2.92,
+    -4.0,
+    -4.28
+  ],
+  "sft_model_rewards": [
+    -4.72,
+    -0.87,
+    5.89
+  ],
+  "improvement_sft_over_base": [
+    -1.8,
+    3.13,
+    10.17
+  ],
+  "improvement_heuristic_over_random": [
+    1.24,
+    10.61,
+    18.39
+  ],
+  "reward_components_by_policy": {
+    "random": {
+      "wrong_actor_penalty": -3.12,
+      "closure_wrong": -17.82,
+      "step_cost": -2.61,
+      "postmortem_empty": -1.0,
+      "escalation_not_needed": -0.3,
+      "clue_bonus": 0.48,
+      "handoff_wrong": -0.8,
+      "mitigation_wrong": -2.1,
+      "rollback_ineffective": -1.65,
+      "sla_exhausted": -1.2,
+      "repeated_lookup_penalty": -0.02,
+      "escalation_needed": 0.2
+    },
+    "heuristic": {
+      "step_cost": -2.02,
+      "clue_bonus": 2.52,
+      "handoff_wrong": -0.8,
+      "mitigation_wrong": -2.1,
+      "closure_wrong": -9.9,
+      "repeated_lookup_penalty": -0.16,
+      "handoff_correct": 0.75,
+      "postmortem_logged": 0.35,
+      "mitigation_correct": 2.1,
+      "closure_correct": 7.36,
+      "closure_mitigation_bonus": 1.8,
+      "speed_bonus": 0.6,
+      "postmortem_bonus": 0.6,
+      "closure_under_investigated": -0.8
+    },
+    "base_model": {
+      "step_cost": -5.16,
+      "clue_bonus": 0.24,
+      "repeated_lookup_penalty": -1.24,
+      "sla_exhausted": -5.04
+    },
+    "sft_model": {
+      "step_cost": -2.02,
+      "clue_bonus": 2.52,
+      "handoff_wrong": -0.8,
+      "mitigation_wrong": -2.1,
+      "closure_wrong": -9.9,
+      "repeated_lookup_penalty": -0.16,
+      "handoff_correct": 0.75,
+      "postmortem_logged": 0.35,
+      "mitigation_correct": 2.1,
+      "closure_correct": 7.36,
+      "closure_mitigation_bonus": 1.8,
+      "speed_bonus": 0.6,
+      "postmortem_bonus": 0.6,
+      "closure_under_investigated": -0.8
+    }
+  }
 }

artifacts/summary_metrics_qwen0p5b.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "base_model": "Qwen/Qwen2.5-0.5B-Instruct",
+  "dataset_rows": 255,
+  "episodes_per_task": 3,
+  "random_rewards": [
+    -5.96,
+    -11.48,
+    -12.5
+  ],
+  "heuristic_rewards": [
+    -4.72,
+    -0.87,
+    5.89
+  ],
+  "base_model_rewards": [
+    -2.92,
+    -4.0,
+    -2.4
+  ],
+  "sft_model_rewards": [
+    -2.49,
+    -3.86,
+    -2.4
+  ],
+  "improvement_sft_over_base": [
+    0.43,
+    0.14,
+    0.0
+  ],
+  "improvement_heuristic_over_random": [
+    1.24,
+    10.61,
+    18.39
+  ]
+}

artifacts/training_curve.png ADDED Viewed

artifacts/training_log.json ADDED Viewed

	@@ -0,0 +1,2051 @@

+[
+  {
+    "loss": 2.836225128173828,
+    "grad_norm": 64.5,
+    "learning_rate": 1.9921568627450984e-05,
+    "entropy": 2.411133313179016,
+    "num_tokens": 3137.0,
+    "mean_token_accuracy": 0.49307813346385954,
+    "epoch": 0.014705882352941176,
+    "step": 5
+  },
+  {
+    "loss": 1.3722827911376954,
+    "grad_norm": 10.0,
+    "learning_rate": 1.9823529411764708e-05,
+    "entropy": 1.489565873146057,
+    "num_tokens": 6240.0,
+    "mean_token_accuracy": 0.7310294091701508,
+    "epoch": 0.029411764705882353,
+    "step": 10
+  },
+  {
+    "loss": 0.9681278228759765,
+    "grad_norm": 9.0,
+    "learning_rate": 1.9725490196078433e-05,
+    "entropy": 1.0941020846366882,
+    "num_tokens": 9372.0,
+    "mean_token_accuracy": 0.7977278172969818,
+    "epoch": 0.04411764705882353,
+    "step": 15
+  },
+  {
+    "loss": 0.7952256202697754,
+    "grad_norm": 7.5625,
+    "learning_rate": 1.9627450980392157e-05,
+    "entropy": 0.7959236443042755,
+    "num_tokens": 12496.0,
+    "mean_token_accuracy": 0.8263253927230835,
+    "epoch": 0.058823529411764705,
+    "step": 20
+  },
+  {
+    "loss": 0.7038975715637207,
+    "grad_norm": 10.0,
+    "learning_rate": 1.9529411764705885e-05,
+    "entropy": 0.7730603992938996,
+    "num_tokens": 15726.0,
+    "mean_token_accuracy": 0.8362560391426086,
+    "epoch": 0.07352941176470588,
+    "step": 25
+  },
+  {
+    "loss": 0.5153284072875977,
+    "grad_norm": 9.5,
+    "learning_rate": 1.943137254901961e-05,
+    "entropy": 0.5871870815753937,
+    "num_tokens": 18807.0,
+    "mean_token_accuracy": 0.8711118042469025,
+    "epoch": 0.08823529411764706,
+    "step": 30
+  },
+  {
+    "loss": 0.4624673843383789,
+    "grad_norm": 9.375,
+    "learning_rate": 1.9333333333333333e-05,
+    "entropy": 0.5334561973810196,
+    "num_tokens": 21955.0,
+    "mean_token_accuracy": 0.8878682732582093,
+    "epoch": 0.10294117647058823,
+    "step": 35
+  },
+  {
+    "loss": 0.3805722236633301,
+    "grad_norm": 7.0625,
+    "learning_rate": 1.923529411764706e-05,
+    "entropy": 0.490571403503418,
+    "num_tokens": 25129.0,
+    "mean_token_accuracy": 0.9082872688770294,
+    "epoch": 0.11764705882352941,
+    "step": 40
+  },
+  {
+    "loss": 0.2753485679626465,
+    "grad_norm": 8.75,
+    "learning_rate": 1.9137254901960786e-05,
+    "entropy": 0.3105604648590088,
+    "num_tokens": 28291.0,
+    "mean_token_accuracy": 0.9394680917263031,
+    "epoch": 0.1323529411764706,
+    "step": 45
+  },
+  {
+    "loss": 0.22170100212097169,
+    "grad_norm": 5.65625,
+    "learning_rate": 1.903921568627451e-05,
+    "entropy": 0.28098965287208555,
+    "num_tokens": 31415.0,
+    "mean_token_accuracy": 0.949154794216156,
+    "epoch": 0.14705882352941177,
+    "step": 50
+  },
+  {
+    "loss": 0.18951488733291627,
+    "grad_norm": 9.9375,
+    "learning_rate": 1.8941176470588238e-05,
+    "entropy": 0.20550020337104796,
+    "num_tokens": 34603.0,
+    "mean_token_accuracy": 0.9539743661880493,
+    "epoch": 0.16176470588235295,
+    "step": 55
+  },
+  {
+    "loss": 0.17650480270385743,
+    "grad_norm": 4.25,
+    "learning_rate": 1.8843137254901962e-05,
+    "entropy": 0.21026135981082916,
+    "num_tokens": 37754.0,
+    "mean_token_accuracy": 0.9567391991615295,
+    "epoch": 0.17647058823529413,
+    "step": 60
+  },
+  {
+    "loss": 0.18774482011795043,
+    "grad_norm": 5.5,
+    "learning_rate": 1.8745098039215686e-05,
+    "entropy": 0.23240296691656112,
+    "num_tokens": 40848.0,
+    "mean_token_accuracy": 0.9520188570022583,
+    "epoch": 0.19117647058823528,
+    "step": 65
+  },
+  {
+    "loss": 0.12736810445785524,
+    "grad_norm": 10.625,
+    "learning_rate": 1.8647058823529414e-05,
+    "entropy": 0.16197917684912683,
+    "num_tokens": 44001.0,
+    "mean_token_accuracy": 0.9676418542861939,
+    "epoch": 0.20588235294117646,
+    "step": 70
+  },
+  {
+    "loss": 0.14076029062271117,
+    "grad_norm": 4.53125,
+    "learning_rate": 1.854901960784314e-05,
+    "entropy": 0.15784153044223787,
+    "num_tokens": 47159.0,
+    "mean_token_accuracy": 0.9648099303245544,
+    "epoch": 0.22058823529411764,
+    "step": 75
+  },
+  {
+    "loss": 0.10759507417678833,
+    "grad_norm": 3.328125,
+    "learning_rate": 1.8450980392156866e-05,
+    "entropy": 0.14289679378271103,
+    "num_tokens": 50298.0,
+    "mean_token_accuracy": 0.9671541452407837,
+    "epoch": 0.23529411764705882,
+    "step": 80
+  },
+  {
+    "loss": 0.12589149475097655,
+    "grad_norm": 5.46875,
+    "learning_rate": 1.8352941176470587e-05,
+    "entropy": 0.13958239406347275,
+    "num_tokens": 53455.0,
+    "mean_token_accuracy": 0.9665216684341431,
+    "epoch": 0.25,
+    "step": 85
+  },
+  {
+    "loss": 0.12024720907211303,
+    "grad_norm": 4.53125,
+    "learning_rate": 1.8254901960784315e-05,
+    "entropy": 0.13711344972252845,
+    "num_tokens": 56595.0,
+    "mean_token_accuracy": 0.9648710668087006,
+    "epoch": 0.2647058823529412,
+    "step": 90
+  },
+  {
+    "loss": 0.10167303085327148,
+    "grad_norm": 4.8125,
+    "learning_rate": 1.815686274509804e-05,
+    "entropy": 0.13078619986772538,
+    "num_tokens": 59674.0,
+    "mean_token_accuracy": 0.9712324619293213,
+    "epoch": 0.27941176470588236,
+    "step": 95
+  },
+  {
+    "loss": 0.08662314414978027,
+    "grad_norm": 3.671875,
+    "learning_rate": 1.8058823529411767e-05,
+    "entropy": 0.10740345045924186,
+    "num_tokens": 62774.0,
+    "mean_token_accuracy": 0.9719909071922302,
+    "epoch": 0.29411764705882354,
+    "step": 100
+  },
+  {
+    "loss": 0.09073780775070191,
+    "grad_norm": 4.15625,
+    "learning_rate": 1.796078431372549e-05,
+    "entropy": 0.09185975939035415,
+    "num_tokens": 65866.0,
+    "mean_token_accuracy": 0.9742748856544494,
+    "epoch": 0.3088235294117647,
+    "step": 105
+  },
+  {
+    "loss": 0.07408615350723266,
+    "grad_norm": 2.734375,
+    "learning_rate": 1.786274509803922e-05,
+    "entropy": 0.10024651288986205,
+    "num_tokens": 68995.0,
+    "mean_token_accuracy": 0.9773713290691376,
+    "epoch": 0.3235294117647059,
+    "step": 110
+  },
+  {
+    "loss": 0.08644189834594726,
+    "grad_norm": 6.71875,
+    "learning_rate": 1.776470588235294e-05,
+    "entropy": 0.09930562153458596,
+    "num_tokens": 72160.0,
+    "mean_token_accuracy": 0.9748322486877441,
+    "epoch": 0.3382352941176471,
+    "step": 115
+  },
+  {
+    "loss": 0.11685197353363037,
+    "grad_norm": 10.3125,
+    "learning_rate": 1.7666666666666668e-05,
+    "entropy": 0.11419346779584885,
+    "num_tokens": 75262.0,
+    "mean_token_accuracy": 0.9695464611053467,
+    "epoch": 0.35294117647058826,
+    "step": 120
+  },
+  {
+    "loss": 0.10757300853729249,
+    "grad_norm": 8.9375,
+    "learning_rate": 1.7568627450980392e-05,
+    "entropy": 0.12836654633283615,
+    "num_tokens": 78384.0,
+    "mean_token_accuracy": 0.9728550255298615,
+    "epoch": 0.36764705882352944,
+    "step": 125
+  },
+  {
+    "loss": 0.07711289525032043,
+    "grad_norm": 3.015625,
+    "learning_rate": 1.747058823529412e-05,
+    "entropy": 0.10070741027593613,
+    "num_tokens": 81583.0,
+    "mean_token_accuracy": 0.9778402209281921,
+    "epoch": 0.38235294117647056,
+    "step": 130
+  },
+  {
+    "loss": 0.08512116074562073,
+    "grad_norm": 5.375,
+    "learning_rate": 1.7372549019607845e-05,
+    "entropy": 0.09163436144590378,
+    "num_tokens": 84729.0,
+    "mean_token_accuracy": 0.9748329102993012,
+    "epoch": 0.39705882352941174,
+    "step": 135
+  },
+  {
+    "loss": 0.09534031748771668,
+    "grad_norm": 3.40625,
+    "learning_rate": 1.7274509803921572e-05,
+    "entropy": 0.09555450975894927,
+    "num_tokens": 87916.0,
+    "mean_token_accuracy": 0.9727975726127625,
+    "epoch": 0.4117647058823529,
+    "step": 140
+  },
+  {
+    "loss": 0.0699828803539276,
+    "grad_norm": 2.828125,
+    "learning_rate": 1.7176470588235293e-05,
+    "entropy": 0.089533219486475,
+    "num_tokens": 90982.0,
+    "mean_token_accuracy": 0.9772566497325897,
+    "epoch": 0.4264705882352941,
+    "step": 145
+  },
+  {
+    "loss": 0.06004565954208374,
+    "grad_norm": 4.28125,
+    "learning_rate": 1.707843137254902e-05,
+    "entropy": 0.07979470491409302,
+    "num_tokens": 94197.0,
+    "mean_token_accuracy": 0.980064970254898,
+    "epoch": 0.4411764705882353,
+    "step": 150
+  },
+  {
+    "loss": 0.07095102667808532,
+    "grad_norm": 3.8125,
+    "learning_rate": 1.6980392156862745e-05,
+    "entropy": 0.07709958106279373,
+    "num_tokens": 97332.0,
+    "mean_token_accuracy": 0.9785419166088104,
+    "epoch": 0.45588235294117646,
+    "step": 155
+  },
+  {
+    "loss": 0.05590643882751465,
+    "grad_norm": 1.671875,
+    "learning_rate": 1.6882352941176473e-05,
+    "entropy": 0.07423891946673393,
+    "num_tokens": 100515.0,
+    "mean_token_accuracy": 0.9827289760112763,
+    "epoch": 0.47058823529411764,
+    "step": 160
+  },
+  {
+    "loss": 0.06335585117340088,
+    "grad_norm": 2.390625,
+    "learning_rate": 1.6784313725490198e-05,
+    "entropy": 0.08311136476695538,
+    "num_tokens": 103630.0,
+    "mean_token_accuracy": 0.9795481741428376,
+    "epoch": 0.4852941176470588,
+    "step": 165
+  },
+  {
+    "loss": 0.06994503140449523,
+    "grad_norm": 3.625,
+    "learning_rate": 1.6686274509803922e-05,
+    "entropy": 0.07972728088498116,
+    "num_tokens": 106741.0,
+    "mean_token_accuracy": 0.9786823868751526,
+    "epoch": 0.5,
+    "step": 170
+  },
+  {
+    "loss": 0.047742915153503415,
+    "grad_norm": 5.71875,
+    "learning_rate": 1.658823529411765e-05,
+    "entropy": 0.059984054416418076,
+    "num_tokens": 109921.0,
+    "mean_token_accuracy": 0.9847357928752899,
+    "epoch": 0.5147058823529411,
+    "step": 175
+  },
+  {
+    "loss": 0.05979984998703003,
+    "grad_norm": 7.0625,
+    "learning_rate": 1.6490196078431374e-05,
+    "entropy": 0.06703888289630414,
+    "num_tokens": 112994.0,
+    "mean_token_accuracy": 0.9824592292308807,
+    "epoch": 0.5294117647058824,
+    "step": 180
+  },
+  {
+    "loss": 0.04938005805015564,
+    "grad_norm": 2.90625,
+    "learning_rate": 1.63921568627451e-05,
+    "entropy": 0.054279588535428046,
+    "num_tokens": 116201.0,
+    "mean_token_accuracy": 0.9846667230129242,
+    "epoch": 0.5441176470588235,
+    "step": 185
+  },
+  {
+    "loss": 0.06785057783126831,
+    "grad_norm": 7.4375,
+    "learning_rate": 1.6294117647058826e-05,
+    "entropy": 0.06177988387644291,
+    "num_tokens": 119381.0,
+    "mean_token_accuracy": 0.9796367406845092,
+    "epoch": 0.5588235294117647,
+    "step": 190
+  },
+  {
+    "loss": 0.05383546352386474,
+    "grad_norm": 5.40625,
+    "learning_rate": 1.619607843137255e-05,
+    "entropy": 0.0636073287576437,
+    "num_tokens": 122517.0,
+    "mean_token_accuracy": 0.9798873722553253,
+    "epoch": 0.5735294117647058,
+    "step": 195
+  },
+  {
+    "loss": 0.0490637868642807,
+    "grad_norm": 1.96875,
+    "learning_rate": 1.6098039215686275e-05,
+    "entropy": 0.0639917254447937,
+    "num_tokens": 125663.0,
+    "mean_token_accuracy": 0.9849890351295472,
+    "epoch": 0.5882352941176471,
+    "step": 200
+  },
+  {
+    "loss": 0.06412197351455688,
+    "grad_norm": 6.84375,
+    "learning_rate": 1.6000000000000003e-05,
+    "entropy": 0.06784685887396336,
+    "num_tokens": 128856.0,
+    "mean_token_accuracy": 0.9818105876445771,
+    "epoch": 0.6029411764705882,
+    "step": 205
+  },
+  {
+    "loss": 0.04346465170383453,
+    "grad_norm": 4.375,
+    "learning_rate": 1.5901960784313727e-05,
+    "entropy": 0.06049864292144776,
+    "num_tokens": 131995.0,
+    "mean_token_accuracy": 0.9882112145423889,
+    "epoch": 0.6176470588235294,
+    "step": 210
+  },
+  {
+    "loss": 0.04320838153362274,
+    "grad_norm": 2.015625,
+    "learning_rate": 1.580392156862745e-05,
+    "entropy": 0.047596517577767374,
+    "num_tokens": 135181.0,
+    "mean_token_accuracy": 0.985132920742035,
+    "epoch": 0.6323529411764706,
+    "step": 215
+  },
+  {
+    "loss": 0.06799347996711731,
+    "grad_norm": 8.5625,
+    "learning_rate": 1.570588235294118e-05,
+    "entropy": 0.06635901145637035,
+    "num_tokens": 138254.0,
+    "mean_token_accuracy": 0.9791639804840088,
+    "epoch": 0.6470588235294118,
+    "step": 220
+  },
+  {
+    "loss": 0.041108173131942746,
+    "grad_norm": 2.859375,
+    "learning_rate": 1.5607843137254904e-05,
+    "entropy": 0.051696383953094484,
+    "num_tokens": 141381.0,
+    "mean_token_accuracy": 0.9862416744232178,
+    "epoch": 0.6617647058823529,
+    "step": 225
+  },
+  {
+    "loss": 0.045146191120147706,
+    "grad_norm": 3.078125,
+    "learning_rate": 1.5509803921568628e-05,
+    "entropy": 0.055339107289910316,
+    "num_tokens": 144583.0,
+    "mean_token_accuracy": 0.9822882294654847,
+    "epoch": 0.6764705882352942,
+    "step": 230
+  },
+  {
+    "loss": 0.04143168330192566,
+    "grad_norm": 1.578125,
+    "learning_rate": 1.5411764705882356e-05,
+    "entropy": 0.05063906572759151,
+    "num_tokens": 147764.0,
+    "mean_token_accuracy": 0.9831606447696686,
+    "epoch": 0.6911764705882353,
+    "step": 235
+  },
+  {
+    "loss": 0.03947827816009521,
+    "grad_norm": 1.9921875,
+    "learning_rate": 1.531372549019608e-05,
+    "entropy": 0.05209046043455601,
+    "num_tokens": 150961.0,
+    "mean_token_accuracy": 0.9848346650600434,
+    "epoch": 0.7058823529411765,
+    "step": 240
+  },
+  {
+    "loss": 0.034212198853492734,
+    "grad_norm": 1.8984375,
+    "learning_rate": 1.5215686274509804e-05,
+    "entropy": 0.04912327118217945,
+    "num_tokens": 154174.0,
+    "mean_token_accuracy": 0.9855735838413239,
+    "epoch": 0.7205882352941176,
+    "step": 245
+  },
+  {
+    "loss": 0.03223183453083038,
+    "grad_norm": 1.7265625,
+    "learning_rate": 1.511764705882353e-05,
+    "entropy": 0.045325061306357384,
+    "num_tokens": 157374.0,
+    "mean_token_accuracy": 0.9866909861564637,
+    "epoch": 0.7352941176470589,
+    "step": 250
+  },
+  {
+    "loss": 0.04085415601730347,
+    "grad_norm": 2.625,
+    "learning_rate": 1.5019607843137257e-05,
+    "entropy": 0.045074894279241565,
+    "num_tokens": 160519.0,
+    "mean_token_accuracy": 0.9865182876586914,
+    "epoch": 0.75,
+    "step": 255
+  },
+  {
+    "loss": 0.03927797079086304,
+    "grad_norm": 2.671875,
+    "learning_rate": 1.4921568627450983e-05,
+    "entropy": 0.039533843845129014,
+    "num_tokens": 163756.0,
+    "mean_token_accuracy": 0.9872985363006592,
+    "epoch": 0.7647058823529411,
+    "step": 260
+  },
+  {
+    "loss": 0.042234039306640624,
+    "grad_norm": 1.7109375,
+    "learning_rate": 1.4823529411764707e-05,
+    "entropy": 0.043326519429683685,
+    "num_tokens": 166884.0,
+    "mean_token_accuracy": 0.9839499652385711,
+    "epoch": 0.7794117647058824,
+    "step": 265
+  },
+  {
+    "loss": 0.04218446910381317,
+    "grad_norm": 3.671875,
+    "learning_rate": 1.4725490196078433e-05,
+    "entropy": 0.05446031875908375,
+    "num_tokens": 170021.0,
+    "mean_token_accuracy": 0.983331423997879,
+    "epoch": 0.7941176470588235,
+    "step": 270
+  },
+  {
+    "loss": 0.031345850229263304,
+    "grad_norm": 1.375,
+    "learning_rate": 1.4627450980392157e-05,
+    "entropy": 0.044994413107633593,
+    "num_tokens": 173138.0,
+    "mean_token_accuracy": 0.9864144027233124,
+    "epoch": 0.8088235294117647,
+    "step": 275
+  },
+  {
+    "loss": 0.03718245923519135,
+    "grad_norm": 2.03125,
+    "learning_rate": 1.4529411764705883e-05,
+    "entropy": 0.04372772537171841,
+    "num_tokens": 176269.0,
+    "mean_token_accuracy": 0.9855779051780701,
+    "epoch": 0.8235294117647058,
+    "step": 280
+  },
+  {
+    "loss": 0.038416677713394166,
+    "grad_norm": 3.234375,
+    "learning_rate": 1.443137254901961e-05,
+    "entropy": 0.04306882936507463,
+    "num_tokens": 179436.0,
+    "mean_token_accuracy": 0.9847787022590637,
+    "epoch": 0.8382352941176471,
+    "step": 285
+  },
+  {
+    "loss": 0.03612026274204254,
+    "grad_norm": 4.28125,
+    "learning_rate": 1.4333333333333334e-05,
+    "entropy": 0.04190887995064259,
+    "num_tokens": 182619.0,
+    "mean_token_accuracy": 0.9853791892528534,
+    "epoch": 0.8529411764705882,
+    "step": 290
+  },
+  {
+    "loss": 0.03549243807792664,
+    "grad_norm": 1.5546875,
+    "learning_rate": 1.423529411764706e-05,
+    "entropy": 0.041007821820676325,
+    "num_tokens": 185835.0,
+    "mean_token_accuracy": 0.987481951713562,
+    "epoch": 0.8676470588235294,
+    "step": 295
+  },
+  {
+    "loss": 0.03658969700336456,
+    "grad_norm": 1.9921875,
+    "learning_rate": 1.4137254901960786e-05,
+    "entropy": 0.03911938704550266,
+    "num_tokens": 189059.0,
+    "mean_token_accuracy": 0.9859034955501557,
+    "epoch": 0.8823529411764706,
+    "step": 300
+  },
+  {
+    "loss": 0.03189299702644348,
+    "grad_norm": 1.3984375,
+    "learning_rate": 1.403921568627451e-05,
+    "entropy": 0.04015427939593792,
+    "num_tokens": 192245.0,
+    "mean_token_accuracy": 0.9858013272285462,
+    "epoch": 0.8970588235294118,
+    "step": 305
+  },
+  {
+    "loss": 0.04162760376930237,
+    "grad_norm": 4.6875,
+    "learning_rate": 1.3941176470588236e-05,
+    "entropy": 0.04337671361863613,
+    "num_tokens": 195334.0,
+    "mean_token_accuracy": 0.9834910809993744,
+    "epoch": 0.9117647058823529,
+    "step": 310
+  },
+  {
+    "loss": 0.03357888162136078,
+    "grad_norm": 1.515625,
+    "learning_rate": 1.384313725490196e-05,
+    "entropy": 0.043437547981739044,
+    "num_tokens": 198482.0,
+    "mean_token_accuracy": 0.9839794993400574,
+    "epoch": 0.9264705882352942,
+    "step": 315
+  },
+  {
+    "loss": 0.03252431154251099,
+    "grad_norm": 2.390625,
+    "learning_rate": 1.3745098039215687e-05,
+    "entropy": 0.041450836881995204,
+    "num_tokens": 201737.0,
+    "mean_token_accuracy": 0.9883051753044129,
+    "epoch": 0.9411764705882353,
+    "step": 320
+  },
+  {
+    "loss": 0.03779064118862152,
+    "grad_norm": 2.953125,
+    "learning_rate": 1.3647058823529413e-05,
+    "entropy": 0.03566624131053686,
+    "num_tokens": 204889.0,
+    "mean_token_accuracy": 0.9875539124011994,
+    "epoch": 0.9558823529411765,
+    "step": 325
+  },
+  {
+    "loss": 0.0329700767993927,
+    "grad_norm": 2.15625,
+    "learning_rate": 1.3549019607843139e-05,
+    "entropy": 0.03808465227484703,
+    "num_tokens": 208114.0,
+    "mean_token_accuracy": 0.986751276254654,
+    "epoch": 0.9705882352941176,
+    "step": 330
+  },
+  {
+    "loss": 0.031173259019851685,
+    "grad_norm": 1.546875,
+    "learning_rate": 1.3450980392156865e-05,
+    "entropy": 0.04065078347921371,
+    "num_tokens": 211217.0,
+    "mean_token_accuracy": 0.9860772728919983,
+    "epoch": 0.9852941176470589,
+    "step": 335
+  },
+  {
+    "loss": 0.03390420079231262,
+    "grad_norm": 1.515625,
+    "learning_rate": 1.3352941176470588e-05,
+    "entropy": 0.04108036197721958,
+    "num_tokens": 214368.0,
+    "mean_token_accuracy": 0.9871271908283233,
+    "epoch": 1.0,
+    "step": 340
+  },
+  {
+    "loss": 0.03671025633811951,
+    "grad_norm": 1.5625,
+    "learning_rate": 1.3254901960784314e-05,
+    "entropy": 0.04091338850557804,
+    "num_tokens": 217480.0,
+    "mean_token_accuracy": 0.9861762046813964,
+    "epoch": 1.0147058823529411,
+    "step": 345
+  },
+  {
+    "loss": 0.030594143271446227,
+    "grad_norm": 1.5546875,
+    "learning_rate": 1.315686274509804e-05,
+    "entropy": 0.040245630964636805,
+    "num_tokens": 220615.0,
+    "mean_token_accuracy": 0.9881528139114379,
+    "epoch": 1.0294117647058822,
+    "step": 350
+  },
+  {
+    "loss": 0.027347692847251893,
+    "grad_norm": 1.7734375,
+    "learning_rate": 1.3058823529411766e-05,
+    "entropy": 0.03420254942029714,
+    "num_tokens": 223751.0,
+    "mean_token_accuracy": 0.989202469587326,
+    "epoch": 1.0441176470588236,
+    "step": 355
+  },
+  {
+    "loss": 0.03148679435253143,
+    "grad_norm": 1.9609375,
+    "learning_rate": 1.2960784313725492e-05,
+    "entropy": 0.03210772704333067,
+    "num_tokens": 226948.0,
+    "mean_token_accuracy": 0.9868246436119079,
+    "epoch": 1.0588235294117647,
+    "step": 360
+  },
+  {
+    "loss": 0.031260594725608826,
+    "grad_norm": 1.8046875,
+    "learning_rate": 1.2862745098039218e-05,
+    "entropy": 0.033671201393008235,
+    "num_tokens": 230088.0,
+    "mean_token_accuracy": 0.9856015264987945,
+    "epoch": 1.0735294117647058,
+    "step": 365
+  },
+  {
+    "loss": 0.028061491250991822,
+    "grad_norm": 1.2890625,
+    "learning_rate": 1.276470588235294e-05,
+    "entropy": 0.03639122284948826,
+    "num_tokens": 233247.0,
+    "mean_token_accuracy": 0.9885319888591766,
+    "epoch": 1.088235294117647,
+    "step": 370
+  },
+  {
+    "loss": 0.0304165780544281,
+    "grad_norm": 2.203125,
+    "learning_rate": 1.2666666666666667e-05,
+    "entropy": 0.03107942212373018,
+    "num_tokens": 236423.0,
+    "mean_token_accuracy": 0.9864429414272309,
+    "epoch": 1.1029411764705883,
+    "step": 375
+  },
+  {
+    "loss": 0.028667458891868593,
+    "grad_norm": 1.4453125,
+    "learning_rate": 1.2568627450980393e-05,
+    "entropy": 0.03269361965358257,
+    "num_tokens": 239698.0,
+    "mean_token_accuracy": 0.9882214546203614,
+    "epoch": 1.1176470588235294,
+    "step": 380
+  },
+  {
+    "loss": 0.03024893403053284,
+    "grad_norm": 1.4375,
+    "learning_rate": 1.2470588235294119e-05,
+    "entropy": 0.036648140475153926,
+    "num_tokens": 242904.0,
+    "mean_token_accuracy": 0.9854198694229126,
+    "epoch": 1.1323529411764706,
+    "step": 385
+  },
+  {
+    "loss": 0.03237654864788055,
+    "grad_norm": 1.140625,
+    "learning_rate": 1.2372549019607845e-05,
+    "entropy": 0.036488327011466024,
+    "num_tokens": 246044.0,
+    "mean_token_accuracy": 0.9868141651153565,
+    "epoch": 1.1470588235294117,
+    "step": 390
+  },
+  {
+    "loss": 0.026534423232078552,
+    "grad_norm": 1.2890625,
+    "learning_rate": 1.2274509803921571e-05,
+    "entropy": 0.03317699953913689,
+    "num_tokens": 249199.0,
+    "mean_token_accuracy": 0.9891056835651397,
+    "epoch": 1.161764705882353,
+    "step": 395
+  },
+  {
+    "loss": 0.02918187975883484,
+    "grad_norm": 1.546875,
+    "learning_rate": 1.2176470588235294e-05,
+    "entropy": 0.033053198270499705,
+    "num_tokens": 252416.0,
+    "mean_token_accuracy": 0.9872093260288238,
+    "epoch": 1.1764705882352942,
+    "step": 400
+  },
+  {
+    "loss": 0.027815410494804384,
+    "grad_norm": 1.5,
+    "learning_rate": 1.207843137254902e-05,
+    "entropy": 0.03630108144134283,
+    "num_tokens": 255505.0,
+    "mean_token_accuracy": 0.9886294066905975,
+    "epoch": 1.1911764705882353,
+    "step": 405
+  },
+  {
+    "loss": 0.029119834303855896,
+    "grad_norm": 1.640625,
+    "learning_rate": 1.1980392156862746e-05,
+    "entropy": 0.0321140518411994,
+    "num_tokens": 258679.0,
+    "mean_token_accuracy": 0.9888967990875244,
+    "epoch": 1.2058823529411764,
+    "step": 410
+  },
+  {
+    "loss": 0.025961104035377502,
+    "grad_norm": 1.8203125,
+    "learning_rate": 1.1882352941176472e-05,
+    "entropy": 0.02944366242736578,
+    "num_tokens": 261856.0,
+    "mean_token_accuracy": 0.9895209610462189,
+    "epoch": 1.2205882352941178,
+    "step": 415
+  },
+  {
+    "loss": 0.03058839440345764,
+    "grad_norm": 2.390625,
+    "learning_rate": 1.1784313725490198e-05,
+    "entropy": 0.03461700212210417,
+    "num_tokens": 264960.0,
+    "mean_token_accuracy": 0.9882765769958496,
+    "epoch": 1.2352941176470589,
+    "step": 420
+  },
+  {
+    "loss": 0.028424999117851256,
+    "grad_norm": 1.28125,
+    "learning_rate": 1.1686274509803922e-05,
+    "entropy": 0.02985447719693184,
+    "num_tokens": 268114.0,
+    "mean_token_accuracy": 0.9882177650928498,
+    "epoch": 1.25,
+    "step": 425
+  },
+  {
+    "loss": 0.03086719512939453,
+    "grad_norm": 2.265625,
+    "learning_rate": 1.1588235294117648e-05,
+    "entropy": 0.03250212036073208,
+    "num_tokens": 271274.0,
+    "mean_token_accuracy": 0.9888392806053161,
+    "epoch": 1.2647058823529411,
+    "step": 430
+  },
+  {
+    "loss": 0.027977922558784486,
+    "grad_norm": 1.3046875,
+    "learning_rate": 1.1490196078431373e-05,
+    "entropy": 0.034127247892320155,
+    "num_tokens": 274452.0,
+    "mean_token_accuracy": 0.9908244907855988,
+    "epoch": 1.2794117647058822,
+    "step": 435
+  },
+  {
+    "loss": 0.02676369547843933,
+    "grad_norm": 1.09375,
+    "learning_rate": 1.1392156862745099e-05,
+    "entropy": 0.03699512742459774,
+    "num_tokens": 277562.0,
+    "mean_token_accuracy": 0.9871235430240631,
+    "epoch": 1.2941176470588236,
+    "step": 440
+  },
+  {
+    "loss": 0.02789466977119446,
+    "grad_norm": 2.203125,
+    "learning_rate": 1.1294117647058825e-05,
+    "entropy": 0.03514884728938341,
+    "num_tokens": 280635.0,
+    "mean_token_accuracy": 0.990158212184906,
+    "epoch": 1.3088235294117647,
+    "step": 445
+  },
+  {
+    "loss": 0.03088509142398834,
+    "grad_norm": 1.8359375,
+    "learning_rate": 1.119607843137255e-05,
+    "entropy": 0.034746605530381204,
+    "num_tokens": 283725.0,
+    "mean_token_accuracy": 0.9876766622066497,
+    "epoch": 1.3235294117647058,
+    "step": 450
+  },
+  {
+    "loss": 0.03232976496219635,
+    "grad_norm": 1.734375,
+    "learning_rate": 1.1098039215686275e-05,
+    "entropy": 0.031742793321609494,
+    "num_tokens": 286888.0,
+    "mean_token_accuracy": 0.9871384859085083,
+    "epoch": 1.3382352941176472,
+    "step": 455
+  },
+  {
+    "loss": 0.02845146059989929,
+    "grad_norm": 2.0,
+    "learning_rate": 1.1000000000000001e-05,
+    "entropy": 0.03175645042210817,
+    "num_tokens": 290064.0,
+    "mean_token_accuracy": 0.9873914003372193,
+    "epoch": 1.3529411764705883,
+    "step": 460
+  },
+  {
+    "loss": 0.029486137628555297,
+    "grad_norm": 1.265625,
+    "learning_rate": 1.0901960784313726e-05,
+    "entropy": 0.03463620245456696,
+    "num_tokens": 293189.0,
+    "mean_token_accuracy": 0.9874814569950103,
+    "epoch": 1.3676470588235294,
+    "step": 465
+  },
+  {
+    "loss": 0.02618069648742676,
+    "grad_norm": 1.109375,
+    "learning_rate": 1.0803921568627452e-05,
+    "entropy": 0.033889508619904515,
+    "num_tokens": 296268.0,
+    "mean_token_accuracy": 0.9882802128791809,
+    "epoch": 1.3823529411764706,
+    "step": 470
+  },
+  {
+    "loss": 0.025544488430023195,
+    "grad_norm": 0.8984375,
+    "learning_rate": 1.0705882352941178e-05,
+    "entropy": 0.03317532502114773,
+    "num_tokens": 299418.0,
+    "mean_token_accuracy": 0.9891822457313537,
+    "epoch": 1.3970588235294117,
+    "step": 475
+  },
+  {
+    "loss": 0.02922942042350769,
+    "grad_norm": 1.5859375,
+    "learning_rate": 1.0607843137254902e-05,
+    "entropy": 0.03228537701070309,
+    "num_tokens": 302608.0,
+    "mean_token_accuracy": 0.9864252746105194,
+    "epoch": 1.4117647058823528,
+    "step": 480
+  },
+  {
+    "loss": 0.025081342458724974,
+    "grad_norm": 1.4140625,
+    "learning_rate": 1.0509803921568628e-05,
+    "entropy": 0.033559339493513106,
+    "num_tokens": 305748.0,
+    "mean_token_accuracy": 0.9891697466373444,
+    "epoch": 1.4264705882352942,
+    "step": 485
+  },
+  {
+    "loss": 0.028987354040145873,
+    "grad_norm": 1.2109375,
+    "learning_rate": 1.0411764705882354e-05,
+    "entropy": 0.029655468463897706,
+    "num_tokens": 308946.0,
+    "mean_token_accuracy": 0.9884015321731567,
+    "epoch": 1.4411764705882353,
+    "step": 490
+  },
+  {
+    "loss": 0.022376981377601624,
+    "grad_norm": 1.5859375,
+    "learning_rate": 1.031372549019608e-05,
+    "entropy": 0.030257853865623473,
+    "num_tokens": 312060.0,
+    "mean_token_accuracy": 0.990349942445755,
+    "epoch": 1.4558823529411764,
+    "step": 495
+  },
+  {
+    "loss": 0.027941384911537172,
+    "grad_norm": 1.2734375,
+    "learning_rate": 1.0215686274509805e-05,
+    "entropy": 0.029427625238895416,
+    "num_tokens": 315202.0,
+    "mean_token_accuracy": 0.9894903540611267,
+    "epoch": 1.4705882352941178,
+    "step": 500
+  },
+  {
+    "loss": 0.02513147294521332,
+    "grad_norm": 1.8828125,
+    "learning_rate": 1.011764705882353e-05,
+    "entropy": 0.029220272414386274,
+    "num_tokens": 318423.0,
+    "mean_token_accuracy": 0.9887598037719727,
+    "epoch": 1.4852941176470589,
+    "step": 505
+  },
+  {
+    "loss": 0.024520005285739898,
+    "grad_norm": 1.3515625,
+    "learning_rate": 1.0019607843137255e-05,
+    "entropy": 0.027622674778103828,
+    "num_tokens": 321643.0,
+    "mean_token_accuracy": 0.9881017684936524,
+    "epoch": 1.5,
+    "step": 510
+  },
+  {
+    "loss": 0.022774545848369597,
+    "grad_norm": 0.96875,
+    "learning_rate": 9.921568627450981e-06,
+    "entropy": 0.027344943769276143,
+    "num_tokens": 324896.0,
+    "mean_token_accuracy": 0.9891824662685395,
+    "epoch": 1.5147058823529411,
+    "step": 515
+  },
+  {
+    "loss": 0.026902440190315246,
+    "grad_norm": 1.34375,
+    "learning_rate": 9.823529411764706e-06,
+    "entropy": 0.03210813459008932,
+    "num_tokens": 327953.0,
+    "mean_token_accuracy": 0.9872022986412048,
+    "epoch": 1.5294117647058822,
+    "step": 520
+  },
+  {
+    "loss": 0.02404342144727707,
+    "grad_norm": 1.34375,
+    "learning_rate": 9.725490196078432e-06,
+    "entropy": 0.03047515023499727,
+    "num_tokens": 331110.0,
+    "mean_token_accuracy": 0.9887873768806458,
+    "epoch": 1.5441176470588234,
+    "step": 525
+  },
+  {
+    "loss": 0.022797247767448424,
+    "grad_norm": 1.2265625,
+    "learning_rate": 9.627450980392158e-06,
+    "entropy": 0.03160413987934589,
+    "num_tokens": 334226.0,
+    "mean_token_accuracy": 0.9889481067657471,
+    "epoch": 1.5588235294117647,
+    "step": 530
+  },
+  {
+    "loss": 0.023706996440887453,
+    "grad_norm": 1.078125,
+    "learning_rate": 9.529411764705882e-06,
+    "entropy": 0.0283035334199667,
+    "num_tokens": 337371.0,
+    "mean_token_accuracy": 0.9890589594841004,
+    "epoch": 1.5735294117647058,
+    "step": 535
+  },
+  {
+    "loss": 0.023340512812137604,
+    "grad_norm": 2.5625,
+    "learning_rate": 9.431372549019608e-06,
+    "entropy": 0.029125319607555867,
+    "num_tokens": 340563.0,
+    "mean_token_accuracy": 0.9882973015308381,
+    "epoch": 1.5882352941176472,
+    "step": 540
+  },
+  {
+    "loss": 0.025814762711524962,
+    "grad_norm": 1.8046875,
+    "learning_rate": 9.333333333333334e-06,
+    "entropy": 0.029474343173205853,
+    "num_tokens": 343715.0,
+    "mean_token_accuracy": 0.9888520836830139,
+    "epoch": 1.6029411764705883,
+    "step": 545
+  },
+  {
+    "loss": 0.024609880149364473,
+    "grad_norm": 1.359375,
+    "learning_rate": 9.23529411764706e-06,
+    "entropy": 0.02793533504009247,
+    "num_tokens": 346928.0,
+    "mean_token_accuracy": 0.9896528542041778,
+    "epoch": 1.6176470588235294,
+    "step": 550
+  },
+  {
+    "loss": 0.024091285467147828,
+    "grad_norm": 1.171875,
+    "learning_rate": 9.137254901960785e-06,
+    "entropy": 0.03169798478484154,
+    "num_tokens": 349942.0,
+    "mean_token_accuracy": 0.9896469593048096,
+    "epoch": 1.6323529411764706,
+    "step": 555
+  },
+  {
+    "loss": 0.022402273118495943,
+    "grad_norm": 1.3203125,
+    "learning_rate": 9.03921568627451e-06,
+    "entropy": 0.02854564245790243,
+    "num_tokens": 353063.0,
+    "mean_token_accuracy": 0.9894876420497895,
+    "epoch": 1.6470588235294117,
+    "step": 560
+  },
+  {
+    "loss": 0.023489847779273987,
+    "grad_norm": 1.8359375,
+    "learning_rate": 8.941176470588237e-06,
+    "entropy": 0.028600608371198176,
+    "num_tokens": 356180.0,
+    "mean_token_accuracy": 0.9890201330184937,
+    "epoch": 1.6617647058823528,
+    "step": 565
+  },
+  {
+    "loss": 0.02147035002708435,
+    "grad_norm": 1.0859375,
+    "learning_rate": 8.843137254901961e-06,
+    "entropy": 0.026650307327508928,
+    "num_tokens": 359351.0,
+    "mean_token_accuracy": 0.9898578941822052,
+    "epoch": 1.6764705882352942,
+    "step": 570
+  },
+  {
+    "loss": 0.022052311897277833,
+    "grad_norm": 1.3515625,
+    "learning_rate": 8.745098039215687e-06,
+    "entropy": 0.027873093821108343,
+    "num_tokens": 362470.0,
+    "mean_token_accuracy": 0.989058256149292,
+    "epoch": 1.6911764705882353,
+    "step": 575
+  },
+  {
+    "loss": 0.023864805698394775,
+    "grad_norm": 1.5859375,
+    "learning_rate": 8.647058823529413e-06,
+    "entropy": 0.027629780396819115,
+    "num_tokens": 365614.0,
+    "mean_token_accuracy": 0.9894056558609009,
+    "epoch": 1.7058823529411766,
+    "step": 580
+  },
+  {
+    "loss": 0.027744096517562867,
+    "grad_norm": 1.6875,
+    "learning_rate": 8.549019607843138e-06,
+    "entropy": 0.028794774785637856,
+    "num_tokens": 368805.0,
+    "mean_token_accuracy": 0.9880473792552948,
+    "epoch": 1.7205882352941178,
+    "step": 585
+  },
+  {
+    "loss": 0.021863000094890596,
+    "grad_norm": 1.1796875,
+    "learning_rate": 8.450980392156864e-06,
+    "entropy": 0.028252063691616057,
+    "num_tokens": 371947.0,
+    "mean_token_accuracy": 0.9904429137706756,
+    "epoch": 1.7352941176470589,
+    "step": 590
+  },
+  {
+    "loss": 0.021520544588565827,
+    "grad_norm": 1.3203125,
+    "learning_rate": 8.35294117647059e-06,
+    "entropy": 0.028264945745468138,
+    "num_tokens": 375103.0,
+    "mean_token_accuracy": 0.9904776751995087,
+    "epoch": 1.75,
+    "step": 595
+  },
+  {
+    "loss": 0.026353719830513,
+    "grad_norm": 1.1953125,
+    "learning_rate": 8.254901960784314e-06,
+    "entropy": 0.027113928645849227,
+    "num_tokens": 378317.0,
+    "mean_token_accuracy": 0.9884898960590363,
+    "epoch": 1.7647058823529411,
+    "step": 600
+  },
+  {
+    "loss": 0.026097461581230164,
+    "grad_norm": 1.421875,
+    "learning_rate": 8.15686274509804e-06,
+    "entropy": 0.028313294425606726,
+    "num_tokens": 381417.0,
+    "mean_token_accuracy": 0.9879869103431702,
+    "epoch": 1.7794117647058822,
+    "step": 605
+  },
+  {
+    "loss": 0.02049378156661987,
+    "grad_norm": 1.0546875,
+    "learning_rate": 8.058823529411766e-06,
+    "entropy": 0.026570411399006844,
+    "num_tokens": 384632.0,
+    "mean_token_accuracy": 0.9887495577335358,
+    "epoch": 1.7941176470588234,
+    "step": 610
+  },
+  {
+    "loss": 0.022221173346042632,
+    "grad_norm": 1.1171875,
+    "learning_rate": 7.96078431372549e-06,
+    "entropy": 0.02754255346953869,
+    "num_tokens": 387836.0,
+    "mean_token_accuracy": 0.9899809181690216,
+    "epoch": 1.8088235294117647,
+    "step": 615
+  },
+  {
+    "loss": 0.023856499791145326,
+    "grad_norm": 1.3203125,
+    "learning_rate": 7.862745098039217e-06,
+    "entropy": 0.031241112016141416,
+    "num_tokens": 390887.0,
+    "mean_token_accuracy": 0.9897979915142059,
+    "epoch": 1.8235294117647058,
+    "step": 620
+  },
+  {
+    "loss": 0.0225734680891037,
+    "grad_norm": 1.40625,
+    "learning_rate": 7.764705882352941e-06,
+    "entropy": 0.02798519879579544,
+    "num_tokens": 394027.0,
+    "mean_token_accuracy": 0.9890839040279389,
+    "epoch": 1.8382352941176472,
+    "step": 625
+  },
+  {
+    "loss": 0.022729092836380006,
+    "grad_norm": 1.25,
+    "learning_rate": 7.666666666666667e-06,
+    "entropy": 0.02719390895217657,
+    "num_tokens": 397202.0,
+    "mean_token_accuracy": 0.9886514127254487,
+    "epoch": 1.8529411764705883,
+    "step": 630
+  },
+  {
+    "loss": 0.021688875555992127,
+    "grad_norm": 1.0859375,
+    "learning_rate": 7.5686274509803925e-06,
+    "entropy": 0.027222988195717335,
+    "num_tokens": 400378.0,
+    "mean_token_accuracy": 0.9908071339130402,
+    "epoch": 1.8676470588235294,
+    "step": 635
+  },
+  {
+    "loss": 0.023884420096874238,
+    "grad_norm": 1.4296875,
+    "learning_rate": 7.4705882352941185e-06,
+    "entropy": 0.028057356551289558,
+    "num_tokens": 403503.0,
+    "mean_token_accuracy": 0.9900456726551056,
+    "epoch": 1.8823529411764706,
+    "step": 640
+  },
+  {
+    "loss": 0.020375268161296846,
+    "grad_norm": 1.6953125,
+    "learning_rate": 7.372549019607845e-06,
+    "entropy": 0.02543655373156071,
+    "num_tokens": 406768.0,
+    "mean_token_accuracy": 0.9911065042018891,
+    "epoch": 1.8970588235294117,
+    "step": 645
+  },
+  {
+    "loss": 0.020015493035316467,
+    "grad_norm": 1.7421875,
+    "learning_rate": 7.274509803921569e-06,
+    "entropy": 0.027230485714972018,
+    "num_tokens": 409875.0,
+    "mean_token_accuracy": 0.9906234502792358,
+    "epoch": 1.9117647058823528,
+    "step": 650
+  },
+  {
+    "loss": 0.022530680894851683,
+    "grad_norm": 1.421875,
+    "learning_rate": 7.176470588235295e-06,
+    "entropy": 0.028223772905766963,
+    "num_tokens": 412987.0,
+    "mean_token_accuracy": 0.9903216242790223,
+    "epoch": 1.9264705882352942,
+    "step": 655
+  },
+  {
+    "loss": 0.021129874885082243,
+    "grad_norm": 1.109375,
+    "learning_rate": 7.07843137254902e-06,
+    "entropy": 0.02674291282892227,
+    "num_tokens": 416181.0,
+    "mean_token_accuracy": 0.9886639952659607,
+    "epoch": 1.9411764705882353,
+    "step": 660
+  },
+  {
+    "loss": 0.021244224905967713,
+    "grad_norm": 0.9453125,
+    "learning_rate": 6.9803921568627454e-06,
+    "entropy": 0.028005971759557723,
+    "num_tokens": 419323.0,
+    "mean_token_accuracy": 0.9905200719833374,
+    "epoch": 1.9558823529411766,
+    "step": 665
+  },
+  {
+    "loss": 0.022309188544750214,
+    "grad_norm": 1.375,
+    "learning_rate": 6.8823529411764715e-06,
+    "entropy": 0.027272411435842515,
+    "num_tokens": 422484.0,
+    "mean_token_accuracy": 0.9878733932971955,
+    "epoch": 1.9705882352941178,
+    "step": 670
+  },
+  {
+    "loss": 0.022459632158279418,
+    "grad_norm": 1.203125,
+    "learning_rate": 6.784313725490197e-06,
+    "entropy": 0.026817415095865726,
+    "num_tokens": 425583.0,
+    "mean_token_accuracy": 0.9908780753612518,
+    "epoch": 1.9852941176470589,
+    "step": 675
+  },
+  {
+    "loss": 0.021811096370220183,
+    "grad_norm": 1.265625,
+    "learning_rate": 6.686274509803922e-06,
+    "entropy": 0.026038615591824056,
+    "num_tokens": 428736.0,
+    "mean_token_accuracy": 0.9897907853126526,
+    "epoch": 2.0,
+    "step": 680
+  },
+  {
+    "loss": 0.019171090424060823,
+    "grad_norm": 1.078125,
+    "learning_rate": 6.588235294117647e-06,
+    "entropy": 0.02475190218538046,
+    "num_tokens": 431976.0,
+    "mean_token_accuracy": 0.989355844259262,
+    "epoch": 2.014705882352941,
+    "step": 685
+  },
+  {
+    "loss": 0.023474155366420744,
+    "grad_norm": 1.1640625,
+    "learning_rate": 6.490196078431373e-06,
+    "entropy": 0.026115396432578562,
+    "num_tokens": 435142.0,
+    "mean_token_accuracy": 0.9885824680328369,
+    "epoch": 2.0294117647058822,
+    "step": 690
+  },
+  {
+    "loss": 0.020176805555820465,
+    "grad_norm": 1.0,
+    "learning_rate": 6.3921568627450984e-06,
+    "entropy": 0.026907235756516455,
+    "num_tokens": 438259.0,
+    "mean_token_accuracy": 0.9919745445251464,
+    "epoch": 2.0441176470588234,
+    "step": 695
+  },
+  {
+    "loss": 0.022543656826019286,
+    "grad_norm": 1.34375,
+    "learning_rate": 6.294117647058824e-06,
+    "entropy": 0.02749718502163887,
+    "num_tokens": 441366.0,
+    "mean_token_accuracy": 0.9880188047885895,
+    "epoch": 2.0588235294117645,
+    "step": 700
+  },
+  {
+    "loss": 0.019685085117816924,
+    "grad_norm": 0.9453125,
+    "learning_rate": 6.19607843137255e-06,
+    "entropy": 0.024849089048802852,
+    "num_tokens": 444474.0,
+    "mean_token_accuracy": 0.9906105160713196,
+    "epoch": 2.073529411764706,
+    "step": 705
+  },
+  {
+    "loss": 0.020225000381469727,
+    "grad_norm": 1.234375,
+    "learning_rate": 6.098039215686276e-06,
+    "entropy": 0.023934758827090265,
+    "num_tokens": 447652.0,
+    "mean_token_accuracy": 0.9896179974079132,
+    "epoch": 2.088235294117647,
+    "step": 710
+  },
+  {
+    "loss": 0.02128472626209259,
+    "grad_norm": 1.078125,
+    "learning_rate": 6e-06,
+    "entropy": 0.02389440070837736,
+    "num_tokens": 450833.0,
+    "mean_token_accuracy": 0.9899099349975586,
+    "epoch": 2.1029411764705883,
+    "step": 715
+  },
+  {
+    "loss": 0.021367147564888,
+    "grad_norm": 1.6015625,
+    "learning_rate": 5.901960784313726e-06,
+    "entropy": 0.02620517127215862,
+    "num_tokens": 453949.0,
+    "mean_token_accuracy": 0.988726532459259,
+    "epoch": 2.1176470588235294,
+    "step": 720
+  },
+  {
+    "loss": 0.01960753947496414,
+    "grad_norm": 1.03125,
+    "learning_rate": 5.803921568627452e-06,
+    "entropy": 0.02435927651822567,
+    "num_tokens": 457147.0,
+    "mean_token_accuracy": 0.9908569097518921,
+    "epoch": 2.1323529411764706,
+    "step": 725
+  },
+  {
+    "loss": 0.022167882323265074,
+    "grad_norm": 1.234375,
+    "learning_rate": 5.705882352941177e-06,
+    "entropy": 0.02521121110767126,
+    "num_tokens": 460308.0,
+    "mean_token_accuracy": 0.9891940593719483,
+    "epoch": 2.1470588235294117,
+    "step": 730
+  },
+  {
+    "loss": 0.0210279181599617,
+    "grad_norm": 1.359375,
+    "learning_rate": 5.607843137254903e-06,
+    "entropy": 0.02500821612775326,
+    "num_tokens": 463449.0,
+    "mean_token_accuracy": 0.9884547054767608,
+    "epoch": 2.161764705882353,
+    "step": 735
+  },
+  {
+    "loss": 0.01987575888633728,
+    "grad_norm": 1.03125,
+    "learning_rate": 5.509803921568628e-06,
+    "entropy": 0.025977463461458683,
+    "num_tokens": 466590.0,
+    "mean_token_accuracy": 0.9888093769550323,
+    "epoch": 2.176470588235294,
+    "step": 740
+  },
+  {
+    "loss": 0.019111356139183043,
+    "grad_norm": 1.25,
+    "learning_rate": 5.411764705882353e-06,
+    "entropy": 0.02638601940125227,
+    "num_tokens": 469726.0,
+    "mean_token_accuracy": 0.9917258858680725,
+    "epoch": 2.1911764705882355,
+    "step": 745
+  },
+  {
+    "loss": 0.020354922115802764,
+    "grad_norm": 1.171875,
+    "learning_rate": 5.313725490196079e-06,
+    "entropy": 0.026662386767566205,
+    "num_tokens": 472853.0,
+    "mean_token_accuracy": 0.99064000248909,
+    "epoch": 2.2058823529411766,
+    "step": 750
+  },
+  {
+    "loss": 0.01959734410047531,
+    "grad_norm": 0.80859375,
+    "learning_rate": 5.2156862745098044e-06,
+    "entropy": 0.02579411044716835,
+    "num_tokens": 476008.0,
+    "mean_token_accuracy": 0.9904728531837463,
+    "epoch": 2.2205882352941178,
+    "step": 755
+  },
+  {
+    "loss": 0.020466303825378417,
+    "grad_norm": 1.3828125,
+    "learning_rate": 5.11764705882353e-06,
+    "entropy": 0.0256651122123003,
+    "num_tokens": 479150.0,
+    "mean_token_accuracy": 0.9903539717197418,
+    "epoch": 2.235294117647059,
+    "step": 760
+  },
+  {
+    "loss": 0.01983775794506073,
+    "grad_norm": 0.99609375,
+    "learning_rate": 5.019607843137255e-06,
+    "entropy": 0.02584236618131399,
+    "num_tokens": 482321.0,
+    "mean_token_accuracy": 0.9914842903614044,
+    "epoch": 2.25,
+    "step": 765
+  },
+  {
+    "loss": 0.020100761950016022,
+    "grad_norm": 1.046875,
+    "learning_rate": 4.921568627450981e-06,
+    "entropy": 0.02499296572059393,
+    "num_tokens": 485510.0,
+    "mean_token_accuracy": 0.991219836473465,
+    "epoch": 2.264705882352941,
+    "step": 770
+  },
+  {
+    "loss": 0.02088477313518524,
+    "grad_norm": 1.328125,
+    "learning_rate": 4.823529411764706e-06,
+    "entropy": 0.024959737621247768,
+    "num_tokens": 488698.0,
+    "mean_token_accuracy": 0.9898148238658905,
+    "epoch": 2.2794117647058822,
+    "step": 775
+  },
+  {
+    "loss": 0.0195361465215683,
+    "grad_norm": 1.2421875,
+    "learning_rate": 4.725490196078431e-06,
+    "entropy": 0.023672481067478657,
+    "num_tokens": 491906.0,
+    "mean_token_accuracy": 0.9900302290916443,
+    "epoch": 2.2941176470588234,
+    "step": 780
+  },
+  {
+    "loss": 0.019702821969985962,
+    "grad_norm": 1.265625,
+    "learning_rate": 4.627450980392157e-06,
+    "entropy": 0.025737580843269825,
+    "num_tokens": 494997.0,
+    "mean_token_accuracy": 0.9905776441097259,
+    "epoch": 2.3088235294117645,
+    "step": 785
+  },
+  {
+    "loss": 0.018527360260486604,
+    "grad_norm": 1.078125,
+    "learning_rate": 4.529411764705883e-06,
+    "entropy": 0.02454463895410299,
+    "num_tokens": 498138.0,
+    "mean_token_accuracy": 0.9910318195819855,
+    "epoch": 2.323529411764706,
+    "step": 790
+  },
+  {
+    "loss": 0.018923106789588928,
+    "grad_norm": 1.359375,
+    "learning_rate": 4.431372549019608e-06,
+    "entropy": 0.0245100449770689,
+    "num_tokens": 501316.0,
+    "mean_token_accuracy": 0.9911953806877136,
+    "epoch": 2.338235294117647,
+    "step": 795
+  },
+  {
+    "loss": 0.01874026209115982,
+    "grad_norm": 1.140625,
+    "learning_rate": 4.333333333333334e-06,
+    "entropy": 0.023334310948848726,
+    "num_tokens": 504533.0,
+    "mean_token_accuracy": 0.9910171329975128,
+    "epoch": 2.3529411764705883,
+    "step": 800
+  },
+  {
+    "loss": 0.022160655260086058,
+    "grad_norm": 1.2578125,
+    "learning_rate": 4.235294117647059e-06,
+    "entropy": 0.026187057420611382,
+    "num_tokens": 507616.0,
+    "mean_token_accuracy": 0.9876076638698578,
+    "epoch": 2.3676470588235294,
+    "step": 805
+  },
+  {
+    "loss": 0.018640576303005217,
+    "grad_norm": 1.03125,
+    "learning_rate": 4.137254901960784e-06,
+    "entropy": 0.02308085039258003,
+    "num_tokens": 510793.0,
+    "mean_token_accuracy": 0.9908162891864777,
+    "epoch": 2.3823529411764706,
+    "step": 810
+  },
+  {
+    "loss": 0.019237047433853148,
+    "grad_norm": 0.8984375,
+    "learning_rate": 4.03921568627451e-06,
+    "entropy": 0.024417817965149878,
+    "num_tokens": 513995.0,
+    "mean_token_accuracy": 0.9902299284934998,
+    "epoch": 2.3970588235294117,
+    "step": 815
+  },
+  {
+    "loss": 0.020626239478588104,
+    "grad_norm": 1.1640625,
+    "learning_rate": 3.941176470588236e-06,
+    "entropy": 0.025944224931299685,
+    "num_tokens": 517128.0,
+    "mean_token_accuracy": 0.9896773338317871,
+    "epoch": 2.411764705882353,
+    "step": 820
+  },
+  {
+    "loss": 0.018906430900096895,
+    "grad_norm": 1.0546875,
+    "learning_rate": 3.843137254901962e-06,
+    "entropy": 0.02529167104512453,
+    "num_tokens": 520219.0,
+    "mean_token_accuracy": 0.9905548214912414,
+    "epoch": 2.426470588235294,
+    "step": 825
+  },
+  {
+    "loss": 0.01989607810974121,
+    "grad_norm": 1.171875,
+    "learning_rate": 3.7450980392156865e-06,
+    "entropy": 0.025429282896220685,
+    "num_tokens": 523368.0,
+    "mean_token_accuracy": 0.9910161614418029,
+    "epoch": 2.4411764705882355,
+    "step": 830
+  },
+  {
+    "loss": 0.019511505961418152,
+    "grad_norm": 1.046875,
+    "learning_rate": 3.6470588235294117e-06,
+    "entropy": 0.026134114153683184,
+    "num_tokens": 526516.0,
+    "mean_token_accuracy": 0.9898114144802094,
+    "epoch": 2.4558823529411766,
+    "step": 835
+  },
+  {
+    "loss": 0.018582092225551607,
+    "grad_norm": 1.1328125,
+    "learning_rate": 3.5490196078431378e-06,
+    "entropy": 0.02343358173966408,
+    "num_tokens": 529660.0,
+    "mean_token_accuracy": 0.9904271245002747,
+    "epoch": 2.4705882352941178,
+    "step": 840
+  },
+  {
+    "loss": 0.020261451601982117,
+    "grad_norm": 1.453125,
+    "learning_rate": 3.450980392156863e-06,
+    "entropy": 0.024460323713719846,
+    "num_tokens": 532778.0,
+    "mean_token_accuracy": 0.9899402976036071,
+    "epoch": 2.485294117647059,
+    "step": 845
+  },
+  {
+    "loss": 0.020383948087692262,
+    "grad_norm": 1.1796875,
+    "learning_rate": 3.352941176470588e-06,
+    "entropy": 0.024987665377557276,
+    "num_tokens": 535932.0,
+    "mean_token_accuracy": 0.9898059248924256,
+    "epoch": 2.5,
+    "step": 850
+  },
+  {
+    "loss": 0.019448164105415344,
+    "grad_norm": 1.3515625,
+    "learning_rate": 3.2549019607843143e-06,
+    "entropy": 0.02465162370353937,
+    "num_tokens": 539037.0,
+    "mean_token_accuracy": 0.9913235783576966,
+    "epoch": 2.514705882352941,
+    "step": 855
+  },
+  {
+    "loss": 0.018925553560256957,
+    "grad_norm": 1.046875,
+    "learning_rate": 3.1568627450980395e-06,
+    "entropy": 0.025184641405940057,
+    "num_tokens": 542197.0,
+    "mean_token_accuracy": 0.991470605134964,
+    "epoch": 2.5294117647058822,
+    "step": 860
+  },
+  {
+    "loss": 0.01913969814777374,
+    "grad_norm": 1.0546875,
+    "learning_rate": 3.058823529411765e-06,
+    "entropy": 0.024113286659121512,
+    "num_tokens": 545387.0,
+    "mean_token_accuracy": 0.9914486467838287,
+    "epoch": 2.5441176470588234,
+    "step": 865
+  },
+  {
+    "loss": 0.018765930831432343,
+    "grad_norm": 1.0703125,
+    "learning_rate": 2.9607843137254903e-06,
+    "entropy": 0.02413007989525795,
+    "num_tokens": 548534.0,
+    "mean_token_accuracy": 0.9907777428627014,
+    "epoch": 2.5588235294117645,
+    "step": 870
+  },
+  {
+    "loss": 0.019279350340366364,
+    "grad_norm": 2.1875,
+    "learning_rate": 2.8627450980392155e-06,
+    "entropy": 0.024522659182548524,
+    "num_tokens": 551721.0,
+    "mean_token_accuracy": 0.9905555963516235,
+    "epoch": 2.5735294117647056,
+    "step": 875
+  },
+  {
+    "loss": 0.019660860300064087,
+    "grad_norm": 1.1015625,
+    "learning_rate": 2.7647058823529416e-06,
+    "entropy": 0.024852845631539822,
+    "num_tokens": 554912.0,
+    "mean_token_accuracy": 0.9898727238178253,
+    "epoch": 2.588235294117647,
+    "step": 880
+  },
+  {
+    "loss": 0.018780362606048585,
+    "grad_norm": 1.0703125,
+    "learning_rate": 2.666666666666667e-06,
+    "entropy": 0.02551023568958044,
+    "num_tokens": 558028.0,
+    "mean_token_accuracy": 0.99192915558815,
+    "epoch": 2.6029411764705883,
+    "step": 885
+  },
+  {
+    "loss": 0.01949601024389267,
+    "grad_norm": 1.1953125,
+    "learning_rate": 2.568627450980392e-06,
+    "entropy": 0.025155650451779366,
+    "num_tokens": 561189.0,
+    "mean_token_accuracy": 0.990712708234787,
+    "epoch": 2.6176470588235294,
+    "step": 890
+  },
+  {
+    "loss": 0.019716159999370576,
+    "grad_norm": 1.296875,
+    "learning_rate": 2.470588235294118e-06,
+    "entropy": 0.024883992783725262,
+    "num_tokens": 564374.0,
+    "mean_token_accuracy": 0.989579439163208,
+    "epoch": 2.6323529411764706,
+    "step": 895
+  },
+  {
+    "loss": 0.017295162379741668,
+    "grad_norm": 0.97265625,
+    "learning_rate": 2.3725490196078433e-06,
+    "entropy": 0.0241273645311594,
+    "num_tokens": 567550.0,
+    "mean_token_accuracy": 0.9934020042419434,
+    "epoch": 2.6470588235294117,
+    "step": 900
+  },
+  {
+    "loss": 0.020695842802524567,
+    "grad_norm": 1.109375,
+    "learning_rate": 2.274509803921569e-06,
+    "entropy": 0.02697849553078413,
+    "num_tokens": 570611.0,
+    "mean_token_accuracy": 0.9914706110954284,
+    "epoch": 2.661764705882353,
+    "step": 905
+  },
+  {
+    "loss": 0.017908445000648497,
+    "grad_norm": 1.2734375,
+    "learning_rate": 2.176470588235294e-06,
+    "entropy": 0.022997986152768136,
+    "num_tokens": 573767.0,
+    "mean_token_accuracy": 0.9898150980472564,
+    "epoch": 2.6764705882352944,
+    "step": 910
+  },
+  {
+    "loss": 0.020641934871673585,
+    "grad_norm": 1.4921875,
+    "learning_rate": 2.07843137254902e-06,
+    "entropy": 0.027346356958150863,
+    "num_tokens": 576830.0,
+    "mean_token_accuracy": 0.9897843182086945,
+    "epoch": 2.6911764705882355,
+    "step": 915
+  },
+  {
+    "loss": 0.019691270589828492,
+    "grad_norm": 1.2890625,
+    "learning_rate": 1.980392156862745e-06,
+    "entropy": 0.023718219250440598,
+    "num_tokens": 580065.0,
+    "mean_token_accuracy": 0.9901076138019562,
+    "epoch": 2.7058823529411766,
+    "step": 920
+  },
+  {
+    "loss": 0.02009253352880478,
+    "grad_norm": 1.2109375,
+    "learning_rate": 1.8823529411764707e-06,
+    "entropy": 0.024860053882002832,
+    "num_tokens": 583200.0,
+    "mean_token_accuracy": 0.9894306361675262,
+    "epoch": 2.7205882352941178,
+    "step": 925
+  },
+  {
+    "loss": 0.019820311665534975,
+    "grad_norm": 1.1796875,
+    "learning_rate": 1.7843137254901963e-06,
+    "entropy": 0.02641481179744005,
+    "num_tokens": 586247.0,
+    "mean_token_accuracy": 0.9888152658939362,
+    "epoch": 2.735294117647059,
+    "step": 930
+  },
+  {
+    "loss": 0.020238989591598512,
+    "grad_norm": 1.34375,
+    "learning_rate": 1.6862745098039217e-06,
+    "entropy": 0.025426279939711093,
+    "num_tokens": 589348.0,
+    "mean_token_accuracy": 0.9893324971199036,
+    "epoch": 2.75,
+    "step": 935
+  },
+  {
+    "loss": 0.020529073476791383,
+    "grad_norm": 1.1953125,
+    "learning_rate": 1.5882352941176472e-06,
+    "entropy": 0.025489212945103645,
+    "num_tokens": 592483.0,
+    "mean_token_accuracy": 0.9883848607540131,
+    "epoch": 2.764705882352941,
+    "step": 940
+  },
+  {
+    "loss": 0.019503119587898254,
+    "grad_norm": 1.875,
+    "learning_rate": 1.4901960784313726e-06,
+    "entropy": 0.025844238512218,
+    "num_tokens": 595654.0,
+    "mean_token_accuracy": 0.9898752987384796,
+    "epoch": 2.7794117647058822,
+    "step": 945
+  },
+  {
+    "loss": 0.020725423097610475,
+    "grad_norm": 1.3359375,
+    "learning_rate": 1.3921568627450982e-06,
+    "entropy": 0.025542815588414668,
+    "num_tokens": 598757.0,
+    "mean_token_accuracy": 0.9899684190750122,
+    "epoch": 2.7941176470588234,
+    "step": 950
+  },
+  {
+    "loss": 0.020795242488384248,
+    "grad_norm": 1.1640625,
+    "learning_rate": 1.2941176470588237e-06,
+    "entropy": 0.023506213910877705,
+    "num_tokens": 602069.0,
+    "mean_token_accuracy": 0.9894281327724457,
+    "epoch": 2.8088235294117645,
+    "step": 955
+  },
+  {
+    "loss": 0.01915638893842697,
+    "grad_norm": 1.21875,
+    "learning_rate": 1.196078431372549e-06,
+    "entropy": 0.024655142053961753,
+    "num_tokens": 605286.0,
+    "mean_token_accuracy": 0.9900248169898986,
+    "epoch": 2.8235294117647056,
+    "step": 960
+  },
+  {
+    "loss": 0.01975841522216797,
+    "grad_norm": 1.1484375,
+    "learning_rate": 1.0980392156862745e-06,
+    "entropy": 0.025551106408238412,
+    "num_tokens": 608374.0,
+    "mean_token_accuracy": 0.9892638444900512,
+    "epoch": 2.838235294117647,
+    "step": 965
+  },
+  {
+    "loss": 0.020852866768836974,
+    "grad_norm": 1.2421875,
+    "learning_rate": 1.0000000000000002e-06,
+    "entropy": 0.02480896282941103,
+    "num_tokens": 611577.0,
+    "mean_token_accuracy": 0.9892595648765564,
+    "epoch": 2.8529411764705883,
+    "step": 970
+  },
+  {
+    "loss": 0.019326749444007873,
+    "grad_norm": 0.875,
+    "learning_rate": 9.019607843137256e-07,
+    "entropy": 0.02385783474892378,
+    "num_tokens": 614761.0,
+    "mean_token_accuracy": 0.9904800593852997,
+    "epoch": 2.8676470588235294,
+    "step": 975
+  },
+  {
+    "loss": 0.019405061006546022,
+    "grad_norm": 1.1875,
+    "learning_rate": 8.039215686274511e-07,
+    "entropy": 0.026029090210795403,
+    "num_tokens": 617870.0,
+    "mean_token_accuracy": 0.9896216452121734,
+    "epoch": 2.8823529411764706,
+    "step": 980
+  },
+  {
+    "loss": 0.019337351620197295,
+    "grad_norm": 0.9921875,
+    "learning_rate": 7.058823529411766e-07,
+    "entropy": 0.026062553003430366,
+    "num_tokens": 620943.0,
+    "mean_token_accuracy": 0.9899002552032471,
+    "epoch": 2.8970588235294117,
+    "step": 985
+  },
+  {
+    "loss": 0.01972263157367706,
+    "grad_norm": 1.5625,
+    "learning_rate": 6.07843137254902e-07,
+    "entropy": 0.025324805453419686,
+    "num_tokens": 624094.0,
+    "mean_token_accuracy": 0.9898600101470947,
+    "epoch": 2.911764705882353,
+    "step": 990
+  },
+  {
+    "loss": 0.017833781242370606,
+    "grad_norm": 1.2265625,
+    "learning_rate": 5.098039215686275e-07,
+    "entropy": 0.023284821771085262,
+    "num_tokens": 627253.0,
+    "mean_token_accuracy": 0.9910983681678772,
+    "epoch": 2.9264705882352944,
+    "step": 995
+  },
+  {
+    "loss": 0.020137375593185423,
+    "grad_norm": 1.3984375,
+    "learning_rate": 4.1176470588235295e-07,
+    "entropy": 0.024203809909522533,
+    "num_tokens": 630427.0,
+    "mean_token_accuracy": 0.9907480180263519,
+    "epoch": 2.9411764705882355,
+    "step": 1000
+  },
+  {
+    "loss": 0.019109995663166048,
+    "grad_norm": 1.21875,
+    "learning_rate": 3.1372549019607843e-07,
+    "entropy": 0.02416255362331867,
+    "num_tokens": 633632.0,
+    "mean_token_accuracy": 0.9915190756320953,
+    "epoch": 2.9558823529411766,
+    "step": 1005
+  },
+  {
+    "loss": 0.02000269144773483,
+    "grad_norm": 1.859375,
+    "learning_rate": 2.1568627450980394e-07,
+    "entropy": 0.024217843264341354,
+    "num_tokens": 636805.0,
+    "mean_token_accuracy": 0.9894875824451447,
+    "epoch": 2.9705882352941178,
+    "step": 1010
+  },
+  {
+    "loss": 0.020338763296604157,
+    "grad_norm": 1.546875,
+    "learning_rate": 1.1764705882352942e-07,
+    "entropy": 0.024258859269320966,
+    "num_tokens": 639984.0,
+    "mean_token_accuracy": 0.9892021059989929,
+    "epoch": 2.985294117647059,
+    "step": 1015
+  },
+  {
+    "loss": 0.020995336771011352,
+    "grad_norm": 1.046875,
+    "learning_rate": 1.9607843137254902e-08,
+    "entropy": 0.025342148169875144,
+    "num_tokens": 643104.0,
+    "mean_token_accuracy": 0.9887544453144074,
+    "epoch": 3.0,
+    "step": 1020
+  },
+  {
+    "train_runtime": 3944.5682,
+    "train_samples_per_second": 0.517,
+    "train_steps_per_second": 0.259,
+    "total_flos": 5056111718203392.0,
+    "train_loss": 0.07629515403041652,
+    "epoch": 3.0,
+    "step": 1020
+  }
+]

llm_policy.py CHANGED Viewed

@@ -76,13 +76,33 @@ class LLMPolicy:
         if self.tokenizer.pad_token is None:
             self.tokenizer.pad_token = self.tokenizer.eos_token
-        self.model = AutoModelForCausalLM.from_pretrained(
-            model_name_or_path,
-            torch_dtype=torch_dtype,
-        ).to(resolved_device)
         self.model.eval()
         self.device = resolved_device
     # ------------------------------------------------------------------
     # Public API
     # ------------------------------------------------------------------

         if self.tokenizer.pad_token is None:
             self.tokenizer.pad_token = self.tokenizer.eos_token
+        # transformers renamed torch_dtype -> dtype; try new kwarg first and
+        # fall back for older versions. Works silently on both.
+        try:
+            self.model = AutoModelForCausalLM.from_pretrained(
+                model_name_or_path,
+                dtype=torch_dtype,
+            ).to(resolved_device)
+        except TypeError:
+            self.model = AutoModelForCausalLM.from_pretrained(
+                model_name_or_path,
+                torch_dtype=torch_dtype,
+            ).to(resolved_device)
         self.model.eval()
         self.device = resolved_device
+        # Strip sampling-only fields from the shipped generation_config so
+        # transformers doesn't warn "these flags will be ignored" when we
+        # decode greedily (do_sample=False).
+        gen_config = getattr(self.model, "generation_config", None)
+        if gen_config is not None:
+            for attr in ("temperature", "top_p", "top_k"):
+                if hasattr(gen_config, attr):
+                    try:
+                        setattr(gen_config, attr, None)
+                    except Exception:
+                        pass
     # ------------------------------------------------------------------
     # Public API
     # ------------------------------------------------------------------

scripts/before_after_demo.py ADDED Viewed

	@@ -0,0 +1,197 @@

+"""Before/after demo: base model vs fine-tuned model on the SAME incident.
+Runs both policies against the same task under the same seed, prints a clean
+side-by-side trace, and writes ``artifacts/before_after_demo.md`` which you
+can paste into the blog post or screen-record for the video.
+Usage (after ``train_trl.py`` has saved ``artifacts/sft_model``)::
+    ENV_URL=http://127.0.0.1:8000 python scripts/before_after_demo.py
+Env variables:
+    ENV_URL            — URL of a running Incident Command Center server
+    BASE_MODEL         — HF hub id of the base model
+    SFT_MODEL_DIR      — path to the fine-tuned checkpoint (default: artifacts/sft_model)
+    DEMO_TASK          — task difficulty to demo (default: hard)
+    DEMO_MAX_STEPS     — per-episode step cap (default: 120)
+"""
+from __future__ import annotations
+import json
+import os
+import random
+import sys
+from pathlib import Path
+from typing import Dict, List, Optional
+# Ensure repo root on sys.path when invoked from subdirectory
+_REPO_ROOT = Path(__file__).resolve().parents[1]
+if str(_REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(_REPO_ROOT))
+from client import IncidentCommandEnvClient  # noqa: E402
+from models import IncidentAction, IncidentObservation  # noqa: E402
+ENV_URL = os.getenv("ENV_URL", "http://127.0.0.1:8000")
+BASE_MODEL = os.getenv("BASE_MODEL", "Qwen/Qwen2.5-0.5B-Instruct")
+SFT_MODEL_DIR = os.getenv("SFT_MODEL_DIR", "artifacts/sft_model")
+DEMO_TASK = os.getenv("DEMO_TASK", "hard")
+DEMO_MAX_STEPS = int(os.getenv("DEMO_MAX_STEPS", "120"))
+DEMO_SEED = int(os.getenv("DEMO_SEED", "2026"))
+def _format_obs_summary(obs: IncidentObservation) -> str:
+    return (
+        f"{obs.incident_title} "
+        f"(tier={obs.customer_tier}, users={obs.affected_users_estimate}, "
+        f"$/min={obs.revenue_impact_usd_per_min})"
+    )
+def _format_action(action: IncidentAction) -> str:
+    target = action.target or "-"
+    bits = [f"{action.actor}:{action.action_type}:{target}"]
+    if action.reason:
+        bits.append(f"reason={action.reason[:80]}")
+    return " | ".join(bits)
+def _format_components(components: Optional[Dict[str, float]]) -> str:
+    if not components:
+        return "-"
+    return ", ".join(f"{k}={v:+.2f}" for k, v in components.items())
+def _rollout_with_policy(policy_name: str, select_fn) -> Dict:
+    env = IncidentCommandEnvClient(base_url=ENV_URL).sync()
+    random.seed(DEMO_SEED)
+    steps_log: List[Dict] = []
+    total_reward = 0.0
+    components_sum: Dict[str, float] = {}
+    closed_incidents = 0
+    incident_seen: List[str] = []
+    try:
+        result = env.reset(task_name=DEMO_TASK)
+        step_idx = 0
+        while not result.done and step_idx < DEMO_MAX_STEPS:
+            step_idx += 1
+            obs = result.observation
+            if obs.incident_id not in incident_seen:
+                incident_seen.append(obs.incident_id)
+            action = select_fn(obs)
+            result = env.step(action)
+            reward = float(result.reward or 0.0)
+            total_reward += reward
+            new_obs = result.observation
+            step_components = getattr(new_obs, "reward_components", None) or {}
+            for k, v in step_components.items():
+                components_sum[k] = components_sum.get(k, 0.0) + float(v)
+            if action.action_type == "close_incident" and reward > 0:
+                closed_incidents += 1
+            steps_log.append(
+                {
+                    "step": step_idx,
+                    "incident": obs.incident_id,
+                    "summary": _format_obs_summary(obs),
+                    "action": _format_action(action),
+                    "reward": round(reward, 3),
+                    "components": _format_components(step_components),
+                }
+            )
+    finally:
+        try:
+            env.close()
+        except Exception:
+            pass
+    return {
+        "policy": policy_name,
+        "task": DEMO_TASK,
+        "steps": len(steps_log),
+        "total_reward": round(total_reward, 3),
+        "incidents_seen": incident_seen,
+        "incidents_closed": closed_incidents,
+        "components_sum": {k: round(v, 3) for k, v in components_sum.items()},
+        "trace": steps_log,
+    }
+def _write_markdown(base_run: Dict, sft_run: Dict, out_path: Path) -> None:
+    lines: List[str] = []
+    lines.append(f"# Before vs After — {DEMO_TASK.title()} task demo\n")
+    lines.append(f"Both policies ran against the same seeded task (`{DEMO_TASK}`, seed {DEMO_SEED}) ")
+    lines.append("on an identical Incident Command Center server. Each sees the same incident ")
+    lines.append("queue in the same order.\n")
+    lines.append("## Headline\n")
+    lines.append(f"| Policy | Total reward | Steps | Incidents closed |")
+    lines.append(f"|---|---:|---:|---:|")
+    lines.append(
+        f"| Base `{BASE_MODEL}` | {base_run['total_reward']:+.2f} | "
+        f"{base_run['steps']} | {base_run['incidents_closed']} |"
+    )
+    lines.append(
+        f"| **Fine-tuned (SFT)** | **{sft_run['total_reward']:+.2f}** | "
+        f"{sft_run['steps']} | {sft_run['incidents_closed']} |"
+    )
+    delta = sft_run["total_reward"] - base_run["total_reward"]
+    lines.append(f"\n**Reward delta: {delta:+.2f}** in favor of fine-tuned.\n")
+    lines.append("## Reward sources (summed across the episode)\n")
+    lines.append("| Component | Base | Fine-tuned |")
+    lines.append("|---|---:|---:|")
+    all_keys = sorted(set(base_run["components_sum"]) | set(sft_run["components_sum"]))
+    for k in all_keys:
+        lines.append(
+            f"| `{k}` | {base_run['components_sum'].get(k, 0.0):+.2f} | "
+            f"{sft_run['components_sum'].get(k, 0.0):+.2f} |"
+        )
+    def _trace_block(run: Dict, title: str) -> None:
+        lines.append(f"\n## Trace — {title}\n")
+        lines.append("```")
+        for row in run["trace"]:
+            lines.append(
+                f"step {row['step']:>3} | incident={row['incident']} | "
+                f"{row['action']} | reward={row['reward']:+.2f} | {row['components']}"
+            )
+        lines.append("```")
+    _trace_block(base_run, f"Base model ({BASE_MODEL})")
+    _trace_block(sft_run, "Fine-tuned (SFT) model")
+    out_path.write_text("\n".join(lines), encoding="utf-8")
+def main() -> None:
+    from llm_policy import LLMPolicy
+    print(f"[demo] task={DEMO_TASK} seed={DEMO_SEED} env={ENV_URL}")
+    print(f"[demo] Loading base model: {BASE_MODEL}")
+    base_policy = LLMPolicy(BASE_MODEL, label="base_model")
+    base_run = _rollout_with_policy("base_model", base_policy.select_action)
+    base_policy.release()
+    print(f"[demo] Loading SFT model: {SFT_MODEL_DIR}")
+    sft_policy = LLMPolicy(SFT_MODEL_DIR, label="sft_model")
+    sft_run = _rollout_with_policy("sft_model", sft_policy.select_action)
+    sft_policy.release()
+    art_dir = Path("artifacts")
+    art_dir.mkdir(exist_ok=True)
+    md_path = art_dir / "before_after_demo.md"
+    json_path = art_dir / "before_after_demo.json"
+    _write_markdown(base_run, sft_run, md_path)
+    with json_path.open("w", encoding="utf-8") as f:
+        json.dump({"base": base_run, "sft": sft_run}, f, indent=2)
+    print(f"[demo] Base   total={base_run['total_reward']:+.2f} "
+          f"steps={base_run['steps']} closed={base_run['incidents_closed']}")
+    print(f"[demo] SFT    total={sft_run['total_reward']:+.2f} "
+          f"steps={sft_run['steps']} closed={sft_run['incidents_closed']}")
+    print(f"[demo] Wrote {md_path} and {json_path}")
+if __name__ == "__main__":
+    main()

server/app.py CHANGED Viewed

@@ -17,10 +17,12 @@ from __future__ import annotations
 import json
 import logging
 from typing import Any, Dict
 import uvicorn
 from fastapi.responses import HTMLResponse, JSONResponse, PlainTextResponse
 from openenv.core.env_server import create_fastapi_app
 from models import IncidentAction, IncidentObservation
@@ -42,12 +44,41 @@ _LOG = logging.getLogger("icc.app")
 _CONFIG = EnvConfig.from_env()
 configure_logging(level=_CONFIG.log_level, structured=_CONFIG.structured_logging)
 app = create_fastapi_app(
     IncidentCommandCenterEnvironment,
     IncidentAction,
     IncidentObservation,
 )
 # ---------------------------------------------------------------------------
 # Introspection helpers
@@ -161,6 +192,153 @@ async def root() -> HTMLResponse:
 def _dashboard_html() -> str:
     metadata_json = json.dumps(_metadata_payload(), indent=2)
     return f"""
 <!DOCTYPE html>
 <html lang='en'>
@@ -180,24 +358,39 @@ def _dashboard_html() -> str:
       background: radial-gradient(1000px 600px at 10% -10%, #1e293b, var(--bg));
       color: var(--text); padding: 2rem; margin: 0; min-height: 100vh;
     }}
-    header {{ display:flex; align-items:center; justify-content:space-between; max-width:1100px; margin:0 auto 1.5rem; }}
     .brand {{ display:flex; align-items:center; gap:0.75rem; }}
     .logo {{ width:44px; height:44px; border-radius:10px; background:linear-gradient(135deg,var(--primary),var(--accent)); }}
     h1 {{ font-size:1.6rem; margin:0; }}
-    h2 {{ font-size:1.1rem; margin:1.4rem 0 0.6rem; color:#cbd5e1; }}
     .sub {{ color: var(--muted); }}
-    .grid {{ display:grid; grid-template-columns: repeat(auto-fit,minmax(260px,1fr)); gap:1rem; max-width:1100px; margin:0 auto; }}
     .card {{ background: var(--card); border: 1px solid #1f2a44; padding: 1.25rem; border-radius: 14px; }}
     .card h3 {{ margin:0 0 0.5rem; font-size:1rem; color:#f1f5f9; }}
     .pill {{ display:inline-block; padding:2px 8px; margin:2px; border-radius:999px; background:#1e293b; border:1px solid #334155; color:#cbd5e1; font-size:0.78rem; }}
     .container {{ max-width: 1100px; margin: 0 auto; }}
     code {{ background:#0b1225; border:1px solid #1f2a44; padding:2px 6px; border-radius:6px; color:#67e8f9; font-family:'JetBrains Mono', monospace; }}
     pre {{ background:#0b1225; border:1px solid #1f2a44; padding: 1rem; border-radius: 10px; color:#cbd5e1; overflow-x:auto; font-size:0.85rem; }}
     a {{ color: var(--accent); text-decoration: none; }}
     .kpi {{ display:flex; flex-direction:column; gap:0.25rem; }}
     .kpi .num {{ font-size:1.6rem; font-weight:700; color:#f8fafc; }}
     .kpi .lbl {{ color: var(--muted); font-size:0.8rem; }}
     footer {{ max-width:1100px; margin:2rem auto 0; color:var(--muted); font-size:0.85rem; }}
   </style>
 </head>
 <body>
@@ -206,16 +399,53 @@ def _dashboard_html() -> str:
       <div class='logo'></div>
       <div>
         <h1>Incident Command Center</h1>
-        <div class='sub'>OpenEnv · Multi-Agent · Long-Horizon · Enterprise Simulation</div>
       </div>
     </div>
-    <div>
       <span class='pill'>v{_CONFIG.version}</span>
       <span class='pill'>task: easy / medium / hard</span>
     </div>
   </header>
   <div class='container'>
     <div class='grid'>
       <div class='card'>
         <div class='kpi'>
@@ -246,6 +476,33 @@ def _dashboard_html() -> str:
       </div>
     </div>
     <h2>Endpoints</h2>
     <div class='card'>
       <p class='sub'>Standard OpenEnv contract plus operational endpoints.</p>
@@ -258,22 +515,40 @@ def _dashboard_html() -> str:
         <li><code>GET /env-info</code> — action space, reward model, budgets.</li>
         <li><code>GET /metrics</code> — Prometheus-style counters.</li>
         <li><code>GET /docs</code> — interactive OpenAPI documentation.</li>
       </ul>
     </div>
     <h2>Action space</h2>
     <div class='card'>
       {"".join(f"<span class='pill'>{a}</span>" for a in ALL_ACTIONS)}
-      <p class='sub'>Each action is gated by the acting role; wrong-actor calls are penalised.</p>
     </div>
-    <h2>Reward model (summary)</h2>
     <div class='card'>
-      <p>Composable rubric with anti-gaming safeguards. Every step returns a
-      <code>reward_components</code> dictionary so training curves are
-      interpretable. Closure rewards and SLA penalties are scaled by
-      customer-tier multipliers:</p>
-      {"".join(f"<span class='pill'>{tier}: x{mult}</span>" for tier, mult in TIER_MULTIPLIER.items())}
     </div>
     <h2>Metadata</h2>
@@ -284,7 +559,9 @@ def _dashboard_html() -> str:
   <footer>
     Incident Command Center v{_CONFIG.version} · Built on
-    <a href='https://github.com/meta-pytorch/openenv'>OpenEnv</a>.
   </footer>
   <script>

 import json
 import logging
+from pathlib import Path
 from typing import Any, Dict
 import uvicorn
 from fastapi.responses import HTMLResponse, JSONResponse, PlainTextResponse
+from fastapi.staticfiles import StaticFiles
 from openenv.core.env_server import create_fastapi_app
 from models import IncidentAction, IncidentObservation
 _CONFIG = EnvConfig.from_env()
 configure_logging(level=_CONFIG.log_level, structured=_CONFIG.structured_logging)
+# External URLs surfaced on the dashboard so judges can jump straight from
+# the HF Space to the GitHub / Colab / training artifacts.
+GITHUB_URL = "https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center"
+SPACE_PAGE_URL = "https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center"
+COLAB_URL = "https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing"
 app = create_fastapi_app(
     IncidentCommandCenterEnvironment,
     IncidentAction,
     IncidentObservation,
 )
+# Serve the committed training-evidence artifacts (reward_curve.png,
+# training_curve.png, reward_components.png, summary_metrics.json, ...)
+# so the dashboard can embed them without depending on external hosts.
+_ARTIFACTS_DIR = Path(__file__).resolve().parent.parent / "artifacts"
+if _ARTIFACTS_DIR.exists():
+    app.mount(
+        "/artifacts",
+        StaticFiles(directory=str(_ARTIFACTS_DIR)),
+        name="artifacts",
+    )
+def _load_summary_metrics() -> Dict[str, Any]:
+    """Best-effort load of the committed training results for the dashboard."""
+    path = _ARTIFACTS_DIR / "summary_metrics.json"
+    if not path.exists():
+        return {}
+    try:
+        with path.open("r", encoding="utf-8") as fh:
+            return json.load(fh)
+    except (OSError, json.JSONDecodeError):
+        return {}
 # ---------------------------------------------------------------------------
 # Introspection helpers
 def _dashboard_html() -> str:
     metadata_json = json.dumps(_metadata_payload(), indent=2)
+    metrics = _load_summary_metrics()
+    artifacts_available = _ARTIFACTS_DIR.exists() and (
+        _ARTIFACTS_DIR / "reward_curve.png"
+    ).exists()
+    # --- Headline training numbers (1.5B SFT vs base, hard task) -------------
+    base_rewards = metrics.get("base_model_rewards") or [0.0, 0.0, 0.0]
+    sft_rewards = metrics.get("sft_model_rewards") or [0.0, 0.0, 0.0]
+    improvement = metrics.get("improvement_sft_over_base") or [0.0, 0.0, 0.0]
+    headline_delta = improvement[2] if len(improvement) >= 3 else 0.0
+    def _fmt(val: Any) -> str:
+        try:
+            return f"{float(val):+.2f}"
+        except (TypeError, ValueError):
+            return "—"
+    training_rows = "".join(
+        f"<tr><td>{tier}</td><td>{_fmt(base_rewards[idx])}</td>"
+        f"<td>{_fmt(sft_rewards[idx])}</td>"
+        f"<td class='delta'>{_fmt(improvement[idx])}</td></tr>"
+        for idx, tier in enumerate(("easy", "medium", "hard"))
+        if idx < len(base_rewards)
+    )
+    # --- Training-evidence block (plots + caption) ---------------------------
+    if artifacts_available:
+        plots_html = """
+    <h2>Training evidence</h2>
+    <p class='sub'>
+      Committed artifacts from the reference training run
+      (Qwen2.5-1.5B-Instruct, 8 episodes/task, 3 epochs).
+    </p>
+    <div class='plots'>
+      <figure>
+        <img src='/artifacts/reward_curve.png' alt='Reward curve by policy' loading='lazy' />
+        <figcaption>Mean episodic reward per task tier across Random / Heuristic /
+        Base-LLM / SFT-LLM. SFT matches the heuristic demonstrator across every tier
+        and outperforms the untuned base by <strong>+{hard}</strong> on hard incidents.</figcaption>
+      </figure>
+      <figure>
+        <img src='/artifacts/training_curve.png' alt='SFT training loss and token accuracy' loading='lazy' />
+        <figcaption>Supervised loss collapses from <code>~2.84 → ~0.02</code> and
+        next-token accuracy climbs from <code>~0.49 → ~0.99</code> in three epochs on 680 rollout tokens.</figcaption>
+      </figure>
+      <figure>
+        <img src='/artifacts/reward_components.png' alt='Reward component decomposition' loading='lazy' />
+        <figcaption>Per-component reward decomposition. SFT reproduces the
+        heuristic's positive components (clue_bonus, mitigation_correct, closure_correct,
+        speed_bonus) while the base model stalls on step_cost and SLA penalties.</figcaption>
+      </figure>
+    </div>
+    <p class='sub' style='margin-top:0.75rem'>
+      Raw files:
+      <a href='/artifacts/summary_metrics.json'>summary_metrics.json</a>
+      ·
+      <a href='/artifacts/training_log.json'>training_log.json</a>
+      ·
+      <a href='/artifacts/reward_curve_qwen0p5b.png'>0.5B ablation plot</a>
+      ·
+      <a href='/artifacts/summary_metrics_qwen0p5b.json'>0.5B metrics</a>
+    </p>
+""".format(hard=_fmt(headline_delta))
+    else:
+        plots_html = (
+            "<h2>Training evidence</h2>"
+            "<div class='card'><p class='sub'>Plots not bundled in this image. "
+            "See the <a href='" + GITHUB_URL + "/tree/main/artifacts'>GitHub artifacts folder</a>.</p></div>"
+        )
+    # --- 0.5B ablation summary ----------------------------------------------
+    ablation_html = """
+    <h2>Ablation: model scale matters for imitation learning</h2>
+    <div class='card'>
+      <p class='sub'>
+        Same pipeline, same data schema — only the base-model size differs. The 0.5B
+        model cannot absorb the expert policy; 1.5B matches it exactly.
+      </p>
+      <div class='table-wrap'>
+        <table>
+          <thead>
+            <tr>
+              <th>Model</th><th>Easy Δ</th><th>Medium Δ</th><th>Hard Δ</th>
+              <th>Heuristic match?</th>
+            </tr>
+          </thead>
+          <tbody>
+            <tr>
+              <td>Qwen2.5-0.5B-Instruct</td>
+              <td>+0.43</td><td>+0.14</td><td class='delta'>+0.00</td>
+              <td>No (stuck on step-cost)</td>
+            </tr>
+            <tr>
+              <td><strong>Qwen2.5-1.5B-Instruct</strong></td>
+              <td>-1.80</td><td>+3.13</td><td class='delta good'>+10.17</td>
+              <td><strong>Yes (exact match)</strong></td>
+            </tr>
+          </tbody>
+        </table>
+      </div>
+    </div>
+"""
+    # --- Theme-mapping block (Multi-Agent / Long-Horizon / Professional) -----
+    themes_html = """
+    <h2>Hackathon theme mapping</h2>
+    <div class='grid grid-3'>
+      <div class='card'>
+        <h3>Theme #1 — Multi-Agent Interactions</h3>
+        <p class='sub'>
+          Three gated specialist roles (triage, investigator, ops manager) exchange
+          structured handoffs. Acting out-of-role triggers a
+          <code>wrong_actor_penalty</code>, so collaboration is trained, not hard-coded.
+        </p>
+      </div>
+      <div class='card'>
+        <h3>Theme #2 — Long-Horizon Planning</h3>
+        <p class='sub'>
+          Episodes span up to 28 steps across stacked incidents with delayed,
+          sparse rewards (closure &amp; post-mortem) and per-tier budget / SLA
+          constraints — a proper credit-assignment stress test.
+        </p>
+      </div>
+      <div class='card'>
+        <h3>Theme #3 — World Modeling / Professional Tasks</h3>
+        <p class='sub'>
+          A realistic enterprise incident-response simulation with customer tiers,
+          rollbacks, escalation policies, post-mortems, and a transparent,
+          anti-gamed reward rubric.
+        </p>
+      </div>
+    </div>
+"""
+    # --- Reward-rubric details ----------------------------------------------
+    reward_rubric_rows = "".join(
+        f"<tr><td><code>{name}</code></td><td>{value}</td></tr>"
+        for name, value in (
+            ("step_cost", f"{STEP_COST_INVESTIGATION} per investigation step"),
+            ("clue_reward", f"+{CLUE_REWARD} per new fact"),
+            ("handoff_correct", f"+{HANDOFF_CORRECT_REWARD}"),
+            ("mitigation_correct", f"+{MITIGATION_CORRECT_REWARD}"),
+            ("closure_correct_base", f"+{CLOSURE_CORRECT_BASE} × tier multiplier"),
+            ("closure_wrong", f"{CLOSURE_WRONG_PENALTY} × tier multiplier"),
+        )
+    )
     return f"""
 <!DOCTYPE html>
 <html lang='en'>
       background: radial-gradient(1000px 600px at 10% -10%, #1e293b, var(--bg));
       color: var(--text); padding: 2rem; margin: 0; min-height: 100vh;
     }}
+    header {{ display:flex; align-items:center; justify-content:space-between; max-width:1100px; margin:0 auto 1.5rem; flex-wrap:wrap; gap:1rem; }}
     .brand {{ display:flex; align-items:center; gap:0.75rem; }}
     .logo {{ width:44px; height:44px; border-radius:10px; background:linear-gradient(135deg,var(--primary),var(--accent)); }}
     h1 {{ font-size:1.6rem; margin:0; }}
+    h2 {{ font-size:1.2rem; margin:1.8rem 0 0.6rem; color:#cbd5e1; }}
     .sub {{ color: var(--muted); }}
+    .grid {{ display:grid; grid-template-columns: repeat(auto-fit,minmax(240px,1fr)); gap:1rem; max-width:1100px; margin:0 auto; }}
+    .grid-3 {{ grid-template-columns: repeat(auto-fit,minmax(280px,1fr)); }}
     .card {{ background: var(--card); border: 1px solid #1f2a44; padding: 1.25rem; border-radius: 14px; }}
     .card h3 {{ margin:0 0 0.5rem; font-size:1rem; color:#f1f5f9; }}
     .pill {{ display:inline-block; padding:2px 8px; margin:2px; border-radius:999px; background:#1e293b; border:1px solid #334155; color:#cbd5e1; font-size:0.78rem; }}
+    .pill.cta {{ background:linear-gradient(135deg,var(--primary),var(--accent)); color:#0b1225; border-color:transparent; font-weight:600; }}
     .container {{ max-width: 1100px; margin: 0 auto; }}
     code {{ background:#0b1225; border:1px solid #1f2a44; padding:2px 6px; border-radius:6px; color:#67e8f9; font-family:'JetBrains Mono', monospace; }}
     pre {{ background:#0b1225; border:1px solid #1f2a44; padding: 1rem; border-radius: 10px; color:#cbd5e1; overflow-x:auto; font-size:0.85rem; }}
     a {{ color: var(--accent); text-decoration: none; }}
+    a:hover {{ text-decoration: underline; }}
     .kpi {{ display:flex; flex-direction:column; gap:0.25rem; }}
     .kpi .num {{ font-size:1.6rem; font-weight:700; color:#f8fafc; }}
     .kpi .lbl {{ color: var(--muted); font-size:0.8rem; }}
+    .kpi .num.good {{ color: var(--good); }}
     footer {{ max-width:1100px; margin:2rem auto 0; color:var(--muted); font-size:0.85rem; }}
+    .plots {{ display:grid; grid-template-columns: repeat(auto-fit,minmax(300px,1fr)); gap:1rem; max-width:1100px; margin:0 auto; }}
+    .plots figure {{ background: var(--card); border:1px solid #1f2a44; border-radius: 14px; padding: 0.75rem; margin:0; }}
+    .plots img {{ width:100%; height:auto; border-radius:8px; background:#0b1225; }}
+    .plots figcaption {{ color: var(--muted); font-size:0.8rem; margin-top:0.5rem; line-height:1.4; }}
+    .table-wrap {{ overflow-x:auto; }}
+    table {{ width:100%; border-collapse: collapse; margin-top:0.5rem; font-size:0.9rem; }}
+    th, td {{ padding:0.5rem 0.75rem; text-align:left; border-bottom:1px solid #1f2a44; }}
+    th {{ color:#cbd5e1; font-weight:600; }}
+    td.delta {{ font-weight:600; color:#f8fafc; }}
+    td.delta.good {{ color: var(--good); }}
+    .links {{ display:flex; flex-wrap:wrap; gap:0.5rem; }}
   </style>
 </head>
 <body>
       <div class='logo'></div>
       <div>
         <h1>Incident Command Center</h1>
+        <div class='sub'>OpenEnv · Multi-Agent · Long-Horizon · Professional-Task Simulation</div>
       </div>
     </div>
+    <div class='links'>
+      <a class='pill cta' href='{GITHUB_URL}' target='_blank' rel='noopener'>GitHub</a>
+      <a class='pill cta' href='{COLAB_URL}' target='_blank' rel='noopener'>Open in Colab</a>
+      <a class='pill' href='{SPACE_PAGE_URL}' target='_blank' rel='noopener'>Space page</a>
       <span class='pill'>v{_CONFIG.version}</span>
       <span class='pill'>task: easy / medium / hard</span>
     </div>
   </header>
   <div class='container'>
+    <h2>Headline results</h2>
+    <div class='grid'>
+      <div class='card'>
+        <div class='kpi'>
+          <span class='lbl'>SFT reward lift on hard tasks</span>
+          <span class='num good'>{_fmt(headline_delta)}</span>
+          <span class='sub'>vs Qwen2.5-1.5B-Instruct base</span>
+        </div>
+      </div>
+      <div class='card'>
+        <div class='kpi'>
+          <span class='lbl'>Heuristic-policy match</span>
+          <span class='num'>Exact</span>
+          <span class='sub'>SFT clones the demonstrator across every tier</span>
+        </div>
+      </div>
+      <div class='card'>
+        <div class='kpi'>
+          <span class='lbl'>Scale ablation (hard Δ)</span>
+          <span class='num'>0.5B → 1.5B</span>
+          <span class='sub'>+0.00 → +10.17: capacity matters</span>
+        </div>
+      </div>
+      <div class='card'>
+        <div class='kpi'>
+          <span class='lbl'>Training data</span>
+          <span class='num'>680 rows</span>
+          <span class='sub'>24 heuristic rollouts · 3 epochs</span>
+        </div>
+      </div>
+    </div>
+    <h2>Environment at a glance</h2>
     <div class='grid'>
       <div class='card'>
         <div class='kpi'>
       </div>
     </div>
+    <h2>1.5B SFT vs base (reference run)</h2>
+    <div class='card'>
+      <div class='table-wrap'>
+        <table>
+          <thead>
+            <tr>
+              <th>Task tier</th><th>Base reward</th><th>SFT reward</th><th>Δ</th>
+            </tr>
+          </thead>
+          <tbody>
+            {training_rows}
+          </tbody>
+        </table>
+      </div>
+      <p class='sub' style='margin-top:0.75rem'>
+        Numbers loaded live from
+        <a href='/artifacts/summary_metrics.json'>summary_metrics.json</a>
+        committed alongside this Space.
+      </p>
+    </div>
+    {plots_html}
+    {ablation_html}
+    {themes_html}
     <h2>Endpoints</h2>
     <div class='card'>
       <p class='sub'>Standard OpenEnv contract plus operational endpoints.</p>
         <li><code>GET /env-info</code> — action space, reward model, budgets.</li>
         <li><code>GET /metrics</code> — Prometheus-style counters.</li>
         <li><code>GET /docs</code> — interactive OpenAPI documentation.</li>
+        <li><code>GET /artifacts/…</code> — committed training plots &amp; metrics.</li>
       </ul>
     </div>
     <h2>Action space</h2>
     <div class='card'>
       {"".join(f"<span class='pill'>{a}</span>" for a in ALL_ACTIONS)}
+      <p class='sub' style='margin-top:0.5rem'>
+        Each action is gated by the acting role; wrong-actor calls are penalised.
+      </p>
     </div>
+    <h2>Reward model</h2>
     <div class='card'>
+      <p>
+        Composable rubric with anti-gaming safeguards. Every step returns a
+        <code>reward_components</code> dictionary so training curves are
+        interpretable. Closure rewards and SLA penalties are scaled by
+        customer-tier multipliers:
+      </p>
+      <p>
+        {"".join(f"<span class='pill'>{tier}: x{mult}</span>" for tier, mult in TIER_MULTIPLIER.items())}
+      </p>
+      <div class='table-wrap'>
+        <table>
+          <thead><tr><th>Component</th><th>Signal</th></tr></thead>
+          <tbody>{reward_rubric_rows}</tbody>
+        </table>
+      </div>
+      <p class='sub' style='margin-top:0.75rem'>
+        Full rubric (invalid-action, repeated-lookup, rollback-effective,
+        post-mortem-logged, etc.) is documented in the
+        <a href='{GITHUB_URL}#reward-model' target='_blank' rel='noopener'>README</a>.
+      </p>
     </div>
     <h2>Metadata</h2>
   <footer>
     Incident Command Center v{_CONFIG.version} · Built on
+    <a href='https://github.com/meta-pytorch/openenv' target='_blank' rel='noopener'>OpenEnv</a>
+    · <a href='{GITHUB_URL}' target='_blank' rel='noopener'>Source on GitHub</a>
+    · <a href='{COLAB_URL}' target='_blank' rel='noopener'>Reproduce training on Colab</a>
   </footer>
   <script>

train_trl.py CHANGED Viewed

@@ -59,6 +59,7 @@ class EpisodeStats:
     total_reward: float
     steps: int
     success: bool
 # ---------------------------------------------------------------------------
@@ -119,6 +120,7 @@ def rollout(
     coordinator = HeuristicCoordinator()
     records: List[Dict[str, str]] = []
     rewards: List[float] = []
     steps = 0
     step_cap = max_steps if max_steps is not None else MAX_ROLLOUT_STEPS
@@ -143,6 +145,9 @@ def rollout(
             result = env.step(action)
             rewards.append(float(result.reward or 0.0))
     finally:
         try:
             env.close()
@@ -151,11 +156,15 @@ def rollout(
     total_reward = sum(rewards)
     success = total_reward > 0.0
-    return (
-        EpisodeStats(policy_name, task_name, total_reward, steps, success),
-        records,
-        rewards,
     )
 def build_training_dataset(episodes_per_task: int = EPISODES_PER_TASK) -> Dataset:
@@ -259,7 +268,12 @@ def run_trl_sft(dataset: Dataset) -> Path:
     SFT_MODEL_DIR.mkdir(parents=True, exist_ok=True)
     trainer.save_model(str(SFT_MODEL_DIR))
     tokenizer.save_pretrained(str(SFT_MODEL_DIR))
     print(f"[train] Saved SFT checkpoint to {SFT_MODEL_DIR}")
     del trainer, model, tokenizer
     _free_gpu_memory()
@@ -306,6 +320,7 @@ def _evaluate_single_policy(
     policy_name: str,
     select_fn: Callable[[IncidentObservation], IncidentAction],
     max_steps: Optional[int] = None,
 ) -> List[float]:
     scores: List[float] = []
     for task in ["easy", "medium", "hard"]:
@@ -320,17 +335,20 @@ def _evaluate_single_policy(
             f"reward={stats.total_reward:+.2f} steps={stats.steps}"
         )
         scores.append(round(stats.total_reward, 4))
     return scores
 def evaluate_policies(
     seed: int = 7,
     evaluate_llms: Optional[bool] = None,
-) -> Dict[str, List[float]]:
     """Run each policy once per task under the same seed.
-    The random policy is seeded for reproducibility. The heuristic policy is
-    deterministic already. LLM policies are evaluated with greedy decoding.
     """
     random.seed(seed)
@@ -340,30 +358,43 @@ def evaluate_policies(
         "base_model": [],
         "sft_model": [],
     }
     for task in ["easy", "medium", "hard"]:
         random_stats, _, _ = rollout("random", task)
         heuristic_stats, _, _ = rollout("heuristic", task)
         scores["random"].append(round(random_stats.total_reward, 4))
         scores["heuristic"].append(round(heuristic_stats.total_reward, 4))
     should_eval_llms = _should_evaluate_llms() if evaluate_llms is None else evaluate_llms
     if not should_eval_llms:
         print("[eval] Skipping LLM evaluation (no GPU or EVAL_LLM_MODELS=false).")
-        return scores
     try:
         from llm_policy import LLMPolicy
     except Exception as exc:  # pragma: no cover - import-time safety
         print(f"[eval] Could not import LLMPolicy ({exc}); skipping LLM eval.")
-        return scores
     # Base model
     try:
         print(f"[eval] Loading BASE model: {BASE_MODEL}")
         base = LLMPolicy(BASE_MODEL, label="base_model")
         scores["base_model"] = _evaluate_single_policy(
-            "base_model", base.select_action, max_steps=MAX_LLM_EVAL_STEPS
         )
         base.release()
         _free_gpu_memory()
@@ -376,7 +407,10 @@ def evaluate_policies(
             print(f"[eval] Loading SFT model: {SFT_MODEL_DIR}")
             sft = LLMPolicy(str(SFT_MODEL_DIR), label="sft_model")
             scores["sft_model"] = _evaluate_single_policy(
-                "sft_model", sft.select_action, max_steps=MAX_LLM_EVAL_STEPS
             )
             sft.release()
             _free_gpu_memory()
@@ -385,7 +419,122 @@ def evaluate_policies(
     else:
         print(f"[eval] No SFT checkpoint found at {SFT_MODEL_DIR}; skipping SFT eval.")
-    return scores
 def plot_rewards(score_map: Dict[str, List[float]]) -> None:
@@ -423,8 +572,13 @@ def main() -> None:
     dataset.save_to_disk(str(ARTIFACT_DIR / "trl_dataset"))
     run_trl_sft(dataset)
-    scores = evaluate_policies()
     plot_rewards(scores)
     summary = {
         "base_model": BASE_MODEL,
@@ -442,6 +596,11 @@ def main() -> None:
             round(h - r, 4)
             for h, r in zip(scores.get("heuristic", []), scores.get("random", []))
         ],
     }
     with open(ARTIFACT_DIR / "summary_metrics.json", "w", encoding="utf-8") as f:
         json.dump(summary, f, indent=2)

     total_reward: float
     steps: int
     success: bool
+    components: Dict[str, float] = None  # type: ignore[assignment]
 # ---------------------------------------------------------------------------
     coordinator = HeuristicCoordinator()
     records: List[Dict[str, str]] = []
     rewards: List[float] = []
+    components_sum: Dict[str, float] = {}
     steps = 0
     step_cap = max_steps if max_steps is not None else MAX_ROLLOUT_STEPS
             result = env.step(action)
             rewards.append(float(result.reward or 0.0))
+            step_components = getattr(result.observation, "reward_components", None) or {}
+            for key, value in step_components.items():
+                components_sum[key] = components_sum.get(key, 0.0) + float(value)
     finally:
         try:
             env.close()
     total_reward = sum(rewards)
     success = total_reward > 0.0
+    stats = EpisodeStats(
+        policy_name=policy_name,
+        task_name=task_name,
+        total_reward=total_reward,
+        steps=steps,
+        success=success,
+        components={k: round(v, 4) for k, v in components_sum.items()},
     )
+    return (stats, records, rewards)
 def build_training_dataset(episodes_per_task: int = EPISODES_PER_TASK) -> Dataset:
     SFT_MODEL_DIR.mkdir(parents=True, exist_ok=True)
     trainer.save_model(str(SFT_MODEL_DIR))
     tokenizer.save_pretrained(str(SFT_MODEL_DIR))
+    log_path = ARTIFACT_DIR / "training_log.json"
+    with log_path.open("w", encoding="utf-8") as f:
+        json.dump(trainer.state.log_history, f, indent=2, default=str)
     print(f"[train] Saved SFT checkpoint to {SFT_MODEL_DIR}")
+    print(f"[train] Saved training log to {log_path}")
     del trainer, model, tokenizer
     _free_gpu_memory()
     policy_name: str,
     select_fn: Callable[[IncidentObservation], IncidentAction],
     max_steps: Optional[int] = None,
+    components_accumulator: Optional[Dict[str, float]] = None,
 ) -> List[float]:
     scores: List[float] = []
     for task in ["easy", "medium", "hard"]:
             f"reward={stats.total_reward:+.2f} steps={stats.steps}"
         )
         scores.append(round(stats.total_reward, 4))
+        if components_accumulator is not None and stats.components:
+            for k, v in stats.components.items():
+                components_accumulator[k] = components_accumulator.get(k, 0.0) + v
     return scores
 def evaluate_policies(
     seed: int = 7,
     evaluate_llms: Optional[bool] = None,
+) -> Dict[str, object]:
     """Run each policy once per task under the same seed.
+    Returns a dict with keys ``scores`` (mapping policy -> [easy, medium, hard])
+    and ``components`` (mapping policy -> {component_name: summed_value}).
     """
     random.seed(seed)
         "base_model": [],
         "sft_model": [],
     }
+    components: Dict[str, Dict[str, float]] = {
+        "random": {},
+        "heuristic": {},
+        "base_model": {},
+        "sft_model": {},
+    }
     for task in ["easy", "medium", "hard"]:
         random_stats, _, _ = rollout("random", task)
         heuristic_stats, _, _ = rollout("heuristic", task)
         scores["random"].append(round(random_stats.total_reward, 4))
         scores["heuristic"].append(round(heuristic_stats.total_reward, 4))
+        for k, v in (random_stats.components or {}).items():
+            components["random"][k] = components["random"].get(k, 0.0) + v
+        for k, v in (heuristic_stats.components or {}).items():
+            components["heuristic"][k] = components["heuristic"].get(k, 0.0) + v
     should_eval_llms = _should_evaluate_llms() if evaluate_llms is None else evaluate_llms
     if not should_eval_llms:
         print("[eval] Skipping LLM evaluation (no GPU or EVAL_LLM_MODELS=false).")
+        return {"scores": scores, "components": components}
     try:
         from llm_policy import LLMPolicy
     except Exception as exc:  # pragma: no cover - import-time safety
         print(f"[eval] Could not import LLMPolicy ({exc}); skipping LLM eval.")
+        return {"scores": scores, "components": components}
     # Base model
     try:
         print(f"[eval] Loading BASE model: {BASE_MODEL}")
         base = LLMPolicy(BASE_MODEL, label="base_model")
         scores["base_model"] = _evaluate_single_policy(
+            "base_model",
+            base.select_action,
+            max_steps=MAX_LLM_EVAL_STEPS,
+            components_accumulator=components["base_model"],
         )
         base.release()
         _free_gpu_memory()
             print(f"[eval] Loading SFT model: {SFT_MODEL_DIR}")
             sft = LLMPolicy(str(SFT_MODEL_DIR), label="sft_model")
             scores["sft_model"] = _evaluate_single_policy(
+                "sft_model",
+                sft.select_action,
+                max_steps=MAX_LLM_EVAL_STEPS,
+                components_accumulator=components["sft_model"],
             )
             sft.release()
             _free_gpu_memory()
     else:
         print(f"[eval] No SFT checkpoint found at {SFT_MODEL_DIR}; skipping SFT eval.")
+    return {"scores": scores, "components": components}
+def plot_training_curve(
+    log_path: Path = ARTIFACT_DIR / "training_log.json",
+    out_path: Path = ARTIFACT_DIR / "training_curve.png",
+) -> None:
+    """Plot loss (and token accuracy if present) vs training step from TRL log.
+    Satisfies the hackathon minimum requirement of showing BOTH loss and reward plots.
+    """
+    if not log_path.exists():
+        return
+    try:
+        log = json.loads(log_path.read_text(encoding="utf-8"))
+    except Exception:
+        return
+    steps: List[int] = []
+    losses: List[float] = []
+    accs: List[Optional[float]] = []
+    for entry in log:
+        if "loss" not in entry or "step" not in entry:
+            continue
+        try:
+            steps.append(int(entry["step"]))
+            losses.append(float(entry["loss"]))
+            acc = entry.get("mean_token_accuracy")
+            accs.append(float(acc) if acc is not None else None)
+        except Exception:
+            continue
+    if not steps:
+        return
+    fig, ax1 = plt.subplots(figsize=(9, 5))
+    ax1.plot(steps, losses, marker="o", color="tab:blue", label="Training loss", linewidth=2)
+    ax1.set_xlabel("Training step")
+    ax1.set_ylabel("Loss", color="tab:blue")
+    ax1.tick_params(axis="y", labelcolor="tab:blue")
+    ax1.grid(alpha=0.3)
+    if all(a is not None for a in accs):
+        ax2 = ax1.twinx()
+        ax2.plot(
+            steps,
+            accs,
+            marker="^",
+            color="tab:orange",
+            label="Mean token accuracy",
+            linewidth=2,
+        )
+        ax2.set_ylabel("Mean token accuracy", color="tab:orange")
+        ax2.tick_params(axis="y", labelcolor="tab:orange")
+        ax2.set_ylim(0.0, 1.05)
+    plt.title("TRL SFT training curve — loss & token accuracy")
+    plt.tight_layout()
+    plt.savefig(out_path, dpi=160)
+    plt.close()
+def plot_reward_components(
+    components_by_policy: Dict[str, Dict[str, float]],
+    out_path: Path = ARTIFACT_DIR / "reward_components.png",
+) -> None:
+    """Grouped bar chart of reward-component contributions per policy.
+    Visualizes the rubric-based reward signal: where each policy's reward
+    actually comes from (step cost, clue bonus, handoff, mitigation, closure,
+    etc.). Makes the reward design visible to judges at a glance.
+    """
+    if not components_by_policy:
+        return
+    all_keys: List[str] = []
+    for comps in components_by_policy.values():
+        for k in comps:
+            if k not in all_keys:
+                all_keys.append(k)
+    if not all_keys:
+        return
+    policies = list(components_by_policy.keys())
+    n_policies = len(policies)
+    n_keys = len(all_keys)
+    fig, ax = plt.subplots(figsize=(max(10, n_keys * 0.6), 6))
+    bar_width = 0.8 / max(n_policies, 1)
+    colors = {
+        "random": "tab:red",
+        "heuristic": "tab:blue",
+        "base_model": "tab:orange",
+        "sft_model": "tab:green",
+    }
+    for i, policy in enumerate(policies):
+        values = [components_by_policy[policy].get(k, 0.0) for k in all_keys]
+        offsets = [x + i * bar_width - 0.4 + bar_width / 2 for x in range(n_keys)]
+        ax.bar(
+            offsets,
+            values,
+            width=bar_width,
+            label=policy,
+            color=colors.get(policy, None),
+        )
+    ax.axhline(0, color="gray", linewidth=0.8)
+    ax.set_xticks(range(n_keys))
+    ax.set_xticklabels(all_keys, rotation=35, ha="right")
+    ax.set_ylabel("Summed reward contribution (all tasks)")
+    ax.set_title("Where each policy earns / loses reward — rubric components")
+    ax.legend()
+    ax.grid(axis="y", alpha=0.3)
+    plt.tight_layout()
+    plt.savefig(out_path, dpi=160)
+    plt.close()
 def plot_rewards(score_map: Dict[str, List[float]]) -> None:
     dataset.save_to_disk(str(ARTIFACT_DIR / "trl_dataset"))
     run_trl_sft(dataset)
+    eval_out = evaluate_policies()
+    scores: Dict[str, List[float]] = eval_out["scores"]  # type: ignore[assignment]
+    components: Dict[str, Dict[str, float]] = eval_out["components"]  # type: ignore[assignment]
     plot_rewards(scores)
+    plot_training_curve()
+    plot_reward_components(components)
     summary = {
         "base_model": BASE_MODEL,
             round(h - r, 4)
             for h, r in zip(scores.get("heuristic", []), scores.get("random", []))
         ],
+        "reward_components_by_policy": {
+            policy: {k: round(v, 4) for k, v in comps.items()}
+            for policy, comps in components.items()
+            if comps
+        },
     }
     with open(ARTIFACT_DIR / "summary_metrics.json", "w", encoding="utf-8") as f:
         json.dump(summary, f, indent=2)