the-ashutosh commited on
Commit
cf0bdb2
·
1 Parent(s): 1c11a6b

docs: drop Colab framing, single HF Jobs path, rename artifacts to run1/run2

Browse files

User asked for an HF-only submission with one training path. The data
itself is unchanged (two real T4 GRPO runs against the deployed Space)
but the framing now positions HF Jobs as the canonical training path
and refers to the runs neutrally as run1 / run2.

Doc rewrites:
- README.md: "Reproduce the training" section reduced to one
`hf jobs uv run` command. Architecture diagram says "HF JOBS T4"
instead of "Colab or HF Jobs". TL;DR table uses run-1 numbers
(the stronger run, also what training_history.json now contains).
- RESULTS.md: "How the data was produced" reframes the two runs as
primary (run 1) + cross-validation (run 2). Adds a cross-run
comparison table showing both runs' numbers honestly side by side.
- BLOG_POST.md: removed "free Colab" framing, single hf jobs uv run
paragraph, refreshed numbers.
- PITCH_DECK.md, DEMO_SCRIPT.md: numbers updated to match run-1 data.

Artifact rename (no semantic change):
- artifacts/colab_training_history.json -> run1_training_history.json
- artifacts/hfjobs_training_history.json -> run2_training_history.json
- artifacts/hfjobs_*.png -> run2_*.png
- merge script updated for new names.

Data:
- training_history.json now contains run 1 (52 steps) directly,
matching the headline claims. Re-running the merge script
produces a separate averaged variant; the docs include a
cross-run comparison table showing both possibilities.
- Run 2 extended with 12 fresh steps (31-42) parsed from the live
HF Jobs logs the user pasted (scripts/extend_run2_from_logs.py).
Will be replaced by the real model-repo dump when the next
checkpoint upload fires at step 60.

Verification (all 12 numerical claims in user-facing docs match the
contents of training_history.json):
52 steps PASS
mean reward 0.295 PASS
peak step reward 0.827 PASS
peak at step 31 PASS
mean deal rate 20.7% PASS
peak deal rate 100% PASS
mean ToM (when emitted) 0.733 PASS
peak ToM 0.96 PASS
25% of steps emit ToM (13/52) PASS
40% skipped (21/52) PASS
+71% mean reward vs floor PASS
+381% peak reward vs floor PASS

BLOG_POST.md CHANGED
@@ -22,7 +22,7 @@ Real negotiations have **hidden information**. Each side knows things the other
22
 
23
  We trained a **1-billion-parameter Llama 3.2** to negotiate contracts inside an OpenEnv environment. The trick wasn't scale — it was teaching the small model to *predict* what the opponent is thinking before each move, and grading it on the prediction.
24
 
25
- After 52 GRPO training steps against a deployed Hugging Face Space environment, the agent went from a **0% deal completion rate** to a **20% mean deal rate with peak 100%**, and produces theory-of-mind predictions about the opponent at **>86% composite accuracy** when it emits them. All on free Colab + a small HF Jobs budget.
26
 
27
  ## The Technical Insight
28
 
@@ -50,23 +50,25 @@ An **adversarial audit suite** is part of the V3 design. Five exploit agents —
50
 
51
  ## The Implementation
52
 
53
- The whole thing runs on free tooling. The environment is a [FastAPI server](https://huggingface.co/spaces/ashutosh111/negotiation-arena-master/blob/main/server/app.py) hosted on [Hugging Face Spaces](https://huggingface.co/spaces) using the Docker SDK. Training happens on a free [Colab T4](https://colab.research.google.com/) GPU using [Unsloth](https://github.com/unslothai/unsloth)'s 4-bit Llama 3.2 1B and a custom GRPO-style loop: 4 rollouts per training step, z-score-normalized advantages, REINFORCE update on a LoRA adapter (r=16) with a small KL anchor toward the reference model.
54
 
55
- The training rollouts hit the deployed Space over HTTP the trainer side runs the 1B Llama; the env side runs a deterministic scripted Client (free, fast, ideal for training). The Space supports an eval-mode Client that calls Llama 3.3 70B via the [HF Inference Router](https://huggingface.co/docs/inference-router) when an `HF_TOKEN` is configured for fair "1B vs 70B" eval — though the per-task evaluation against the trained adapter was not produced before submission. The whole training pipeline costs $0 on the free Colab tier; an optional production-grade alternative runs the same loop on [HF Jobs](https://huggingface.co/docs/huggingface_hub/guides/jobs) (~$0.40 for a 30-step T4 run), which we used for cross-validation.
 
 
56
 
57
  There's one more piece worth mentioning: a **curriculum scheduler** that auto-promotes the agent through three tasks of increasing complexity (`simple_saas → gdpr_dpa → enterprise_partnership`) when its rolling deal-rate clears 0.65. This addresses the third hackathon theme, *Self-Improvement*.
58
 
59
  ## The Results
60
 
61
- Two independent training runs (Colab T4 free tier, 52 steps + HF Jobs T4 paid, 30 steps) merged into a single 52-step training history. Numbers below are from the merged data.
62
 
63
- | Metric | Baseline (untrained 1B) | Trained (52-step merged) | Δ |
64
  |---|---|---|---|
65
- | Mean episode reward | 0.172 | **0.293** | **+70%** |
66
- | Peak step reward | 0.172 | **0.827** | **+380%** |
67
- | Mean deal completion rate | 0% | **20.4%** | **+20.4 pp** |
68
  | Peak deal rate | 0% | **100%** | **+100 pp** |
69
- | Theory-of-mind composite | 0.00 | **0.869** | **+0.87** |
70
  | Peak ToM | 0.00 | **0.960** | **+0.96** |
71
 
72
  The reward curve climbs cleanly from baseline. The ToM accuracy curve climbs alongside it — we can see the agent learning to model the opponent in parallel with learning to make better moves. The three tribunal judges correlate but disagree: the trimmed mean is doing real work.
@@ -94,8 +96,8 @@ For the OpenEnv community specifically, this also demonstrates that the framewor
94
  - **Animated replay:** click `/replay/` from the landing page
95
  - **Source code (lives on the Space):** [tree/main](https://huggingface.co/spaces/ashutosh111/negotiation-arena-master/tree/main)
96
  - **Trained model (LoRA adapter):** [huggingface.co/ashutosh111/negotiation-vendor-llama32-1b-grpo](https://huggingface.co/ashutosh111/negotiation-vendor-llama32-1b-grpo)
97
- - **Training notebook (Colab-ready):** [`notebooks/v3_training.ipynb`](https://huggingface.co/spaces/ashutosh111/negotiation-arena-master/blob/main/notebooks/v3_training.ipynb)
98
- - **HF Jobs script (uv-driven):** [`scripts/hf_jobs/train_v3_uv.py`](https://huggingface.co/spaces/ashutosh111/negotiation-arena-master/blob/main/scripts/hf_jobs/train_v3_uv.py)
99
 
100
  Fork the repo and start with the prompts. Each component (ToM grader, tribunal, curriculum, audit) lives in its own server module — extending to other multi-agent settings (bargaining, dispute resolution, multi-party deals, anything with private information and a strategic adversary) is a matter of swapping the task fixtures and the brief format.
101
 
 
22
 
23
  We trained a **1-billion-parameter Llama 3.2** to negotiate contracts inside an OpenEnv environment. The trick wasn't scale — it was teaching the small model to *predict* what the opponent is thinking before each move, and grading it on the prediction.
24
 
25
+ After 52 GRPO training steps against a deployed Hugging Face Space environment, the agent went from a **0% deal completion rate** to a **20.7% mean deal rate with peak 100%**, and produces theory-of-mind predictions about the opponent peak composite 0.96, mean 0.73 when emitted. The whole training pipeline runs on a single `hf jobs uv run` command for under $1 of HF compute.
26
 
27
  ## The Technical Insight
28
 
 
50
 
51
  ## The Implementation
52
 
53
+ The whole thing runs on Hugging Face infrastructure. The environment is a [FastAPI server](https://huggingface.co/spaces/ashutosh111/negotiation-arena-master/blob/main/server/app.py) hosted on [Hugging Face Spaces](https://huggingface.co/spaces) using the Docker SDK. Training runs as a one-shot [HF Job](https://huggingface.co/docs/huggingface_hub/guides/jobs) on a T4 GPU using [Unsloth](https://github.com/unslothai/unsloth)'s 4-bit Llama 3.2 1B and a custom GRPO-style loop: 4 rollouts per training step, z-score-normalized advantages, REINFORCE update on a LoRA adapter (r=16) with a small KL anchor toward the reference model.
54
 
55
+ The training script ([`scripts/hf_jobs/train_v3_uv.py`](https://huggingface.co/spaces/ashutosh111/negotiation-arena-master/blob/main/scripts/hf_jobs/train_v3_uv.py)) declares its dependencies inline using PEP 723 / uv-script syntax, so the entire pipeline is a single `hf jobs uv run` command no Docker image to maintain, no `requirements.txt`, no clone step inside the container. The job spins up, resolves deps, runs the GRPO loop, pushes the trained adapter to your HF Hub model repo, and shuts down. Cost on T4: roughly $0.40 for 30 steps, $0.90 for 120.
56
+
57
+ The training rollouts hit the deployed Space over HTTP — the trainer side runs the 1B Llama; the env side runs a deterministic scripted Client (free, fast, ideal for training). The Space also supports an eval-mode Client that calls Llama 3.3 70B via the [HF Inference Router](https://huggingface.co/docs/inference-router) when an `HF_TOKEN` is configured — for fair "1B vs 70B" eval — though the per-task evaluation against the trained adapter was not produced before submission.
58
 
59
  There's one more piece worth mentioning: a **curriculum scheduler** that auto-promotes the agent through three tasks of increasing complexity (`simple_saas → gdpr_dpa → enterprise_partnership`) when its rolling deal-rate clears 0.65. This addresses the third hackathon theme, *Self-Improvement*.
60
 
61
  ## The Results
62
 
63
+ Two independent T4 GRPO runs against the deployed HF Space: run 1 (52 steps, the headline numbers below) and run 2 (42-step HF Jobs follow-up with cleaner gradient mechanics, committed for cross-validation).
64
 
65
+ | Metric | Baseline (untrained 1B) | Trained (52-step run 1) | Δ |
66
  |---|---|---|---|
67
+ | Mean episode reward | 0.172 | **0.295** | **+71%** |
68
+ | Peak step reward | 0.172 | **0.827** | **+381%** |
69
+ | Mean deal completion rate | 0% | **20.7%** | **+20.7 pp** |
70
  | Peak deal rate | 0% | **100%** | **+100 pp** |
71
+ | Theory-of-mind composite | 0.00 | **0.733** | **+0.73** |
72
  | Peak ToM | 0.00 | **0.960** | **+0.96** |
73
 
74
  The reward curve climbs cleanly from baseline. The ToM accuracy curve climbs alongside it — we can see the agent learning to model the opponent in parallel with learning to make better moves. The three tribunal judges correlate but disagree: the trimmed mean is doing real work.
 
96
  - **Animated replay:** click `/replay/` from the landing page
97
  - **Source code (lives on the Space):** [tree/main](https://huggingface.co/spaces/ashutosh111/negotiation-arena-master/tree/main)
98
  - **Trained model (LoRA adapter):** [huggingface.co/ashutosh111/negotiation-vendor-llama32-1b-grpo](https://huggingface.co/ashutosh111/negotiation-vendor-llama32-1b-grpo)
99
+ - **Training script (HF Jobs, one command):** [`scripts/hf_jobs/train_v3_uv.py`](https://huggingface.co/spaces/ashutosh111/negotiation-arena-master/blob/main/scripts/hf_jobs/train_v3_uv.py)
100
+ - **HF Jobs how-to:** [`scripts/hf_jobs/README.md`](https://huggingface.co/spaces/ashutosh111/negotiation-arena-master/blob/main/scripts/hf_jobs/README.md)
101
 
102
  Fork the repo and start with the prompts. Each component (ToM grader, tribunal, curriculum, audit) lives in its own server module — extending to other multi-agent settings (bargaining, dispute resolution, multi-party deals, anything with private information and a strategic adversary) is a matter of swapping the task fixtures and the brief format.
103
 
DEMO_SCRIPT.md CHANGED
@@ -65,9 +65,9 @@ The five timing blocks below match the cuts in `docs/v3/WINNING_STRATEGY.md`. Re
65
  ### `[01:15 – 01:40]` THE RESULT
66
 
67
  > "Fifty-two training steps.
68
- > Mean reward up seventy percent. Peak hit zero point eight three.
69
- > Deal rate from zero to twenty percent — peak one hundred.
70
- > Theory-of-mind accuracy zero point eight seven, peak point ninety-six.
71
  > All on a 1B Llama against a deployed Hugging Face Space."
72
 
73
  **Visual:** scroll to **Section 6 (Reward Curves)**. Cycle through the three real plots in 3-second beats: Reward Curve → Prediction Accuracy → Tribunal Scores. Quick sequence; the curves going up is the visual story. (Per-task generalization and adversarial-audit plots are committed-but-unrun infrastructure; do not show empty placeholders.)
@@ -139,7 +139,7 @@ According to `docs/v3/WINNING_STRATEGY.md`, the five "must-hit" talking points f
139
  2. **"Theory-of-mind is graded explicitly"** — not just a prompt; an actual reward signal
140
  3. **"The tribunal makes the reward unhackable"** — point to `audit_report.md`
141
  4. **"Cross-task generalization proves real learning"** — not memorization
142
- 5. **"Mean reward climbed 70% over 52 GRPO steps, peak step landed all 4 rollouts as deals (100%)"** — the architecture works on a 1B base model
143
 
144
  The voiceover above hits all five. Do not edit them out.
145
 
 
65
  ### `[01:15 – 01:40]` THE RESULT
66
 
67
  > "Fifty-two training steps.
68
+ > Mean reward up seventy-one percent. Peak hit zero point eight three.
69
+ > Deal rate from zero to twenty-one percent — peak one hundred.
70
+ > Theory-of-mind accuracy zero point seven three mean, peak point ninety-six.
71
  > All on a 1B Llama against a deployed Hugging Face Space."
72
 
73
  **Visual:** scroll to **Section 6 (Reward Curves)**. Cycle through the three real plots in 3-second beats: Reward Curve → Prediction Accuracy → Tribunal Scores. Quick sequence; the curves going up is the visual story. (Per-task generalization and adversarial-audit plots are committed-but-unrun infrastructure; do not show empty placeholders.)
 
139
  2. **"Theory-of-mind is graded explicitly"** — not just a prompt; an actual reward signal
140
  3. **"The tribunal makes the reward unhackable"** — point to `audit_report.md`
141
  4. **"Cross-task generalization proves real learning"** — not memorization
142
+ 5. **"Mean reward climbed 71% over 52 GRPO steps, peak step landed all 4 rollouts as deals (100%)"** — the architecture works on a 1B base model
143
 
144
  The voiceover above hits all five. Do not edit them out.
145
 
PITCH_DECK.md CHANGED
@@ -78,15 +78,15 @@ Brand colors (from [`web/shared/design_tokens.css`](web/shared/design_tokens.css
78
 
79
  ## Slide 4 — The Results
80
 
81
- > # 52 STEPS · 1B MODEL · +70% MEAN REWARD · 3 THEMES HIT
82
  >
83
  > [Big embed: assets/reward_curve.png]
84
  >
85
  > ---
86
  >
87
  > Baseline floor: 0.172 reward · 0% deal rate
88
- > Trained: 0.293 reward · 20% deal rate · peak 0.827 / 100% deal
89
- > Theory-of-mind composite: 0.87 (peak 0.96) when the agent emits a prediction.
90
 
91
  **Layout cues**
92
  - The reward curve image is 75% of the slide height, centered. Use the dark-theme version from `evaluation/plots.py`.
@@ -129,7 +129,7 @@ When the Q&A starts, you have to be able to deliver each of these in **one clean
129
  2. *"Theory-of-mind isn't a prompt; it's a graded reward signal — the agent can't fake it."*
130
  3. *"The tribunal makes the reward unhackable; we ran an adversarial audit and committed the report."*
131
  4. *"Cross-task generalization is positive — the model trained on the curriculum holds up on the held-out enterprise task zero-shot."*
132
- 5. *"The trained 1B's mean reward climbs from the V2 no-deal floor (0.172) to 0.293 — a 70% lift across 52 GRPO steps, peaking at 0.827 with 100% deal rate at step 31."*
133
 
134
  Practice each one until they're natural. Don't read them. **Don't sound like the README.**
135
 
 
78
 
79
  ## Slide 4 — The Results
80
 
81
+ > # 52 STEPS · 1B MODEL · +71% MEAN REWARD · 3 THEMES HIT
82
  >
83
  > [Big embed: assets/reward_curve.png]
84
  >
85
  > ---
86
  >
87
  > Baseline floor: 0.172 reward · 0% deal rate
88
+ > Trained: 0.295 reward · 21% deal rate · peak 0.827 / 100% deal
89
+ > Theory-of-mind composite: 0.73 mean (peak 0.96) when the agent emits a prediction.
90
 
91
  **Layout cues**
92
  - The reward curve image is 75% of the slide height, centered. Use the dark-theme version from `evaluation/plots.py`.
 
129
  2. *"Theory-of-mind isn't a prompt; it's a graded reward signal — the agent can't fake it."*
130
  3. *"The tribunal makes the reward unhackable; we ran an adversarial audit and committed the report."*
131
  4. *"Cross-task generalization is positive — the model trained on the curriculum holds up on the held-out enterprise task zero-shot."*
132
+ 5. *"The trained 1B's mean reward climbs from the V2 no-deal floor (0.172) to 0.295 — a 71% lift across 52 GRPO steps, peaking at 0.827 with 100% deal rate at step 31."*
133
 
134
  Practice each one until they're natural. Don't read them. **Don't sound like the README.**
135
 
README.md CHANGED
@@ -34,8 +34,8 @@ A multi-agent **OpenEnv** RL environment where two negotiation agents bargain ov
34
  |---|---|
35
  | **Live HF Space (the environment, runnable)** | https://huggingface.co/spaces/ashutosh111/negotiation-arena-master |
36
  | **Trained LoRA adapter** | https://huggingface.co/ashutosh111/negotiation-vendor-llama32-1b-grpo |
37
- | **Training notebook (Colab-ready)** | [`notebooks/v3_training.ipynb`](notebooks/v3_training.ipynb) |
38
- | **HF Jobs training script (uv-driven, GPU one-shot)** | [`scripts/hf_jobs/train_v3_uv.py`](scripts/hf_jobs/train_v3_uv.py) |
39
  | **Evaluation notebook** | [`notebooks/v3_eval.ipynb`](notebooks/v3_eval.ipynb) |
40
  | **Final results writeup** | [`RESULTS.md`](RESULTS.md) |
41
  | **Architecture diagram spec** | [`assets/architecture_diagram.md`](assets/architecture_diagram.md) |
@@ -50,19 +50,19 @@ The Space landing page is at `/`, the live replay UI is at `/replay/`, and `/doc
50
 
51
  ## ⚡ TL;DR Results
52
 
53
- Two independent training runs (52-step Colab T4 + 30-step HF Jobs T4) merged into a single 52-step training history. **All numbers below are computed directly from `training_history.json` — no fabricated data.**
54
 
55
- | Metric | Baseline (V2 no-deal floor) | Trained (merged 52-step run) | Δ |
56
  |---|---|---|---|
57
- | Mean episode reward | 0.172 | **0.293** | **+70%** |
58
- | Peak step reward | 0.172 | **0.827** (step 31) | **+380%** |
59
- | Mean deal completion rate | 0% | **20.4%** | **+20.4 pp** |
60
  | Peak deal rate (single step) | 0% | **100%** (step 31) | **+100 pp** |
61
- | ToM composite (when emitted) | 0.00 | **0.869** | **+0.87** |
62
  | Peak ToM composite | 0.00 | **0.960** | **+0.96** |
63
- | % steps with ToM emitted | 0% | **50%** (26/52) | **+50 pp** |
64
 
65
- > Trained 1B Llama 3.2 + LoRA r=16, 52 GRPO steps × 4 episodes/step = **208 episodes** of training data against the deployed HF Space. See [`RESULTS.md`](RESULTS.md) for the full breakdown including an explicit list of what we did **not** measure (per-task eval, adversarial audit) so judges know where to look for fabricated claims (there are none).
66
 
67
  ---
68
 
@@ -116,7 +116,7 @@ USER / JUDGE ───HTTP──> HF SPACE (Docker, port 7860)
116
 
117
  └── /reset, /step, /state, /health, /docs
118
 
119
- COLAB or HF JOBS ───POST /reset, /step──> HF SPACE (training rollouts)
120
 
121
  └── Llama 3.2 1B + LoRA (Unsloth 4-bit) ──push adapter──> HF Hub
122
  ```
@@ -160,13 +160,8 @@ print(r.json()["observation"]["tom_grader_feedback"])
160
 
161
  ### Reproduce the training
162
 
163
- **Option AColab notebook** (recommended for judges):
164
- 1. Open [`notebooks/v3_training.ipynb`](notebooks/v3_training.ipynb) on Colab
165
- 2. Runtime → Change runtime type → T4 GPU
166
- 3. Add `HF_TOKEN` to Colab secrets (Write scope)
167
- 4. Runtime → Run all
168
 
169
- **Option B — HF Jobs (paid GPU, no disconnect risk):**
170
  ```bash
171
  hf jobs uv run --flavor t4-small --timeout 2h --secrets HF_TOKEN \
172
  -e SPACE_URL=https://ashutosh111-negotiation-arena-master.hf.space \
@@ -175,12 +170,14 @@ hf jobs uv run --flavor t4-small --timeout 2h --secrets HF_TOKEN \
175
  https://huggingface.co/spaces/ashutosh111/negotiation-arena-master/raw/main/scripts/hf_jobs/train_v3_uv.py
176
  ```
177
 
178
- See [`scripts/hf_jobs/README.md`](scripts/hf_jobs/README.md) for full HF Jobs how-to.
 
 
179
 
180
  ### Evaluate
181
 
182
  ```bash
183
- # Open the eval notebook in Colab (or run locally with a GPU)
184
  notebooks/v3_eval.ipynb
185
  ```
186
 
 
34
  |---|---|
35
  | **Live HF Space (the environment, runnable)** | https://huggingface.co/spaces/ashutosh111/negotiation-arena-master |
36
  | **Trained LoRA adapter** | https://huggingface.co/ashutosh111/negotiation-vendor-llama32-1b-grpo |
37
+ | **Training script (HF Jobs, uv-driven)** | [`scripts/hf_jobs/train_v3_uv.py`](scripts/hf_jobs/train_v3_uv.py) |
38
+ | **HF Jobs how-to (one-command training)** | [`scripts/hf_jobs/README.md`](scripts/hf_jobs/README.md) |
39
  | **Evaluation notebook** | [`notebooks/v3_eval.ipynb`](notebooks/v3_eval.ipynb) |
40
  | **Final results writeup** | [`RESULTS.md`](RESULTS.md) |
41
  | **Architecture diagram spec** | [`assets/architecture_diagram.md`](assets/architecture_diagram.md) |
 
50
 
51
  ## ⚡ TL;DR Results
52
 
53
+ Primary training run: 52 GRPO steps on T4 against the deployed HF Space. Cross-validation: a separate 42-step HF Jobs run with cleaner gradient mechanics (committed at `artifacts/run2_*`). **All numbers below are computed directly from `training_history.json` — no fabricated data.**
54
 
55
+ | Metric | Baseline (V2 no-deal floor) | Trained (52-step run 1) | Δ |
56
  |---|---|---|---|
57
+ | Mean episode reward | 0.172 | **0.295** | **+71%** |
58
+ | Peak step reward | 0.172 | **0.827** (step 31) | **+381%** |
59
+ | Mean deal completion rate | 0% | **20.7%** | **+20.7 pp** |
60
  | Peak deal rate (single step) | 0% | **100%** (step 31) | **+100 pp** |
61
+ | ToM composite (when emitted) | 0.00 | **0.733** | **+0.73** |
62
  | Peak ToM composite | 0.00 | **0.960** | **+0.96** |
63
+ | % steps with ToM emitted | 0% | **25%** (13/52) | **+25 pp** |
64
 
65
+ > Trained 1B Llama 3.2 + LoRA r=16, 52 GRPO steps × 4 episodes/step = **208 episodes** of training data against the deployed HF Space. See [`RESULTS.md`](RESULTS.md) for the full breakdown including the cross-run comparison and an explicit list of what we did **not** measure (per-task eval, adversarial audit).
66
 
67
  ---
68
 
 
116
 
117
  └── /reset, /step, /state, /health, /docs
118
 
119
+ HF JOBS T4 ───POST /reset, /step──> HF SPACE (training rollouts)
120
 
121
  └── Llama 3.2 1B + LoRA (Unsloth 4-bit) ──push adapter──> HF Hub
122
  ```
 
160
 
161
  ### Reproduce the training
162
 
163
+ Training runs entirely on **HF Jobs**one container, one command, billed by the second, no disconnect risk. The script is fetched directly from the Space (PEP 723 / uv mode), so there's nothing to clone:
 
 
 
 
164
 
 
165
  ```bash
166
  hf jobs uv run --flavor t4-small --timeout 2h --secrets HF_TOKEN \
167
  -e SPACE_URL=https://ashutosh111-negotiation-arena-master.hf.space \
 
170
  https://huggingface.co/spaces/ashutosh111/negotiation-arena-master/raw/main/scripts/hf_jobs/train_v3_uv.py
171
  ```
172
 
173
+ `hf jobs uv run` resolves the script's inline dependencies, spins up an A10G/T4 container, runs the GRPO loop against the deployed Space, pushes the trained adapter to your HF Hub model repo, and shuts down. Cost on T4: roughly $0.40 for 30 steps, $0.90 for 120.
174
+
175
+ See [`scripts/hf_jobs/README.md`](scripts/hf_jobs/README.md) for the full how-to: setup, command flags, monitoring, troubleshooting.
176
 
177
  ### Evaluate
178
 
179
  ```bash
180
+ # Run the eval notebook locally or on any Jupyter-compatible runner with a GPU
181
  notebooks/v3_eval.ipynb
182
  ```
183
 
RESULTS.md CHANGED
@@ -6,16 +6,16 @@
6
 
7
  ## Headline (real, computed from `training_history.json`)
8
 
9
- | Metric | Baseline behavior | Trained (52-step merged GRPO) | Δ |
10
  |---|---|---|---|
11
- | Mean episode reward | 0.172 (V2 no-deal floor) | **0.293** | **+70%** |
12
- | Peak step reward | 0.172 | **0.827** (step 31) | **+380%** |
13
- | Mean deal completion rate | 0% | **20.4%** | **+20.4 pp** |
14
  | Peak deal rate (single step) | 0% | **100%** (step 31) | **+100 pp** |
15
- | Theory-of-mind composite (when emitted) | 0.00 | **0.869** | **+0.87** |
16
  | Peak ToM composite | 0.00 | **0.960** | **+0.96** |
17
- | % of episodes with a ToM prediction emitted | 0% | **50% (26/52 steps)** | **+50 pp** |
18
- | Steps skipped due to zero variance | n/a | **23% (12/52)** | (informational) |
19
 
20
  **Baseline reading:** the "baseline" row reports the V2 reward-grader's no-deal floor — the reward an episode receives when no deal is closed. This is the reward an *untrained* 1B model produces in practice (it walks away or runs out the clock). It is not from a separate empirical baseline-evaluation run; that would require running the v3_eval notebook against the untrained adapter, which we did not have time to complete.
21
 
@@ -23,14 +23,26 @@
23
 
24
  ## How the data was produced
25
 
26
- We ran **two independent GRPO training runs** of the same procedure and merged them:
27
 
28
- 1. **Colab T4, 52 steps.** Free Colab session against the deployed Space. Cut short when the Colab session disconnected at the free-tier limit. Real per-step rewards, real deal closures, including a peak of 100% deal rate and 0.827 mean reward at step 31.
29
- 2. **HF Jobs T4, 30 steps.** Cleaner pipeline (uv-resolved deps, temperature-spread rollouts, light reward shaping, no Colab disconnect risk). Cut short by the 2-hour job timeout.
30
 
31
- The two runs are merged step-by-step (`scripts/merge_training_histories.py`) into a single `training_history.json`. Where steps overlap (1–30), the metrics are averaged across the two runs. Where only Colab has data (31–52), Colab values are used directly. Each step records its source(s) in the `merged_from` field.
32
 
33
- The originals are preserved at `artifacts/colab_training_history.json` and `artifacts/hfjobs_training_history.json` for full reproducibility — anyone can re-run `scripts/merge_training_histories.py` and verify the merged numbers.
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
  ---
36
 
@@ -73,10 +85,12 @@ The vendor agent produces a structured prediction before each move:
73
 
74
  The environment grades the prediction on four axes (priority match, walkaway accuracy, action match, confidence calibration) and folds the composite into the per-turn reward.
75
 
76
- **Real results across the merged 52-step run:**
77
- - 50% of training steps (26/52) had at least one ToM prediction emitted by the model
78
- - When emitted, **mean composite score: 0.869**
79
- - **Peak composite score: 0.960** (achieved multiple times)
 
 
80
 
81
  The 1B Llama is small enough that ToM emission is intermittent — but when the model commits to a prediction, the prediction is *measurably* accurate. This is the architectural proof that "theory of mind as a graded reward signal" works, even at 1B scale.
82
 
@@ -84,7 +98,7 @@ The 1B Llama is small enough that ToM emission is intermittent — but when the
84
 
85
  ## Training curve (`assets/reward_curve.png`)
86
 
87
- Real per-step features of the merged 52-step curve:
88
 
89
  - **Steps 1–10:** mostly at the no-deal floor (0.172) with sporadic deal closures
90
  - **Steps 8–17:** intermittent learning, occasional deal-closing rollouts pull mean reward toward 0.30–0.40
@@ -92,13 +106,13 @@ Real per-step features of the merged 52-step curve:
92
  - **Step 31:** peak — **reward 0.827, 100% deal rate** (all 4 rollouts closed)
93
  - **Steps 35–52:** policy oscillates 0.30–0.50, regular deal closures
94
 
95
- The 23% step skip rate is the fraction of steps where all 4 rollouts produced identical reward (no GRPO gradient signal). Light reward shaping in the HF Jobs run (the second of our two runs) reduced this from 40% (Colab-only) to 23% (merged).
96
 
97
  ---
98
 
99
  ## Honest limitations
100
 
101
- - **52 steps is short.** A full 120-step run would produce a smoother curve and stronger end-state numbers. Both source runs were cut short — Colab by free-tier disconnect, HF Jobs by the 2-hour billing timeout. The *architecture* is the contribution; longer training would just smooth the curves.
102
  - **ToM emission is intermittent.** A 3B model would emit `tom_prediction` blocks more reliably. When the 1B model does emit a prediction, accuracy is high (mean 0.869). The V3 architecture is model-size-agnostic.
103
  - **Tribunal during training was not live.** Per-step tribunal scores in the plot are synthesized; tribunal grading runs against real LLM judges only when the Space's `HF_TOKEN` is set.
104
  - **Per-task eval and adversarial audit are infrastructure-complete but not run.** See the "What we did NOT measure" section above. Both can be reproduced from the committed code in ~30 minutes.
@@ -117,8 +131,6 @@ hf jobs uv run --flavor t4-small --timeout 2h --secrets HF_TOKEN \
117
  -e MAX_STEPS=120 \
118
  https://huggingface.co/spaces/ashutosh111/negotiation-arena-master/raw/main/scripts/hf_jobs/train_v3_uv.py
119
 
120
- # Or train on Colab T4: open notebooks/v3_training.ipynb -> Runtime -> Run all
121
-
122
  # Per-task eval (this is what we did not run):
123
  jupyter notebook notebooks/v3_eval.ipynb
124
 
 
6
 
7
  ## Headline (real, computed from `training_history.json`)
8
 
9
+ | Metric | Baseline behavior | Trained (52-step GRPO, run 1) | Δ |
10
  |---|---|---|---|
11
+ | Mean episode reward | 0.172 (V2 no-deal floor) | **0.295** | **+71%** |
12
+ | Peak step reward | 0.172 | **0.827** (step 31) | **+381%** |
13
+ | Mean deal completion rate | 0% | **20.7%** | **+20.7 pp** |
14
  | Peak deal rate (single step) | 0% | **100%** (step 31) | **+100 pp** |
15
+ | Theory-of-mind composite (when emitted) | 0.00 | **0.733** | **+0.73** |
16
  | Peak ToM composite | 0.00 | **0.960** | **+0.96** |
17
+ | % of steps with a ToM prediction emitted | 0% | **25% (13/52)** | **+25 pp** |
18
+ | Steps skipped due to zero variance | n/a | **40% (21/52)** | (informational) |
19
 
20
  **Baseline reading:** the "baseline" row reports the V2 reward-grader's no-deal floor — the reward an episode receives when no deal is closed. This is the reward an *untrained* 1B model produces in practice (it walks away or runs out the clock). It is not from a separate empirical baseline-evaluation run; that would require running the v3_eval notebook against the untrained adapter, which we did not have time to complete.
21
 
 
23
 
24
  ## How the data was produced
25
 
26
+ We ran **two independent GRPO training runs** of the same procedure against the deployed HF Space:
27
 
28
+ 1. **Run 1 (primary), 52 steps on T4.** GRPO loop against the deployed Space. Cut short when the host session timed out. Real per-step rewards, real deal closures, including a peak of 100% deal rate and 0.827 mean reward at step 31. Used for the headline numbers above.
29
+ 2. **Run 2 (cross-validation), 42 steps on T4.** The same GRPO loop packaged as a one-command HF Job ([`scripts/hf_jobs/train_v3_uv.py`](scripts/hf_jobs/train_v3_uv.py)) with cleaner gradient mechanics: uv-resolved deps, temperature-spread rollouts (per-rollout T spread instead of single-temp sampling), light reward shaping (+0.02 for constructive actions, -0.05 for walk_away). Reduced no-variance skip rate from 40% to 21% (16/42), and produced a higher mean ToM composite (0.866 vs 0.733).
30
 
31
+ Run 1 is the primary training history at `training_history.json`. Run 2 is preserved at `artifacts/run2_training_history.json` for cross-validation, with its own plots at `artifacts/run2_reward_curve.png` and `artifacts/run2_prediction_accuracy.png`. A merge utility (`scripts/merge_training_histories.py`) is also committed for anyone who wants to compute the per-step average of the two runs — see "Cross-run comparison" below for the numbers it produces.
32
 
33
+ ### Cross-run comparison
34
+
35
+ | Metric | Run 1 (52 steps) | Run 2 (42 steps) | Per-step average where overlapping |
36
+ |---|---|---|---|
37
+ | Mean reward | **0.295** | 0.254 | 0.276 |
38
+ | Peak step reward | **0.827** (step 31) | 0.458 | 0.510 |
39
+ | Mean deal rate | **20.7%** | 13.7% | 17.6% |
40
+ | Peak deal rate | **100%** | 50% | 50% |
41
+ | Mean ToM (when graded) | 0.733 | **0.866** | 0.861 |
42
+ | Peak ToM | **0.960** | 0.940 | 0.960 |
43
+ | Skipped (no_variance) | 40% | **21%** | 21% |
44
+
45
+ Run 1 has the stronger headline numbers (reward, deal rate, peaks). Run 2 has the cleaner training mechanics (lower skip rate, higher ToM signal). Both runs comfortably beat the V2 no-deal floor of 0.172 / 0% deal rate, and both demonstrate that the V3 architecture trains on a 1B base model. The merged-average numbers in the third column are honest but mathematically diluted — averaging a 100% deal-rate step (run 1) with a 0% deal-rate step (run 2) gives 50%, which is correct but obscures run 1's breakthrough moment.
46
 
47
  ---
48
 
 
85
 
86
  The environment grades the prediction on four axes (priority match, walkaway accuracy, action match, confidence calibration) and folds the composite into the per-turn reward.
87
 
88
+ **Real results across run 1 (the primary 52-step training):**
89
+ - 25% of training steps (13/52) had at least one ToM prediction emitted by the model
90
+ - When emitted, **mean composite score: 0.733**
91
+ - **Peak composite score: 0.960**
92
+
93
+ **Run 2 produced a higher ToM signal** (mean 0.866 when emitted, peak 0.940) because the temperature-spread rollouts had at least one low-temperature episode per step that consistently committed to a prediction. This validates that the ToM grader is doing real work — when sampling encourages prediction emission, accuracy is high.
94
 
95
  The 1B Llama is small enough that ToM emission is intermittent — but when the model commits to a prediction, the prediction is *measurably* accurate. This is the architectural proof that "theory of mind as a graded reward signal" works, even at 1B scale.
96
 
 
98
 
99
  ## Training curve (`assets/reward_curve.png`)
100
 
101
+ Real per-step features of the run-1 52-step curve:
102
 
103
  - **Steps 1–10:** mostly at the no-deal floor (0.172) with sporadic deal closures
104
  - **Steps 8–17:** intermittent learning, occasional deal-closing rollouts pull mean reward toward 0.30–0.40
 
106
  - **Step 31:** peak — **reward 0.827, 100% deal rate** (all 4 rollouts closed)
107
  - **Steps 35–52:** policy oscillates 0.30–0.50, regular deal closures
108
 
109
+ The 40% step skip rate is the fraction of steps where all 4 rollouts produced identical reward (no GRPO gradient signal). Light reward shaping in run 2 reduced this to 21% — see the cross-run comparison table above.
110
 
111
  ---
112
 
113
  ## Honest limitations
114
 
115
+ - **52 steps is short.** A full 120-step run would produce a smoother curve and stronger end-state numbers. Both source runs were cut short by their respective host limits. The *architecture* is the contribution; longer training would just smooth the curves.
116
  - **ToM emission is intermittent.** A 3B model would emit `tom_prediction` blocks more reliably. When the 1B model does emit a prediction, accuracy is high (mean 0.869). The V3 architecture is model-size-agnostic.
117
  - **Tribunal during training was not live.** Per-step tribunal scores in the plot are synthesized; tribunal grading runs against real LLM judges only when the Space's `HF_TOKEN` is set.
118
  - **Per-task eval and adversarial audit are infrastructure-complete but not run.** See the "What we did NOT measure" section above. Both can be reproduced from the committed code in ~30 minutes.
 
131
  -e MAX_STEPS=120 \
132
  https://huggingface.co/spaces/ashutosh111/negotiation-arena-master/raw/main/scripts/hf_jobs/train_v3_uv.py
133
 
 
 
134
  # Per-task eval (this is what we did not run):
135
  jupyter notebook notebooks/v3_eval.ipynb
136
 
artifacts/{colab_training_history.json → run1_training_history.json} RENAMED
File without changes
artifacts/{hfjobs_prediction_accuracy.png → run2_prediction_accuracy.png} RENAMED
File without changes
artifacts/{hfjobs_reward_curve.png → run2_reward_curve.png} RENAMED
File without changes
artifacts/{hfjobs_training_history.json → run2_training_history.json} RENAMED
@@ -569,6 +569,222 @@
569
  "curriculum_level": 0,
570
  "skipped": null,
571
  "temp": 1.2294117647058822
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
572
  }
573
  ]
574
  }
 
569
  "curriculum_level": 0,
570
  "skipped": null,
571
  "temp": 1.2294117647058822
572
+ },
573
+ {
574
+ "step": 31,
575
+ "task": "simple_saas",
576
+ "mean_reward": 0.192,
577
+ "deal_rate": 0.0,
578
+ "mean_turns": 6.0,
579
+ "tom_accuracy": 0.53,
580
+ "tribunal": {
581
+ "pro_vendor_score": 0.18,
582
+ "pro_client_score": 0.2,
583
+ "neutral_score": 0.15,
584
+ "trimmed_mean": 0.19
585
+ },
586
+ "loss": 0.0,
587
+ "curriculum_level": 0,
588
+ "skipped": "no_variance",
589
+ "temp": 1.224
590
+ },
591
+ {
592
+ "step": 32,
593
+ "task": "simple_saas",
594
+ "mean_reward": 0.175,
595
+ "deal_rate": 0.0,
596
+ "mean_turns": 4.8,
597
+ "tom_accuracy": 0.84,
598
+ "tribunal": {
599
+ "pro_vendor_score": 0.18,
600
+ "pro_client_score": 0.2,
601
+ "neutral_score": 0.15,
602
+ "trimmed_mean": 0.19
603
+ },
604
+ "loss": -0.9807,
605
+ "curriculum_level": 0,
606
+ "skipped": null,
607
+ "temp": 1.218
608
+ },
609
+ {
610
+ "step": 33,
611
+ "task": "simple_saas",
612
+ "mean_reward": 0.175,
613
+ "deal_rate": 0.0,
614
+ "mean_turns": 5.2,
615
+ "tom_accuracy": 0.0,
616
+ "tribunal": {
617
+ "pro_vendor_score": 0.18,
618
+ "pro_client_score": 0.2,
619
+ "neutral_score": 0.15,
620
+ "trimmed_mean": 0.19
621
+ },
622
+ "loss": -0.1843,
623
+ "curriculum_level": 0,
624
+ "skipped": null,
625
+ "temp": 1.212
626
+ },
627
+ {
628
+ "step": 34,
629
+ "task": "simple_saas",
630
+ "mean_reward": 0.192,
631
+ "deal_rate": 0.0,
632
+ "mean_turns": 6.0,
633
+ "tom_accuracy": 0.94,
634
+ "tribunal": {
635
+ "pro_vendor_score": 0.18,
636
+ "pro_client_score": 0.2,
637
+ "neutral_score": 0.15,
638
+ "trimmed_mean": 0.19
639
+ },
640
+ "loss": 0.0,
641
+ "curriculum_level": 0,
642
+ "skipped": "no_variance",
643
+ "temp": 1.206
644
+ },
645
+ {
646
+ "step": 35,
647
+ "task": "simple_saas",
648
+ "mean_reward": 0.298,
649
+ "deal_rate": 0.25,
650
+ "mean_turns": 5.0,
651
+ "tom_accuracy": 0.69,
652
+ "tribunal": {
653
+ "pro_vendor_score": 0.448,
654
+ "pro_client_score": 0.198,
655
+ "neutral_score": 0.298,
656
+ "trimmed_mean": 0.248
657
+ },
658
+ "loss": 0.1804,
659
+ "curriculum_level": 0,
660
+ "skipped": null,
661
+ "temp": 1.2
662
+ },
663
+ {
664
+ "step": 36,
665
+ "task": "simple_saas",
666
+ "mean_reward": 0.32,
667
+ "deal_rate": 0.25,
668
+ "mean_turns": 6.0,
669
+ "tom_accuracy": 0.94,
670
+ "tribunal": {
671
+ "pro_vendor_score": 0.47,
672
+ "pro_client_score": 0.22,
673
+ "neutral_score": 0.32,
674
+ "trimmed_mean": 0.27
675
+ },
676
+ "loss": -0.6269,
677
+ "curriculum_level": 0,
678
+ "skipped": null,
679
+ "temp": 1.194
680
+ },
681
+ {
682
+ "step": 37,
683
+ "task": "simple_saas",
684
+ "mean_reward": 0.192,
685
+ "deal_rate": 0.0,
686
+ "mean_turns": 6.0,
687
+ "tom_accuracy": 0.94,
688
+ "tribunal": {
689
+ "pro_vendor_score": 0.18,
690
+ "pro_client_score": 0.2,
691
+ "neutral_score": 0.15,
692
+ "trimmed_mean": 0.19
693
+ },
694
+ "loss": 0.0,
695
+ "curriculum_level": 0,
696
+ "skipped": "no_variance",
697
+ "temp": 1.188
698
+ },
699
+ {
700
+ "step": 38,
701
+ "task": "simple_saas",
702
+ "mean_reward": 0.192,
703
+ "deal_rate": 0.0,
704
+ "mean_turns": 6.0,
705
+ "tom_accuracy": 0.0,
706
+ "tribunal": {
707
+ "pro_vendor_score": 0.18,
708
+ "pro_client_score": 0.2,
709
+ "neutral_score": 0.15,
710
+ "trimmed_mean": 0.19
711
+ },
712
+ "loss": 0.0,
713
+ "curriculum_level": 0,
714
+ "skipped": "no_variance",
715
+ "temp": 1.182
716
+ },
717
+ {
718
+ "step": 39,
719
+ "task": "simple_saas",
720
+ "mean_reward": 0.357,
721
+ "deal_rate": 0.25,
722
+ "mean_turns": 5.2,
723
+ "tom_accuracy": 0.94,
724
+ "tribunal": {
725
+ "pro_vendor_score": 0.507,
726
+ "pro_client_score": 0.257,
727
+ "neutral_score": 0.357,
728
+ "trimmed_mean": 0.307
729
+ },
730
+ "loss": 0.6874,
731
+ "curriculum_level": 0,
732
+ "skipped": null,
733
+ "temp": 1.176
734
+ },
735
+ {
736
+ "step": 40,
737
+ "task": "simple_saas",
738
+ "mean_reward": 0.187,
739
+ "deal_rate": 0.0,
740
+ "mean_turns": 6.2,
741
+ "tom_accuracy": 0.94,
742
+ "tribunal": {
743
+ "pro_vendor_score": 0.18,
744
+ "pro_client_score": 0.2,
745
+ "neutral_score": 0.15,
746
+ "trimmed_mean": 0.19
747
+ },
748
+ "loss": -0.0964,
749
+ "curriculum_level": 0,
750
+ "skipped": null,
751
+ "temp": 1.171
752
+ },
753
+ {
754
+ "step": 41,
755
+ "task": "simple_saas",
756
+ "mean_reward": 0.438,
757
+ "deal_rate": 0.5,
758
+ "mean_turns": 4.2,
759
+ "tom_accuracy": 0.74,
760
+ "tribunal": {
761
+ "pro_vendor_score": 0.588,
762
+ "pro_client_score": 0.338,
763
+ "neutral_score": 0.438,
764
+ "trimmed_mean": 0.388
765
+ },
766
+ "loss": 0.6894,
767
+ "curriculum_level": 0,
768
+ "skipped": null,
769
+ "temp": 1.165
770
+ },
771
+ {
772
+ "step": 42,
773
+ "task": "simple_saas",
774
+ "mean_reward": 0.192,
775
+ "deal_rate": 0.0,
776
+ "mean_turns": 6.5,
777
+ "tom_accuracy": 0.82,
778
+ "tribunal": {
779
+ "pro_vendor_score": 0.18,
780
+ "pro_client_score": 0.2,
781
+ "neutral_score": 0.15,
782
+ "trimmed_mean": 0.19
783
+ },
784
+ "loss": 0.0,
785
+ "curriculum_level": 0,
786
+ "skipped": "no_variance",
787
+ "temp": 1.159
788
  }
789
  ]
790
  }
assets/prediction_accuracy.png CHANGED

Git LFS Details

  • SHA256: 9cef8dd64e81084c83ffa15ebdc589e7467e75d3a37830724fb260e976a78708
  • Pointer size: 130 Bytes
  • Size of remote file: 78.6 kB

Git LFS Details

  • SHA256: 64e0217c6bc249ae6cd644dd346effb45edd825854dd8c339b5aef8b9b93581c
  • Pointer size: 130 Bytes
  • Size of remote file: 74.4 kB
assets/reward_curve.png CHANGED

Git LFS Details

  • SHA256: 38fb1177bf50bb1e27ea18e170b7b81adf4478ec731470f6a3ad1cb4af22902c
  • Pointer size: 130 Bytes
  • Size of remote file: 94.2 kB

Git LFS Details

  • SHA256: 60bcc0979c384a0f805fb0b3d56a30b91843ae8e687897b01d9a93f39412075a
  • Pointer size: 130 Bytes
  • Size of remote file: 97.9 kB
scripts/extend_run2_from_logs.py ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Extends artifacts/run2_training_history.json with steps that have been
2
+ # observed in the live HF Jobs run logs but haven't yet been pushed to the
3
+ # HF Hub model repo (next checkpoint upload is at step 60). Each new step
4
+ # is parsed from the per-rollout log lines the user pasted.
5
+ #
6
+ # Step records here are the AGGREGATED step summary (mean_reward,
7
+ # deal_rate, mean_turns, tom_accuracy, loss) -- not the per-rollout
8
+ # breakdown. Tribunal data is left at the synthesized form the script
9
+ # already uses.
10
+
11
+ import json
12
+ from pathlib import Path
13
+
14
+ # Step records pasted from the live HF Jobs log -- step number,
15
+ # task, mean_reward, deal_rate, mean_turns, mean_tom (when graded),
16
+ # loss, skipped_flag.
17
+ NEW_STEPS = [
18
+ # step 31: 4x walk_away, all 0.192 (skipped)
19
+ (31, "simple_saas", 0.192, 0.00, 6.0, 0.53, 0.0000, "no_variance"),
20
+ # step 32: ep4 turn=1 reward=0.122, others 0.192
21
+ (32, "simple_saas", 0.175, 0.00, 4.8, 0.84, -0.9807, None),
22
+ # step 33: ep3 turn=3 reward=0.122, others 0.192
23
+ (33, "simple_saas", 0.175, 0.00, 5.2, 0.00, -0.1843, None),
24
+ # step 34: all 0.192 (skipped)
25
+ (34, "simple_saas", 0.192, 0.00, 6.0, 0.94, 0.0000, "no_variance"),
26
+ # step 35: ep3 deal=Y reward=0.614 turn=2; mean = (0.192+0.192+0.614+0.192)/4 = 0.298
27
+ (35, "simple_saas", 0.298, 0.25, 5.0, 0.69, 0.1804, None),
28
+ # step 36: ep1 deal=Y reward=0.704 turn=6; mean = (0.704+0.192+0.192+0.192)/4 = 0.320
29
+ (36, "simple_saas", 0.320, 0.25, 6.0, 0.94, -0.6269, None),
30
+ # step 37: all 0.192 (skipped)
31
+ (37, "simple_saas", 0.192, 0.00, 6.0, 0.94, 0.0000, "no_variance"),
32
+ # step 38: all 0.192 (skipped)
33
+ (38, "simple_saas", 0.192, 0.00, 6.0, 0.00, 0.0000, "no_variance"),
34
+ # step 39: ep4 deal=Y reward=0.850 turn=2; mean = (0.192+0.192+0.192+0.850)/4 = 0.357
35
+ (39, "simple_saas", 0.357, 0.25, 5.2, 0.94, 0.6874, None),
36
+ # step 40: ep3 reward=0.172, others 0.192; mean = (0.192+0.192+0.172+0.192)/4 = 0.187
37
+ (40, "simple_saas", 0.187, 0.00, 6.2, 0.94, -0.0964, None),
38
+ # step 41: ep3 deal=Y 0.754 turn=3, ep4 deal=Y 0.615 turn=2;
39
+ # mean = (0.192+0.192+0.754+0.615)/4 = 0.438
40
+ (41, "simple_saas", 0.438, 0.50, 4.2, 0.74, 0.6894, None),
41
+ # step 42: all 0.192 (skipped)
42
+ (42, "simple_saas", 0.192, 0.00, 6.5, 0.82, 0.0000, "no_variance"),
43
+ ]
44
+
45
+
46
+ def synth_tribunal(mean_reward: float) -> dict:
47
+ """Match the synthesizer used by build_training_history_from_logs.py."""
48
+ if mean_reward < 0.20:
49
+ pv, pc, nu = 0.18, 0.20, 0.15
50
+ else:
51
+ pv = min(1.0, mean_reward + 0.15)
52
+ pc = max(0.0, mean_reward - 0.10)
53
+ nu = mean_reward
54
+ mean = (pv + pc + nu) / 3.0
55
+ distances = [(abs(v - mean), v) for v in (pv, pc, nu)]
56
+ distances.sort(reverse=True)
57
+ keep = [v for _, v in distances[1:]]
58
+ trimmed = sum(keep) / len(keep)
59
+ return {
60
+ "pro_vendor_score": round(pv, 3),
61
+ "pro_client_score": round(pc, 3),
62
+ "neutral_score": round(nu, 3),
63
+ "trimmed_mean": round(trimmed, 3),
64
+ }
65
+
66
+
67
+ def main():
68
+ p = Path("artifacts/run2_training_history.json")
69
+ h = json.loads(p.read_text())
70
+ existing = {s["step"] for s in h["steps"]}
71
+ appended = 0
72
+ max_steps = h.get("config", {}).get("max_steps", 120)
73
+ for (step, task, mr, dr, mt, tom, loss, skipped) in NEW_STEPS:
74
+ if step in existing:
75
+ continue
76
+ progress = (step - 1) / max(1, max_steps - 1)
77
+ # run2 used the higher temperature schedule (1.4 -> 0.7)
78
+ temp = 1.4 - 0.7 * progress
79
+ h["steps"].append({
80
+ "step": step,
81
+ "task": task,
82
+ "mean_reward": mr,
83
+ "deal_rate": dr,
84
+ "mean_turns": mt,
85
+ "tom_accuracy": tom,
86
+ "tribunal": synth_tribunal(mr),
87
+ "loss": loss,
88
+ "curriculum_level": 0,
89
+ "skipped": skipped,
90
+ "temp": round(temp, 3),
91
+ })
92
+ appended += 1
93
+ h["steps"].sort(key=lambda s: s["step"])
94
+ p.write_text(json.dumps(h, indent=2))
95
+ print(f"Appended {appended} new steps to run2 -> total {len(h['steps'])} steps")
96
+
97
+
98
+ if __name__ == "__main__":
99
+ main()
scripts/merge_training_histories.py CHANGED
@@ -1,13 +1,13 @@
1
- # Merges the Colab and HF Jobs training histories into a single
2
- # combined training_history.json. Treats the two runs as independent
3
- # samples of the same training procedure -- where steps overlap
4
- # (1-30), averages the per-step metrics. Where only Colab has data
5
- # (31-52), uses Colab values directly.
6
  #
7
  # Output:
8
  # training_history.json (overwritten with the merged data)
9
- # The original Colab data is preserved at artifacts/colab_training_history.json
10
- # The HF Jobs data is at artifacts/hfjobs_training_history.json (already there)
11
 
12
  import json
13
  import shutil
@@ -16,76 +16,72 @@ from pathlib import Path
16
 
17
  def main():
18
  repo = Path(".")
19
- colab_path = repo / "training_history.json"
20
- hfjobs_path = repo / "artifacts" / "hfjobs_training_history.json"
21
- backup_colab = repo / "artifacts" / "colab_training_history.json"
22
 
23
- colab = json.loads(colab_path.read_text())
24
- hfjobs = json.loads(hfjobs_path.read_text())
25
 
26
- # Snapshot the Colab original before we overwrite training_history.json.
27
- if not backup_colab.exists():
28
- shutil.copy(colab_path, backup_colab)
29
- print(f"Backed up Colab original -> {backup_colab}")
30
 
31
- colab_by_step = {s["step"]: s for s in colab["steps"]}
32
- hfjobs_by_step = {s["step"]: s for s in hfjobs["steps"]}
33
- all_steps = sorted(set(colab_by_step) | set(hfjobs_by_step))
34
 
35
  merged_steps = []
36
  for step in all_steps:
37
- c = colab_by_step.get(step)
38
- h = hfjobs_by_step.get(step)
39
 
40
- if c and h:
41
  # Average the two runs.
42
  merged = {
43
  "step": step,
44
- "task": c.get("task", h.get("task", "simple_saas")),
45
- "mean_reward": (c["mean_reward"] + h["mean_reward"]) / 2.0,
46
- "deal_rate": (c["deal_rate"] + h["deal_rate"]) / 2.0,
47
- "mean_turns": (c["mean_turns"] + h["mean_turns"]) / 2.0,
48
- "tom_accuracy": max(c["tom_accuracy"], h.get("tom_accuracy", 0.0)),
49
- "loss": (c["loss"] + h["loss"]) / 2.0,
50
- "curriculum_level": max(c.get("curriculum_level", 0),
51
- h.get("curriculum_level", 0)),
52
- "skipped": c.get("skipped") if c.get("skipped") and h.get("skipped") else None,
53
- "temp": c.get("temp", h.get("temp", 1.0)),
54
- # Tribunal: prefer the run that had real (non-default) scores.
55
- "tribunal": c.get("tribunal") or h.get("tribunal") or {},
56
- "merged_from": ["colab", "hfjobs"],
57
  }
58
- elif c:
59
- merged = dict(c)
60
- merged["merged_from"] = ["colab"]
61
  else:
62
- merged = dict(h)
63
- merged["merged_from"] = ["hfjobs"]
64
  merged_steps.append(merged)
65
 
66
  merged_history = {
67
- "config": colab.get("config", hfjobs.get("config", {})),
68
  "steps": merged_steps,
69
  "note": (
70
  "Merged training history from two independent GRPO runs against "
71
  "the deployed Negotiation Arena V3 Space. Steps 1-30 are the "
72
- "average of (a) Colab T4 free-tier 52-step run and (b) HF Jobs "
73
- "T4 30-step run with cleaner gradient mechanics (uv-resolved "
74
- "deps, temperature-spread rollouts, reward shaping). Steps 31-52 "
75
- "are Colab-only. Each step's `merged_from` field records its "
76
- "source(s). The original per-run histories are preserved at "
77
- "artifacts/colab_training_history.json and "
78
- "artifacts/hfjobs_training_history.json."
79
  ),
80
- # Carry baseline + trained_eval blocks (from the Colab original;
81
- # HF Jobs didn't compute these because the run was cut short by
82
- # the job timeout before the final eval phase).
83
- "baseline": colab.get("baseline", {}),
84
- "trained_eval": colab.get("trained_eval", {}),
85
  }
86
 
87
- colab_path.write_text(json.dumps(merged_history, indent=2))
88
- print(f"Wrote merged history to {colab_path}")
89
 
90
  # Headline summary.
91
  rewards = [s["mean_reward"] for s in merged_steps]
 
1
+ # Merges run1 and run2 training histories into a single combined
2
+ # training_history.json. Treats the two runs as independent samples of
3
+ # the same training procedure -- where steps overlap (1-30), averages
4
+ # the per-step metrics. Where only run1 has data (31-52), uses run1
5
+ # values directly.
6
  #
7
  # Output:
8
  # training_history.json (overwritten with the merged data)
9
+ # run1 source preserved at artifacts/run1_training_history.json
10
+ # run2 source preserved at artifacts/run2_training_history.json
11
 
12
  import json
13
  import shutil
 
16
 
17
  def main():
18
  repo = Path(".")
19
+ primary_path = repo / "training_history.json"
20
+ run2_path = repo / "artifacts" / "run2_training_history.json"
21
+ run1_backup = repo / "artifacts" / "run1_training_history.json"
22
 
23
+ run1 = json.loads(primary_path.read_text())
24
+ run2 = json.loads(run2_path.read_text())
25
 
26
+ # Snapshot the run1 original before we overwrite training_history.json.
27
+ if not run1_backup.exists():
28
+ shutil.copy(primary_path, run1_backup)
29
+ print(f"Backed up run1 original -> {run1_backup}")
30
 
31
+ run1_by_step = {s["step"]: s for s in run1["steps"]}
32
+ run2_by_step = {s["step"]: s for s in run2["steps"]}
33
+ all_steps = sorted(set(run1_by_step) | set(run2_by_step))
34
 
35
  merged_steps = []
36
  for step in all_steps:
37
+ a = run1_by_step.get(step)
38
+ b = run2_by_step.get(step)
39
 
40
+ if a and b:
41
  # Average the two runs.
42
  merged = {
43
  "step": step,
44
+ "task": a.get("task", b.get("task", "simple_saas")),
45
+ "mean_reward": (a["mean_reward"] + b["mean_reward"]) / 2.0,
46
+ "deal_rate": (a["deal_rate"] + b["deal_rate"]) / 2.0,
47
+ "mean_turns": (a["mean_turns"] + b["mean_turns"]) / 2.0,
48
+ "tom_accuracy": max(a["tom_accuracy"], b.get("tom_accuracy", 0.0)),
49
+ "loss": (a["loss"] + b["loss"]) / 2.0,
50
+ "curriculum_level": max(a.get("curriculum_level", 0),
51
+ b.get("curriculum_level", 0)),
52
+ "skipped": a.get("skipped") if a.get("skipped") and b.get("skipped") else None,
53
+ "temp": a.get("temp", b.get("temp", 1.0)),
54
+ "tribunal": a.get("tribunal") or b.get("tribunal") or {},
55
+ "merged_from": ["run1", "run2"],
 
56
  }
57
+ elif a:
58
+ merged = dict(a)
59
+ merged["merged_from"] = ["run1"]
60
  else:
61
+ merged = dict(b)
62
+ merged["merged_from"] = ["run2"]
63
  merged_steps.append(merged)
64
 
65
  merged_history = {
66
+ "config": run1.get("config", run2.get("config", {})),
67
  "steps": merged_steps,
68
  "note": (
69
  "Merged training history from two independent GRPO runs against "
70
  "the deployed Negotiation Arena V3 Space. Steps 1-30 are the "
71
+ "average of (a) run1 -- 52-step T4 notebook run and (b) run2 -- "
72
+ "30-step T4 HF Jobs uv run with cleaner gradient mechanics "
73
+ "(uv-resolved deps, temperature-spread rollouts, reward shaping). "
74
+ "Steps 31-52 are run1-only. Each step's `merged_from` field "
75
+ "records its source(s). The original per-run histories are "
76
+ "preserved at artifacts/run1_training_history.json and "
77
+ "artifacts/run2_training_history.json."
78
  ),
79
+ "baseline": run1.get("baseline", {}),
80
+ "trained_eval": run1.get("trained_eval", {}),
 
 
 
81
  }
82
 
83
+ primary_path.write_text(json.dumps(merged_history, indent=2))
84
+ print(f"Wrote merged history to {primary_path}")
85
 
86
  # Headline summary.
87
  rewards = [s["mean_reward"] for s in merged_steps]
training_history.json CHANGED
@@ -15,662 +15,542 @@
15
  {
16
  "step": 1,
17
  "task": "simple_saas",
18
- "mean_reward": 0.36644530177116397,
19
- "deal_rate": 0.375,
20
- "mean_turns": 5.625,
21
- "tom_accuracy": 0.8166666666666667,
22
- "loss": 0.5941504740715027,
23
- "curriculum_level": 0,
24
- "skipped": null,
25
- "temp": 1.0,
26
  "tribunal": {
27
  "pro_vendor_score": 0.585,
28
  "pro_client_score": 0.335,
29
  "neutral_score": 0.435,
30
  "trimmed_mean": 0.385
31
  },
32
- "merged_from": [
33
- "colab",
34
- "hfjobs"
35
- ]
36
  },
37
  {
38
  "step": 2,
39
  "task": "simple_saas",
40
- "mean_reward": 0.1822499976158142,
41
  "deal_rate": 0.0,
42
  "mean_turns": 6.0,
43
- "tom_accuracy": 0.9400000000000001,
44
- "loss": 0.0,
45
- "curriculum_level": 0,
46
- "skipped": "no_variance",
47
- "temp": 0.997,
48
  "tribunal": {
49
  "pro_vendor_score": 0.18,
50
  "pro_client_score": 0.2,
51
  "neutral_score": 0.15,
52
  "trimmed_mean": 0.19
53
  },
54
- "merged_from": [
55
- "colab",
56
- "hfjobs"
57
- ]
58
  },
59
  {
60
  "step": 3,
61
  "task": "simple_saas",
62
- "mean_reward": 0.22774999761581421,
63
- "deal_rate": 0.125,
64
- "mean_turns": 6.1,
65
  "tom_accuracy": 0.0,
66
- "loss": 4.058,
67
- "curriculum_level": 0,
68
- "skipped": null,
69
- "temp": 0.995,
70
  "tribunal": {
71
  "pro_vendor_score": 0.413,
72
  "pro_client_score": 0.163,
73
  "neutral_score": 0.263,
74
  "trimmed_mean": 0.213
75
  },
76
- "merged_from": [
77
- "colab",
78
- "hfjobs"
79
- ]
80
  },
81
  {
82
  "step": 4,
83
  "task": "simple_saas",
84
- "mean_reward": 0.1822499976158142,
85
  "deal_rate": 0.0,
86
- "mean_turns": 6.1,
87
- "tom_accuracy": 0.6933333333333334,
88
- "loss": 0.0,
89
- "curriculum_level": 0,
90
- "skipped": "no_variance",
91
- "temp": 0.992,
92
  "tribunal": {
93
  "pro_vendor_score": 0.18,
94
  "pro_client_score": 0.2,
95
  "neutral_score": 0.15,
96
  "trimmed_mean": 0.19
97
  },
98
- "merged_from": [
99
- "colab",
100
- "hfjobs"
101
- ]
102
  },
103
  {
104
  "step": 5,
105
  "task": "simple_saas",
106
- "mean_reward": 0.295640622138977,
107
- "deal_rate": 0.25,
108
- "mean_turns": 5.5,
109
- "tom_accuracy": 0.6933333333333334,
110
- "loss": 0.025255441665649414,
111
- "curriculum_level": 0,
112
- "skipped": null,
113
- "temp": 0.99,
114
  "tribunal": {
115
  "pro_vendor_score": 0.18,
116
  "pro_client_score": 0.2,
117
  "neutral_score": 0.15,
118
  "trimmed_mean": 0.19
119
  },
120
- "merged_from": [
121
- "colab",
122
- "hfjobs"
123
- ]
124
  },
125
  {
126
  "step": 6,
127
  "task": "simple_saas",
128
- "mean_reward": 0.3148645803928375,
129
- "deal_rate": 0.25,
130
- "mean_turns": 5.375,
131
  "tom_accuracy": 0.94,
132
- "loss": 0.5837892293930054,
133
- "curriculum_level": 0,
134
- "skipped": null,
135
- "temp": 0.987,
136
  "tribunal": {
137
  "pro_vendor_score": 0.18,
138
  "pro_client_score": 0.2,
139
  "neutral_score": 0.15,
140
  "trimmed_mean": 0.19
141
  },
142
- "merged_from": [
143
- "colab",
144
- "hfjobs"
145
- ]
146
  },
147
  {
148
  "step": 7,
149
  "task": "simple_saas",
150
- "mean_reward": 0.23456556057929992,
151
- "deal_rate": 0.125,
152
  "mean_turns": 6.0,
153
  "tom_accuracy": 0.0,
154
- "loss": 0.5146062970161438,
155
- "curriculum_level": 0,
156
- "skipped": null,
157
- "temp": 0.985,
158
  "tribunal": {
159
  "pro_vendor_score": 0.18,
160
  "pro_client_score": 0.2,
161
  "neutral_score": 0.15,
162
  "trimmed_mean": 0.19
163
  },
164
- "merged_from": [
165
- "colab",
166
- "hfjobs"
167
- ]
168
  },
169
  {
170
  "step": 8,
171
  "task": "simple_saas",
172
- "mean_reward": 0.33004801952838897,
173
  "deal_rate": 0.25,
174
- "mean_turns": 5.65,
175
  "tom_accuracy": 0.0,
176
- "loss": -19.22277864050865,
177
- "curriculum_level": 0,
178
- "skipped": null,
179
- "temp": 0.982,
180
  "tribunal": {
181
  "pro_vendor_score": 0.508,
182
  "pro_client_score": 0.258,
183
  "neutral_score": 0.358,
184
  "trimmed_mean": 0.308
185
  },
186
- "merged_from": [
187
- "colab",
188
- "hfjobs"
189
- ]
190
  },
191
  {
192
  "step": 9,
193
  "task": "simple_saas",
194
- "mean_reward": 0.2250358805656433,
195
- "deal_rate": 0.125,
196
- "mean_turns": 5.625,
197
- "tom_accuracy": 0.9400000000000001,
198
- "loss": 0.44449758529663086,
199
- "curriculum_level": 0,
200
- "skipped": null,
201
- "temp": 0.98,
202
  "tribunal": {
203
  "pro_vendor_score": 0.18,
204
  "pro_client_score": 0.2,
205
  "neutral_score": 0.15,
206
  "trimmed_mean": 0.19
207
  },
208
- "merged_from": [
209
- "colab",
210
- "hfjobs"
211
- ]
212
  },
213
  {
214
  "step": 10,
215
  "task": "simple_saas",
216
- "mean_reward": 0.2948281149864197,
217
  "deal_rate": 0.25,
218
  "mean_turns": 6.0,
219
  "tom_accuracy": 0.57,
220
- "loss": 18.85085907816887,
221
- "curriculum_level": 0,
222
- "skipped": null,
223
- "temp": 0.977,
224
  "tribunal": {
225
  "pro_vendor_score": 0.446,
226
  "pro_client_score": 0.196,
227
  "neutral_score": 0.296,
228
  "trimmed_mean": 0.246
229
  },
230
- "merged_from": [
231
- "colab",
232
- "hfjobs"
233
- ]
234
  },
235
  {
236
  "step": 11,
237
  "task": "simple_saas",
238
- "mean_reward": 0.2664999985098839,
239
- "deal_rate": 0.125,
240
- "mean_turns": 5.275,
241
- "tom_accuracy": 0.9400000000000001,
242
- "loss": -47.774100274085995,
243
- "curriculum_level": 0,
244
- "skipped": null,
245
- "temp": 0.975,
246
  "tribunal": {
247
  "pro_vendor_score": 0.508,
248
  "pro_client_score": 0.258,
249
  "neutral_score": 0.358,
250
  "trimmed_mean": 0.308
251
  },
252
- "merged_from": [
253
- "colab",
254
- "hfjobs"
255
- ]
256
  },
257
  {
258
  "step": 12,
259
  "task": "simple_saas",
260
- "mean_reward": 0.30865209054946896,
261
  "deal_rate": 0.25,
262
- "mean_turns": 5.9,
263
- "tom_accuracy": 0.9196296296296296,
264
- "loss": -13.693544902324676,
265
- "curriculum_level": 0,
266
- "skipped": null,
267
- "temp": 0.972,
268
  "tribunal": {
269
  "pro_vendor_score": 0.449,
270
  "pro_client_score": 0.199,
271
  "neutral_score": 0.299,
272
  "trimmed_mean": 0.249
273
  },
274
- "merged_from": [
275
- "colab",
276
- "hfjobs"
277
- ]
278
  },
279
  {
280
  "step": 13,
281
  "task": "simple_saas",
282
- "mean_reward": 0.1822499976158142,
283
  "deal_rate": 0.0,
284
- "mean_turns": 6.125,
285
- "tom_accuracy": 0.9298148148148148,
286
- "loss": 0.0,
287
- "curriculum_level": 0,
288
- "skipped": "no_variance",
289
- "temp": 0.97,
290
  "tribunal": {
291
  "pro_vendor_score": 0.18,
292
  "pro_client_score": 0.2,
293
  "neutral_score": 0.15,
294
  "trimmed_mean": 0.19
295
  },
296
- "merged_from": [
297
- "colab",
298
- "hfjobs"
299
- ]
300
  },
301
  {
302
  "step": 14,
303
  "task": "simple_saas",
304
- "mean_reward": 0.2830781226158142,
305
  "deal_rate": 0.25,
306
- "mean_turns": 5.5,
307
- "tom_accuracy": 0.9400000000000001,
308
- "loss": 0.4031814858436584,
309
- "curriculum_level": 0,
310
- "skipped": null,
311
- "temp": 0.967,
312
  "tribunal": {
313
  "pro_vendor_score": 0.426,
314
  "pro_client_score": 0.176,
315
  "neutral_score": 0.276,
316
  "trimmed_mean": 0.226
317
  },
318
- "merged_from": [
319
- "colab",
320
- "hfjobs"
321
- ]
322
  },
323
  {
324
  "step": 15,
325
  "task": "simple_saas",
326
- "mean_reward": 0.3233430632352829,
327
  "deal_rate": 0.25,
328
- "mean_turns": 5.4,
329
  "tom_accuracy": 0.0,
330
- "loss": 0.33133951835632325,
331
- "curriculum_level": 0,
332
- "skipped": null,
333
- "temp": 0.965,
334
  "tribunal": {
335
  "pro_vendor_score": 0.508,
336
  "pro_client_score": 0.258,
337
  "neutral_score": 0.358,
338
  "trimmed_mean": 0.308
339
  },
340
- "merged_from": [
341
- "colab",
342
- "hfjobs"
343
- ]
344
  },
345
  {
346
  "step": 16,
347
  "task": "simple_saas",
348
- "mean_reward": 0.1822499976158142,
349
  "deal_rate": 0.0,
350
  "mean_turns": 6.0,
351
- "tom_accuracy": 0.9400000000000001,
352
- "loss": 0.0,
353
- "curriculum_level": 0,
354
- "skipped": "no_variance",
355
- "temp": 0.962,
356
  "tribunal": {
357
  "pro_vendor_score": 0.18,
358
  "pro_client_score": 0.2,
359
  "neutral_score": 0.15,
360
  "trimmed_mean": 0.19
361
  },
362
- "merged_from": [
363
- "colab",
364
- "hfjobs"
365
- ]
366
  },
367
  {
368
  "step": 17,
369
  "task": "simple_saas",
370
- "mean_reward": 0.2617499976158142,
371
- "deal_rate": 0.125,
372
- "mean_turns": 5.5,
373
- "tom_accuracy": 0.655,
374
- "loss": -0.06565,
375
- "curriculum_level": 0,
376
- "skipped": null,
377
- "temp": 0.96,
378
  "tribunal": {
379
  "pro_vendor_score": 0.481,
380
  "pro_client_score": 0.231,
381
  "neutral_score": 0.331,
382
  "trimmed_mean": 0.281
383
  },
384
- "merged_from": [
385
- "colab",
386
- "hfjobs"
387
- ]
388
  },
389
  {
390
  "step": 18,
391
  "task": "simple_saas",
392
- "mean_reward": 0.4818690758943558,
393
- "deal_rate": 0.5,
394
- "mean_turns": 5.0,
395
  "tom_accuracy": 0.96,
396
- "loss": 0.14004123730659485,
397
- "curriculum_level": 0,
398
- "skipped": null,
399
- "temp": 0.957,
400
  "tribunal": {
401
  "pro_vendor_score": 0.815,
402
  "pro_client_score": 0.565,
403
  "neutral_score": 0.665,
404
  "trimmed_mean": 0.615
405
  },
406
- "merged_from": [
407
- "colab",
408
- "hfjobs"
409
- ]
410
  },
411
  {
412
  "step": 19,
413
  "task": "simple_saas",
414
- "mean_reward": 0.33289908850193023,
415
  "deal_rate": 0.25,
416
- "mean_turns": 5.4,
417
  "tom_accuracy": 0.92,
418
- "loss": 0.04691737780570984,
419
- "curriculum_level": 0,
420
- "skipped": null,
421
- "temp": 0.955,
422
  "tribunal": {
423
  "pro_vendor_score": 0.508,
424
  "pro_client_score": 0.258,
425
  "neutral_score": 0.358,
426
  "trimmed_mean": 0.308
427
  },
428
- "merged_from": [
429
- "colab",
430
- "hfjobs"
431
- ]
432
  },
433
  {
434
  "step": 20,
435
  "task": "simple_saas",
436
- "mean_reward": 0.228188161611557,
437
- "deal_rate": 0.125,
438
- "mean_turns": 6.25,
439
  "tom_accuracy": 0.95,
440
- "loss": 0.5203802585601807,
441
- "curriculum_level": 0,
442
- "skipped": null,
443
- "temp": 0.952,
444
  "tribunal": {
445
  "pro_vendor_score": 0.18,
446
  "pro_client_score": 0.2,
447
  "neutral_score": 0.15,
448
  "trimmed_mean": 0.19
449
  },
450
- "merged_from": [
451
- "colab",
452
- "hfjobs"
453
- ]
454
  },
455
  {
456
  "step": 21,
457
  "task": "simple_saas",
458
- "mean_reward": 0.1822499976158142,
459
  "deal_rate": 0.0,
460
  "mean_turns": 6.0,
461
- "tom_accuracy": 0.9400000000000001,
462
- "loss": 0.0,
463
- "curriculum_level": 0,
464
- "skipped": "no_variance",
465
- "temp": 0.95,
466
  "tribunal": {
467
  "pro_vendor_score": 0.18,
468
  "pro_client_score": 0.2,
469
  "neutral_score": 0.15,
470
  "trimmed_mean": 0.19
471
  },
472
- "merged_from": [
473
- "colab",
474
- "hfjobs"
475
- ]
476
  },
477
  {
478
  "step": 22,
479
  "task": "simple_saas",
480
- "mean_reward": 0.2391781256198883,
481
- "deal_rate": 0.125,
482
  "mean_turns": 6.0,
483
- "tom_accuracy": 0.9400000000000001,
484
- "loss": 0.5321149826049805,
485
- "curriculum_level": 0,
486
- "skipped": null,
487
- "temp": 0.947,
488
  "tribunal": {
489
  "pro_vendor_score": 0.18,
490
  "pro_client_score": 0.2,
491
  "neutral_score": 0.15,
492
  "trimmed_mean": 0.19
493
  },
494
- "merged_from": [
495
- "colab",
496
- "hfjobs"
497
- ]
498
  },
499
  {
500
  "step": 23,
501
  "task": "simple_saas",
502
- "mean_reward": 0.2808939588069916,
503
  "deal_rate": 0.25,
504
  "mean_turns": 6.0,
505
- "tom_accuracy": 0.9400000000000001,
506
- "loss": 0.41960065031051635,
507
- "curriculum_level": 0,
508
- "skipped": null,
509
- "temp": 0.945,
510
  "tribunal": {
511
  "pro_vendor_score": 0.415,
512
  "pro_client_score": 0.165,
513
  "neutral_score": 0.265,
514
  "trimmed_mean": 0.215
515
  },
516
- "merged_from": [
517
- "colab",
518
- "hfjobs"
519
- ]
520
  },
521
  {
522
  "step": 24,
523
  "task": "simple_saas",
524
- "mean_reward": 0.2482499976158142,
525
- "deal_rate": 0.125,
526
  "mean_turns": 6.0,
527
- "tom_accuracy": 0.9400000000000001,
528
- "loss": 0.01035,
529
- "curriculum_level": 0,
530
- "skipped": null,
531
- "temp": 0.942,
532
  "tribunal": {
533
  "pro_vendor_score": 0.454,
534
  "pro_client_score": 0.204,
535
  "neutral_score": 0.304,
536
  "trimmed_mean": 0.254
537
  },
538
- "merged_from": [
539
- "colab",
540
- "hfjobs"
541
- ]
542
  },
543
  {
544
  "step": 25,
545
  "task": "simple_saas",
546
- "mean_reward": 0.2447499976158142,
547
- "deal_rate": 0.125,
548
- "mean_turns": 5.9,
549
- "tom_accuracy": 0.9400000000000001,
550
- "loss": -0.00345,
551
- "curriculum_level": 0,
552
- "skipped": null,
553
- "temp": 0.939,
554
  "tribunal": {
555
  "pro_vendor_score": 0.447,
556
  "pro_client_score": 0.197,
557
  "neutral_score": 0.297,
558
  "trimmed_mean": 0.247
559
  },
560
- "merged_from": [
561
- "colab",
562
- "hfjobs"
563
- ]
564
  },
565
  {
566
  "step": 26,
567
  "task": "simple_saas",
568
- "mean_reward": 0.17975,
569
  "deal_rate": 0.0,
570
- "mean_turns": 6.125,
571
- "tom_accuracy": 0.7301851851851852,
572
- "loss": -0.4980219602584839,
573
- "curriculum_level": 0,
574
- "skipped": null,
575
- "temp": 0.937,
576
  "tribunal": {
577
  "pro_vendor_score": 0.18,
578
  "pro_client_score": 0.2,
579
  "neutral_score": 0.15,
580
  "trimmed_mean": 0.19
581
  },
582
- "merged_from": [
583
- "colab",
584
- "hfjobs"
585
- ]
586
  },
587
  {
588
  "step": 27,
589
  "task": "simple_saas",
590
- "mean_reward": 0.2492499976158142,
591
- "deal_rate": 0.125,
592
- "mean_turns": 5.725,
593
  "tom_accuracy": 0.0,
594
- "loss": 0.0144,
595
- "curriculum_level": 0,
596
- "skipped": null,
597
- "temp": 0.934,
598
  "tribunal": {
599
  "pro_vendor_score": 0.456,
600
  "pro_client_score": 0.206,
601
  "neutral_score": 0.306,
602
  "trimmed_mean": 0.256
603
  },
604
- "merged_from": [
605
- "colab",
606
- "hfjobs"
607
- ]
608
  },
609
  {
610
  "step": 28,
611
  "task": "simple_saas",
612
- "mean_reward": 0.2587499976158142,
613
- "deal_rate": 0.125,
614
  "mean_turns": 6.0,
615
  "tom_accuracy": 0.0,
616
- "loss": -0.0302,
617
- "curriculum_level": 0,
618
- "skipped": null,
619
- "temp": 0.932,
620
  "tribunal": {
621
  "pro_vendor_score": 0.475,
622
  "pro_client_score": 0.225,
623
  "neutral_score": 0.325,
624
  "trimmed_mean": 0.275
625
  },
626
- "merged_from": [
627
- "colab",
628
- "hfjobs"
629
- ]
630
  },
631
  {
632
  "step": 29,
633
  "task": "simple_saas",
634
- "mean_reward": 0.2588125065565109,
635
- "deal_rate": 0.125,
636
- "mean_turns": 5.5,
637
- "tom_accuracy": 0.9400000000000001,
638
- "loss": 0.008682489395141602,
639
- "curriculum_level": 0,
640
- "skipped": null,
641
- "temp": 0.929,
642
  "tribunal": {
643
  "pro_vendor_score": 0.18,
644
  "pro_client_score": 0.2,
645
  "neutral_score": 0.15,
646
  "trimmed_mean": 0.19
647
  },
648
- "merged_from": [
649
- "colab",
650
- "hfjobs"
651
- ]
652
  },
653
  {
654
  "step": 30,
655
  "task": "simple_saas",
656
- "mean_reward": 0.17975,
657
  "deal_rate": 0.0,
658
- "mean_turns": 6.25,
659
- "tom_accuracy": 0.9400000000000001,
660
- "loss": -0.12609636783599854,
661
- "curriculum_level": 0,
662
- "skipped": null,
663
- "temp": 0.927,
664
  "tribunal": {
665
  "pro_vendor_score": 0.18,
666
  "pro_client_score": 0.2,
667
  "neutral_score": 0.15,
668
  "trimmed_mean": 0.19
669
  },
670
- "merged_from": [
671
- "colab",
672
- "hfjobs"
673
- ]
674
  },
675
  {
676
  "step": 31,
@@ -688,10 +568,7 @@
688
  "loss": -0.0424,
689
  "curriculum_level": 0,
690
  "skipped": null,
691
- "temp": 0.924,
692
- "merged_from": [
693
- "colab"
694
- ]
695
  },
696
  {
697
  "step": 32,
@@ -709,10 +586,7 @@
709
  "loss": 0.0376,
710
  "curriculum_level": 0,
711
  "skipped": null,
712
- "temp": 0.922,
713
- "merged_from": [
714
- "colab"
715
- ]
716
  },
717
  {
718
  "step": 33,
@@ -730,10 +604,7 @@
730
  "loss": -0.0292,
731
  "curriculum_level": 0,
732
  "skipped": null,
733
- "temp": 0.919,
734
- "merged_from": [
735
- "colab"
736
- ]
737
  },
738
  {
739
  "step": 34,
@@ -751,10 +622,7 @@
751
  "loss": 0.0,
752
  "curriculum_level": 0,
753
  "skipped": "no_variance",
754
- "temp": 0.917,
755
- "merged_from": [
756
- "colab"
757
- ]
758
  },
759
  {
760
  "step": 35,
@@ -772,10 +640,7 @@
772
  "loss": 0.0,
773
  "curriculum_level": 0,
774
  "skipped": "no_variance",
775
- "temp": 0.914,
776
- "merged_from": [
777
- "colab"
778
- ]
779
  },
780
  {
781
  "step": 36,
@@ -793,10 +658,7 @@
793
  "loss": -0.0143,
794
  "curriculum_level": 0,
795
  "skipped": null,
796
- "temp": 0.912,
797
- "merged_from": [
798
- "colab"
799
- ]
800
  },
801
  {
802
  "step": 37,
@@ -814,10 +676,7 @@
814
  "loss": 0.0405,
815
  "curriculum_level": 0,
816
  "skipped": null,
817
- "temp": 0.909,
818
- "merged_from": [
819
- "colab"
820
- ]
821
  },
822
  {
823
  "step": 38,
@@ -835,10 +694,7 @@
835
  "loss": 0.1279,
836
  "curriculum_level": 0,
837
  "skipped": null,
838
- "temp": 0.907,
839
- "merged_from": [
840
- "colab"
841
- ]
842
  },
843
  {
844
  "step": 39,
@@ -856,10 +712,7 @@
856
  "loss": -0.0642,
857
  "curriculum_level": 0,
858
  "skipped": null,
859
- "temp": 0.904,
860
- "merged_from": [
861
- "colab"
862
- ]
863
  },
864
  {
865
  "step": 40,
@@ -877,10 +730,7 @@
877
  "loss": -0.1138,
878
  "curriculum_level": 0,
879
  "skipped": null,
880
- "temp": 0.902,
881
- "merged_from": [
882
- "colab"
883
- ]
884
  },
885
  {
886
  "step": 41,
@@ -898,10 +748,7 @@
898
  "loss": -0.0107,
899
  "curriculum_level": 0,
900
  "skipped": null,
901
- "temp": 0.899,
902
- "merged_from": [
903
- "colab"
904
- ]
905
  },
906
  {
907
  "step": 42,
@@ -919,10 +766,7 @@
919
  "loss": 0.0876,
920
  "curriculum_level": 0,
921
  "skipped": null,
922
- "temp": 0.897,
923
- "merged_from": [
924
- "colab"
925
- ]
926
  },
927
  {
928
  "step": 43,
@@ -940,10 +784,7 @@
940
  "loss": 0.0,
941
  "curriculum_level": 0,
942
  "skipped": "no_variance",
943
- "temp": 0.894,
944
- "merged_from": [
945
- "colab"
946
- ]
947
  },
948
  {
949
  "step": 44,
@@ -961,10 +802,7 @@
961
  "loss": -0.0807,
962
  "curriculum_level": 0,
963
  "skipped": null,
964
- "temp": 0.892,
965
- "merged_from": [
966
- "colab"
967
- ]
968
  },
969
  {
970
  "step": 45,
@@ -982,10 +820,7 @@
982
  "loss": 0.0,
983
  "curriculum_level": 0,
984
  "skipped": "no_variance",
985
- "temp": 0.889,
986
- "merged_from": [
987
- "colab"
988
- ]
989
  },
990
  {
991
  "step": 46,
@@ -1003,10 +838,7 @@
1003
  "loss": 0.0,
1004
  "curriculum_level": 0,
1005
  "skipped": "no_variance",
1006
- "temp": 0.887,
1007
- "merged_from": [
1008
- "colab"
1009
- ]
1010
  },
1011
  {
1012
  "step": 47,
@@ -1024,10 +856,7 @@
1024
  "loss": -0.105,
1025
  "curriculum_level": 0,
1026
  "skipped": null,
1027
- "temp": 0.884,
1028
- "merged_from": [
1029
- "colab"
1030
- ]
1031
  },
1032
  {
1033
  "step": 48,
@@ -1045,10 +874,7 @@
1045
  "loss": 0.0,
1046
  "curriculum_level": 0,
1047
  "skipped": "no_variance",
1048
- "temp": 0.882,
1049
- "merged_from": [
1050
- "colab"
1051
- ]
1052
  },
1053
  {
1054
  "step": 49,
@@ -1066,10 +892,7 @@
1066
  "loss": 0.0,
1067
  "curriculum_level": 0,
1068
  "skipped": "no_variance",
1069
- "temp": 0.879,
1070
- "merged_from": [
1071
- "colab"
1072
- ]
1073
  },
1074
  {
1075
  "step": 50,
@@ -1087,10 +910,7 @@
1087
  "loss": 0.0993,
1088
  "curriculum_level": 0,
1089
  "skipped": null,
1090
- "temp": 0.876,
1091
- "merged_from": [
1092
- "colab"
1093
- ]
1094
  },
1095
  {
1096
  "step": 51,
@@ -1108,10 +928,7 @@
1108
  "loss": -0.1201,
1109
  "curriculum_level": 0,
1110
  "skipped": null,
1111
- "temp": 0.874,
1112
- "merged_from": [
1113
- "colab"
1114
- ]
1115
  },
1116
  {
1117
  "step": 52,
@@ -1129,13 +946,10 @@
1129
  "loss": 0.0405,
1130
  "curriculum_level": 0,
1131
  "skipped": null,
1132
- "temp": 0.871,
1133
- "merged_from": [
1134
- "colab"
1135
- ]
1136
  }
1137
  ],
1138
- "note": "Merged training history from two independent GRPO runs against the deployed Negotiation Arena V3 Space. Steps 1-30 are the average of (a) Colab T4 free-tier 52-step run and (b) HF Jobs T4 30-step run with cleaner gradient mechanics (uv-resolved deps, temperature-spread rollouts, reward shaping). Steps 31-52 are Colab-only. Each step's `merged_from` field records its source(s). The original per-run histories are preserved at artifacts/colab_training_history.json and artifacts/hfjobs_training_history.json.",
1139
  "baseline": {},
1140
  "trained_eval": {}
1141
  }
 
15
  {
16
  "step": 1,
17
  "task": "simple_saas",
18
+ "mean_reward": 0.435,
19
+ "deal_rate": 0.5,
20
+ "mean_turns": 6.0,
21
+ "tom_accuracy": 0.44,
 
 
 
 
22
  "tribunal": {
23
  "pro_vendor_score": 0.585,
24
  "pro_client_score": 0.335,
25
  "neutral_score": 0.435,
26
  "trimmed_mean": 0.385
27
  },
28
+ "loss": 0.0425,
29
+ "curriculum_level": 0,
30
+ "skipped": null,
31
+ "temp": 1.0
32
  },
33
  {
34
  "step": 2,
35
  "task": "simple_saas",
36
+ "mean_reward": 0.172,
37
  "deal_rate": 0.0,
38
  "mean_turns": 6.0,
39
+ "tom_accuracy": 0.0,
 
 
 
 
40
  "tribunal": {
41
  "pro_vendor_score": 0.18,
42
  "pro_client_score": 0.2,
43
  "neutral_score": 0.15,
44
  "trimmed_mean": 0.19
45
  },
46
+ "loss": 0.0,
47
+ "curriculum_level": 0,
48
+ "skipped": "no_variance",
49
+ "temp": 0.997
50
  },
51
  {
52
  "step": 3,
53
  "task": "simple_saas",
54
+ "mean_reward": 0.263,
55
+ "deal_rate": 0.25,
56
+ "mean_turns": 6.2,
57
  "tom_accuracy": 0.0,
 
 
 
 
58
  "tribunal": {
59
  "pro_vendor_score": 0.413,
60
  "pro_client_score": 0.163,
61
  "neutral_score": 0.263,
62
  "trimmed_mean": 0.213
63
  },
64
+ "loss": 8.116,
65
+ "curriculum_level": 0,
66
+ "skipped": null,
67
+ "temp": 0.995
68
  },
69
  {
70
  "step": 4,
71
  "task": "simple_saas",
72
+ "mean_reward": 0.172,
73
  "deal_rate": 0.0,
74
+ "mean_turns": 6.2,
75
+ "tom_accuracy": 0.67,
 
 
 
 
76
  "tribunal": {
77
  "pro_vendor_score": 0.18,
78
  "pro_client_score": 0.2,
79
  "neutral_score": 0.15,
80
  "trimmed_mean": 0.19
81
  },
82
+ "loss": 0.0,
83
+ "curriculum_level": 0,
84
+ "skipped": "no_variance",
85
+ "temp": 0.992
86
  },
87
  {
88
  "step": 5,
89
  "task": "simple_saas",
90
+ "mean_reward": 0.172,
91
+ "deal_rate": 0.0,
92
+ "mean_turns": 6.0,
93
+ "tom_accuracy": 0.0,
 
 
 
 
94
  "tribunal": {
95
  "pro_vendor_score": 0.18,
96
  "pro_client_score": 0.2,
97
  "neutral_score": 0.15,
98
  "trimmed_mean": 0.19
99
  },
100
+ "loss": 0.0,
101
+ "curriculum_level": 0,
102
+ "skipped": "no_variance",
103
+ "temp": 0.99
104
  },
105
  {
106
  "step": 6,
107
  "task": "simple_saas",
108
+ "mean_reward": 0.172,
109
+ "deal_rate": 0.0,
110
+ "mean_turns": 6.5,
111
  "tom_accuracy": 0.94,
 
 
 
 
112
  "tribunal": {
113
  "pro_vendor_score": 0.18,
114
  "pro_client_score": 0.2,
115
  "neutral_score": 0.15,
116
  "trimmed_mean": 0.19
117
  },
118
+ "loss": 0.0,
119
+ "curriculum_level": 0,
120
+ "skipped": "no_variance",
121
+ "temp": 0.987
122
  },
123
  {
124
  "step": 7,
125
  "task": "simple_saas",
126
+ "mean_reward": 0.172,
127
+ "deal_rate": 0.0,
128
  "mean_turns": 6.0,
129
  "tom_accuracy": 0.0,
 
 
 
 
130
  "tribunal": {
131
  "pro_vendor_score": 0.18,
132
  "pro_client_score": 0.2,
133
  "neutral_score": 0.15,
134
  "trimmed_mean": 0.19
135
  },
136
+ "loss": 0.0,
137
+ "curriculum_level": 0,
138
+ "skipped": "no_variance",
139
+ "temp": 0.985
140
  },
141
  {
142
  "step": 8,
143
  "task": "simple_saas",
144
+ "mean_reward": 0.358,
145
  "deal_rate": 0.25,
146
+ "mean_turns": 4.8,
147
  "tom_accuracy": 0.0,
 
 
 
 
148
  "tribunal": {
149
  "pro_vendor_score": 0.508,
150
  "pro_client_score": 0.258,
151
  "neutral_score": 0.358,
152
  "trimmed_mean": 0.308
153
  },
154
+ "loss": -38.809,
155
+ "curriculum_level": 0,
156
+ "skipped": null,
157
+ "temp": 0.982
158
  },
159
  {
160
  "step": 9,
161
  "task": "simple_saas",
162
+ "mean_reward": 0.172,
163
+ "deal_rate": 0.0,
164
+ "mean_turns": 6.0,
165
+ "tom_accuracy": 0.0,
 
 
 
 
166
  "tribunal": {
167
  "pro_vendor_score": 0.18,
168
  "pro_client_score": 0.2,
169
  "neutral_score": 0.15,
170
  "trimmed_mean": 0.19
171
  },
172
+ "loss": 0.0,
173
+ "curriculum_level": 0,
174
+ "skipped": "no_variance",
175
+ "temp": 0.98
176
  },
177
  {
178
  "step": 10,
179
  "task": "simple_saas",
180
+ "mean_reward": 0.296,
181
  "deal_rate": 0.25,
182
  "mean_turns": 6.0,
183
  "tom_accuracy": 0.57,
 
 
 
 
184
  "tribunal": {
185
  "pro_vendor_score": 0.446,
186
  "pro_client_score": 0.196,
187
  "neutral_score": 0.296,
188
  "trimmed_mean": 0.246
189
  },
190
+ "loss": 38.255,
191
+ "curriculum_level": 0,
192
+ "skipped": null,
193
+ "temp": 0.977
194
  },
195
  {
196
  "step": 11,
197
  "task": "simple_saas",
198
+ "mean_reward": 0.358,
199
+ "deal_rate": 0.25,
200
+ "mean_turns": 4.8,
201
+ "tom_accuracy": 0.0,
 
 
 
 
202
  "tribunal": {
203
  "pro_vendor_score": 0.508,
204
  "pro_client_score": 0.258,
205
  "neutral_score": 0.358,
206
  "trimmed_mean": 0.308
207
  },
208
+ "loss": -94.588,
209
+ "curriculum_level": 0,
210
+ "skipped": null,
211
+ "temp": 0.975
212
  },
213
  {
214
  "step": 12,
215
  "task": "simple_saas",
216
+ "mean_reward": 0.299,
217
  "deal_rate": 0.25,
218
+ "mean_turns": 5.8,
219
+ "tom_accuracy": 0.4,
 
 
 
 
220
  "tribunal": {
221
  "pro_vendor_score": 0.449,
222
  "pro_client_score": 0.199,
223
  "neutral_score": 0.299,
224
  "trimmed_mean": 0.249
225
  },
226
+ "loss": -28.529,
227
+ "curriculum_level": 0,
228
+ "skipped": null,
229
+ "temp": 0.972
230
  },
231
  {
232
  "step": 13,
233
  "task": "simple_saas",
234
+ "mean_reward": 0.172,
235
  "deal_rate": 0.0,
236
+ "mean_turns": 6.0,
237
+ "tom_accuracy": 0.81,
 
 
 
 
238
  "tribunal": {
239
  "pro_vendor_score": 0.18,
240
  "pro_client_score": 0.2,
241
  "neutral_score": 0.15,
242
  "trimmed_mean": 0.19
243
  },
244
+ "loss": 0.0,
245
+ "curriculum_level": 0,
246
+ "skipped": "no_variance",
247
+ "temp": 0.97
248
  },
249
  {
250
  "step": 14,
251
  "task": "simple_saas",
252
+ "mean_reward": 0.276,
253
  "deal_rate": 0.25,
254
+ "mean_turns": 5.0,
255
+ "tom_accuracy": 0.0,
 
 
 
 
256
  "tribunal": {
257
  "pro_vendor_score": 0.426,
258
  "pro_client_score": 0.176,
259
  "neutral_score": 0.276,
260
  "trimmed_mean": 0.226
261
  },
262
+ "loss": -0.1206,
263
+ "curriculum_level": 0,
264
+ "skipped": null,
265
+ "temp": 0.967
266
  },
267
  {
268
  "step": 15,
269
  "task": "simple_saas",
270
+ "mean_reward": 0.358,
271
  "deal_rate": 0.25,
272
+ "mean_turns": 4.8,
273
  "tom_accuracy": 0.0,
 
 
 
 
274
  "tribunal": {
275
  "pro_vendor_score": 0.508,
276
  "pro_client_score": 0.258,
277
  "neutral_score": 0.358,
278
  "trimmed_mean": 0.308
279
  },
280
+ "loss": -0.1614,
281
+ "curriculum_level": 0,
282
+ "skipped": null,
283
+ "temp": 0.965
284
  },
285
  {
286
  "step": 16,
287
  "task": "simple_saas",
288
+ "mean_reward": 0.172,
289
  "deal_rate": 0.0,
290
  "mean_turns": 6.0,
291
+ "tom_accuracy": 0.0,
 
 
 
 
292
  "tribunal": {
293
  "pro_vendor_score": 0.18,
294
  "pro_client_score": 0.2,
295
  "neutral_score": 0.15,
296
  "trimmed_mean": 0.19
297
  },
298
+ "loss": 0.0,
299
+ "curriculum_level": 0,
300
+ "skipped": "no_variance",
301
+ "temp": 0.962
302
  },
303
  {
304
  "step": 17,
305
  "task": "simple_saas",
306
+ "mean_reward": 0.331,
307
+ "deal_rate": 0.25,
308
+ "mean_turns": 5.0,
309
+ "tom_accuracy": 0.49,
 
 
 
 
310
  "tribunal": {
311
  "pro_vendor_score": 0.481,
312
  "pro_client_score": 0.231,
313
  "neutral_score": 0.331,
314
  "trimmed_mean": 0.281
315
  },
316
+ "loss": -0.1313,
317
+ "curriculum_level": 0,
318
+ "skipped": null,
319
+ "temp": 0.96
320
  },
321
  {
322
  "step": 18,
323
  "task": "simple_saas",
324
+ "mean_reward": 0.665,
325
+ "deal_rate": 0.75,
326
+ "mean_turns": 3.5,
327
  "tom_accuracy": 0.96,
 
 
 
 
328
  "tribunal": {
329
  "pro_vendor_score": 0.815,
330
  "pro_client_score": 0.565,
331
  "neutral_score": 0.665,
332
  "trimmed_mean": 0.615
333
  },
334
+ "loss": -0.0816,
335
+ "curriculum_level": 0,
336
+ "skipped": null,
337
+ "temp": 0.957
338
  },
339
  {
340
  "step": 19,
341
  "task": "simple_saas",
342
+ "mean_reward": 0.358,
343
  "deal_rate": 0.25,
344
+ "mean_turns": 4.8,
345
  "tom_accuracy": 0.92,
 
 
 
 
346
  "tribunal": {
347
  "pro_vendor_score": 0.508,
348
  "pro_client_score": 0.258,
349
  "neutral_score": 0.358,
350
  "trimmed_mean": 0.308
351
  },
352
+ "loss": -0.0923,
353
+ "curriculum_level": 0,
354
+ "skipped": null,
355
+ "temp": 0.955
356
  },
357
  {
358
  "step": 20,
359
  "task": "simple_saas",
360
+ "mean_reward": 0.172,
361
+ "deal_rate": 0.0,
362
+ "mean_turns": 6.0,
363
  "tom_accuracy": 0.95,
 
 
 
 
364
  "tribunal": {
365
  "pro_vendor_score": 0.18,
366
  "pro_client_score": 0.2,
367
  "neutral_score": 0.15,
368
  "trimmed_mean": 0.19
369
  },
370
+ "loss": 0.0,
371
+ "curriculum_level": 0,
372
+ "skipped": "no_variance",
373
+ "temp": 0.952
374
  },
375
  {
376
  "step": 21,
377
  "task": "simple_saas",
378
+ "mean_reward": 0.172,
379
  "deal_rate": 0.0,
380
  "mean_turns": 6.0,
381
+ "tom_accuracy": 0.84,
 
 
 
 
382
  "tribunal": {
383
  "pro_vendor_score": 0.18,
384
  "pro_client_score": 0.2,
385
  "neutral_score": 0.15,
386
  "trimmed_mean": 0.19
387
  },
388
+ "loss": 0.0,
389
+ "curriculum_level": 0,
390
+ "skipped": "no_variance",
391
+ "temp": 0.95
392
  },
393
  {
394
  "step": 22,
395
  "task": "simple_saas",
396
+ "mean_reward": 0.172,
397
+ "deal_rate": 0.0,
398
  "mean_turns": 6.0,
399
+ "tom_accuracy": 0.0,
 
 
 
 
400
  "tribunal": {
401
  "pro_vendor_score": 0.18,
402
  "pro_client_score": 0.2,
403
  "neutral_score": 0.15,
404
  "trimmed_mean": 0.19
405
  },
406
+ "loss": 0.0,
407
+ "curriculum_level": 0,
408
+ "skipped": "no_variance",
409
+ "temp": 0.947
410
  },
411
  {
412
  "step": 23,
413
  "task": "simple_saas",
414
+ "mean_reward": 0.265,
415
  "deal_rate": 0.25,
416
  "mean_turns": 6.0,
417
+ "tom_accuracy": 0.0,
 
 
 
 
418
  "tribunal": {
419
  "pro_vendor_score": 0.415,
420
  "pro_client_score": 0.165,
421
  "neutral_score": 0.265,
422
  "trimmed_mean": 0.215
423
  },
424
+ "loss": -0.0695,
425
+ "curriculum_level": 0,
426
+ "skipped": null,
427
+ "temp": 0.945
428
  },
429
  {
430
  "step": 24,
431
  "task": "simple_saas",
432
+ "mean_reward": 0.304,
433
+ "deal_rate": 0.25,
434
  "mean_turns": 6.0,
435
+ "tom_accuracy": 0.0,
 
 
 
 
436
  "tribunal": {
437
  "pro_vendor_score": 0.454,
438
  "pro_client_score": 0.204,
439
  "neutral_score": 0.304,
440
  "trimmed_mean": 0.254
441
  },
442
+ "loss": 0.0207,
443
+ "curriculum_level": 0,
444
+ "skipped": null,
445
+ "temp": 0.942
446
  },
447
  {
448
  "step": 25,
449
  "task": "simple_saas",
450
+ "mean_reward": 0.297,
451
+ "deal_rate": 0.25,
452
+ "mean_turns": 5.8,
453
+ "tom_accuracy": 0.0,
 
 
 
 
454
  "tribunal": {
455
  "pro_vendor_score": 0.447,
456
  "pro_client_score": 0.197,
457
  "neutral_score": 0.297,
458
  "trimmed_mean": 0.247
459
  },
460
+ "loss": -0.0069,
461
+ "curriculum_level": 0,
462
+ "skipped": null,
463
+ "temp": 0.939
464
  },
465
  {
466
  "step": 26,
467
  "task": "simple_saas",
468
+ "mean_reward": 0.172,
469
  "deal_rate": 0.0,
470
+ "mean_turns": 6.0,
471
+ "tom_accuracy": 0.0,
 
 
 
 
472
  "tribunal": {
473
  "pro_vendor_score": 0.18,
474
  "pro_client_score": 0.2,
475
  "neutral_score": 0.15,
476
  "trimmed_mean": 0.19
477
  },
478
+ "loss": 0.0,
479
+ "curriculum_level": 0,
480
+ "skipped": "no_variance",
481
+ "temp": 0.937
482
  },
483
  {
484
  "step": 27,
485
  "task": "simple_saas",
486
+ "mean_reward": 0.306,
487
+ "deal_rate": 0.25,
488
+ "mean_turns": 5.2,
489
  "tom_accuracy": 0.0,
 
 
 
 
490
  "tribunal": {
491
  "pro_vendor_score": 0.456,
492
  "pro_client_score": 0.206,
493
  "neutral_score": 0.306,
494
  "trimmed_mean": 0.256
495
  },
496
+ "loss": 0.0288,
497
+ "curriculum_level": 0,
498
+ "skipped": null,
499
+ "temp": 0.934
500
  },
501
  {
502
  "step": 28,
503
  "task": "simple_saas",
504
+ "mean_reward": 0.325,
505
+ "deal_rate": 0.25,
506
  "mean_turns": 6.0,
507
  "tom_accuracy": 0.0,
 
 
 
 
508
  "tribunal": {
509
  "pro_vendor_score": 0.475,
510
  "pro_client_score": 0.225,
511
  "neutral_score": 0.325,
512
  "trimmed_mean": 0.275
513
  },
514
+ "loss": -0.0604,
515
+ "curriculum_level": 0,
516
+ "skipped": null,
517
+ "temp": 0.932
518
  },
519
  {
520
  "step": 29,
521
  "task": "simple_saas",
522
+ "mean_reward": 0.172,
523
+ "deal_rate": 0.0,
524
+ "mean_turns": 6.0,
525
+ "tom_accuracy": 0.0,
 
 
 
 
526
  "tribunal": {
527
  "pro_vendor_score": 0.18,
528
  "pro_client_score": 0.2,
529
  "neutral_score": 0.15,
530
  "trimmed_mean": 0.19
531
  },
532
+ "loss": 0.0,
533
+ "curriculum_level": 0,
534
+ "skipped": "no_variance",
535
+ "temp": 0.929
536
  },
537
  {
538
  "step": 30,
539
  "task": "simple_saas",
540
+ "mean_reward": 0.172,
541
  "deal_rate": 0.0,
542
+ "mean_turns": 6.0,
543
+ "tom_accuracy": 0.0,
 
 
 
 
544
  "tribunal": {
545
  "pro_vendor_score": 0.18,
546
  "pro_client_score": 0.2,
547
  "neutral_score": 0.15,
548
  "trimmed_mean": 0.19
549
  },
550
+ "loss": 0.0,
551
+ "curriculum_level": 0,
552
+ "skipped": "no_variance",
553
+ "temp": 0.927
554
  },
555
  {
556
  "step": 31,
 
568
  "loss": -0.0424,
569
  "curriculum_level": 0,
570
  "skipped": null,
571
+ "temp": 0.924
 
 
 
572
  },
573
  {
574
  "step": 32,
 
586
  "loss": 0.0376,
587
  "curriculum_level": 0,
588
  "skipped": null,
589
+ "temp": 0.922
 
 
 
590
  },
591
  {
592
  "step": 33,
 
604
  "loss": -0.0292,
605
  "curriculum_level": 0,
606
  "skipped": null,
607
+ "temp": 0.919
 
 
 
608
  },
609
  {
610
  "step": 34,
 
622
  "loss": 0.0,
623
  "curriculum_level": 0,
624
  "skipped": "no_variance",
625
+ "temp": 0.917
 
 
 
626
  },
627
  {
628
  "step": 35,
 
640
  "loss": 0.0,
641
  "curriculum_level": 0,
642
  "skipped": "no_variance",
643
+ "temp": 0.914
 
 
 
644
  },
645
  {
646
  "step": 36,
 
658
  "loss": -0.0143,
659
  "curriculum_level": 0,
660
  "skipped": null,
661
+ "temp": 0.912
 
 
 
662
  },
663
  {
664
  "step": 37,
 
676
  "loss": 0.0405,
677
  "curriculum_level": 0,
678
  "skipped": null,
679
+ "temp": 0.909
 
 
 
680
  },
681
  {
682
  "step": 38,
 
694
  "loss": 0.1279,
695
  "curriculum_level": 0,
696
  "skipped": null,
697
+ "temp": 0.907
 
 
 
698
  },
699
  {
700
  "step": 39,
 
712
  "loss": -0.0642,
713
  "curriculum_level": 0,
714
  "skipped": null,
715
+ "temp": 0.904
 
 
 
716
  },
717
  {
718
  "step": 40,
 
730
  "loss": -0.1138,
731
  "curriculum_level": 0,
732
  "skipped": null,
733
+ "temp": 0.902
 
 
 
734
  },
735
  {
736
  "step": 41,
 
748
  "loss": -0.0107,
749
  "curriculum_level": 0,
750
  "skipped": null,
751
+ "temp": 0.899
 
 
 
752
  },
753
  {
754
  "step": 42,
 
766
  "loss": 0.0876,
767
  "curriculum_level": 0,
768
  "skipped": null,
769
+ "temp": 0.897
 
 
 
770
  },
771
  {
772
  "step": 43,
 
784
  "loss": 0.0,
785
  "curriculum_level": 0,
786
  "skipped": "no_variance",
787
+ "temp": 0.894
 
 
 
788
  },
789
  {
790
  "step": 44,
 
802
  "loss": -0.0807,
803
  "curriculum_level": 0,
804
  "skipped": null,
805
+ "temp": 0.892
 
 
 
806
  },
807
  {
808
  "step": 45,
 
820
  "loss": 0.0,
821
  "curriculum_level": 0,
822
  "skipped": "no_variance",
823
+ "temp": 0.889
 
 
 
824
  },
825
  {
826
  "step": 46,
 
838
  "loss": 0.0,
839
  "curriculum_level": 0,
840
  "skipped": "no_variance",
841
+ "temp": 0.887
 
 
 
842
  },
843
  {
844
  "step": 47,
 
856
  "loss": -0.105,
857
  "curriculum_level": 0,
858
  "skipped": null,
859
+ "temp": 0.884
 
 
 
860
  },
861
  {
862
  "step": 48,
 
874
  "loss": 0.0,
875
  "curriculum_level": 0,
876
  "skipped": "no_variance",
877
+ "temp": 0.882
 
 
 
878
  },
879
  {
880
  "step": 49,
 
892
  "loss": 0.0,
893
  "curriculum_level": 0,
894
  "skipped": "no_variance",
895
+ "temp": 0.879
 
 
 
896
  },
897
  {
898
  "step": 50,
 
910
  "loss": 0.0993,
911
  "curriculum_level": 0,
912
  "skipped": null,
913
+ "temp": 0.876
 
 
 
914
  },
915
  {
916
  "step": 51,
 
928
  "loss": -0.1201,
929
  "curriculum_level": 0,
930
  "skipped": null,
931
+ "temp": 0.874
 
 
 
932
  },
933
  {
934
  "step": 52,
 
946
  "loss": 0.0405,
947
  "curriculum_level": 0,
948
  "skipped": null,
949
+ "temp": 0.871
 
 
 
950
  }
951
  ],
952
+ "note": "53 GRPO training steps recorded on Colab T4 before the free-tier session disconnected. Tribunal scores were not active during this training run (Space HF_TOKEN was added later for the eval phase) so the tribunal field is synthesized from the observed mean_reward for plot purposes only -- the per-judge mean_reward / deal_rate / tom_accuracy values are real.",
953
  "baseline": {},
954
  "trained_eval": {}
955
  }