Jarrodbarnes commited on
Commit
89e0e29
·
verified ·
1 Parent(s): 5e3f92b

Update model card for v2 open-world RL step-100

Browse files
Files changed (1) hide show
  1. README.md +42 -74
README.md CHANGED
@@ -19,106 +19,76 @@ tags:
19
 
20
  # Dynamical-30B-A3B
21
 
22
- Dynamical-30B-A3B is a trained judgment layer for autonomous scientific workflows. It is a merged reinforcement-learning checkpoint built from `Qwen/Qwen3-30B-A3B-Instruct-2507` and trained in verified campaign environments that require sequential `select`, `validate`, and `revise` decisions under explicit budget constraints.
23
 
24
- The target capability is scientific judgment: selecting which candidate to investigate next, evaluating whether intermediate evidence is trustworthy or should be escalated to higher-fidelity verification, and revising beliefs about competing hypotheses as results accumulate. In this setting, the signal that determines whether a decision was correct lives in the process, not only in the final outcome.
25
 
26
- ## Summary
27
 
28
- Autonomous scientific campaigns are stateful. A decision at round `t` changes which evidence becomes available at round `t+1`, and a wrong belief early in the campaign compounds through later decisions. Dynamical-30B-A3B was trained for this coupled loop rather than for isolated single-step reasoning.
29
 
30
- The model was trained in verified campaign environments built from:
31
- - a candidate source
32
- - a thermodynamic stability oracle
33
- - a staged verifier ladder with explicit budget costs
34
 
35
- In the materials-discovery instantiation used here, candidate crystal structures are generated with MatterGen, evaluated against the MP2020 convex hull to produce `E_hull`, and converted into staged evidence with increasing cost and fidelity.
36
 
37
- ## Training Recipe
 
 
 
38
 
39
- The base model is `Qwen/Qwen3-30B-A3B-Instruct-2507`, a mixture-of-experts language model with 30B total parameters and 3B active parameters per token.
40
 
41
- Training proceeded in two stages:
42
- - SFT with LoRA rank 32 on 3,861 rollout rows from GPT-5.4 teacher demonstrations across 128 closed-world environments
43
- - Multi-turn RL with trajectory-level GRPO on 60 open-world environments
44
 
45
- The RL curriculum introduced harder budget tiers progressively:
46
- - steps 0-19: budget 9
47
- - steps 20-39: budgets 9 and 7
48
- - steps 40-100: budgets 5, 7, and 9
49
-
50
- The released model corresponds to the step-50 merged RL checkpoint.
51
-
52
- ## What This Model Does
53
-
54
- Dynamical-30B-A3B is designed to operate at the decision point in an automated scientific workflow. It does not generate candidates, run experiments, or compute stability. It decides:
55
- - which candidate to prioritize
56
- - whether to trust, flag, or reject evidence
57
- - when to commit to or revise a hypothesis
58
-
59
- The model is therefore best understood as a judgment policy, not as a simulator or oracle.
60
 
61
  ## Results
62
 
63
- The primary metric in the paper is hypothesis accuracy: the fraction of episodes where the model's highest-posterior hypothesis after all evidence rounds matches the oracle ground truth.
64
 
65
- On 15 held-out open-world environments with novel crystal structures:
66
- - Base: 46.7%
67
- - SFT: 53.3%
68
- - Dynamical-30B-A3B: 60.0%
69
 
70
- On 30 held-out closed-world environments:
71
- - Base: 32.2%
72
- - SFT: 40.0%
73
- - Dynamical-30B-A3B: 42.2%
 
 
 
74
 
75
- This closed-world result is reported as a retention check: open-world RL training does not produce regression on the training-adjacent domain.
76
 
77
- On MADE, an independent closed-loop materials-discovery benchmark:
78
- - formula recall: 0.156
79
- - structure recall: 0.085
80
- - stable efficiency: 0.233
81
- - AUDC: 0.279
82
 
83
- Relative to GPT-5.4 on MADE:
84
- - formula recall trails by 54%
85
- - structure recall exceeds by 67%
86
 
87
- The paper interprets this decomposition as evidence that judgment and knowledge are independently closable gaps: RL closes part of the judgment gap, while compositional exploration remains knowledge-limited.
88
 
89
- ## Mechanistic Findings
90
 
91
- The main behavioral gain is not improved evidence discrimination. In the paper's signal-detection analysis, RL decreases `d'` from 0.770 to 0.471 while shifting the decision criterion from 1.606 to 0.801. The model becomes better calibrated about when to reject or escalate evidence; it does not become better at perceptually distinguishing admissible from inadmissible evidence.
92
 
93
- The paper also reports an emergent fast revision pathway. After RL, 11.8% of belief updates are "silent" low-uncertainty revisions, and these flip the leading hypothesis at 3.8x the rate of deliberate updates. This pathway is absent in the base model.
 
 
94
 
95
  ## Intended Use
96
 
97
  This model is intended for research use in:
98
- - autonomous science agents
99
- - budget-constrained experimental planning
100
- - evidence escalation policies
101
- - multi-step scientific decision-making
102
- - studies of post-training for sequential judgment
103
 
104
- It is most appropriate when paired with an external system that already provides:
105
- - candidate generation
106
- - structured evidence
107
- - domain-specific verification oracles
108
 
109
  ## Limitations
110
 
111
- The limitations reported in the paper should be treated as first-order caveats.
112
-
113
- - Checkpoint selection: step 50 was selected from 4 periodic evaluations on the same 15 held-out open-world environments, introducing mild optimistic bias.
114
- - Discrimination: RL teaches calibration, not discrimination. The model is better calibrated about when to reject, but not better at distinguishing admissible from inadmissible evidence.
115
- - Statistical power: the aggregate open-world evaluation uses 15 environments, and per-budget analyses use 5 environments per tier.
116
- - Curriculum: budget-5 environments received less training exposure than budget-9 environments.
117
- - Staged verifier: cheap and medium evidence are transforms of `E_hull`, not independent physical measurements.
118
- - Formula recall: the remaining gap in compositional exploration is not addressed by environment training alone.
119
- - External transfer: the SFT checkpoint did not complete MADE reliably, limiting SFT-vs-RL comparison there.
120
-
121
- This model should not be used as a substitute for physical simulation, laboratory verification, or expert review in high-stakes scientific settings.
122
 
123
  ## Usage
124
 
@@ -154,8 +124,6 @@ print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_token
154
 
155
  ## Citation
156
 
157
- If you use this model, please cite:
158
-
159
  ```bibtex
160
  @misc{barnes2026trainingscientificjudgment,
161
  title={Training Scientific Judgment with Verified Environments for Autonomous Science},
@@ -167,4 +135,4 @@ If you use this model, please cite:
167
 
168
  ## Acknowledgments
169
 
170
- Dynamical-30B-A3B is derived from `Qwen/Qwen3-30B-A3B-Instruct-2507` and retains the upstream Apache 2.0 license.
 
19
 
20
  # Dynamical-30B-A3B
21
 
22
+ A trained judgment layer for autonomous scientific workflows. Starting from `Qwen/Qwen3-30B-A3B-Instruct-2507`, this model was trained with multi-turn reinforcement learning in verified campaign environments to improve sequential decision-making under uncertainty at the planner-verifier boundary.
23
 
24
+ The target capability is not general reasoning or autonomous science. It is the decision-making core that determines whether a larger scientific system behaves intelligently when search, evidence, cost, and belief updates are all coupled: selecting which candidate to investigate, evaluating whether evidence should be trusted or escalated, and revising hypotheses as conflicting results accumulate.
25
 
26
+ ## Training
27
 
28
+ **Base model:** Qwen3-30B-A3B-Instruct-2507 (30B total / 3B active MoE)
29
 
30
+ **Stage 1 -- Supervised fine-tuning:** LoRA rank 32 on 3,861 rollout rows from GPT-5.4 teacher demonstrations across 128 closed-world environments. SFT establishes the action contract and baseline scientific behavior.
 
 
 
31
 
32
+ **Stage 2 -- Trajectory GRPO on open-world environments:** 247 training environments (741 entries across 3 seeds), 100 gradient steps with G=4. The RL stage teaches experiential decision-making: when to search vs. exploit existing candidates, when to trust vs. escalate evidence, and when to revise beliefs.
33
 
34
+ The curriculum uses a staged bucket schedule over environment difficulty:
35
+ - Steps 0--9: anchor-heavy warmup (75% anchor, 20% challenge)
36
+ - Steps 10--24: broadening (40% anchor, 30% challenge, 25% stress-discriminative)
37
+ - Steps 25--100: hard transfer regime (25% anchor, 30% challenge, 35% stress-discriminative, 10% stress-search-fragile)
38
 
39
+ Budget tiers expand across stages from (5, 8) to the full range (5, 8, 11, 14, 18).
40
 
41
+ **Reward:** Hybrid oracle + rubric. A physics-grounded oracle (thermodynamic stability from the MP2020 convex hull) provides manipulation-proof outcome reward. Rubric-based process supervision scores search quality, validation discipline, escalation efficiency, and revision quality via Gemini 3.1 Pro, gated on correctness.
 
 
42
 
43
+ This release corresponds to the **step-100 merged checkpoint**.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  ## Results
46
 
47
+ Primary metric: **hypothesis accuracy** -- the fraction of episodes where the model's highest-posterior hypothesis matches the oracle ground truth after all evidence rounds.
48
 
49
+ ### Held-out learning curve (29 open-world environments, pass@1)
 
 
 
50
 
51
+ | Checkpoint | Hypothesis Accuracy | Mean Reward | Parse Rate |
52
+ |---|---|---|---|
53
+ | Step 0 (SFT baseline) | 55.17% (16/29) | 0.3404 | 0.978 |
54
+ | Step 25 | 65.52% (19/29) | 0.4076 | 0.982 |
55
+ | Step 50 | 68.97% (20/29) | 0.4460 | 0.993 |
56
+ | Step 75 | **79.31% (23/29)** | 0.4801 | 0.981 |
57
+ | Step 100 | **79.31% (23/29)** | 0.5001 | 0.983 |
58
 
59
+ Total gain from SFT baseline: **+24.14 percentage points** across 100 RL steps. Step 100 matches step 75 in accuracy but achieves +4.2% higher reward and 33% lower truncation, indicating more efficient trajectories at the same correctness level.
60
 
61
+ ### Per-round accuracy (survival curve)
 
 
 
 
62
 
63
+ RL fundamentally changes how accuracy relates to investigation depth. The SFT model's accuracy is flat-to-declining across campaign rounds (0.55 at round 0, dropping to 0.33 by round 10+). The RL model's accuracy increases monotonically with depth: 0.79 at round 0, rising to 1.0 by round 7. Episodes that investigate deeper are the ones that converge to the correct answer.
 
 
64
 
65
+ ## What RL Teaches
66
 
67
+ Behavioral analysis of the trained policy reveals three learned capabilities absent in the SFT baseline.
68
 
69
+ **Search-then-exploit.** The RL model uses `skip_search` in 76.5% of search turns, learning to search once or twice to populate the candidate pool, then exploit existing candidates rather than wasting budget on redundant search. High-reward trajectories show strategic hypothesis switching -- searching for one system, pivoting to another, then returning.
70
 
71
+ **Suspect-then-decide validation.** The clearest RL signature is learned evidence escalation. Rather than blanket-rejecting all evidence (the SFT default), the RL model develops a two-step pattern: flag evidence as `suspect` to gather additional signal, then commit to `trust` or `reject`. Among the highest-reward trajectories, 100% use this pattern; among the lowest-reward, 0% do.
72
+
73
+ **Revision as a failure signal, not a tool.** In the SFT model, belief revision correlates with incorrect outcomes (wrong episodes average 2.1 revise turns vs 0.4 for correct). After RL, the 10 episodes that flipped from wrong to correct reduced their revision turns from 1.8 to 0.6. RL eliminates unnecessary revision by improving upstream search and validation, so the model arrives at the correct hypothesis without needing to update beliefs.
74
 
75
  ## Intended Use
76
 
77
  This model is intended for research use in:
78
+ - Autonomous science agents requiring structured decision-making
79
+ - Budget-constrained experimental planning and evidence triage
80
+ - Multi-step scientific judgment under uncertainty
81
+ - Studies of post-training for sequential decision-making
 
82
 
83
+ It operates as a judgment policy within a larger workflow that provides candidate generation, structured evidence, and domain-specific verification oracles.
 
 
 
84
 
85
  ## Limitations
86
 
87
+ - **Single training seed.** Results are from one training run. The learning curve shape (acceleration at steps 50--75) has not been validated across seeds.
88
+ - **Evaluation scale.** The primary evaluation uses 29 held-out environments with pass@1. Per-budget and per-difficulty analyses have limited statistical power.
89
+ - **Staged verifier.** Cheap and medium evidence stages are transforms of E_hull, not independent physical measurements.
90
+ - **Persistent failures.** 6 of 29 held-out episodes remain incorrect. Failure modes include budget exhaustion on sparse search pools, RL-induced regressions on easy specifications, and contaminated evidence that passes the suspect filter.
91
+ - **Not a substitute for expert review.** This model should not replace physical simulation, laboratory verification, or domain expertise in high-stakes scientific decisions.
 
 
 
 
 
 
92
 
93
  ## Usage
94
 
 
124
 
125
  ## Citation
126
 
 
 
127
  ```bibtex
128
  @misc{barnes2026trainingscientificjudgment,
129
  title={Training Scientific Judgment with Verified Environments for Autonomous Science},
 
135
 
136
  ## Acknowledgments
137
 
138
+ Dynamical-30B-A3B is derived from `Qwen/Qwen3-30B-A3B-Instruct-2507` and retains the upstream Apache 2.0 license. Training infrastructure provided by [Tinker](https://tinker.computer).