Update model card for v2 open-world RL step-100
Browse files
README.md
CHANGED
|
@@ -19,106 +19,76 @@ tags:
|
|
| 19 |
|
| 20 |
# Dynamical-30B-A3B
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
The target capability is
|
| 25 |
|
| 26 |
-
##
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
| 31 |
-
- a candidate source
|
| 32 |
-
- a thermodynamic stability oracle
|
| 33 |
-
- a staged verifier ladder with explicit budget costs
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
-
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
| 42 |
-
- SFT with LoRA rank 32 on 3,861 rollout rows from GPT-5.4 teacher demonstrations across 128 closed-world environments
|
| 43 |
-
- Multi-turn RL with trajectory-level GRPO on 60 open-world environments
|
| 44 |
|
| 45 |
-
|
| 46 |
-
- steps 0-19: budget 9
|
| 47 |
-
- steps 20-39: budgets 9 and 7
|
| 48 |
-
- steps 40-100: budgets 5, 7, and 9
|
| 49 |
-
|
| 50 |
-
The released model corresponds to the step-50 merged RL checkpoint.
|
| 51 |
-
|
| 52 |
-
## What This Model Does
|
| 53 |
-
|
| 54 |
-
Dynamical-30B-A3B is designed to operate at the decision point in an automated scientific workflow. It does not generate candidates, run experiments, or compute stability. It decides:
|
| 55 |
-
- which candidate to prioritize
|
| 56 |
-
- whether to trust, flag, or reject evidence
|
| 57 |
-
- when to commit to or revise a hypothesis
|
| 58 |
-
|
| 59 |
-
The model is therefore best understood as a judgment policy, not as a simulator or oracle.
|
| 60 |
|
| 61 |
## Results
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
-
|
| 66 |
-
- Base: 46.7%
|
| 67 |
-
- SFT: 53.3%
|
| 68 |
-
- Dynamical-30B-A3B: 60.0%
|
| 69 |
|
| 70 |
-
|
| 71 |
-
-
|
| 72 |
-
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
-
|
| 76 |
|
| 77 |
-
|
| 78 |
-
- formula recall: 0.156
|
| 79 |
-
- structure recall: 0.085
|
| 80 |
-
- stable efficiency: 0.233
|
| 81 |
-
- AUDC: 0.279
|
| 82 |
|
| 83 |
-
|
| 84 |
-
- formula recall trails by 54%
|
| 85 |
-
- structure recall exceeds by 67%
|
| 86 |
|
| 87 |
-
|
| 88 |
|
| 89 |
-
|
| 90 |
|
| 91 |
-
The
|
| 92 |
|
| 93 |
-
|
|
|
|
|
|
|
| 94 |
|
| 95 |
## Intended Use
|
| 96 |
|
| 97 |
This model is intended for research use in:
|
| 98 |
-
-
|
| 99 |
-
-
|
| 100 |
-
-
|
| 101 |
-
-
|
| 102 |
-
- studies of post-training for sequential judgment
|
| 103 |
|
| 104 |
-
It
|
| 105 |
-
- candidate generation
|
| 106 |
-
- structured evidence
|
| 107 |
-
- domain-specific verification oracles
|
| 108 |
|
| 109 |
## Limitations
|
| 110 |
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
-
|
| 114 |
-
-
|
| 115 |
-
-
|
| 116 |
-
- Curriculum: budget-5 environments received less training exposure than budget-9 environments.
|
| 117 |
-
- Staged verifier: cheap and medium evidence are transforms of `E_hull`, not independent physical measurements.
|
| 118 |
-
- Formula recall: the remaining gap in compositional exploration is not addressed by environment training alone.
|
| 119 |
-
- External transfer: the SFT checkpoint did not complete MADE reliably, limiting SFT-vs-RL comparison there.
|
| 120 |
-
|
| 121 |
-
This model should not be used as a substitute for physical simulation, laboratory verification, or expert review in high-stakes scientific settings.
|
| 122 |
|
| 123 |
## Usage
|
| 124 |
|
|
@@ -154,8 +124,6 @@ print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_token
|
|
| 154 |
|
| 155 |
## Citation
|
| 156 |
|
| 157 |
-
If you use this model, please cite:
|
| 158 |
-
|
| 159 |
```bibtex
|
| 160 |
@misc{barnes2026trainingscientificjudgment,
|
| 161 |
title={Training Scientific Judgment with Verified Environments for Autonomous Science},
|
|
@@ -167,4 +135,4 @@ If you use this model, please cite:
|
|
| 167 |
|
| 168 |
## Acknowledgments
|
| 169 |
|
| 170 |
-
Dynamical-30B-A3B is derived from `Qwen/Qwen3-30B-A3B-Instruct-2507` and retains the upstream Apache 2.0 license.
|
|
|
|
| 19 |
|
| 20 |
# Dynamical-30B-A3B
|
| 21 |
|
| 22 |
+
A trained judgment layer for autonomous scientific workflows. Starting from `Qwen/Qwen3-30B-A3B-Instruct-2507`, this model was trained with multi-turn reinforcement learning in verified campaign environments to improve sequential decision-making under uncertainty at the planner-verifier boundary.
|
| 23 |
|
| 24 |
+
The target capability is not general reasoning or autonomous science. It is the decision-making core that determines whether a larger scientific system behaves intelligently when search, evidence, cost, and belief updates are all coupled: selecting which candidate to investigate, evaluating whether evidence should be trusted or escalated, and revising hypotheses as conflicting results accumulate.
|
| 25 |
|
| 26 |
+
## Training
|
| 27 |
|
| 28 |
+
**Base model:** Qwen3-30B-A3B-Instruct-2507 (30B total / 3B active MoE)
|
| 29 |
|
| 30 |
+
**Stage 1 -- Supervised fine-tuning:** LoRA rank 32 on 3,861 rollout rows from GPT-5.4 teacher demonstrations across 128 closed-world environments. SFT establishes the action contract and baseline scientific behavior.
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
+
**Stage 2 -- Trajectory GRPO on open-world environments:** 247 training environments (741 entries across 3 seeds), 100 gradient steps with G=4. The RL stage teaches experiential decision-making: when to search vs. exploit existing candidates, when to trust vs. escalate evidence, and when to revise beliefs.
|
| 33 |
|
| 34 |
+
The curriculum uses a staged bucket schedule over environment difficulty:
|
| 35 |
+
- Steps 0--9: anchor-heavy warmup (75% anchor, 20% challenge)
|
| 36 |
+
- Steps 10--24: broadening (40% anchor, 30% challenge, 25% stress-discriminative)
|
| 37 |
+
- Steps 25--100: hard transfer regime (25% anchor, 30% challenge, 35% stress-discriminative, 10% stress-search-fragile)
|
| 38 |
|
| 39 |
+
Budget tiers expand across stages from (5, 8) to the full range (5, 8, 11, 14, 18).
|
| 40 |
|
| 41 |
+
**Reward:** Hybrid oracle + rubric. A physics-grounded oracle (thermodynamic stability from the MP2020 convex hull) provides manipulation-proof outcome reward. Rubric-based process supervision scores search quality, validation discipline, escalation efficiency, and revision quality via Gemini 3.1 Pro, gated on correctness.
|
|
|
|
|
|
|
| 42 |
|
| 43 |
+
This release corresponds to the **step-100 merged checkpoint**.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
## Results
|
| 46 |
|
| 47 |
+
Primary metric: **hypothesis accuracy** -- the fraction of episodes where the model's highest-posterior hypothesis matches the oracle ground truth after all evidence rounds.
|
| 48 |
|
| 49 |
+
### Held-out learning curve (29 open-world environments, pass@1)
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
+
| Checkpoint | Hypothesis Accuracy | Mean Reward | Parse Rate |
|
| 52 |
+
|---|---|---|---|
|
| 53 |
+
| Step 0 (SFT baseline) | 55.17% (16/29) | 0.3404 | 0.978 |
|
| 54 |
+
| Step 25 | 65.52% (19/29) | 0.4076 | 0.982 |
|
| 55 |
+
| Step 50 | 68.97% (20/29) | 0.4460 | 0.993 |
|
| 56 |
+
| Step 75 | **79.31% (23/29)** | 0.4801 | 0.981 |
|
| 57 |
+
| Step 100 | **79.31% (23/29)** | 0.5001 | 0.983 |
|
| 58 |
|
| 59 |
+
Total gain from SFT baseline: **+24.14 percentage points** across 100 RL steps. Step 100 matches step 75 in accuracy but achieves +4.2% higher reward and 33% lower truncation, indicating more efficient trajectories at the same correctness level.
|
| 60 |
|
| 61 |
+
### Per-round accuracy (survival curve)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+
RL fundamentally changes how accuracy relates to investigation depth. The SFT model's accuracy is flat-to-declining across campaign rounds (0.55 at round 0, dropping to 0.33 by round 10+). The RL model's accuracy increases monotonically with depth: 0.79 at round 0, rising to 1.0 by round 7. Episodes that investigate deeper are the ones that converge to the correct answer.
|
|
|
|
|
|
|
| 64 |
|
| 65 |
+
## What RL Teaches
|
| 66 |
|
| 67 |
+
Behavioral analysis of the trained policy reveals three learned capabilities absent in the SFT baseline.
|
| 68 |
|
| 69 |
+
**Search-then-exploit.** The RL model uses `skip_search` in 76.5% of search turns, learning to search once or twice to populate the candidate pool, then exploit existing candidates rather than wasting budget on redundant search. High-reward trajectories show strategic hypothesis switching -- searching for one system, pivoting to another, then returning.
|
| 70 |
|
| 71 |
+
**Suspect-then-decide validation.** The clearest RL signature is learned evidence escalation. Rather than blanket-rejecting all evidence (the SFT default), the RL model develops a two-step pattern: flag evidence as `suspect` to gather additional signal, then commit to `trust` or `reject`. Among the highest-reward trajectories, 100% use this pattern; among the lowest-reward, 0% do.
|
| 72 |
+
|
| 73 |
+
**Revision as a failure signal, not a tool.** In the SFT model, belief revision correlates with incorrect outcomes (wrong episodes average 2.1 revise turns vs 0.4 for correct). After RL, the 10 episodes that flipped from wrong to correct reduced their revision turns from 1.8 to 0.6. RL eliminates unnecessary revision by improving upstream search and validation, so the model arrives at the correct hypothesis without needing to update beliefs.
|
| 74 |
|
| 75 |
## Intended Use
|
| 76 |
|
| 77 |
This model is intended for research use in:
|
| 78 |
+
- Autonomous science agents requiring structured decision-making
|
| 79 |
+
- Budget-constrained experimental planning and evidence triage
|
| 80 |
+
- Multi-step scientific judgment under uncertainty
|
| 81 |
+
- Studies of post-training for sequential decision-making
|
|
|
|
| 82 |
|
| 83 |
+
It operates as a judgment policy within a larger workflow that provides candidate generation, structured evidence, and domain-specific verification oracles.
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
## Limitations
|
| 86 |
|
| 87 |
+
- **Single training seed.** Results are from one training run. The learning curve shape (acceleration at steps 50--75) has not been validated across seeds.
|
| 88 |
+
- **Evaluation scale.** The primary evaluation uses 29 held-out environments with pass@1. Per-budget and per-difficulty analyses have limited statistical power.
|
| 89 |
+
- **Staged verifier.** Cheap and medium evidence stages are transforms of E_hull, not independent physical measurements.
|
| 90 |
+
- **Persistent failures.** 6 of 29 held-out episodes remain incorrect. Failure modes include budget exhaustion on sparse search pools, RL-induced regressions on easy specifications, and contaminated evidence that passes the suspect filter.
|
| 91 |
+
- **Not a substitute for expert review.** This model should not replace physical simulation, laboratory verification, or domain expertise in high-stakes scientific decisions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
## Usage
|
| 94 |
|
|
|
|
| 124 |
|
| 125 |
## Citation
|
| 126 |
|
|
|
|
|
|
|
| 127 |
```bibtex
|
| 128 |
@misc{barnes2026trainingscientificjudgment,
|
| 129 |
title={Training Scientific Judgment with Verified Environments for Autonomous Science},
|
|
|
|
| 135 |
|
| 136 |
## Acknowledgments
|
| 137 |
|
| 138 |
+
Dynamical-30B-A3B is derived from `Qwen/Qwen3-30B-A3B-Instruct-2507` and retains the upstream Apache 2.0 license. Training infrastructure provided by [Tinker](https://tinker.computer).
|