Update model card for v2 open-world RL step-100

Browse files

Files changed (1) hide show

README.md +42 -74

README.md CHANGED Viewed

@@ -19,106 +19,76 @@ tags:
 # Dynamical-30B-A3B
-Dynamical-30B-A3B is a trained judgment layer for autonomous scientific workflows. It is a merged reinforcement-learning checkpoint built from `Qwen/Qwen3-30B-A3B-Instruct-2507` and trained in verified campaign environments that require sequential `select`, `validate`, and `revise` decisions under explicit budget constraints.
-The target capability is scientific judgment: selecting which candidate to investigate next, evaluating whether intermediate evidence is trustworthy or should be escalated to higher-fidelity verification, and revising beliefs about competing hypotheses as results accumulate. In this setting, the signal that determines whether a decision was correct lives in the process, not only in the final outcome.
-## Summary
-Autonomous scientific campaigns are stateful. A decision at round `t` changes which evidence becomes available at round `t+1`, and a wrong belief early in the campaign compounds through later decisions. Dynamical-30B-A3B was trained for this coupled loop rather than for isolated single-step reasoning.
-The model was trained in verified campaign environments built from:
-- a candidate source
-- a thermodynamic stability oracle
-- a staged verifier ladder with explicit budget costs
-In the materials-discovery instantiation used here, candidate crystal structures are generated with MatterGen, evaluated against the MP2020 convex hull to produce `E_hull`, and converted into staged evidence with increasing cost and fidelity.
-## Training Recipe
-The base model is `Qwen/Qwen3-30B-A3B-Instruct-2507`, a mixture-of-experts language model with 30B total parameters and 3B active parameters per token.
-Training proceeded in two stages:
-- SFT with LoRA rank 32 on 3,861 rollout rows from GPT-5.4 teacher demonstrations across 128 closed-world environments
-- Multi-turn RL with trajectory-level GRPO on 60 open-world environments
-The RL curriculum introduced harder budget tiers progressively:
-- steps 0-19: budget 9
-- steps 20-39: budgets 9 and 7
-- steps 40-100: budgets 5, 7, and 9
-The released model corresponds to the step-50 merged RL checkpoint.
-## What This Model Does
-Dynamical-30B-A3B is designed to operate at the decision point in an automated scientific workflow. It does not generate candidates, run experiments, or compute stability. It decides:
-- which candidate to prioritize
-- whether to trust, flag, or reject evidence
-- when to commit to or revise a hypothesis
-The model is therefore best understood as a judgment policy, not as a simulator or oracle.
 ## Results
-The primary metric in the paper is hypothesis accuracy: the fraction of episodes where the model's highest-posterior hypothesis after all evidence rounds matches the oracle ground truth.
-On 15 held-out open-world environments with novel crystal structures:
-- Base: 46.7%
-- SFT: 53.3%
-- Dynamical-30B-A3B: 60.0%
-On 30 held-out closed-world environments:
-- Base: 32.2%
-- SFT: 40.0%
-- Dynamical-30B-A3B: 42.2%
-This closed-world result is reported as a retention check: open-world RL training does not produce regression on the training-adjacent domain.
-On MADE, an independent closed-loop materials-discovery benchmark:
-- formula recall: 0.156
-- structure recall: 0.085
-- stable efficiency: 0.233
-- AUDC: 0.279
-Relative to GPT-5.4 on MADE:
-- formula recall trails by 54%
-- structure recall exceeds by 67%
-The paper interprets this decomposition as evidence that judgment and knowledge are independently closable gaps: RL closes part of the judgment gap, while compositional exploration remains knowledge-limited.
-## Mechanistic Findings
-The main behavioral gain is not improved evidence discrimination. In the paper's signal-detection analysis, RL decreases `d'` from 0.770 to 0.471 while shifting the decision criterion from 1.606 to 0.801. The model becomes better calibrated about when to reject or escalate evidence; it does not become better at perceptually distinguishing admissible from inadmissible evidence.
-The paper also reports an emergent fast revision pathway. After RL, 11.8% of belief updates are "silent" low-uncertainty revisions, and these flip the leading hypothesis at 3.8x the rate of deliberate updates. This pathway is absent in the base model.
 ## Intended Use
 This model is intended for research use in:
-- autonomous science agents
-- budget-constrained experimental planning
-- evidence escalation policies
-- multi-step scientific decision-making
-- studies of post-training for sequential judgment
-It is most appropriate when paired with an external system that already provides:
-- candidate generation
-- structured evidence
-- domain-specific verification oracles
 ## Limitations
-The limitations reported in the paper should be treated as first-order caveats.
-- Checkpoint selection: step 50 was selected from 4 periodic evaluations on the same 15 held-out open-world environments, introducing mild optimistic bias.
-- Discrimination: RL teaches calibration, not discrimination. The model is better calibrated about when to reject, but not better at distinguishing admissible from inadmissible evidence.
-- Statistical power: the aggregate open-world evaluation uses 15 environments, and per-budget analyses use 5 environments per tier.
-- Curriculum: budget-5 environments received less training exposure than budget-9 environments.
-- Staged verifier: cheap and medium evidence are transforms of `E_hull`, not independent physical measurements.
-- Formula recall: the remaining gap in compositional exploration is not addressed by environment training alone.
-- External transfer: the SFT checkpoint did not complete MADE reliably, limiting SFT-vs-RL comparison there.
-This model should not be used as a substitute for physical simulation, laboratory verification, or expert review in high-stakes scientific settings.
 ## Usage
@@ -154,8 +124,6 @@ print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_token
 ## Citation
-If you use this model, please cite:
 ```bibtex
 @misc{barnes2026trainingscientificjudgment,
   title={Training Scientific Judgment with Verified Environments for Autonomous Science},
@@ -167,4 +135,4 @@ If you use this model, please cite:
 ## Acknowledgments
-Dynamical-30B-A3B is derived from `Qwen/Qwen3-30B-A3B-Instruct-2507` and retains the upstream Apache 2.0 license.

 # Dynamical-30B-A3B
+A trained judgment layer for autonomous scientific workflows. Starting from `Qwen/Qwen3-30B-A3B-Instruct-2507`, this model was trained with multi-turn reinforcement learning in verified campaign environments to improve sequential decision-making under uncertainty at the planner-verifier boundary.
+The target capability is not general reasoning or autonomous science. It is the decision-making core that determines whether a larger scientific system behaves intelligently when search, evidence, cost, and belief updates are all coupled: selecting which candidate to investigate, evaluating whether evidence should be trusted or escalated, and revising hypotheses as conflicting results accumulate.
+## Training
+**Base model:** Qwen3-30B-A3B-Instruct-2507 (30B total / 3B active MoE)
+**Stage 1 -- Supervised fine-tuning:** LoRA rank 32 on 3,861 rollout rows from GPT-5.4 teacher demonstrations across 128 closed-world environments. SFT establishes the action contract and baseline scientific behavior.
+**Stage 2 -- Trajectory GRPO on open-world environments:** 247 training environments (741 entries across 3 seeds), 100 gradient steps with G=4. The RL stage teaches experiential decision-making: when to search vs. exploit existing candidates, when to trust vs. escalate evidence, and when to revise beliefs.
+The curriculum uses a staged bucket schedule over environment difficulty:
+- Steps 0--9: anchor-heavy warmup (75% anchor, 20% challenge)
+- Steps 10--24: broadening (40% anchor, 30% challenge, 25% stress-discriminative)
+- Steps 25--100: hard transfer regime (25% anchor, 30% challenge, 35% stress-discriminative, 10% stress-search-fragile)
+Budget tiers expand across stages from (5, 8) to the full range (5, 8, 11, 14, 18).
+**Reward:** Hybrid oracle + rubric. A physics-grounded oracle (thermodynamic stability from the MP2020 convex hull) provides manipulation-proof outcome reward. Rubric-based process supervision scores search quality, validation discipline, escalation efficiency, and revision quality via Gemini 3.1 Pro, gated on correctness.
+This release corresponds to the **step-100 merged checkpoint**.
 ## Results
+Primary metric: **hypothesis accuracy** -- the fraction of episodes where the model's highest-posterior hypothesis matches the oracle ground truth after all evidence rounds.
+### Held-out learning curve (29 open-world environments, pass@1)
+| Checkpoint | Hypothesis Accuracy | Mean Reward | Parse Rate |
+|---|---|---|---|
+| Step 0 (SFT baseline) | 55.17% (16/29) | 0.3404 | 0.978 |
+| Step 25 | 65.52% (19/29) | 0.4076 | 0.982 |
+| Step 50 | 68.97% (20/29) | 0.4460 | 0.993 |
+| Step 75 | **79.31% (23/29)** | 0.4801 | 0.981 |
+| Step 100 | **79.31% (23/29)** | 0.5001 | 0.983 |
+Total gain from SFT baseline: **+24.14 percentage points** across 100 RL steps. Step 100 matches step 75 in accuracy but achieves +4.2% higher reward and 33% lower truncation, indicating more efficient trajectories at the same correctness level.
+### Per-round accuracy (survival curve)
+RL fundamentally changes how accuracy relates to investigation depth. The SFT model's accuracy is flat-to-declining across campaign rounds (0.55 at round 0, dropping to 0.33 by round 10+). The RL model's accuracy increases monotonically with depth: 0.79 at round 0, rising to 1.0 by round 7. Episodes that investigate deeper are the ones that converge to the correct answer.
+## What RL Teaches
+Behavioral analysis of the trained policy reveals three learned capabilities absent in the SFT baseline.
+**Search-then-exploit.** The RL model uses `skip_search` in 76.5% of search turns, learning to search once or twice to populate the candidate pool, then exploit existing candidates rather than wasting budget on redundant search. High-reward trajectories show strategic hypothesis switching -- searching for one system, pivoting to another, then returning.
+**Suspect-then-decide validation.** The clearest RL signature is learned evidence escalation. Rather than blanket-rejecting all evidence (the SFT default), the RL model develops a two-step pattern: flag evidence as `suspect` to gather additional signal, then commit to `trust` or `reject`. Among the highest-reward trajectories, 100% use this pattern; among the lowest-reward, 0% do.
+**Revision as a failure signal, not a tool.** In the SFT model, belief revision correlates with incorrect outcomes (wrong episodes average 2.1 revise turns vs 0.4 for correct). After RL, the 10 episodes that flipped from wrong to correct reduced their revision turns from 1.8 to 0.6. RL eliminates unnecessary revision by improving upstream search and validation, so the model arrives at the correct hypothesis without needing to update beliefs.
 ## Intended Use
 This model is intended for research use in:
+- Autonomous science agents requiring structured decision-making
+- Budget-constrained experimental planning and evidence triage
+- Multi-step scientific judgment under uncertainty
+- Studies of post-training for sequential decision-making
+It operates as a judgment policy within a larger workflow that provides candidate generation, structured evidence, and domain-specific verification oracles.
 ## Limitations
+- **Single training seed.** Results are from one training run. The learning curve shape (acceleration at steps 50--75) has not been validated across seeds.
+- **Evaluation scale.** The primary evaluation uses 29 held-out environments with pass@1. Per-budget and per-difficulty analyses have limited statistical power.
+- **Staged verifier.** Cheap and medium evidence stages are transforms of E_hull, not independent physical measurements.
+- **Persistent failures.** 6 of 29 held-out episodes remain incorrect. Failure modes include budget exhaustion on sparse search pools, RL-induced regressions on easy specifications, and contaminated evidence that passes the suspect filter.
+- **Not a substitute for expert review.** This model should not replace physical simulation, laboratory verification, or domain expertise in high-stakes scientific decisions.
 ## Usage
 ## Citation
 ```bibtex
 @misc{barnes2026trainingscientificjudgment,
   title={Training Scientific Judgment with Verified Environments for Autonomous Science},
 ## Acknowledgments
+Dynamical-30B-A3B is derived from `Qwen/Qwen3-30B-A3B-Instruct-2507` and retains the upstream Apache 2.0 license. Training infrastructure provided by [Tinker](https://tinker.computer).