File size: 9,144 Bytes
869d9b3 89e0e29 869d9b3 89e0e29 869d9b3 67dd15f 89e0e29 869d9b3 89e0e29 869d9b3 89e0e29 869d9b3 89e0e29 869d9b3 89e0e29 869d9b3 89e0e29 869d9b3 89e0e29 869d9b3 89e0e29 869d9b3 89e0e29 869d9b3 67dd15f 89e0e29 869d9b3 89e0e29 869d9b3 89e0e29 869d9b3 89e0e29 869d9b3 89e0e29 869d9b3 89e0e29 869d9b3 89e0e29 869d9b3 89e0e29 869d9b3 89e0e29 869d9b3 89e0e29 869d9b3 89e0e29 869d9b3 89e0e29 869d9b3 89e0e29 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | ---
language:
- en
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507/blob/main/LICENSE
library_name: transformers
pipeline_tag: text-generation
base_model: Qwen/Qwen3-30B-A3B-Instruct-2507
tags:
- qwen3
- qwen3-moe
- reinforcement-learning
- scientific-judgment
- autonomous-science
- materials-discovery
- verified-environments
- multi-turn-reasoning
---
# Dynamical-30B-A3B
A trained judgment layer for autonomous scientific workflows. Starting from `Qwen/Qwen3-30B-A3B-Instruct-2507`, this model was trained with multi-turn reinforcement learning in verified campaign environments to improve sequential decision-making under uncertainty at the planner-verifier boundary.
The target capability is not general reasoning or autonomous science. It is the decision-making core that determines whether a larger scientific system behaves intelligently when search, evidence, cost, and belief updates are all coupled: selecting which candidate to investigate, evaluating whether evidence should be trusted or escalated, and revising hypotheses as conflicting results accumulate.
## Release Links
- **Paper PDF:** [Training Scientific Judgment with Verified Environments for Autonomous Science](https://github.com/Dynamical-Systems-Research/training-scientific-judgment/blob/main/paper/training-scientific-judgment.pdf)
- **Blog post:** [Training Scientific Judgment](https://dynamicalsystems.ai/blog/training-scientific-judgment)
- **Public repo:** [Dynamical-Systems-Research/training-scientific-judgment](https://github.com/Dynamical-Systems-Research/training-scientific-judgment)
- **Released evaluation bundle:** [repo `data/open_world/`](https://github.com/Dynamical-Systems-Research/training-scientific-judgment/tree/main/data/open_world)
- **Search assets:** [`Dynamical-Systems/crystalite-base`](https://huggingface.co/Dynamical-Systems/crystalite-base), [`Dynamical-Systems/crystalite-balanced`](https://huggingface.co/Dynamical-Systems/crystalite-balanced)
This model is the released **scientific-judgment policy** used in the final paper and blog post. The associated Crystalite checkpoints are released as supporting search-side assets for the open-world campaign provenance. The default public reproducibility path uses the frozen serialized campaign bundle from the public repo.
## Training
**Base model:** Qwen3-30B-A3B-Instruct-2507 (30B total / 3B active MoE)
**Stage 1 -- Supervised fine-tuning:** LoRA rank 32 on 3,861 rollout rows from GPT-5.4 teacher demonstrations across 128 closed-world environments. SFT establishes the action contract and baseline scientific behavior.
**Stage 2 -- Trajectory GRPO on open-world environments:** 247 training environments (741 entries across 3 seeds), 100 gradient steps with G=4. The RL stage teaches experiential decision-making: when to search vs. exploit existing candidates, when to trust vs. escalate evidence, and when to revise beliefs.
The curriculum uses a staged bucket schedule over environment difficulty:
- Steps 0--9: anchor-heavy warmup (75% anchor, 20% challenge)
- Steps 10--24: broadening (40% anchor, 30% challenge, 25% stress-discriminative)
- Steps 25--100: hard transfer regime (25% anchor, 30% challenge, 35% stress-discriminative, 10% stress-search-fragile)
Budget tiers expand across stages from (5, 8) to the full range (5, 8, 11, 14, 18).
**Reward:** Hybrid oracle + rubric. A physics-grounded oracle (thermodynamic stability from the MP2020 convex hull) provides manipulation-proof outcome reward. Rubric-based process supervision scores search quality, validation discipline, escalation efficiency, and revision quality via Gemini 3.1 Pro, gated on correctness.
This release corresponds to the **step-100 merged checkpoint**.
## Results
Primary metric: **hypothesis accuracy** -- the fraction of episodes where the model's highest-posterior hypothesis matches the oracle ground truth after all evidence rounds.
The final public release is paired with a frozen open-world bundle containing `300` serialized campaigns overall, with the primary paper evaluation reported on the pruned reachable held-out set of `29` campaigns.
### Held-out learning curve (29 open-world environments, pass@1)
| Checkpoint | Hypothesis Accuracy | Mean Reward | Parse Rate |
|---|---|---|---|
| Step 0 (SFT baseline) | 55.17% (16/29) | 0.3404 | 0.978 |
| Step 25 | 65.52% (19/29) | 0.4076 | 0.982 |
| Step 50 | 68.97% (20/29) | 0.4460 | 0.993 |
| Step 75 | **79.31% (23/29)** | 0.4801 | 0.981 |
| Step 100 | **79.31% (23/29)** | 0.5001 | 0.983 |
Total gain from SFT baseline: **+24.14 percentage points** across 100 RL steps. Step 100 matches step 75 in accuracy but achieves +4.2% higher reward and 33% lower truncation, indicating more efficient trajectories at the same correctness level.
### Per-round accuracy (survival curve)
RL fundamentally changes how accuracy relates to investigation depth. The SFT model's accuracy is flat-to-declining across campaign rounds (0.55 at round 0, dropping to 0.33 by round 10+). The RL model's accuracy increases monotonically with depth: 0.79 at round 0, rising to 1.0 by round 7. Episodes that investigate deeper are the ones that converge to the correct answer.
## What RL Teaches
Behavioral analysis of the trained policy reveals three learned capabilities absent in the SFT baseline.
**Search-then-exploit.** The RL model uses `skip_search` in 76.5% of search turns, learning to search once or twice to populate the candidate pool, then exploit existing candidates rather than wasting budget on redundant search. High-reward trajectories show strategic hypothesis switching -- searching for one system, pivoting to another, then returning.
**Suspect-then-decide validation.** The clearest RL signature is learned evidence escalation. Rather than blanket-rejecting all evidence (the SFT default), the RL model develops a two-step pattern: flag evidence as `suspect` to gather additional signal, then commit to `trust` or `reject`. Among the highest-reward trajectories, 100% use this pattern; among the lowest-reward, 0% do.
**Revision as a failure signal, not a tool.** In the SFT model, belief revision correlates with incorrect outcomes (wrong episodes average 2.1 revise turns vs 0.4 for correct). After RL, the 10 episodes that flipped from wrong to correct reduced their revision turns from 1.8 to 0.6. RL eliminates unnecessary revision by improving upstream search and validation, so the model arrives at the correct hypothesis without needing to update beliefs.
## Intended Use
This model is intended for research use in:
- Autonomous science agents requiring structured decision-making
- Budget-constrained experimental planning and evidence triage
- Multi-step scientific judgment under uncertainty
- Studies of post-training for sequential decision-making
It operates as a judgment policy within a larger workflow that provides candidate generation, structured evidence, and domain-specific verification oracles.
## Limitations
- **Single training seed.** Results are from one training run. The learning curve shape (acceleration at steps 50--75) has not been validated across seeds.
- **Evaluation scale.** The primary evaluation uses 29 held-out environments with pass@1. Per-budget and per-difficulty analyses have limited statistical power.
- **Staged verifier.** Cheap and medium evidence stages are transforms of E_hull, not independent physical measurements.
- **Persistent failures.** 6 of 29 held-out episodes remain incorrect. Failure modes include budget exhaustion on sparse search pools, RL-induced regressions on easy specifications, and contaminated evidence that passes the suspect filter.
- **Not a substitute for expert review.** This model should not replace physical simulation, laboratory verification, or domain expertise in high-stakes scientific decisions.
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Dynamical-Systems/Dynamical-30B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
messages = [
{
"role": "user",
"content": "Given these candidates and staged measurements, which one should we validate next, and should we trust or escalate the current evidence?"
}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))
```
## Citation
```bibtex
@misc{barnes2026trainingscientificjudgment,
title={Training Scientific Judgment with Verified Environments for Autonomous Science},
author={Jarrod Barnes},
year={2026},
note={Technical report preprint}
}
```
## Acknowledgments
Dynamical-30B-A3B is derived from `Qwen/Qwen3-30B-A3B-Instruct-2507` and retains the upstream Apache 2.0 license. Training infrastructure provided by [Tinker](https://tinker.computer).
|