Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -23,6 +23,16 @@ A trained judgment layer for autonomous scientific workflows. Starting from `Qwe
|
|
| 23 |
|
| 24 |
The target capability is not general reasoning or autonomous science. It is the decision-making core that determines whether a larger scientific system behaves intelligently when search, evidence, cost, and belief updates are all coupled: selecting which candidate to investigate, evaluating whether evidence should be trusted or escalated, and revising hypotheses as conflicting results accumulate.
|
| 25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
## Training
|
| 27 |
|
| 28 |
**Base model:** Qwen3-30B-A3B-Instruct-2507 (30B total / 3B active MoE)
|
|
@@ -46,6 +56,8 @@ This release corresponds to the **step-100 merged checkpoint**.
|
|
| 46 |
|
| 47 |
Primary metric: **hypothesis accuracy** -- the fraction of episodes where the model's highest-posterior hypothesis matches the oracle ground truth after all evidence rounds.
|
| 48 |
|
|
|
|
|
|
|
| 49 |
### Held-out learning curve (29 open-world environments, pass@1)
|
| 50 |
|
| 51 |
| Checkpoint | Hypothesis Accuracy | Mean Reward | Parse Rate |
|
|
|
|
| 23 |
|
| 24 |
The target capability is not general reasoning or autonomous science. It is the decision-making core that determines whether a larger scientific system behaves intelligently when search, evidence, cost, and belief updates are all coupled: selecting which candidate to investigate, evaluating whether evidence should be trusted or escalated, and revising hypotheses as conflicting results accumulate.
|
| 25 |
|
| 26 |
+
## Release Links
|
| 27 |
+
|
| 28 |
+
- **Paper PDF:** [Training Scientific Judgment with Verified Environments for Autonomous Science](https://github.com/Dynamical-Systems-Research/training-scientific-judgment/blob/main/paper/training-scientific-judgment.pdf)
|
| 29 |
+
- **Blog post:** [Training Scientific Judgment](https://dynamicalsystems.ai/blog/training-scientific-judgment)
|
| 30 |
+
- **Public repo:** [Dynamical-Systems-Research/training-scientific-judgment](https://github.com/Dynamical-Systems-Research/training-scientific-judgment)
|
| 31 |
+
- **Released evaluation bundle:** [repo `data/open_world/`](https://github.com/Dynamical-Systems-Research/training-scientific-judgment/tree/main/data/open_world)
|
| 32 |
+
- **Search assets:** [`Dynamical-Systems/crystalite-base`](https://huggingface.co/Dynamical-Systems/crystalite-base), [`Dynamical-Systems/crystalite-balanced`](https://huggingface.co/Dynamical-Systems/crystalite-balanced)
|
| 33 |
+
|
| 34 |
+
This model is the released **scientific-judgment policy** used in the final paper and blog post. The associated Crystalite checkpoints are released as supporting search-side assets for the open-world campaign provenance. The default public reproducibility path uses the frozen serialized campaign bundle from the public repo.
|
| 35 |
+
|
| 36 |
## Training
|
| 37 |
|
| 38 |
**Base model:** Qwen3-30B-A3B-Instruct-2507 (30B total / 3B active MoE)
|
|
|
|
| 56 |
|
| 57 |
Primary metric: **hypothesis accuracy** -- the fraction of episodes where the model's highest-posterior hypothesis matches the oracle ground truth after all evidence rounds.
|
| 58 |
|
| 59 |
+
The final public release is paired with a frozen open-world bundle containing `300` serialized campaigns overall, with the primary paper evaluation reported on the pruned reachable held-out set of `29` campaigns.
|
| 60 |
+
|
| 61 |
### Held-out learning curve (29 open-world environments, pass@1)
|
| 62 |
|
| 63 |
| Checkpoint | Hypothesis Accuracy | Mean Reward | Parse Rate |
|