Jarrodbarnes commited on
Commit
67dd15f
·
verified ·
1 Parent(s): 89e0e29

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +12 -0
README.md CHANGED
@@ -23,6 +23,16 @@ A trained judgment layer for autonomous scientific workflows. Starting from `Qwe
23
 
24
  The target capability is not general reasoning or autonomous science. It is the decision-making core that determines whether a larger scientific system behaves intelligently when search, evidence, cost, and belief updates are all coupled: selecting which candidate to investigate, evaluating whether evidence should be trusted or escalated, and revising hypotheses as conflicting results accumulate.
25
 
 
 
 
 
 
 
 
 
 
 
26
  ## Training
27
 
28
  **Base model:** Qwen3-30B-A3B-Instruct-2507 (30B total / 3B active MoE)
@@ -46,6 +56,8 @@ This release corresponds to the **step-100 merged checkpoint**.
46
 
47
  Primary metric: **hypothesis accuracy** -- the fraction of episodes where the model's highest-posterior hypothesis matches the oracle ground truth after all evidence rounds.
48
 
 
 
49
  ### Held-out learning curve (29 open-world environments, pass@1)
50
 
51
  | Checkpoint | Hypothesis Accuracy | Mean Reward | Parse Rate |
 
23
 
24
  The target capability is not general reasoning or autonomous science. It is the decision-making core that determines whether a larger scientific system behaves intelligently when search, evidence, cost, and belief updates are all coupled: selecting which candidate to investigate, evaluating whether evidence should be trusted or escalated, and revising hypotheses as conflicting results accumulate.
25
 
26
+ ## Release Links
27
+
28
+ - **Paper PDF:** [Training Scientific Judgment with Verified Environments for Autonomous Science](https://github.com/Dynamical-Systems-Research/training-scientific-judgment/blob/main/paper/training-scientific-judgment.pdf)
29
+ - **Blog post:** [Training Scientific Judgment](https://dynamicalsystems.ai/blog/training-scientific-judgment)
30
+ - **Public repo:** [Dynamical-Systems-Research/training-scientific-judgment](https://github.com/Dynamical-Systems-Research/training-scientific-judgment)
31
+ - **Released evaluation bundle:** [repo `data/open_world/`](https://github.com/Dynamical-Systems-Research/training-scientific-judgment/tree/main/data/open_world)
32
+ - **Search assets:** [`Dynamical-Systems/crystalite-base`](https://huggingface.co/Dynamical-Systems/crystalite-base), [`Dynamical-Systems/crystalite-balanced`](https://huggingface.co/Dynamical-Systems/crystalite-balanced)
33
+
34
+ This model is the released **scientific-judgment policy** used in the final paper and blog post. The associated Crystalite checkpoints are released as supporting search-side assets for the open-world campaign provenance. The default public reproducibility path uses the frozen serialized campaign bundle from the public repo.
35
+
36
  ## Training
37
 
38
  **Base model:** Qwen3-30B-A3B-Instruct-2507 (30B total / 3B active MoE)
 
56
 
57
  Primary metric: **hypothesis accuracy** -- the fraction of episodes where the model's highest-posterior hypothesis matches the oracle ground truth after all evidence rounds.
58
 
59
+ The final public release is paired with a frozen open-world bundle containing `300` serialized campaigns overall, with the primary paper evaluation reported on the pruned reachable held-out set of `29` campaigns.
60
+
61
  ### Held-out learning curve (29 open-world environments, pass@1)
62
 
63
  | Checkpoint | Hypothesis Accuracy | Mean Reward | Parse Rate |