Update README.md
Browse files
README.md
CHANGED
|
@@ -24,7 +24,6 @@ tags:
|
|
| 24 |
[](https://arxiv.org/abs/2601.22975)
|
| 25 |
[](https://creativecommons.org/licenses/by-nc/4.0/)
|
| 26 |
</div>
|
| 27 |
-
|
| 28 |
**GooseReason-4B-Instruct** is a state-of-the-art 4B reasoning model trained via Reinforcement Learning with Verifiable Rewards (RLVR) on [GooseReason-0.7M](https://huggingface.co/datasets/nvidia/Nemotron-Research-GooseReason-0.6M), a large-scale dataset synthesized by the **Golden Goose** pipeline. Starting from [Qwen3-4B-Instruct](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) and applying the ProRLv2 RL recipe augmented with GooseReason-0.7M data, **GooseReason-4B-Instruct achieves new state-of-the-art results among 4B-Instruct models across 15 diverse benchmarks**, spanning mathematics, programming, STEM reasoning, instruction following, and logical puzzles.
|
| 29 |
|
| 30 |
This model is for research and development only.
|
|
@@ -36,7 +35,7 @@ Scaling up RLVR is bottlenecked by the scarcity of verifiable training data, whe
|
|
| 36 |
|
| 37 |
**The key idea:** given a source text *S*, we prompt an LLM to identify a contiguous span *t* of crucial reasoning steps and replace it with a `[MASK]` token, constructing a masked context *S*_mask. Treating *t* as the ground-truth answer, the LLM then generates a set of diverse, plausible distractors *D* = {*d*₁, …, *d*ₖ} that are similar in style and length to the removed span yet incorrect in context, forming a multiple-choice question: *Q* = (*S*_mask, {*t*} ∪ *D*)
|
| 38 |
|
| 39 |
-
Verification during RL simply checks whether the model's prediction matches the ground-truth option—no external judge or test execution needed. This formulation unlocks reasoning-rich corpora that were previously unusable for RLVR: Olympiad-level theorem proving from AoPS-Instruct, free-form textbook QA from MegaScience, coding problems without test cases from rStar-Coder
|
| 40 |
|
| 41 |
## GooseReason-0.7M Dataset
|
| 42 |
|
|
|
|
| 24 |
[](https://arxiv.org/abs/2601.22975)
|
| 25 |
[](https://creativecommons.org/licenses/by-nc/4.0/)
|
| 26 |
</div>
|
|
|
|
| 27 |
**GooseReason-4B-Instruct** is a state-of-the-art 4B reasoning model trained via Reinforcement Learning with Verifiable Rewards (RLVR) on [GooseReason-0.7M](https://huggingface.co/datasets/nvidia/Nemotron-Research-GooseReason-0.6M), a large-scale dataset synthesized by the **Golden Goose** pipeline. Starting from [Qwen3-4B-Instruct](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) and applying the ProRLv2 RL recipe augmented with GooseReason-0.7M data, **GooseReason-4B-Instruct achieves new state-of-the-art results among 4B-Instruct models across 15 diverse benchmarks**, spanning mathematics, programming, STEM reasoning, instruction following, and logical puzzles.
|
| 28 |
|
| 29 |
This model is for research and development only.
|
|
|
|
| 35 |
|
| 36 |
**The key idea:** given a source text *S*, we prompt an LLM to identify a contiguous span *t* of crucial reasoning steps and replace it with a `[MASK]` token, constructing a masked context *S*_mask. Treating *t* as the ground-truth answer, the LLM then generates a set of diverse, plausible distractors *D* = {*d*₁, …, *d*ₖ} that are similar in style and length to the removed span yet incorrect in context, forming a multiple-choice question: *Q* = (*S*_mask, {*t*} ∪ *D*)
|
| 37 |
|
| 38 |
+
Verification during RL simply checks whether the model's prediction matches the ground-truth option—no external judge or test execution needed. This formulation unlocks reasoning-rich corpora that were previously unusable for RLVR: Olympiad-level theorem proving from AoPS-Instruct, free-form textbook QA from MegaScience, and coding problems without test cases from rStar-Coder.
|
| 39 |
|
| 40 |
## GooseReason-0.7M Dataset
|
| 41 |
|