Update README.md
Browse files
README.md
CHANGED
|
@@ -31,7 +31,6 @@ This model is for research and development only.
|
|
| 31 |
|
| 32 |
|
| 33 |
## Golden Goose
|
| 34 |
-
|
| 35 |
Scaling up RLVR is bottlenecked by the scarcity of verifiable training data, where improvements increasingly saturate after prolonged training on existing datasets. **Golden Goose** is a simple, scalable pipeline that synthesizes *unlimited* RLVR tasks from reasoning-rich but unverifiable internet text—corpora such as science textbooks, Olympiad math forums, and cybersecurity web scrapes that were previously excluded from RLVR data construction due to the difficulty of automatic verification.
|
| 36 |
|
| 37 |
**The key idea:** given a source text *S*, we prompt an LLM to identify a contiguous span *t* of crucial reasoning steps and replace it with a `[MASK]` token, constructing a masked context *S*_mask. Treating *t* as the ground-truth answer, the LLM then generates a set of diverse, plausible distractors *D* = {*d*₁, …, *d*ₖ} that are similar in style and length to the removed span yet incorrect in context, forming a multiple-choice question: *Q* = (*S*_mask, {*t*} ∪ *D*)
|
|
@@ -39,7 +38,6 @@ Scaling up RLVR is bottlenecked by the scarcity of verifiable training data, whe
|
|
| 39 |
Verification during RL simply checks whether the model's prediction matches the ground-truth option—no external judge or test execution needed. This formulation unlocks reasoning-rich corpora that were previously unusable for RLVR: Olympiad-level theorem proving from AoPS-Instruct, free-form textbook QA from MegaScience, and coding problems without test cases from rStar-Coder.
|
| 40 |
|
| 41 |
## GooseReason-0.7M Dataset
|
| 42 |
-
|
| 43 |
Using the Golden Goose pipeline, we synthesize **GooseReason-0.7M**, a large-scale RLVR dataset with over **0.7 million tasks** spanning mathematics, programming, and general scientific domains. The dataset is constructed from the following source corpora:
|
| 44 |
|
| 45 |
| Domain | # Examples | Source | Description |
|
|
@@ -55,7 +53,6 @@ Using the Golden Goose pipeline, we synthesize **GooseReason-0.7M**, a large-sca
|
|
| 55 |
GooseReason-4B-Instruct is evaluated on 15 diverse benchmarks following the [ProRL](https://arxiv.org/abs/2505.24864) evaluation protocol. Math performance is measured on AIME 2024/2025, AMC, MATH, Minerva, and Olympiad Bench. Code performance is measured on APPS, CodeContests, CodeForces, TACO, HumanEvalPlus, and LiveCodeBench. STEM and reasoning tasks are measured via GPQA Diamond, IFEval, and Reasoning Gym (logical puzzles). The Qwen3-30B-Instruct results (in *italics*) are provided as a reference.
|
| 56 |
|
| 57 |
**Table 1.** Performance (pass@1) comparison across math benchmarks. Adding GooseReason-0.7M revives the saturated model and enables further RL scaling, achieving a **+2.18% absolute gain** (vs. a 0.79% degradation when continuing on ProRL data alone).
|
| 58 |
-
|
| 59 |
| Model | RL Data | RL Steps | AIME24 | AIME25 | AMC | MATH | Minerva | Olympiad | Avg |
|
| 60 |
|-------|---------|:--------:|-------:|-------:|----:|-----:|--------:|---------:|----:|
|
| 61 |
| Qwen3-4B-Instruct | — | — | 64.79 | 48.75 | 85.17 | 94.66 | 50.09 | 65.83 | 68.21 |
|
|
@@ -65,7 +62,6 @@ GooseReason-4B-Instruct is evaluated on 15 diverse benchmarks following the [Pro
|
|
| 65 |
| *Qwen3-30B-Instruct* | *—* | *—* | *76.66* | *63.74* | *91.64* | *97.10* | *51.99* | *70.05* | *75.20* |
|
| 66 |
|
| 67 |
**Table 2.** Performance (pass@1) comparison across coding benchmarks. GooseReason-4B-Instruct achieves a **+2.24% absolute gain** in coding average, outperforming Qwen3-30B-Instruct by a wide margin.
|
| 68 |
-
|
| 69 |
| Model | RL Data | RL Steps | APPS | CodeContests | CodeForces | TACO | HumanEvalPlus | LiveCodeBench | Avg |
|
| 70 |
|-------|---------|:--------:|-----:|-------------:|-----------:|-----:|--------------:|--------------:|----:|
|
| 71 |
| Qwen3-4B-Instruct | — | — | 47.01 | 42.08 | 33.69 | 23.69 | 77.56 | 31.74 | 42.63 |
|
|
@@ -75,7 +71,6 @@ GooseReason-4B-Instruct is evaluated on 15 diverse benchmarks following the [Pro
|
|
| 75 |
| *Qwen3-30B-Instruct* | *—* | *—* | *55.37* | *49.70* | *47.76* | *29.05* | *80.56* | *43.20* | *50.94* |
|
| 76 |
|
| 77 |
**Table 3.** Performance (pass@1) on STEM reasoning (GPQA Diamond), instruction following (IFEval), and logic puzzles (Reasoning Gym). Tasks in Reasoning Gym are grouped into four categories: Math, Algorithmic, Cognition, and Logic.
|
| 78 |
-
|
| 79 |
| Model | RL Data | RL Steps | GPQA | IFEval | Math | Algorithmic | Cognition | Logic | Avg. Gym |
|
| 80 |
|-------|---------|:--------:|-----:|-------:|-----:|------------:|----------:|------:|---------:|
|
| 81 |
| Qwen3-4B-Instruct | — | — | 60.26 | 72.36 | 43.69 | 19.46 | 34.92 | 57.26 | 33.98 |
|
|
|
|
| 31 |
|
| 32 |
|
| 33 |
## Golden Goose
|
|
|
|
| 34 |
Scaling up RLVR is bottlenecked by the scarcity of verifiable training data, where improvements increasingly saturate after prolonged training on existing datasets. **Golden Goose** is a simple, scalable pipeline that synthesizes *unlimited* RLVR tasks from reasoning-rich but unverifiable internet text—corpora such as science textbooks, Olympiad math forums, and cybersecurity web scrapes that were previously excluded from RLVR data construction due to the difficulty of automatic verification.
|
| 35 |
|
| 36 |
**The key idea:** given a source text *S*, we prompt an LLM to identify a contiguous span *t* of crucial reasoning steps and replace it with a `[MASK]` token, constructing a masked context *S*_mask. Treating *t* as the ground-truth answer, the LLM then generates a set of diverse, plausible distractors *D* = {*d*₁, …, *d*ₖ} that are similar in style and length to the removed span yet incorrect in context, forming a multiple-choice question: *Q* = (*S*_mask, {*t*} ∪ *D*)
|
|
|
|
| 38 |
Verification during RL simply checks whether the model's prediction matches the ground-truth option—no external judge or test execution needed. This formulation unlocks reasoning-rich corpora that were previously unusable for RLVR: Olympiad-level theorem proving from AoPS-Instruct, free-form textbook QA from MegaScience, and coding problems without test cases from rStar-Coder.
|
| 39 |
|
| 40 |
## GooseReason-0.7M Dataset
|
|
|
|
| 41 |
Using the Golden Goose pipeline, we synthesize **GooseReason-0.7M**, a large-scale RLVR dataset with over **0.7 million tasks** spanning mathematics, programming, and general scientific domains. The dataset is constructed from the following source corpora:
|
| 42 |
|
| 43 |
| Domain | # Examples | Source | Description |
|
|
|
|
| 53 |
GooseReason-4B-Instruct is evaluated on 15 diverse benchmarks following the [ProRL](https://arxiv.org/abs/2505.24864) evaluation protocol. Math performance is measured on AIME 2024/2025, AMC, MATH, Minerva, and Olympiad Bench. Code performance is measured on APPS, CodeContests, CodeForces, TACO, HumanEvalPlus, and LiveCodeBench. STEM and reasoning tasks are measured via GPQA Diamond, IFEval, and Reasoning Gym (logical puzzles). The Qwen3-30B-Instruct results (in *italics*) are provided as a reference.
|
| 54 |
|
| 55 |
**Table 1.** Performance (pass@1) comparison across math benchmarks. Adding GooseReason-0.7M revives the saturated model and enables further RL scaling, achieving a **+2.18% absolute gain** (vs. a 0.79% degradation when continuing on ProRL data alone).
|
|
|
|
| 56 |
| Model | RL Data | RL Steps | AIME24 | AIME25 | AMC | MATH | Minerva | Olympiad | Avg |
|
| 57 |
|-------|---------|:--------:|-------:|-------:|----:|-----:|--------:|---------:|----:|
|
| 58 |
| Qwen3-4B-Instruct | — | — | 64.79 | 48.75 | 85.17 | 94.66 | 50.09 | 65.83 | 68.21 |
|
|
|
|
| 62 |
| *Qwen3-30B-Instruct* | *—* | *—* | *76.66* | *63.74* | *91.64* | *97.10* | *51.99* | *70.05* | *75.20* |
|
| 63 |
|
| 64 |
**Table 2.** Performance (pass@1) comparison across coding benchmarks. GooseReason-4B-Instruct achieves a **+2.24% absolute gain** in coding average, outperforming Qwen3-30B-Instruct by a wide margin.
|
|
|
|
| 65 |
| Model | RL Data | RL Steps | APPS | CodeContests | CodeForces | TACO | HumanEvalPlus | LiveCodeBench | Avg |
|
| 66 |
|-------|---------|:--------:|-----:|-------------:|-----------:|-----:|--------------:|--------------:|----:|
|
| 67 |
| Qwen3-4B-Instruct | — | — | 47.01 | 42.08 | 33.69 | 23.69 | 77.56 | 31.74 | 42.63 |
|
|
|
|
| 71 |
| *Qwen3-30B-Instruct* | *—* | *—* | *55.37* | *49.70* | *47.76* | *29.05* | *80.56* | *43.20* | *50.94* |
|
| 72 |
|
| 73 |
**Table 3.** Performance (pass@1) on STEM reasoning (GPQA Diamond), instruction following (IFEval), and logic puzzles (Reasoning Gym). Tasks in Reasoning Gym are grouped into four categories: Math, Algorithmic, Cognition, and Logic.
|
|
|
|
| 74 |
| Model | RL Data | RL Steps | GPQA | IFEval | Math | Algorithmic | Cognition | Logic | Avg. Gym |
|
| 75 |
|-------|---------|:--------:|-----:|-------:|-----:|------------:|----------:|------:|---------:|
|
| 76 |
| Qwen3-4B-Instruct | — | — | 60.26 | 72.36 | 43.69 | 19.46 | 34.92 | 57.26 | 33.98 |
|