Ximing commited on
Commit
7448341
·
verified ·
1 Parent(s): 5e52fef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -5
README.md CHANGED
@@ -31,7 +31,6 @@ This model is for research and development only.
31
 
32
 
33
  ## Golden Goose
34
-
35
  Scaling up RLVR is bottlenecked by the scarcity of verifiable training data, where improvements increasingly saturate after prolonged training on existing datasets. **Golden Goose** is a simple, scalable pipeline that synthesizes *unlimited* RLVR tasks from reasoning-rich but unverifiable internet text—corpora such as science textbooks, Olympiad math forums, and cybersecurity web scrapes that were previously excluded from RLVR data construction due to the difficulty of automatic verification.
36
 
37
  **The key idea:** given a source text *S*, we prompt an LLM to identify a contiguous span *t* of crucial reasoning steps and replace it with a `[MASK]` token, constructing a masked context *S*_mask. Treating *t* as the ground-truth answer, the LLM then generates a set of diverse, plausible distractors *D* = {*d*₁, …, *d*ₖ} that are similar in style and length to the removed span yet incorrect in context, forming a multiple-choice question: *Q* = (*S*_mask, {*t*} ∪ *D*)
@@ -39,7 +38,6 @@ Scaling up RLVR is bottlenecked by the scarcity of verifiable training data, whe
39
  Verification during RL simply checks whether the model's prediction matches the ground-truth option—no external judge or test execution needed. This formulation unlocks reasoning-rich corpora that were previously unusable for RLVR: Olympiad-level theorem proving from AoPS-Instruct, free-form textbook QA from MegaScience, and coding problems without test cases from rStar-Coder.
40
 
41
  ## GooseReason-0.7M Dataset
42
-
43
  Using the Golden Goose pipeline, we synthesize **GooseReason-0.7M**, a large-scale RLVR dataset with over **0.7 million tasks** spanning mathematics, programming, and general scientific domains. The dataset is constructed from the following source corpora:
44
 
45
  | Domain | # Examples | Source | Description |
@@ -55,7 +53,6 @@ Using the Golden Goose pipeline, we synthesize **GooseReason-0.7M**, a large-sca
55
  GooseReason-4B-Instruct is evaluated on 15 diverse benchmarks following the [ProRL](https://arxiv.org/abs/2505.24864) evaluation protocol. Math performance is measured on AIME 2024/2025, AMC, MATH, Minerva, and Olympiad Bench. Code performance is measured on APPS, CodeContests, CodeForces, TACO, HumanEvalPlus, and LiveCodeBench. STEM and reasoning tasks are measured via GPQA Diamond, IFEval, and Reasoning Gym (logical puzzles). The Qwen3-30B-Instruct results (in *italics*) are provided as a reference.
56
 
57
  **Table 1.** Performance (pass@1) comparison across math benchmarks. Adding GooseReason-0.7M revives the saturated model and enables further RL scaling, achieving a **+2.18% absolute gain** (vs. a 0.79% degradation when continuing on ProRL data alone).
58
-
59
  | Model | RL Data | RL Steps | AIME24 | AIME25 | AMC | MATH | Minerva | Olympiad | Avg |
60
  |-------|---------|:--------:|-------:|-------:|----:|-----:|--------:|---------:|----:|
61
  | Qwen3-4B-Instruct | — | — | 64.79 | 48.75 | 85.17 | 94.66 | 50.09 | 65.83 | 68.21 |
@@ -65,7 +62,6 @@ GooseReason-4B-Instruct is evaluated on 15 diverse benchmarks following the [Pro
65
  | *Qwen3-30B-Instruct* | *—* | *—* | *76.66* | *63.74* | *91.64* | *97.10* | *51.99* | *70.05* | *75.20* |
66
 
67
  **Table 2.** Performance (pass@1) comparison across coding benchmarks. GooseReason-4B-Instruct achieves a **+2.24% absolute gain** in coding average, outperforming Qwen3-30B-Instruct by a wide margin.
68
-
69
  | Model | RL Data | RL Steps | APPS | CodeContests | CodeForces | TACO | HumanEvalPlus | LiveCodeBench | Avg |
70
  |-------|---------|:--------:|-----:|-------------:|-----------:|-----:|--------------:|--------------:|----:|
71
  | Qwen3-4B-Instruct | — | — | 47.01 | 42.08 | 33.69 | 23.69 | 77.56 | 31.74 | 42.63 |
@@ -75,7 +71,6 @@ GooseReason-4B-Instruct is evaluated on 15 diverse benchmarks following the [Pro
75
  | *Qwen3-30B-Instruct* | *—* | *—* | *55.37* | *49.70* | *47.76* | *29.05* | *80.56* | *43.20* | *50.94* |
76
 
77
  **Table 3.** Performance (pass@1) on STEM reasoning (GPQA Diamond), instruction following (IFEval), and logic puzzles (Reasoning Gym). Tasks in Reasoning Gym are grouped into four categories: Math, Algorithmic, Cognition, and Logic.
78
-
79
  | Model | RL Data | RL Steps | GPQA | IFEval | Math | Algorithmic | Cognition | Logic | Avg. Gym |
80
  |-------|---------|:--------:|-----:|-------:|-----:|------------:|----------:|------:|---------:|
81
  | Qwen3-4B-Instruct | — | — | 60.26 | 72.36 | 43.69 | 19.46 | 34.92 | 57.26 | 33.98 |
 
31
 
32
 
33
  ## Golden Goose
 
34
  Scaling up RLVR is bottlenecked by the scarcity of verifiable training data, where improvements increasingly saturate after prolonged training on existing datasets. **Golden Goose** is a simple, scalable pipeline that synthesizes *unlimited* RLVR tasks from reasoning-rich but unverifiable internet text—corpora such as science textbooks, Olympiad math forums, and cybersecurity web scrapes that were previously excluded from RLVR data construction due to the difficulty of automatic verification.
35
 
36
  **The key idea:** given a source text *S*, we prompt an LLM to identify a contiguous span *t* of crucial reasoning steps and replace it with a `[MASK]` token, constructing a masked context *S*_mask. Treating *t* as the ground-truth answer, the LLM then generates a set of diverse, plausible distractors *D* = {*d*₁, …, *d*ₖ} that are similar in style and length to the removed span yet incorrect in context, forming a multiple-choice question: *Q* = (*S*_mask, {*t*} ∪ *D*)
 
38
  Verification during RL simply checks whether the model's prediction matches the ground-truth option—no external judge or test execution needed. This formulation unlocks reasoning-rich corpora that were previously unusable for RLVR: Olympiad-level theorem proving from AoPS-Instruct, free-form textbook QA from MegaScience, and coding problems without test cases from rStar-Coder.
39
 
40
  ## GooseReason-0.7M Dataset
 
41
  Using the Golden Goose pipeline, we synthesize **GooseReason-0.7M**, a large-scale RLVR dataset with over **0.7 million tasks** spanning mathematics, programming, and general scientific domains. The dataset is constructed from the following source corpora:
42
 
43
  | Domain | # Examples | Source | Description |
 
53
  GooseReason-4B-Instruct is evaluated on 15 diverse benchmarks following the [ProRL](https://arxiv.org/abs/2505.24864) evaluation protocol. Math performance is measured on AIME 2024/2025, AMC, MATH, Minerva, and Olympiad Bench. Code performance is measured on APPS, CodeContests, CodeForces, TACO, HumanEvalPlus, and LiveCodeBench. STEM and reasoning tasks are measured via GPQA Diamond, IFEval, and Reasoning Gym (logical puzzles). The Qwen3-30B-Instruct results (in *italics*) are provided as a reference.
54
 
55
  **Table 1.** Performance (pass@1) comparison across math benchmarks. Adding GooseReason-0.7M revives the saturated model and enables further RL scaling, achieving a **+2.18% absolute gain** (vs. a 0.79% degradation when continuing on ProRL data alone).
 
56
  | Model | RL Data | RL Steps | AIME24 | AIME25 | AMC | MATH | Minerva | Olympiad | Avg |
57
  |-------|---------|:--------:|-------:|-------:|----:|-----:|--------:|---------:|----:|
58
  | Qwen3-4B-Instruct | — | — | 64.79 | 48.75 | 85.17 | 94.66 | 50.09 | 65.83 | 68.21 |
 
62
  | *Qwen3-30B-Instruct* | *—* | *—* | *76.66* | *63.74* | *91.64* | *97.10* | *51.99* | *70.05* | *75.20* |
63
 
64
  **Table 2.** Performance (pass@1) comparison across coding benchmarks. GooseReason-4B-Instruct achieves a **+2.24% absolute gain** in coding average, outperforming Qwen3-30B-Instruct by a wide margin.
 
65
  | Model | RL Data | RL Steps | APPS | CodeContests | CodeForces | TACO | HumanEvalPlus | LiveCodeBench | Avg |
66
  |-------|---------|:--------:|-----:|-------------:|-----------:|-----:|--------------:|--------------:|----:|
67
  | Qwen3-4B-Instruct | — | — | 47.01 | 42.08 | 33.69 | 23.69 | 77.56 | 31.74 | 42.63 |
 
71
  | *Qwen3-30B-Instruct* | *—* | *—* | *55.37* | *49.70* | *47.76* | *29.05* | *80.56* | *43.20* | *50.94* |
72
 
73
  **Table 3.** Performance (pass@1) on STEM reasoning (GPQA Diamond), instruction following (IFEval), and logic puzzles (Reasoning Gym). Tasks in Reasoning Gym are grouped into four categories: Math, Algorithmic, Cognition, and Logic.
 
74
  | Model | RL Data | RL Steps | GPQA | IFEval | Math | Algorithmic | Cognition | Logic | Avg. Gym |
75
  |-------|---------|:--------:|-----:|-------:|-----:|------------:|----------:|------:|---------:|
76
  | Qwen3-4B-Instruct | — | — | 60.26 | 72.36 | 43.69 | 19.46 | 34.92 | 57.26 | 33.98 |