Update README.md
Browse files
README.md
CHANGED
|
@@ -75,5 +75,50 @@ This simple process led to the creation of a small, synthetically labeled traini
|
|
| 75 |
|
| 76 |
IMPORTANT: Do not validate this model on MBPP-derived benchmarks due to the data overlap.
|
| 77 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
### Training Procedure
|
| 79 |
|
|
|
|
| 75 |
|
| 76 |
IMPORTANT: Do not validate this model on MBPP-derived benchmarks due to the data overlap.
|
| 77 |
|
| 78 |
+
#### Reinforcement Learning with Verifiable Rewards Data
|
| 79 |
+
|
| 80 |
+
In this phase, both the programming tasks and their test cases were generated automatically using a large language model (OpenAI o3-mini).
|
| 81 |
+
|
| 82 |
+
1) The model received detailed instructions regarding:
|
| 83 |
+
- the expected format of the task descriptions
|
| 84 |
+
- the difficulty level of the problems
|
| 85 |
+
- the structure and format of the test cases
|
| 86 |
+
2) To ensure a wide variety of tasks, 30 distinct themes were defined, including:
|
| 87 |
+
* string manipulation and formatting
|
| 88 |
+
* basic array processing (1D arrays)
|
| 89 |
+
* simple numeric sequences
|
| 90 |
+
* frequency counting in arrays
|
| 91 |
+
* finding prime numbers
|
| 92 |
+
* basic sorting algorithms on 1D arrays
|
| 93 |
+
* simple recursive functions
|
| 94 |
+
* pattern detection in strings
|
| 95 |
+
* calculating GCD and LCM
|
| 96 |
+
* basic statistics (mean, median)
|
| 97 |
+
* string encoding/decoding
|
| 98 |
+
* subarray sums
|
| 99 |
+
* basic combinatorial calculations
|
| 100 |
+
* bitwise operations
|
| 101 |
+
* date and time manipulation
|
| 102 |
+
* palindrome substring detection
|
| 103 |
+
* basic hashing techniques
|
| 104 |
+
* number base conversions
|
| 105 |
+
* array rotation (1D)
|
| 106 |
+
* counting unique elements
|
| 107 |
+
* string compression
|
| 108 |
+
* validating numeric strings
|
| 109 |
+
* string reversal with conditions
|
| 110 |
+
* generating the Fibonacci sequence
|
| 111 |
+
* checking balanced parentheses
|
| 112 |
+
* basic queue and stack problems (using 1D arrays)
|
| 113 |
+
* counting vowels and consonants
|
| 114 |
+
* integer factorization
|
| 115 |
+
* simple encryption/decryption
|
| 116 |
+
* basic logical puzzles
|
| 117 |
+
|
| 118 |
+
3) For each theme, the model was prompted once to generate 10 unique programming problems and their corresponding test cases.
|
| 119 |
+
|
| 120 |
+
This final step was key to generating high-quality synthetic data. Without a clearly defined theme, the model tends to repeat or default to similar types of tasks.
|
| 121 |
+
By guiding generation through specific topics, I built a synthetic dataset of 300 examples—each composed of a task and a corresponding test case.
|
| 122 |
+
|
| 123 |
### Training Procedure
|
| 124 |
|