Upload README.md
Browse files
README.md
CHANGED
|
@@ -29,16 +29,16 @@ $$
|
|
| 29 |
Define
|
| 30 |
|
| 31 |
$$
|
| 32 |
-
q_\phi^t(\mathbf{y}{<t}, y_t) := \sum{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}{<i})}{\
|
| 33 |
$$
|
| 34 |
|
| 35 |
-
is the exponential average of $r_\theta$ at step $t$.
|
| 36 |
|
| 37 |
$$
|
| 38 |
-
q_\phi^t(\mathbf{y}{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}{\leq t})} \left[ e^{\frac{1}{\beta}
|
| 39 |
$$
|
| 40 |
|
| 41 |
-
Hence, **$q_\theta^t$**represents an exact expectation of outcome reward $r_\theta$ at step $t$, i.e., the Q value.
|
| 42 |
|
| 43 |
The proposition indicates that when modeling
|
| 44 |
|
|
@@ -49,7 +49,7 @@ $$
|
|
| 49 |
to train an ORM with the standard pipeline, where $\beta$ is a hyperparameter, $\phi$ can implicitly learn a Q function. Hence, process reward $r_\phi^t$ can be obtained by:
|
| 50 |
|
| 51 |
$$
|
| 52 |
-
r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}{<t})}{\
|
| 53 |
$$
|
| 54 |
|
| 55 |
Therefore, we can indeed obtain PRMs simply by collecting response-level data and training an ORM, without any burden of annotating step labels.
|
|
@@ -69,7 +69,7 @@ $$
|
|
| 69 |
For example, DPO already meets our assumption and serves as a strong variant, while in this work, we instantiate our implicit PRM with cross entropy (CE) loss due to memory efficiency:
|
| 70 |
|
| 71 |
$$
|
| 72 |
-
\mathcal{L}{CE} = l \cdot \log \sigma \left( \beta \log \frac{\pi\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) + (1 - l) \cdot \log \left[ 1 - \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) \right]
|
| 73 |
$$
|
| 74 |
|
| 75 |
We started the second-stage training on top of [EurusPRM-Stage1](https://huggingface.co/PRIME-RL/EurusPRM-Stage1) with fine-grained step-level labels. To obtain step-level labels, we employed Llama-3.1-70B-Inst and Qwen2.5-72B-Inst to insert nuance errors into correct solutions. We also mixed response-level data in this stage. The model was continually trained with $L_{CE}$ with a learning rate of 5e-7 and a batch-size of 64.
|
|
|
|
| 29 |
Define
|
| 30 |
|
| 31 |
$$
|
| 32 |
+
q_\phi^t(\mathbf{y}_{<t}, y_t) := \sum{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}_{<i})}{\pi_\text{ref}(y_{i}|\mathbf{y}_{<i})}.
|
| 33 |
$$
|
| 34 |
|
| 35 |
+
is the exponential average of $ r_\theta $ at step $ t $.
|
| 36 |
|
| 37 |
$$
|
| 38 |
+
q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} \left[ e^{\frac{1}{\beta} r_\phi(\mathbf{y})} \right]
|
| 39 |
$$
|
| 40 |
|
| 41 |
+
Hence, **$ q_\theta^t $**represents an exact expectation of outcome reward $ r_\theta $ at step $t$, i.e., the Q value.
|
| 42 |
|
| 43 |
The proposition indicates that when modeling
|
| 44 |
|
|
|
|
| 49 |
to train an ORM with the standard pipeline, where $\beta$ is a hyperparameter, $\phi$ can implicitly learn a Q function. Hence, process reward $r_\phi^t$ can be obtained by:
|
| 50 |
|
| 51 |
$$
|
| 52 |
+
r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}{<t})}{\pi_\text{ref}(y_{t}|\mathbf{y}_{<t})}.
|
| 53 |
$$
|
| 54 |
|
| 55 |
Therefore, we can indeed obtain PRMs simply by collecting response-level data and training an ORM, without any burden of annotating step labels.
|
|
|
|
| 69 |
For example, DPO already meets our assumption and serves as a strong variant, while in this work, we instantiate our implicit PRM with cross entropy (CE) loss due to memory efficiency:
|
| 70 |
|
| 71 |
$$
|
| 72 |
+
\small \mathcal{L}_{CE} = l \cdot \log \sigma \left( \beta \log \frac{\pi\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) + (1 - l) \cdot \log \left[ 1 - \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) \right]
|
| 73 |
$$
|
| 74 |
|
| 75 |
We started the second-stage training on top of [EurusPRM-Stage1](https://huggingface.co/PRIME-RL/EurusPRM-Stage1) with fine-grained step-level labels. To obtain step-level labels, we employed Llama-3.1-70B-Inst and Qwen2.5-72B-Inst to insert nuance errors into correct solutions. We also mixed response-level data in this stage. The model was continually trained with $L_{CE}$ with a learning rate of 5e-7 and a batch-size of 64.
|