PRIME-RL
/

EurusPRM-Stage2

Model card Files Files and versions

yuchenFan commited on Jan 2, 2025

Commit

be3201e

·

1 Parent(s): 94c55ff

Upload README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -38,7 +38,7 @@ $$
 q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} \left[ e^{\frac{1}{\beta} r_\phi(\mathbf{y})} \right]
 $$
-Hence, **\\(q_\theta^t\\)**represents an exact expectation of outcome reward **\\(r_\theta\\)** at step \\(t\\), i.e., the Q value.
 The proposition indicates that when modeling
@@ -46,7 +46,7 @@ $$
 r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
 $$
-to train an ORM with the standard pipeline, where \\(\beta\\) is a hyperparameter, \\(\phi$\\) can implicitly learn a Q function. Hence, process reward \\(r_\phi^t\\) can be obtained by:
 $$
 r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}_{<t})}{\pi_\text{ref}(y_{t}|\mathbf{y}_{<t})}.

 q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} \left[ e^{\frac{1}{\beta} r_\phi(\mathbf{y})} \right]
 $$
+Hence, \\(**q_\theta^t**\\)represents an exact expectation of outcome reward \\(**r_\theta**\\) at step \\(t\\), i.e., the Q value.
 The proposition indicates that when modeling
 r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
 $$
+to train an ORM with the standard pipeline, where \\(\beta\\) is a hyperparameter, \\(\phi\\) can implicitly learn a Q function. Hence, process reward \\(r_\phi^t\\) can be obtained by:
 $$
 r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}_{<t})}{\pi_\text{ref}(y_{t}|\mathbf{y}_{<t})}.