Upload README.md
Browse files
README.md
CHANGED
|
@@ -38,7 +38,7 @@ $$
|
|
| 38 |
q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} \left[ e^{\frac{1}{\beta} r_\phi(\mathbf{y})} \right]
|
| 39 |
$$
|
| 40 |
|
| 41 |
-
Hence,
|
| 42 |
|
| 43 |
The proposition indicates that when modeling
|
| 44 |
|
|
@@ -46,7 +46,7 @@ $$
|
|
| 46 |
r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
|
| 47 |
$$
|
| 48 |
|
| 49 |
-
to train an ORM with the standard pipeline, where \\(\beta\\) is a hyperparameter, \\(\phi
|
| 50 |
|
| 51 |
$$
|
| 52 |
r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}_{<t})}{\pi_\text{ref}(y_{t}|\mathbf{y}_{<t})}.
|
|
|
|
| 38 |
q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} \left[ e^{\frac{1}{\beta} r_\phi(\mathbf{y})} \right]
|
| 39 |
$$
|
| 40 |
|
| 41 |
+
Hence, \\(**q_\theta^t**\\)represents an exact expectation of outcome reward \\(**r_\theta**\\) at step \\(t\\), i.e., the Q value.
|
| 42 |
|
| 43 |
The proposition indicates that when modeling
|
| 44 |
|
|
|
|
| 46 |
r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
|
| 47 |
$$
|
| 48 |
|
| 49 |
+
to train an ORM with the standard pipeline, where \\(\beta\\) is a hyperparameter, \\(\phi\\) can implicitly learn a Q function. Hence, process reward \\(r_\phi^t\\) can be obtained by:
|
| 50 |
|
| 51 |
$$
|
| 52 |
r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}_{<t})}{\pi_\text{ref}(y_{t}|\mathbf{y}_{<t})}.
|