Upload README.md
Browse files
README.md
CHANGED
|
@@ -29,7 +29,7 @@ $$
|
|
| 29 |
Define
|
| 30 |
|
| 31 |
$$
|
| 32 |
-
q_\phi^t(\mathbf{y}_{<t}, y_t) := \
|
| 33 |
$$
|
| 34 |
|
| 35 |
is the exponential average of \\(r_\theta\\) at step \\(t\\).
|
|
@@ -38,7 +38,7 @@ $$
|
|
| 38 |
q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} \left[ e^{\frac{1}{\beta} r_\phi(\mathbf{y})} \right]
|
| 39 |
$$
|
| 40 |
|
| 41 |
-
Hence, \\(
|
| 42 |
|
| 43 |
The proposition indicates that when modeling
|
| 44 |
|
|
|
|
| 29 |
Define
|
| 30 |
|
| 31 |
$$
|
| 32 |
+
q_\phi^t(\mathbf{y}_{<t}, y_t) := \sum_{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}_{<i})}{\pi_\text{ref}(y_{i}|\mathbf{y}_{<i})}.
|
| 33 |
$$
|
| 34 |
|
| 35 |
is the exponential average of \\(r_\theta\\) at step \\(t\\).
|
|
|
|
| 38 |
q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} \left[ e^{\frac{1}{\beta} r_\phi(\mathbf{y})} \right]
|
| 39 |
$$
|
| 40 |
|
| 41 |
+
Hence, \\(q_\theta^t\\)represents an exact expectation of outcome reward \\(r_\theta\\) at step \\(t\\), i.e., the Q value.
|
| 42 |
|
| 43 |
The proposition indicates that when modeling
|
| 44 |
|