Update README.md
Browse files
README.md
CHANGED
|
@@ -22,7 +22,7 @@ base_model:
|
|
| 22 |
- Metrics of online rejection-sampling are flexible to direct the constitution of reference flow for reward calculation.
|
| 23 |
- 📈 **Reward Behavior** Flow rewards enable arbitrary expert off-policy data as reference for constituting reward signal. Additionally, flow rewards rely on efficient context dependence that natively compressed in the latent space rather than individual denotation in the token space for context comprehending.
|
| 24 |
|
| 25 |
-

|
|
|
|
| 22 |
- Metrics of online rejection-sampling are flexible to direct the constitution of reference flow for reward calculation.
|
| 23 |
- 📈 **Reward Behavior** Flow rewards enable arbitrary expert off-policy data as reference for constituting reward signal. Additionally, flow rewards rely on efficient context dependence that natively compressed in the latent space rather than individual denotation in the token space for context comprehending.
|
| 24 |
|
| 25 |
+

|
| 26 |
|
| 27 |
### Model Description
|
| 28 |
- Trained from model:[Qwen2.5-Math-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B)
|