|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- PrimeIntellect/Reverse-Text-RL |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Qwen/Qwen3-0.6B |
|
|
- PrimeIntellect/Qwen3-0.6B-Reverse-Text-SFT |
|
|
--- |
|
|
# Reverse Text Model Qwen3-0.6B |
|
|
|
|
|
Simple model that was RL FT for 20 steps / epochs after SFT to reverse text using [prime-rl](https://github.com/PrimeIntellect-ai/prime-rl/) (RL Training) and [reverse-text](https://github.com/PrimeIntellect-ai/prime-environments/tree/main/environments/reverse_text) (RL Environment). See the improvement in results: |
|
|
|
|
|
## Comparison with SFT (base) model |
|
|
|
|
|
The reward (correctness score) distribution has improved for the RLFT model across all rollouts. |
|
|
 |
|
|
|
|
|
At an instance level, if we compare the best scores across rollouts, we see a mean improvement of 3.73%. But a maximum of ~30% and reduction of ~3% |
|
|
 |
|
|
|
|
|
## Example Prompt & Reward |
|
|
|
|
|
**Task:** `reverse-text` |
|
|
|
|
|
**Prompt:** |
|
|
- **System:** |
|
|
“Reverse the text character-by-character. Put your answer in `<reversed_text>` tags.” |
|
|
- **User:** |
|
|
“The community in Bruck was merged into it” |
|
|
|
|
|
**Expected Completion:** |
|
|
```text |
|
|
<reversed_text> |
|
|
.ti otni degrem saw kcuBr ni ytinummoc ehT |
|
|
</reversed_text> |
|
|
``` |
|
|
|
|
|
**Expected Reward:** |
|
|
0.963855421686747 |
|
|
|
|
|
Note: Reward is basd on the long common subsequence |
|
|
|