sameersegal's picture
Update README.md
45a3583 verified
---
license: mit
datasets:
- PrimeIntellect/Reverse-Text-RL
language:
- en
base_model:
- Qwen/Qwen3-0.6B
- PrimeIntellect/Qwen3-0.6B-Reverse-Text-SFT
---
# Reverse Text Model Qwen3-0.6B
Simple model that was RL FT for 20 steps / epochs after SFT to reverse text using [prime-rl](https://github.com/PrimeIntellect-ai/prime-rl/) (RL Training) and [reverse-text](https://github.com/PrimeIntellect-ai/prime-environments/tree/main/environments/reverse_text) (RL Environment). See the improvement in results:
## Comparison with SFT (base) model
The reward (correctness score) distribution has improved for the RLFT model across all rollouts.
![](comparison.png)
At an instance level, if we compare the best scores across rollouts, we see a mean improvement of 3.73%. But a maximum of ~30% and reduction of ~3%
![](instance-level.png)
## Example Prompt & Reward
**Task:** `reverse-text`
**Prompt:**
- **System:**
“Reverse the text character-by-character. Put your answer in `<reversed_text>` tags.”
- **User:**
“The community in Bruck was merged into it”
**Expected Completion:**
```text
<reversed_text>
.ti otni degrem saw kcuBr ni ytinummoc ehT
</reversed_text>
```
**Expected Reward:**
0.963855421686747
Note: Reward is basd on the long common subsequence