QWEN-0.5B-GRPO / README.md
vxnuaj's picture
Update README.md
3f8b350 verified
metadata
license: mit

Finetuned Qwen2.5-0.5B on GSM8k, using Group Relative Policy Optimization proposed on DeepSeekMath.

A random sample,

Question:
Jenna is adding black dots to a bunch of white blouses. Each blouse gets 20 dots, and each dot takes 10 ml of black dye. How many 400-ml bottles of dye does Jenna need to buy to dye 100 blouses?

Answer:
50

Response:
<reasoning>
Jenna needs to dye 100 blouses, and each blouse gets 20 dots, so she needs 100 * 20 = 2000 dots in total. Each dot takes 10 ml of black dye, so she needs 2000 * 10 = 20000 ml of black dye. Since each bottle holds 400 ml of dye, Jenna needs to buy 20000 / 400 = 50 bottles of dye.
</reasoning>
<answer>
50
</answer>