This repository has weights of an LLM agent that learns to solve the logic puzzle Masyu (Necklace) using Reinforcement Learning with the GRPO algorithm.
These are my results of training Qwen/Qwen2-1.5B-Instruct. Due to constraints on available computational resources, a significant improvement in performance was primarily achieved for the first four difficulty levels. More extensive training—with more steps, a larger base model, or higher num_generations—would likely be required to achieve improvements on more complex puzzles.
- Downloads last month
- 2
