This repository has weights of an LLM agent that learns to solve the logic puzzle Masyu (Necklace) using Reinforcement Learning with the GRPO algorithm.

These are my results of training Qwen/Qwen2-1.5B-Instruct. Due to constraints on available computational resources, a significant improvement in performance was primarily achieved for the first four difficulty levels. More extensive training—with more steps, a larger base model, or higher num_generations—would likely be required to achieve improvements on more complex puzzles.

Downloads last month: 3

Safetensors

Model size

2B params

Tensor type

F16

Model tree for Bernoulli/MasyuLLMAgent

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Finetuned

(1466)

this model

Quantizations

1 model