This repository has weights of an LLM agent that learns to solve the logic puzzle Masyu (Necklace) using Reinforcement Learning with the GRPO algorithm.

alt text

These are my results of training Qwen/Qwen2-1.5B-Instruct. Due to constraints on available computational resources, a significant improvement in performance was primarily achieved for the first four difficulty levels. More extensive training—with more steps, a larger base model, or higher num_generations—would likely be required to achieve improvements on more complex puzzles.

Downloads last month
2
Safetensors
Model size
2B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Bernoulli/MasyuLLMAgent

Base model

Qwen/Qwen2.5-1.5B
Finetuned
(1416)
this model
Quantizations
1 model