Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
ftajwar 's Collections
MaxRL
Paprika
Self-Rewarding-LLM-Training

MaxRL

updated 2 days ago

Qwen3-Base post-trained checkpoints for our paper, Maximum Likelihood Reinforcement Learning [https://zanette-labs.github.io/MaxRL/]

Upvote
2

  • ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps

    Text Generation • 4B • Updated 2 days ago • 20

  • ftajwar/qwen3_4B_Base_GRPO_Polaris_1000_steps

    Text Generation • 4B • Updated 2 days ago • 21

  • ftajwar/qwen3_1.7B_Base_MaxRL_Polaris_1000_steps

    Text Generation • 2B • Updated 2 days ago • 14

  • ftajwar/qwen3_1.7B_Base_GRPO_Polaris_1000_steps

    Text Generation • 2B • Updated 2 days ago • 17
Upvote
2
  • Collection guide
  • Browse collections
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs