Add metadata and improve model card for RLAnything

84d0f81 verified about 1 month ago

2.16 kB

	---
	license: mit
	library_name: transformers
	pipeline_tag: image-text-to-text
	tags:
	- reinforcement-learning
	- agent
	- gui-agent
	- vl-model
	---
	---

	# RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

	[Paper](https://arxiv.org/abs/2602.02488) \| [Code](https://github.com/Gen-Verse/Open-AgentRL) \| [Blog](https://yinjjiew.github.io/projects/rlanything/)

	RLAnything is a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios.

	### Highlights

	* Integrated Feedback for Policy: The policy is trained with integrated outcome and step-wise signals from the reward model, outperforming traditional outcome-only signals.
	* Consistency Feedback for Reward Model: The reward model is jointly optimized via consistency feedback, which in turn further improves policy training.
	* Critic Feedback for Environment: Theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience.

	<p align="center">
	<img src="https://github.com/yinjjiew/Data/raw/main/rlanything/rlanythingoverview.png" width="100%"/>
	</p>

	### Performance

	RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively.

	<p align="center">
	<img src="https://github.com/yinjjiew/Data/raw/main/rlanything/rlanythingscaleosworld.png" width="70%"/>
	</p>


	<p align="center">
	<img src="https://github.com/yinjjiew/Data/raw/main/rlanything/rlanythingosworldbench.png" width="100%"/>
	</p>


	## Citation

	```bibtex
	@article{wang2026rlanything,
	title={RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System},
	author={Wang, Yinjie and Xie, Tianbao and Shen, Ke and Wang, Mengdi and Yang, Ling},
	journal={arXiv preprint arXiv:2602.02488},
	year={2026}
	}
	```