| | --- |
| | license: mit |
| | library_name: transformers |
| | pipeline_tag: image-text-to-text |
| | tags: |
| | - reinforcement-learning |
| | - agent |
| | - gui-agent |
| | - vl-model |
| | --- |
| | --- |
| |
|
| | # RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System |
| |
|
| | [Paper](https://arxiv.org/abs/2602.02488) | [Code](https://github.com/Gen-Verse/Open-AgentRL) | [Blog](https://yinjjiew.github.io/projects/rlanything/) |
| |
|
| | **RLAnything** is a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. |
| |
|
| | ### Highlights |
| |
|
| | * **Integrated Feedback for Policy:** The policy is trained with integrated outcome and step-wise signals from the reward model, outperforming traditional outcome-only signals. |
| | * **Consistency Feedback for Reward Model:** The reward model is jointly optimized via consistency feedback, which in turn further improves policy training. |
| | * **Critic Feedback for Environment:** Theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. |
| |
|
| | <p align="center"> |
| | <img src="https://github.com/yinjjiew/Data/raw/main/rlanything/rlanythingoverview.png" width="100%"/> |
| | </p> |
| |
|
| | ### Performance |
| |
|
| | RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. |
| |
|
| | <p align="center"> |
| | <img src="https://github.com/yinjjiew/Data/raw/main/rlanything/rlanythingscaleosworld.png" width="70%"/> |
| | </p> |
| |
|
| |
|
| | <p align="center"> |
| | <img src="https://github.com/yinjjiew/Data/raw/main/rlanything/rlanythingosworldbench.png" width="100%"/> |
| | </p> |
| |
|
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @article{wang2026rlanything, |
| | title={RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System}, |
| | author={Wang, Yinjie and Xie, Tianbao and Shen, Ke and Wang, Mengdi and Yang, Ling}, |
| | journal={arXiv preprint arXiv:2602.02488}, |
| | year={2026} |
| | } |
| | ``` |