Add metadata and improve model card for RLAnything

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +20 -12
README.md CHANGED
@@ -1,24 +1,34 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
3
  ---
4
 
5
-
6
- # Introduction to TraDo
7
 
8
  [Paper](https://arxiv.org/abs/2602.02488) | [Code](https://github.com/Gen-Verse/Open-AgentRL) | [Blog](https://yinjjiew.github.io/projects/rlanything/)
9
 
10
- We introduce **RLAnything**, a reinforcement learning framework forges environment, policy and reward model in a completely dynamic system to enhance the training signals and improve the whole system.
11
-
12
- * **Integrated Feedback for Policy:** The policy is trained with integrated outcome and step-wise signals from reward model.
13
- * **Consistency Feedback for Reward Model:** The Reward model is jointly optimized by consistency feedback, further improves policy training.
14
- * **Critic Feedback for Environment:** Our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each.
15
 
 
16
 
 
 
 
17
 
18
  <p align="center">
19
  <img src="https://github.com/yinjjiew/Data/raw/main/rlanything/rlanythingoverview.png" width="100%"/>
20
  </p>
21
 
 
 
 
22
 
23
  <p align="center">
24
  <img src="https://github.com/yinjjiew/Data/raw/main/rlanything/rlanythingscaleosworld.png" width="70%"/>
@@ -30,15 +40,13 @@ We introduce **RLAnything**, a reinforcement learning framework forges environme
30
  </p>
31
 
32
 
33
- # Citation
34
 
35
- ```
36
  @article{wang2026rlanything,
37
  title={RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System},
38
  author={Wang, Yinjie and Xie, Tianbao and Shen, Ke and Wang, Mengdi and Yang, Ling},
39
  journal={arXiv preprint arXiv:2602.02488},
40
  year={2026}
41
  }
42
- ```
43
-
44
-
 
1
  ---
2
  license: mit
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ tags:
6
+ - reinforcement-learning
7
+ - agent
8
+ - gui-agent
9
+ - vl-model
10
+ ---
11
  ---
12
 
13
+ # RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System
 
14
 
15
  [Paper](https://arxiv.org/abs/2602.02488) | [Code](https://github.com/Gen-Verse/Open-AgentRL) | [Blog](https://yinjjiew.github.io/projects/rlanything/)
16
 
17
+ **RLAnything** is a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios.
 
 
 
 
18
 
19
+ ### Highlights
20
 
21
+ * **Integrated Feedback for Policy:** The policy is trained with integrated outcome and step-wise signals from the reward model, outperforming traditional outcome-only signals.
22
+ * **Consistency Feedback for Reward Model:** The reward model is jointly optimized via consistency feedback, which in turn further improves policy training.
23
+ * **Critic Feedback for Environment:** Theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience.
24
 
25
  <p align="center">
26
  <img src="https://github.com/yinjjiew/Data/raw/main/rlanything/rlanythingoverview.png" width="100%"/>
27
  </p>
28
 
29
+ ### Performance
30
+
31
+ RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively.
32
 
33
  <p align="center">
34
  <img src="https://github.com/yinjjiew/Data/raw/main/rlanything/rlanythingscaleosworld.png" width="70%"/>
 
40
  </p>
41
 
42
 
43
+ ## Citation
44
 
45
+ ```bibtex
46
  @article{wang2026rlanything,
47
  title={RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System},
48
  author={Wang, Yinjie and Xie, Tianbao and Shen, Ke and Wang, Mengdi and Yang, Ling},
49
  journal={arXiv preprint arXiv:2602.02488},
50
  year={2026}
51
  }
52
+ ```