xxwu nielsr HF Staff commited on
Commit
7013d9f
·
1 Parent(s): 8f977aa

Add model card and paper link (#1)

Browse files

- Add model card and paper link (a169c5afe03f19402e5b97546a63beb6a153bcac)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +64 -3
README.md CHANGED
@@ -1,3 +1,64 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
+ base_model: Qwen/Qwen2.5-7B-Instruct
6
+ tags:
7
+ - reinforcement-learning
8
+ - tool-use
9
+ - agent
10
+ - travel-planner
11
+ ---
12
+
13
+ # Agent-STAR-RL-7B
14
+
15
+ Agent-STAR-RL-7B is a 7B parameter model based on **Qwen2.5-7B-Instruct**, fine-tuned using Reinforcement Learning (RL) for long-horizon tool-use tasks.
16
+
17
+ This model is a key artifact of the paper [Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe](https://huggingface.co/papers/2603.21972).
18
+
19
+ ## Model Description
20
+
21
+ The model was developed using the **STAR [Data Synthesis → SFT → RL]** pipeline, a unified post-training recipe for scaling RL in complex, multi-turn environments. It is specifically optimized for [TravelPlanner](https://github.com/OSU-NLP-Group/TravelPlanner/), a challenging testbed requiring tool orchestration to satisfy multifaceted commonsense and hard constraints.
22
+
23
+ As per the systematic study in the paper, the 7B variant leverages **GRPO (Group Relative Policy Optimization)** with a dense **SUM reward** for optimized performance and faster convergence.
24
+
25
+ - **Paper:** [Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe](https://huggingface.co/papers/2603.21972)
26
+ - **Repository:** [https://github.com/WxxShirley/Agent-STAR](https://github.com/WxxShirley/Agent-STAR)
27
+ - **Dataset:** [Agent-STAR-TravelDataset](https://huggingface.co/datasets/xxwu/Agent-STAR-TravelDataset)
28
+
29
+ ## Training Pipeline
30
+
31
+ 1. **Data Synthesis:** Generation of synthetic queries and successful trajectories.
32
+ 2. **SFT:** Fine-tuning from the backbone using ~1K successful trajectories.
33
+ 3. **RL:** Scale-aware reinforcement learning tuning.
34
+
35
+ ## Usage
36
+
37
+ This model is designed to be used within a ReAct-style agentic framework. For reproducing the results on TravelPlanner, it is recommended to use the inference code provided in the official repository.
38
+
39
+ ### Inference Example
40
+ From the [Agent-STAR](https://github.com/WxxShirley/Agent-STAR) repository root:
41
+ ```bash
42
+ cd Inference
43
+ python3 -u main.py \
44
+ --model xxwu/Agent-STAR-RL-7B \
45
+ --save_suffix test_run \
46
+ --max_workers 20 \
47
+ --split validation \
48
+ --max_context 32768 \
49
+ --max_turns 60
50
+ ```
51
+
52
+ ## Citation
53
+
54
+ ```bibtex
55
+ @misc{wu2026agentstar,
56
+ title={Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe},
57
+ author={Xixi Wu and Qianguo Sun and Ruiyang Zhang and Chao Song and Junlong Wu and Yiyan Qi and Hong Cheng},
58
+ year={2026},
59
+ eprint={2603.21972},
60
+ archivePrefix={arXiv},
61
+ primaryClass={cs.LG},
62
+ url={https://arxiv.org/abs/2603.21972},
63
+ }
64
+ ```