Safetensors
English
qwen2
swordli's picture
Update README.md
a6ee6c5 verified
metadata
license: apache-2.0
datasets:
  - PeterJinGo/nq_hotpotqa_train
language:
  - en
metrics:
  - exact_match
base_model:
  - Qwen/Qwen2.5-7B-Instruct

SAPO

Improving Search Agent with One Line of Code

πŸ”₯ News

  • Paper will available on [Arxiv]

Overview

SAPO is a policy optimization method that stabilizes post-training for autonomous multi-turn search agents tackling complex, real-world questions.

Highlights

  • One line of code: A conditional KL penalty enforces token-level distributional constraints on low-probability positive tokens.
  • Strong performance: Consistent gains across diverse search agents on seven challenging QA benchmarks.

πŸ™ Acknowledgements

Built upon VeRL, Search-R1, and AutoRefine. Thanks to the authors for their valuable work.

πŸŽ“ Citations

@misc{li2026improvingsearchagentline,
      title={Improving Search Agent with One Line of Code}, 
      author={Jian Li and Dongsheng Chen and Zhenhua Xu and Yizhang Jin and Jiafu Wu and Chengjie Wang and Xiaotong Yuan and Yabiao Wang},
      year={2026},
      eprint={2603.10069},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.10069}, 
}

@misc{li2026sesearch,
      title={SE-Search: Self-Evolving Search Agent via Memory and Dense Reward}, 
      author={Jian Li and Yizhang Jin and Dongqi Liu and Hang Ding and Jiafu Wu and Dongsheng Chen and Yunhang Shen and Yulei Qin and Ying Tai and Chengjie Wang and Xiaotong Yuan and Yabiao Wang},
      year={2026},
      eprint={2603.03293},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.03293}, 
}

@article{li2025survey,
  title={A Survey on AI Search with Large Language Models},
  author={Li, Jian and Li, Xiaoxi and Zheng, Yan and Jin, Yizhang and Wang, Shuo and Wu, Jiafu and Wang, Yabiao and Wang, Chengjie and Yuan, Xiaotong},
  year={2025}
}