0 Bytes
6 files
Updated 3 days ago
Name
Size
assets
.gitattributes2.5 kB
xet
LoHoSearch.csv837 kB
xet
README.md5.04 kB
xet
decrypt.py2.29 kB
xet
train.csv3.5 MB
xet
README.md

LoHoSearch: Benchmarking Long-Horizon
Search Agents Beyond the Human Difficulty Ceiling

📃 Paper • 🏆 Benchmark • 📦 Training Data

Abstract

Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents. For more details, see our paper.

Dataset

This repository contains two subsets:

Config File Split Records Description Language
benchmark LoHoSearch.csv test 544 Human-verified evaluation benchmark English
train train.csv train 2000 Training set generated by the same automated pipeline, without human verification English

Domain Distribution

Main Results

Evaluation setup. Each model is equipped with two tools, search (keyword queries via a traditional search engine) and browse (fetch the content of given URLs), and uses the same system prompt as BrowseComp. We set temperature to 1.0, keep each model's default thinking settings, and use a 200K context window (184K input + 16K output). The score is the average correct ratio over the 544 questions, computed by averaging two LLM-judge gradings: the BrowseComp grading prompt with GPT-4.1 as judge, and the SimpleQA grading prompt with Qwen2.5-32B as judge. Averaging two complementary judges avoids the over-strictness or over-leniency of any single setup.

Model Reasoning Source LoHoSearch Score (%)
GPT-5.5 N Closed 34.74
DeepSeek-V4-Pro Y Open 15.99
Claude-Opus-4.6 N Closed 15.62
Kimi-K2.6 Y Open 15.53
Gemini-3.1-Pro Y Closed 13.32
GLM-5.1 Y Open 12.77
Claude-Opus-4.7 N Closed 10.29
DeepSeek-V4-Flash Y Open 10.02
LongCat-Flash-Thinking-2601 Y Open 9.74
MiniMax-M2.7 Y Open 2.48
MiniMax-M2.5 Y Open 2.29

Construction Pipeline

Pipeline Overview

The benchmark is constructed through four stages:

  1. Knowledge Graph Construction: Built from the full English Wikipedia dump with Wikidata type annotations.
  2. Subgraph Sampling: Tree-structured and graph-structured subgraphs are sampled with constraints on search space size, structural complexity, and answer uniqueness.
  3. QA Generation and Verification: Relations are extracted and obfuscated, then assembled into natural-language questions with automated coverage and satisfaction checks.
  4. Post Filtering and Human Review: Multiple rounds of uniqueness verification, difficulty filtering, and professional human annotation.

Citation

@misc{zhao2026lohosearchbenchmarkinglonghorizonsearch,
      title={LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling}, 
      author={Jiarui Zhao and Rongzhi Zhang and Lingchuan Liu and Hao Yang and Xunliang Cai and Xi Su},
      year={2026},
      eprint={2606.12837},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.12837}, 
}

License

This dataset is released under the MIT License.

Total size
0 Bytes
Files
6
Last updated
Jun 15
Pre-warmed CDN
US EU US EU

Contributors