Buckets:

llc1231
/

LoHoSearch-bucket

0 Bytes

6 files

Updated 3 days ago

Ctrl+K

Name	Size	Uploaded	Xet hash
assets		3 days ago	2 items
.gitattributes	2.5 kB xet	3 days ago	738f1125
LoHoSearch.csv	837 kB xet	3 days ago	263c3135
README.md	5.04 kB xet	3 days ago	7db90e37
decrypt.py	2.29 kB xet	3 days ago	d305d425
train.csv	3.5 MB xet	3 days ago	4a9322db

README.md

LoHoSearch: Benchmarking Long-Horizon
Search Agents Beyond the Human Difficulty Ceiling

📃 Paper • 🏆 Benchmark • 📦 Training Data

Abstract

Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents. For more details, see our paper.

Dataset

This repository contains two subsets:

Config	File	Split	Records	Description	Language
`benchmark`	`LoHoSearch.csv`	test	544	Human-verified evaluation benchmark	English
`train`	`train.csv`	train	2000	Training set generated by the same automated pipeline, without human verification	English

Domain Distribution

Main Results

Evaluation setup. Each model is equipped with two tools, search (keyword queries via a traditional search engine) and browse (fetch the content of given URLs), and uses the same system prompt as BrowseComp. We set temperature to 1.0, keep each model's default thinking settings, and use a 200K context window (184K input + 16K output). The score is the average correct ratio over the 544 questions, computed by averaging two LLM-judge gradings: the BrowseComp grading prompt with GPT-4.1 as judge, and the SimpleQA grading prompt with Qwen2.5-32B as judge. Averaging two complementary judges avoids the over-strictness or over-leniency of any single setup.

Model	Reasoning	Source	LoHoSearch Score (%)
GPT-5.5	N	Closed	34.74
DeepSeek-V4-Pro	Y	Open	15.99
Claude-Opus-4.6	N	Closed	15.62
Kimi-K2.6	Y	Open	15.53
Gemini-3.1-Pro	Y	Closed	13.32
GLM-5.1	Y	Open	12.77
Claude-Opus-4.7	N	Closed	10.29
DeepSeek-V4-Flash	Y	Open	10.02
LongCat-Flash-Thinking-2601	Y	Open	9.74
MiniMax-M2.7	Y	Open	2.48
MiniMax-M2.5	Y	Open	2.29

Construction Pipeline

Pipeline Overview

The benchmark is constructed through four stages:

Knowledge Graph Construction: Built from the full English Wikipedia dump with Wikidata type annotations.
Subgraph Sampling: Tree-structured and graph-structured subgraphs are sampled with constraints on search space size, structural complexity, and answer uniqueness.
QA Generation and Verification: Relations are extracted and obfuscated, then assembled into natural-language questions with automated coverage and satisfaction checks.
Post Filtering and Human Review: Multiple rounds of uniqueness verification, difficulty filtering, and professional human annotation.

Citation

@misc{zhao2026lohosearchbenchmarkinglonghorizonsearch,
      title={LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling}, 
      author={Jiarui Zhao and Rongzhi Zhang and Lingchuan Liu and Hao Yang and Xunliang Cai and Xi Su},
      year={2026},
      eprint={2606.12837},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.12837}, 
}

License

This dataset is released under the MIT License.

Total size: 0 Bytes

Files: 6

Last updated: Jun 15

Pre-warmed CDN: US EU US EU

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling