arxiv:2605.14068

CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

Published on May 13

· Submitted by

Authors:

Abstract

CurveBench presents a benchmark for hierarchical topological reasoning using visual inputs, demonstrating significant challenges in exact topology-aware visual reasoning even with advanced models.

AI-generated summary

We introduce CurveBench, a benchmark for hierarchical topological reasoning from visual input. CurveBench consists of 756 images of pairwise non-intersecting Jordan curves across easy, polygonal, topographic-inspired, maze-like, and dense counting configurations. Each image is annotated with a rooted tree encoding the containment relations between planar regions. We formulate the task as structured prediction: given an image, a model must recover the full rooted containment tree induced by the curves. Despite the visual simplicity of the task, the strongest evaluated model, Gemini 3.1 Pro, achieves only 71.1\% tree-generation accuracy on CurveBench-Easy and 19.1\% on CurveBench-Hard. We further demonstrate benchmark utility through RLVR-style fine-tuning of open-weight vision-language models. Our trained Qwen3-VL-8B model improves over Qwen-3-VL-8B-Thinking from 2.8\% to 33.3\% tree-generation accuracy on CurveBench-Easy, exceeding GPT-5.4 and Claude Opus 4.5 under our evaluation protocol. The remaining gap, especially on CurveBench-Hard, shows that exact topology-aware visual reasoning remains far from solved.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

AmirMohseni

Paper submitter about 23 hours ago

We introduce CurveBench, a benchmark for testing whether vision-language models can recover hierarchical region-containment trees from images of non-intersecting Jordan curves. The task targets visual topology and structured reasoning beyond simple object recognition, counting, or OCR.

The Hugging Face collection includes the paper, the CurveBench and CurveBench-Easy datasets, evaluation code, ground-truth generation resources, and fine-tuning artifacts. Our results show that even strong frontier VLMs struggle substantially on the harder settings, while fine-tuned open models improve but remain far from solving the task.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.14068

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.14068 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.14068 in a Space README.md to link it from this page.