PointLLM-R: Enhancing 3D Point Cloud Reasoning via Chain-of-Thought
Abstract
A data-centric framework constructs large-scale Chain-of-Thought supervision for 3D point cloud understanding through vision-language model refinement and human-in-the-loop prompt optimization, enabling robust 3D multimodal reasoning capabilities.
Understanding 3D point clouds through language remains a fundamental challenge in computer graphics and visual computing, due to the irregular structure of point cloud data and the lack of explicit reasoning in existing 3D multimodal models. While Chain-of-Thought (CoT) reasoning has shown strong effectiveness in LLMs and image-based MLLMs, its extension to 3D understanding remains largely underexplored. In this paper, we propose a data-centric framework for constructing large-scale CoT supervision tailored to 3D point cloud understanding. Our framework consists of a two-stage pipeline that first refines point-text instruction data via vision-language-model-based quality evaluation and reference-guided refinement, and then synthesizes high-quality reasoning paths through Human-in-the-Loop Prompt Optimization (HiLPO). Using this approach, we build PoCoTI, a CoT-enhanced point-text instruction-following dataset containing 55K samples with explicit reasoning paths. Fine-tuning PointLLM on PoCoTI yields PointLLM-R, a reasoning-capable 3D multimodal language model. Extensive experiments on generative 3D classification and captioning demonstrate that PointLLM-R achieves state-of-the-art performance and generalizes robustly to real-world scanned point clouds and multi-turn dialogue scenarios.
Get this paper in your agent:
hf papers read 2605.22013 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 2
QileXu/PoCoTI-55K
QileXu/OmniObject3D_brief_description_val_GT
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper