arxiv:2606.25763

ShutterMuse: Capture-Time Photography Guidance with MLLMs

Published on Jun 24

· Submitted by

Yixiao Fang on Jun 25

#3 Paper of the day

Upvote

Authors:

Jiayu Li ,

Yixiao Fang ,

Wei Cheng ,

Abstract

Researchers developed a new benchmark and dataset for photography assistance, along with a unified multimodal model that provides both composition guidance and pose recommendations during image capture.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Real-world photography requires capture-time guidance for both camera framing and subject pose. Yet existing aesthetic cropping benchmarks mainly evaluate post-hoc crop prediction and overlook subject-side recommendations, leaving the capture-time guidance capabilities of multimodal large language models (MLLMs) underexplored. To address this gap, we introduce CaptureGuide-Bench, a benchmark with two complementary tasks: photographer-side composition decision and refinement, and subject-side scene-conditioned pose recommendation. Our evaluation reveals limitations: general-purpose MLLMs can make composition decisions but lack precise refinement localization, while specialized aesthetic cropping models localize crops effectively but are limited to refinement; neither provides actionable pose guidance. To support model development, we further construct CaptureGuide-Dataset, comprising 130K samples with textual rationales and structured visual annotations, and develop ShutterMuse, a unified MLLM trained with supervised and reinforcement fine-tuning. Experiments on CaptureGuide-Bench show that ShutterMuse achieves the best overall photographer-side performance among evaluated baselines and competitive subject-side pose recommendation with substantially lower inference cost, demonstrating the potential of MLLMs as interactive assistants for photography during image capture.

View arXiv page View PDF Project page GitHub 11 Add to collection

Community

fangyixiao

Paper author Paper submitter about 4 hours ago

ShutterMuse is a unified multimodal large language model for capture-time photography guidance. It supports:

Photographer-side guidance: keep, refine, or reject the current framing, with a composition box when refinement is needed.
Subject-side guidance: recommend scene-conditioned portrait poses with COCO-17 keypoints and visibility states.