Abstract
A unified text-visual image editing framework is presented that combines semantic intent from textual instructions with spatial guidance from visual prompts to achieve more precise and faithful image manipulation.
Existing image editing methods can be generally categorized into textual instruction-based and visual prompt-based ones. Textual instructions are semantically expressive, but are limited by the coarse granularity of spatial control of the editing results. In contrast, visual prompts such as drag and point can provide precise spatial guidance, but are limited by the inherent ambiguity in semantic intent. To unify the strength of textual and visual prompts, we present Text-Vision Co-Instructed Image Editing, which jointly models textual instructions as semantic intent and sparse visual instructions as spatial guidance, aiming to achieve precise and intent-faithful image manipulation. To this end, we first construct a textual-visual instruction paired dataset with more than 23K samples derived from dynamic videos, enabling aligned supervision for cross-modal instruction. We then propose TV-Edit, a Textual-Visual instruction unified Editing framework to contextualize drag or point-based visual instructions with image-text semantics and lift them into semantic-aware control representations for pretrained editing backbones. By integrating semantic intent and spatial constraints, TV-Edit leads to more precise spatial control, less instruction ambiguity, and stronger structural consistency than text-only or drag-based alternatives. Finally, we establish TV-Edit-Bench, a deliberately designed benchmark to evaluate semantic faithfulness, spatial alignment, and visual consistency with ground-truth references and controlled textual-visual variations for reliable assessment. Our experiments across multiple editing backbones demonstrate that TV-Edit consistently yields more precise and intent-faithful edits, significantly outperforming state-of-the-art instruction-based and drag-based baselines.
Community
Existing image editing methods can be generally categorized into textual instruction-based and visual prompt-based ones. Textual instructions are semantically expressive, but are limited by the coarse granularity of spatial control of the editing results. In contrast, visual prompts such as drag and point can provide precise spatial guidance, but are limited by the inherent ambiguity in semantic intent. To unify the strength of textual and visual prompts, we present Text-Vision Co-Instructed Image Editing, which jointly models textual instructions as semantic intent and sparse visual instructions as spatial guidance, aiming to achieve precise and intent-faithful image manipulation. To this end, we first construct a textual-visual instruction paired dataset with more than 23K samples derived from dynamic videos, enabling aligned supervision for cross-modal instruction. We then propose TV-Edit, a Textual-Visual instruction unified Editing framework to contextualize drag or point-based visual instructions with image-text semantics and lift them into semantic-aware control representations for pretrained editing backbones. By integrating semantic intent and spatial constraints, TV-Edit leads to more precise spatial control, less instruction ambiguity, and stronger structural consistency than text-only or drag-based alternatives. Finally, we establish TV-Edit-Bench, a deliberately designed benchmark to evaluate semantic faithfulness, spatial alignment, and visual consistency with ground-truth references and controlled textual-visual variations for reliable assessment. Our experiments across multiple editing backbones demonstrate that TV-Edit consistently yields more precise and intent-faithful edits, significantly outperforming state-of-the-art instruction-based and drag-based baselines.
Neat paper. It makes a lot of sense to combine textual instructions with visual prompts like points or drags, since text is great for intent but often lacks the specific spatial control needed for precise editing. Seeing a framework that explicitly bridges that gap is pretty interesting.
I'm curious, how does the model handle cases where the text and the visual prompt might be slightly contradictory or ambiguous?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/02cf4e30-a159-4ca7-85dc-5238baf81475
Thank you for your interest in our work — this is an interesting question.
In our setting, text mainly provides the semantic intent, while the visual prompt provides more explicit spatial guidance. For example, when asking a model to “turn a dog’s head to the left,” the meaning of “left” can be ambiguous: it may refer to the dog’s own left or the image’s left. Text-only foundation models often focus more on performing the semantic action, such as “turning,” but are less precise about the spatial direction.
Since TV-Edit is trained with visual trajectories as supervision, it tends to rely on the visual prompt to resolve such spatial ambiguity. Therefore, if the text says “turn left” but the drag trajectory indicates a rightward movement, the model will usually follow the visual prompt for the direction while still preserving the semantic action specified by the text. In this way, the two prompts play complementary roles: text specifies what to edit, and the visual prompt specifies how/where to move it spatially.
Get this paper in your agent:
hf papers read 2606.16767 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper