metadata
library_name: transformers
pipeline_tag: image-text-to-text
Osprey: Pixel Understanding with Visual Instruction Tuning
Osprey: Pixel Understanding with Visual Instruction Tuning
Osprey is a mask-text instruction tuning approach that extends MLLMs by incorporating pixel-wise mask regions into language instructions, enabling fine-grained visual understanding. Based on input mask region, Osprey generates semantic descriptions including short description and detailed description.
Our Osprey can seamlessly integrate with SAM in point-prompt, box-prompt and segmentation everything modes to generate the semantics associated with specific parts or objects.