language: en license: mit tags: - multimodal - vision-language - captioning
A model designed to generate textual descriptions from visual inputs.