metadata
license: apache-2.0
pipeline_tag: image-to-text
library_name: transformers
🌷 TULIP: Token-length Upgraded CLIP
TULIP (Token-length Upgraded CLIP) addresses the challenge of representing long captions in vision-language models. It enhances CLIP-like models by incorporating relative position encodings, enabling effective processing of captions longer than the default 77 tokens.
"TULIP: Token-length Upgraded CLIP" (accepted to ICLR 2025)
Ivona Najdenkoska٭, Mohammad M. Derakshani٭, Yuki M. Asano, Nanne van Noord, Marcel Worring, Cees G. M. Snoek
٭ Equal core contributions
Code: https://github.com/ivonajdenkoska/tulip
Highlights
- Improves performance on long caption understanding tasks.
- Uses relative positional encodings to handle long image captions.
- Works with CLIP-like models.
How to use
Please refer to the original repository for detailed instructions on how to use and train the model.