TULIP / README.md
nielsr's picture
nielsr HF Staff
Add/improve model card for TULIP
b24ab89 verified
|
raw
history blame
1.19 kB
metadata
license: apache-2.0
pipeline_tag: image-to-text
library_name: transformers

🌷 TULIP: Token-length Upgraded CLIP

TULIP (Token-length Upgraded CLIP) addresses the challenge of representing long captions in vision-language models. It enhances CLIP-like models by incorporating relative position encodings, enabling effective processing of captions longer than the default 77 tokens.

"TULIP: Token-length Upgraded CLIP" (accepted to ICLR 2025)
Ivona Najdenkoska٭, Mohammad M. Derakshani٭, Yuki M. Asano, Nanne van Noord, Marcel Worring, Cees G. M. Snoek
٭ Equal core contributions

Code: https://github.com/ivonajdenkoska/tulip

Highlights

  • Improves performance on long caption understanding tasks.
  • Uses relative positional encodings to handle long image captions.
  • Works with CLIP-like models.

How to use

Please refer to the original repository for detailed instructions on how to use and train the model.