LongCLIP / README.md
cooper_robot
Add release note for v1.1.0
d659e57
metadata
library_name: pytorch

longclip_logo

LongCLIP extends the CLIP vision–language framework to support significantly longer text inputs, enabling richer contextual understanding while preserving strong image–text alignment.

Original paper: Long-CLIP: Unlocking Long-Text Capability in CLIP, Zhang et al., 2024

LongCLIP-B16

This model uses the LongCLIP B/16 variant, which is based on a ViT-Base backbone with 16×16 image patches and enhanced long-text encoding capacity. It is well suited for vision–language applications such as image retrieval, zero-shot classification, and multimodal reasoning where long textual prompts or descriptions are important.

Model Configuration:

  • Reference implementation: LongCLIP-B16
  • Original Weight: LongCLIP-B16
  • Resolution: 3x224x224
  • Support Cooper version:
    • Cooper SDK: [2.5.2]
    • Cooper Foundry: [2.2]
Model Device Model Link
LongCLIP-B16 Image Encoder N1-655 Model_Link
LongCLIP-B16 Text Encoder N1-655 Model_Link
LongCLIP-B16 Image encoder CV72 Model_Link
LongCLIP-B16 Text Encoder CV72 Model_Link
LongCLIP-B16 Image encoder CV75 Model_Link
LongCLIP-B16 Text Encoder CV75 Model_Link