metadata
library_name: pytorch
X-CLIP extends the CLIP framework from images to videos by incorporating temporal modeling, enabling aligned video–text representations for efficient video understanding and recognition.
Original paper: Expanding Language-Image Pretrained Models for General Video Recognition (X-CLIP)
XCLIP-B32F8
This model uses the X-CLIP Base-Patch32-8Frames variant, which combines a ViT-Base backbone with 32×32 image patches and processes 8 video frames to capture both appearance and motion information. It is well suited for applications such as video classification, video retrieval, video-text matching, and zero-shot video understanding where efficient spatiotemporal reasoning is required.
Model Configuration:
- Reference implementation: Official X-CLIP source code
- Original Weight: XCLIP-B32F8
- Resolution: 8x3x224x224
- Support Cooper version:
- Cooper SDK: [2.5.4]
- Cooper Foundry: [2.3]
| Model | Device | Compression | Model Link |
|---|---|---|---|
| XCLIP-B32F8 Video encoder | N1-655 | Activation_fp16 | Model_Link |
| XCLIP-B32F8 Text encoder | N1-655 | Activation_fp16 | Model_Link |
| XCLIP-B32F8 Post Predictor | N1-655 | Activation_fp16 | Model_Link |
