PyTorch
XCLIP / README.md
cooper_robot
Add release note for v1.3.0
96b6362
|
Raw
History Blame Contribute Delete
1.76 kB
metadata
library_name: pytorch

xclip_logo

X-CLIP extends the CLIP framework from images to videos by incorporating temporal modeling, enabling aligned video–text representations for efficient video understanding and recognition.

Original paper: Expanding Language-Image Pretrained Models for General Video Recognition (X-CLIP)

XCLIP-B32F8

This model uses the X-CLIP Base-Patch32-8Frames variant, which combines a ViT-Base backbone with 32×32 image patches and processes 8 video frames to capture both appearance and motion information. It is well suited for applications such as video classification, video retrieval, video-text matching, and zero-shot video understanding where efficient spatiotemporal reasoning is required.

Model Configuration:

Model Device Compression Model Link
XCLIP-B32F8 Video encoder N1-655 Activation_fp16 Model_Link
XCLIP-B32F8 Text encoder N1-655 Activation_fp16 Model_Link
XCLIP-B32F8 Post Predictor N1-655 Activation_fp16 Model_Link