| library_name: pytorch | |
|  | |
| X-CLIP extends the CLIP framework from images to videos by incorporating temporal modeling, enabling aligned video–text representations for efficient video understanding and recognition. | |
| Original paper: [Expanding Language-Image Pretrained Models for General Video Recognition (X-CLIP)](https://arxiv.org/abs/2208.02816) | |
| # XCLIP-B32F8 | |
| This model uses the **X-CLIP Base-Patch32-8Frames** variant, which combines a ViT-Base backbone with 32×32 image patches and processes 8 video frames to capture both appearance and motion information. It is well suited for applications such as video classification, video retrieval, video-text matching, and zero-shot video understanding where efficient spatiotemporal reasoning is required. | |
| Model Configuration: | |
| - Reference implementation: [Official X-CLIP source code](https://github.com/microsoft/VideoX/tree/master/X-CLIP) | |
| - Original Weight: [XCLIP-B32F8](https://huggingface.co/microsoft/xclip-base-patch32/blob/main/model.safetensors) | |
| - Resolution: 8x3x224x224 | |
| - Support Cooper version: | |
| - Cooper SDK: [2.5.4] | |
| - Cooper Foundry: [2.3] | |
| | Model | Device | Compression | Model Link | | |
| | :-----: | :-----: | :-----: | ------- | | |
| | XCLIP-B32F8 Video encoder | N1-655 | Activation_fp16 | [Model_Link](https://huggingface.co/Ambarella/XCLIP/blob/main/n1-655_xclip_b32f8_video_encoder_act16.bin) | | |
| | XCLIP-B32F8 Text encoder | N1-655 | Activation_fp16 | [Model_Link](https://huggingface.co/Ambarella/XCLIP/blob/main/n1-655_xclip_b32f8_text_encoder_act16.bin) | | |
| | XCLIP-B32F8 Post Predictor | N1-655 | Activation_fp16 | [Model_Link](https://huggingface.co/Ambarella/XCLIP/blob/main/n1-655_xclip_b32f8_post_predictor_act16.bin) | | |