Ambarella
/

XCLIP

Model card Files Files and versions

XCLIP / README.md

cooper_robot

Add release note for v1.3.0

96b6362 10 days ago

|

History Blame Contribute Delete

1.76 kB

	---
	library_name: pytorch
	---

	![xclip_logo](resource/XCLIP_base_patch32_frames8.png)

	X-CLIP extends the CLIP framework from images to videos by incorporating temporal modeling, enabling aligned video–text representations for efficient video understanding and recognition.

	Original paper: [Expanding Language-Image Pretrained Models for General Video Recognition (X-CLIP)](https://arxiv.org/abs/2208.02816)

	# XCLIP-B32F8

	This model uses the X-CLIP Base-Patch32-8Frames variant, which combines a ViT-Base backbone with 32×32 image patches and processes 8 video frames to capture both appearance and motion information. It is well suited for applications such as video classification, video retrieval, video-text matching, and zero-shot video understanding where efficient spatiotemporal reasoning is required.

	Model Configuration:
	- Reference implementation: [Official X-CLIP source code](https://github.com/microsoft/VideoX/tree/master/X-CLIP)
	- Original Weight: [XCLIP-B32F8](https://huggingface.co/microsoft/xclip-base-patch32/blob/main/model.safetensors)
	- Resolution: 8x3x224x224
	- Support Cooper version:
	- Cooper SDK: [2.5.4]
	- Cooper Foundry: [2.3]


	\| Model \| Device \| Compression \| Model Link \|
	\| :-----: \| :-----: \| :-----: \| ------- \|
	\| XCLIP-B32F8 Video encoder \| N1-655 \| Activation_fp16 \| [Model_Link](https://huggingface.co/Ambarella/XCLIP/blob/main/n1-655_xclip_b32f8_video_encoder_act16.bin) \|
	\| XCLIP-B32F8 Text encoder \| N1-655 \| Activation_fp16 \| [Model_Link](https://huggingface.co/Ambarella/XCLIP/blob/main/n1-655_xclip_b32f8_text_encoder_act16.bin) \|
	\| XCLIP-B32F8 Post Predictor \| N1-655 \| Activation_fp16 \| [Model_Link](https://huggingface.co/Ambarella/XCLIP/blob/main/n1-655_xclip_b32f8_post_predictor_act16.bin) \|