docs: acaua mirror model card with upstream provenance

e5fae3c verified 15 days ago

3.31 kB

	---
	license: mit
	library_name: transformers
	pipeline_tag: video-classification
	tags:
	- video-classification
	- zero-shot
	- vision
	- acaua
	datasets:
	- kinetics-400
	base_model: microsoft/xclip-base-patch32
	---

	# X-CLIP (base, patch 32) — acaua mirror

	MIT-licensed mirror hosted under `CondadosAI/` for use with the [acaua](https://github.com/CondadosAI/acaua) computer vision library.

	This is a safetensors-only mirror of the upstream Microsoft weights at the pinned commit shown below. The legacy `pytorch_model.bin` (pickle format) that upstream ships alongside `model.safetensors` has been deliberately removed for security hygiene — pickle loads can execute arbitrary code, and `transformers` auto-prefers safetensors when both are present, so removing it has zero functional impact on downstream users.

	X-CLIP is a zero-shot video classification model: you provide a list of candidate text labels at inference time and the model ranks them by similarity to the video clip. It is not a closed-set softmax classifier, and it does not appear in `AutoModelForVideoClassification`.

	## Provenance

	\| \| \|
	\|---\|---\|
	\| Upstream repo \| [`microsoft/xclip-base-patch32`](https://huggingface.co/microsoft/xclip-base-patch32) \|
	\| Upstream commit SHA \| `a2e27a78a2b5d802e894b8a1ef14f3a8ce490963` \|
	\| Upstream commit date \| 2024-02-04 \|
	\| Declared license \| MIT \|
	\| Paper \| Ni et al., "Expanding Language-Image Pretrained Models for General Video Recognition", ECCV 2022, arXiv:[2208.02816](https://arxiv.org/abs/2208.02816) \|
	\| Official code \| [`microsoft/VideoX`](https://github.com/microsoft/VideoX) (MIT) \|
	\| Mirrored on \| 2026-04-23 \|
	\| Mirrored by \| [CondadosAI/acaua](https://github.com/CondadosAI/acaua) \|

	## Usage via acaua

	```python
	import acaua

	model = acaua.Model.from_pretrained(
	"CondadosAI/xclip_base_patch32",
	allow_non_apache=True, # weights are MIT, not Apache-2.0
	)
	result = model.predict(
	"dance.mp4",
	labels=["dancing", "cooking", "running", "sleeping", "walking"],
	top_k=3,
	)
	for label, score in zip(result.labels, result.scores.tolist()):
	print(f"{label}: {score:.3f}")
	```

	## Usage via 🤗 Transformers

	This mirror is drop-in compatible with the upstream repo.

	```python
	from transformers import XCLIPModel, XCLIPProcessor

	processor = XCLIPProcessor.from_pretrained("CondadosAI/xclip_base_patch32")
	model = XCLIPModel.from_pretrained("CondadosAI/xclip_base_patch32")
	```

	## Expected input

	- Frames: 8 uniformly-sampled frames per clip (`vision_config.num_frames=8`).
	- Resolution: 224 × 224 after resize + center-crop.
	- Normalization: ImageNet mean/std (handled by `XCLIPProcessor`).
	- Text prompts: supplied at inference time — any natural-language strings.

	## License and attribution

	Redistributed under MIT, consistent with the upstream declaration. See [`NOTICE`](./NOTICE) for required attribution.

	## Citation

	```bibtex
	@inproceedings{ni2022expanding,
	title={Expanding language-image pretrained models for general video recognition},
	author={Ni, Bolin and Peng, Houwen and Chen, Minghao and Zhang, Songyang and Meng, Gaofeng and Fu, Jianlong and Xiang, Shiming and Ling, Haibin},
	booktitle={European Conference on Computer Vision (ECCV)},
	pages={1--18},
	year={2022},
	publisher={Springer}
	}
	```