docs: acaua mirror model card with code+weights provenance

87c6fae verified about 1 month ago

3.84 kB

	---
	license: mit
	library_name: acaua
	pipeline_tag: video-classification
	tags:
	- video-classification
	- vision
	- acaua
	- native-pytorch-port
	datasets:
	- kinetics-400
	---

	# UniFormer-S (Kinetics-400) — acaua mirror (pure-PyTorch port)

	Pure-PyTorch port of UniFormer-S (video classification, trained on
	Kinetics-400 with 16-frame clips at sampling stride 8) hosted under
	`CondadosAI/` for use with the [acaua](https://github.com/CondadosAI/acaua)
	computer vision library.

	The architecture has been re-implemented in pure PyTorch under
	`acaua.adapters.uniformer.video` — no `mmcv`, no `mmengine`, no
	`mmaction2`, no `trust_remote_code`, no `timm` runtime dependency.
	The weights are converted from the upstream `.pth` checkpoint to
	safetensors with acaua's state-dict key naming (`backbone.*` +
	`head.fc.`). They are not* drop-in compatible with timm or
	Sense-X/UniFormer loaders — they are designed to load cleanly into
	acaua's `nn.Module` tree under `load_state_dict(strict=True)`.

	## Provenance

	\| \| \|
	\|---\|---\|
	\| Upstream code \| [`Sense-X/UniFormer`](https://github.com/Sense-X/UniFormer) @ `main` (Apache-2.0) \|
	\| Upstream weights \| [`Sense-X/uniformer_video`](https://huggingface.co/Sense-X/uniformer_video) at revision `f9448914e6161573b14ba47b72fcef170e03a1f9` (MIT) \|
	\| Upstream file \| `uniformer_small_k400_16x8.pth` \|
	\| Upstream SHA256 \| `d5fd7b0c49ee6a5422ef5d0c884d962c742003bfbd900747485eb99fa269d0db` \|
	\| Upstream factory \| `uniformer_small()` in `video_classification/models/uniformer.py` \|
	\| Conversion script \| [`scripts/convert_uniformer_video.py`](https://github.com/CondadosAI/acaua/blob/main/scripts/convert_uniformer_video.py) \|
	\| Paper \| Li et al., [UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning](https://arxiv.org/abs/2201.04676), 2022 \|
	\| Params \| 22M \|
	\| Top-1 (Kinetics-400, 16 frames x 1 clip x 1 crop) \| 78.4% \|
	\| FLOPs \| 41.8G \|
	\| Training recipe \| 16 input frames, sampling stride 8, 224x224 center-crop, ImageNet-mean/std normalization \|
	\| Mirrored on \| 2026-04-24 \|
	\| Mirrored by \| [CondadosAI/acaua](https://github.com/CondadosAI/acaua) \|

	## Usage via acaua

	```python
	import acaua

	# MIT-declared weights require the explicit opt-in.
	model = acaua.Model.from_pretrained(
	"CondadosAI/uniformer_s_k400", allow_non_apache=True
	)
	result = model.predict("video.mp4")
	print(result.labels) # tuple of top-5 Kinetics-400 action labels
	print(result.scores) # aligned float32 probabilities
	```

	Requires `pip install 'acaua[video]'` for the TorchCodec-backed video
	decoder and a system-level `ffmpeg` install.

	## Files in this mirror

	- `model.safetensors` — acaua-format weights (key-remapped, verified
	round-trip under `load_state_dict(strict=True)` at conversion time).
	- `labels.json` — JSON array of 400 Kinetics-400 action labels in
	index order. Read by the adapter at load time.
	- `config.json` — minimal metadata: `acaua_task=video_classification`,
	`num_frames`, `num_classes`.
	- `NOTICE` — attribution chain (code AND weights).
	- `LICENSE` — Apache-2.0.

	## License and attribution

	The adapter code (this repository) is redistributed under Apache-2.0.
	The underlying weights carry upstream's MIT declaration (compatible,
	permissively redistributable). The acaua UniFormer-video adapter is
	itself a derivative work of the upstream PyTorch implementation — see
	[`NOTICE`](./NOTICE) for the required attribution chain.

	## Citation

	```bibtex
	@misc{li2022uniformervideo,
	title = {UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning},
	author = {Li, Kunchang and Wang, Yali and Gao, Peng and Song, Guanglu and Liu, Yu and Li, Hongsheng and Qiao, Yu},
	year = {2022},
	eprint = {2201.04676},
	archivePrefix = {arXiv},
	primaryClass = {cs.CV},
	}
	```