uniformer_s_k400 / README.md
CondadosAI's picture
docs: acaua mirror model card with code+weights provenance
87c6fae verified
---
license: mit
library_name: acaua
pipeline_tag: video-classification
tags:
- video-classification
- vision
- acaua
- native-pytorch-port
datasets:
- kinetics-400
---
# UniFormer-S (Kinetics-400) β€” acaua mirror (pure-PyTorch port)
Pure-PyTorch port of **UniFormer-S** (video classification, trained on
Kinetics-400 with 16-frame clips at sampling stride 8) hosted under
`CondadosAI/` for use with the [acaua](https://github.com/CondadosAI/acaua)
computer vision library.
The architecture has been re-implemented in pure PyTorch under
`acaua.adapters.uniformer.video` β€” no `mmcv`, no `mmengine`, no
`mmaction2`, no `trust_remote_code`, no `timm` runtime dependency.
The weights are converted from the upstream `.pth` checkpoint to
safetensors with acaua's state-dict key naming (`backbone.*` +
`head.fc.*`). They are **not** drop-in compatible with timm or
Sense-X/UniFormer loaders β€” they are designed to load cleanly into
acaua's `nn.Module` tree under `load_state_dict(strict=True)`.
## Provenance
| | |
|---|---|
| Upstream code | [`Sense-X/UniFormer`](https://github.com/Sense-X/UniFormer) @ `main` (Apache-2.0) |
| Upstream weights | [`Sense-X/uniformer_video`](https://huggingface.co/Sense-X/uniformer_video) at revision `f9448914e6161573b14ba47b72fcef170e03a1f9` (MIT) |
| Upstream file | `uniformer_small_k400_16x8.pth` |
| Upstream SHA256 | `d5fd7b0c49ee6a5422ef5d0c884d962c742003bfbd900747485eb99fa269d0db` |
| Upstream factory | `uniformer_small()` in `video_classification/models/uniformer.py` |
| Conversion script | [`scripts/convert_uniformer_video.py`](https://github.com/CondadosAI/acaua/blob/main/scripts/convert_uniformer_video.py) |
| Paper | Li et al., [*UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning*](https://arxiv.org/abs/2201.04676), 2022 |
| Params | 22M |
| Top-1 (Kinetics-400, 16 frames x 1 clip x 1 crop) | 78.4% |
| FLOPs | 41.8G |
| Training recipe | 16 input frames, sampling stride 8, 224x224 center-crop, ImageNet-mean/std normalization |
| Mirrored on | 2026-04-24 |
| Mirrored by | [CondadosAI/acaua](https://github.com/CondadosAI/acaua) |
## Usage via acaua
```python
import acaua
# MIT-declared weights require the explicit opt-in.
model = acaua.Model.from_pretrained(
"CondadosAI/uniformer_s_k400", allow_non_apache=True
)
result = model.predict("video.mp4")
print(result.labels) # tuple of top-5 Kinetics-400 action labels
print(result.scores) # aligned float32 probabilities
```
Requires `pip install 'acaua[video]'` for the TorchCodec-backed video
decoder and a system-level `ffmpeg` install.
## Files in this mirror
- `model.safetensors` β€” acaua-format weights (key-remapped, verified
round-trip under `load_state_dict(strict=True)` at conversion time).
- `labels.json` β€” JSON array of 400 Kinetics-400 action labels in
index order. Read by the adapter at load time.
- `config.json` β€” minimal metadata: `acaua_task=video_classification`,
`num_frames`, `num_classes`.
- `NOTICE` β€” attribution chain (code AND weights).
- `LICENSE` β€” Apache-2.0.
## License and attribution
The adapter code (this repository) is redistributed under Apache-2.0.
The underlying weights carry upstream's MIT declaration (compatible,
permissively redistributable). The acaua UniFormer-video adapter is
itself a derivative work of the upstream PyTorch implementation β€” see
[`NOTICE`](./NOTICE) for the required attribution chain.
## Citation
```bibtex
@misc{li2022uniformervideo,
title = {UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning},
author = {Li, Kunchang and Wang, Yali and Gao, Peng and Song, Guanglu and Liu, Yu and Li, Hongsheng and Qiao, Yu},
year = {2022},
eprint = {2201.04676},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
}
```