---
license: mit
library_name: acaua
pipeline_tag: video-classification
tags:
  - video-classification
  - vision
  - acaua
  - native-pytorch-port
datasets:
  - kinetics-400
---

# UniFormer-S (Kinetics-400) — acaua mirror (pure-PyTorch port)

Pure-PyTorch port of **UniFormer-S** (video classification, trained on
Kinetics-400 with 16-frame clips at sampling stride 8) hosted under
`CondadosAI/` for use with the [acaua](https://github.com/CondadosAI/acaua)
computer vision library.

The architecture has been re-implemented in pure PyTorch under
`acaua.adapters.uniformer.video` — no `mmcv`, no `mmengine`, no
`mmaction2`, no `trust_remote_code`, no `timm` runtime dependency.
The weights are converted from the upstream `.pth` checkpoint to
safetensors with acaua's state-dict key naming (`backbone.*` +
`head.fc.*`). They are **not** drop-in compatible with timm or
Sense-X/UniFormer loaders — they are designed to load cleanly into
acaua's `nn.Module` tree under `load_state_dict(strict=True)`.

## Provenance

| | |
|---|---|
| Upstream code | [`Sense-X/UniFormer`](https://github.com/Sense-X/UniFormer) @ `main` (Apache-2.0) |
| Upstream weights | [`Sense-X/uniformer_video`](https://huggingface.co/Sense-X/uniformer_video) at revision `f9448914e6161573b14ba47b72fcef170e03a1f9` (MIT) |
| Upstream file | `uniformer_small_k400_16x8.pth` |
| Upstream SHA256 | `d5fd7b0c49ee6a5422ef5d0c884d962c742003bfbd900747485eb99fa269d0db` |
| Upstream factory | `uniformer_small()` in `video_classification/models/uniformer.py` |
| Conversion script | [`scripts/convert_uniformer_video.py`](https://github.com/CondadosAI/acaua/blob/main/scripts/convert_uniformer_video.py) |
| Paper | Li et al., [*UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning*](https://arxiv.org/abs/2201.04676), 2022 |
| Params | 22M |
| Top-1 (Kinetics-400, 16 frames x 1 clip x 1 crop) | 78.4% |
| FLOPs | 41.8G |
| Training recipe | 16 input frames, sampling stride 8, 224x224 center-crop, ImageNet-mean/std normalization |
| Mirrored on | 2026-04-24 |
| Mirrored by | [CondadosAI/acaua](https://github.com/CondadosAI/acaua) |

## Usage via acaua

```python
import acaua

# MIT-declared weights require the explicit opt-in.
model = acaua.Model.from_pretrained(
    "CondadosAI/uniformer_s_k400", allow_non_apache=True
)
result = model.predict("video.mp4")
print(result.labels)   # tuple of top-5 Kinetics-400 action labels
print(result.scores)   # aligned float32 probabilities
```

Requires `pip install 'acaua[video]'` for the TorchCodec-backed video
decoder and a system-level `ffmpeg` install.

## Files in this mirror

- `model.safetensors` — acaua-format weights (key-remapped, verified
  round-trip under `load_state_dict(strict=True)` at conversion time).
- `labels.json` — JSON array of 400 Kinetics-400 action labels in
  index order. Read by the adapter at load time.
- `config.json` — minimal metadata: `acaua_task=video_classification`,
  `num_frames`, `num_classes`.
- `NOTICE` — attribution chain (code AND weights).
- `LICENSE` — Apache-2.0.

## License and attribution

The adapter code (this repository) is redistributed under Apache-2.0.
The underlying weights carry upstream's MIT declaration (compatible,
permissively redistributable). The acaua UniFormer-video adapter is
itself a derivative work of the upstream PyTorch implementation — see
[`NOTICE`](./NOTICE) for the required attribution chain.

## Citation

```bibtex
@misc{li2022uniformervideo,
  title        = {UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning},
  author       = {Li, Kunchang and Wang, Yali and Gao, Peng and Song, Guanglu and Liu, Yu and Li, Hongsheng and Qiao, Yu},
  year         = {2022},
  eprint       = {2201.04676},
  archivePrefix = {arXiv},
  primaryClass = {cs.CV},
}
```