fcxfcx's picture
Upload 2446 files
1327f34 verified
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
==
![PolyViT: Co-training Vision Transformers on Images, Videos and Audio](data/polyvit.png)
PolyViT is a transformer model that has been trained on multiple tasks and
modalities, including images, audio, and video. This approach allows PolyViT
to achieve improved accuracy on five video and audio classification datasets,
while using fewer parameters than other models. In particular, when trained on
9 datasets across three modalities, PolyViT uses 8.3 times fewer parameters than
a state-of-the-art single-task model, while outperforming it on two datasets
and achieving competitive performance on the others. A key advantage of PolyViT
is its simplicity and the fact that it requires minimal hyperparameter tuning
, as the per-task hyperparameters can be easily reused.
Details can be found in the [paper](https://arxiv.org/abs/2111.12993).
## Getting Started
The following command will install the required packages for ViViT:
```shell
$ pip install -r scenic/projects/polyvit/requirements.txt
```
PolyViT uses a pretrained ViT on images which can be downloaded or trained using
[Scenic](https://github.com/google-research/scenic/tree/main/scenic/projects/baselines)
or the [original implementation](https://github.com/google-research/vision_transformer).
PolyViT uses the approaches from [MBT](https://github.com/google-research/scenic/tree/main/scenic/projects/mbt)
and [ViViT](https://github.com/google-research/scenic/tree/main/scenic/projects/vivit)
for processing and training on audio and video, so please take look at them for
more information on data pipeline.
The following command trains a PolyViT-B/16:
```shell
$ python -m scenic.projects.polyvit.main \
--config=scenic/projects/polyvit/configs/polyvit_all.py \
--workdir=polyvit_all/
```
## Checkpoints
Will be released soon.
## Reference
If you use PolyViT, please use the following BibTeX entry.
```
@article{likhosherstov2022polyvit,
title={Polyvit: Co-training vision transformers on images, videos and audio},
author={Likhosherstov, Valerii and Arnab, Anurag and Choromanski,
Krzysztof and Lucic, Mario and Tay, Yi and Weller, Adrian
and Dehghani, Mostafa},
journal={Transactions on Machine Learning Research},
year={2022}
}
```