fcxfcx
/

owlv2

Model card Files Files and versions

owlv2 / scenic /projects /polyvit /README.md

fcxfcx's picture

Upload 2446 files

1327f34 verified 6 days ago

|

history blame contribute delete

2.31 kB

	PolyViT: Co-training Vision Transformers on Images, Videos and Audio
	==
	![PolyViT: Co-training Vision Transformers on Images, Videos and Audio](data/polyvit.png)

	PolyViT is a transformer model that has been trained on multiple tasks and
	modalities, including images, audio, and video. This approach allows PolyViT
	to achieve improved accuracy on five video and audio classification datasets,
	while using fewer parameters than other models. In particular, when trained on
	9 datasets across three modalities, PolyViT uses 8.3 times fewer parameters than
	a state-of-the-art single-task model, while outperforming it on two datasets
	and achieving competitive performance on the others. A key advantage of PolyViT
	is its simplicity and the fact that it requires minimal hyperparameter tuning
	, as the per-task hyperparameters can be easily reused.
	Details can be found in the [paper](https://arxiv.org/abs/2111.12993).

	## Getting Started
	The following command will install the required packages for ViViT:
	```shell
	$ pip install -r scenic/projects/polyvit/requirements.txt
	```

	PolyViT uses a pretrained ViT on images which can be downloaded or trained using
	[Scenic](https://github.com/google-research/scenic/tree/main/scenic/projects/baselines)
	or the [original implementation](https://github.com/google-research/vision_transformer).

	PolyViT uses the approaches from [MBT](https://github.com/google-research/scenic/tree/main/scenic/projects/mbt)
	and [ViViT](https://github.com/google-research/scenic/tree/main/scenic/projects/vivit)
	for processing and training on audio and video, so please take look at them for
	more information on data pipeline.

	The following command trains a PolyViT-B/16:
	```shell
	$ python -m scenic.projects.polyvit.main \
	--config=scenic/projects/polyvit/configs/polyvit_all.py \
	--workdir=polyvit_all/
	```

	## Checkpoints

	Will be released soon.

	## Reference

	If you use PolyViT, please use the following BibTeX entry.

	```
	@article{likhosherstov2022polyvit,
	title={Polyvit: Co-training vision transformers on images, videos and audio},
	author={Likhosherstov, Valerii and Arnab, Anurag and Choromanski,
	Krzysztof and Lucic, Mario and Tay, Yi and Weller, Adrian
	and Dehghani, Mostafa},
	journal={Transactions on Machine Learning Research},
	year={2022}
	}
	```