Add files using upload-large-folder tool

02e364a verified 21 days ago

5.5 kB


	# BEATs

	[BEATs](https://arxiv.org/abs/2212.09058): Audio Pre-Training with Acoustic Tokenizers

	Official PyTorch implementation and pretrained models of BEATs

	## Pre-Trained and Fine-Tuned Tokenizers and Models
	Iterations \| Tokenizer \| Pre-Trained Model \| AudioSet Fine-Tuned Model 1 \| AudioSet Fine-Tuned Model 2
	\|---\|---\|---\|---\|---
	Iter1 \| Random Projection \| [BEATs_iter1](https://1drv.ms/u/s!AqeByhGUtINrgcpmY7IHhgc9q0pT7Q?e=uQuisJ) \| [Fine-tuned BEATs_iter1 (cpt1)](https://1drv.ms/u/s!AqeByhGUtINrgcpuRfRZmco2XulmFw?e=f2INHa) \| [Fine-tuned BEATs_iter1 (cpt2)](https://1drv.ms/u/s!AqeByhGUtINrgcpyMlTmnRh0Wp_Qgg?e=sgzv8H) \|
	Iter2 \| [Tokenizer_iter2](https://1drv.ms/u/s!AqeByhGUtINrgcpnFGsfd_buKng5Pw?e=avWBJw)\| [BEATs_iter2](https://1drv.ms/u/s!AqeByhGUtINrgcpwwEGgUyiI-jQyQw?e=1rP1RI) \| [Fine-tuned BEATs_iter2 (cpt1)](https://1drv.ms/u/s!AqeByhGUtINrgcp4l547zKa7xPqy8w?e=rsLdPr) \| [Fine-tuned BEATs_iter2 (cpt2)](https://1drv.ms/u/s!AqeByhGUtINrgcp5APbt_2bdIQvX0w?e=2cd2ry) \|
	Iter3 \| [Tokenizer_iter3](https://1drv.ms/u/s!AqeByhGUtINrgcp1DEzUBtzHapxcqw?e=JZI5Uf)\| [BEATs_iter3](https://1drv.ms/u/s!AqeByhGUtINrgcpxJUNDxg4eU0r-vA?e=qezPJ5) \| [Fine-tuned BEATs_iter3 (cpt1)](https://1drv.ms/u/s!AqeByhGUtINrgcplb48ll1zIt82eWQ?e=XyxrX7) \| [Fine-tuned BEATs_iter3 (cpt2)](https://1drv.ms/u/s!AqeByhGUtINrgcptb4S-CeJnlJGtZA?e=2FyDy3) \|
	Iter3+ \| [Tokenizer_iter3+ (AS20K)](https://1drv.ms/u/s!AqeByhGUtINrgcpz_SnXxs0SrwHEwA?e=14nugm)\| [BEATs_iter3+ (AS20K)](https://1drv.ms/u/s!AqeByhGUtINrgcpvdNz8-aYim60CIg?e=53V8pg) \| [Fine-tuned BEATs_iter3+ (AS20K) (cpt1)](https://1drv.ms/u/s!AqeByhGUtINrgcp2YHUCT1uZx2Kysw?e=nvu1Dw) \| [Fine-tuned BEATs_iter3+ (AS20K) (cpt2)](https://1drv.ms/u/s!AqeByhGUtINrgcp092af0h7P3kXKFA?e=kUkPhN) \|
	Iter3+ \| [Tokenizer_iter3+ (AS2M)](https://1drv.ms/u/s!AqeByhGUtINrgcppJUDx2TmXiIMFyQ?e=pJsOLl)\| [BEATs_iter3+ (AS2M)](https://1drv.ms/u/s!AqeByhGUtINrgcpke6_lRSZEKD5j2Q?e=A3FpOf) \| [Fine-tuned BEATs_iter3+ (AS2M) (cpt1)](https://1drv.ms/u/s!AqeByhGUtINrgcpoZecQbiXeaUjN8A?e=DasbeC) \| [Fine-tuned BEATs_iter3+ (AS2M) (cpt2)](https://1drv.ms/u/s!AqeByhGUtINrgcpj8ujXH1YUtxooEg?e=E9Ncea) \|


	### Load Tokenizers

	```python
	import torch
	from Tokenizers import TokenizersConfig, Tokenizers

	# load the pre-trained checkpoints
	checkpoint = torch.load('/path/to/tokenizer.pt')

	cfg = TokenizersConfig(checkpoint['cfg'])
	BEATs_tokenizer = Tokenizers(cfg)
	BEATs_tokenizer.load_state_dict(checkpoint['model'])
	BEATs_tokenizer.eval()

	# tokenize the audio and generate the labels
	audio_input_16khz = torch.randn(1, 10000)
	padding_mask = torch.zeros(1, 10000).bool()

	labels = BEATs_tokenizer.extract_labels(audio_input_16khz, padding_mask=padding_mask)
	```


	### Load Pre-Trained Models

	```python
	import torch
	from BEATs import BEATs, BEATsConfig

	# load the pre-trained checkpoints
	checkpoint = torch.load('/path/to/model.pt')

	cfg = BEATsConfig(checkpoint['cfg'])
	BEATs_model = BEATs(cfg)
	BEATs_model.load_state_dict(checkpoint['model'])
	BEATs_model.eval()

	# extract the the audio representation
	audio_input_16khz = torch.randn(1, 10000)
	padding_mask = torch.zeros(1, 10000).bool()

	representation = BEATs_model.extract_features(audio_input_16khz, padding_mask=padding_mask)[0]
	```


	### Load Fine-tuned Models

	```python
	import torch
	from BEATs import BEATs, BEATsConfig

	# load the fine-tuned checkpoints
	checkpoint = torch.load('/path/to/model.pt')

	cfg = BEATsConfig(checkpoint['cfg'])
	BEATs_model = BEATs(cfg)
	BEATs_model.load_state_dict(checkpoint['model'])
	BEATs_model.eval()

	# predict the classification probability of each class
	audio_input_16khz = torch.randn(3, 10000)
	padding_mask = torch.zeros(3, 10000).bool()

	probs = BEATs_model.extract_features(audio_input_16khz, padding_mask=padding_mask)[0]

	for i, (top5_label_prob, top5_label_idx) in enumerate(zip(*probs.topk(k=5))):
	top5_label = [checkpoint['label_dict'][label_idx.item()] for label_idx in top5_label_idx]
	print(f'Top 5 predicted labels of the {i}th audio are {top5_label} with probability of {top5_label_prob}')
	```

	## Evaluation Results

	### Comparing with the SOTA Single Models
	![alt text](Evaluation_Results/Comparing_with_the_SOTA_Single_Models.png)


	### Comparing with the SOTA Ensemble Models
	![alt text](Evaluation_Results/Comparing_with_the_SOTA_Ensemble_Models.png)


	### Comparing Different BEATS Tokenizers
	![alt text](Evaluation_Results/Comparing_Different_BEATS_Tokenizers.png)


	### Comparing Different Pre-Training Targets
	![alt text](Evaluation_Results/Comparing_Different_Pre-Training_Targets.png)


	## License
	This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
	Portions of the source code are based on the [FAIRSEQ](https://github.com/pytorch/fairseq) and [VQGAN](https://github.com/CompVis/taming-transformers) project.

	[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)


	### Reference
	If you find our work is useful in your research, please cite the following paper:
	``` latex
	@article{Chen2022beats,
	title = {BEATs: Audio Pre-Training with Acoustic Tokenizers},
	author = {Sanyuan Chen and Yu Wu and Chengyi Wang and Shujie Liu and Daniel Tompkins and Zhuo Chen and Furu Wei},
	eprint={2212.09058},
	archivePrefix={arXiv},
	year={2022}
	}
	```
	### Contact Information

	For help or issues using BEATs models, please submit a GitHub issue.

	For other communications related to BEATs, please contact Yu Wu (`yuwu1@microsoft.com`).