mispeech
/

dasheng-base

Audio Classification

feature-extraction

Model card Files Files and versions

jimbozhang commited on Jun 9, 2024

Commit

5b78cf1

·

verified ·

1 Parent(s): e6b6cce

Create README.md

Files changed (1) hide show

README.md +70 -0

README.md ADDED Viewed

	@@ -0,0 +1,70 @@

+---
+license: apache-2.0
+tags:
+- Audio
+---
+# Dasheng: a large scale general-purpose audio encoder
+Dasheng (**D**eep **A**udio-**S**ignal **H**olistic **E**mbeddi**ng**s), or “大声” ("great sound"), is a general-purpose audio encoder trained on a large-scale self-supervised learning task. Dasheng is designed to capture rich audio information across various domains, including speech, music, and environmental sounds. The model is trained on 272,356 hours of diverse audio data with 1.2 billion parameters, and exhibits significant performance gains on the [HEAR benchmark](https://hearbenchmark.com/). Dasheng outperforms previous works on CREMA-D, LibriCount, Speech Commands, VoxLingua, and competes well in music and environmental sound classification tasks.
+![dasheng](https://raw.githubusercontent.com/jimbozhang/hf_transformers_custom_model_dasheng/main/pic/hear_eval.png)
+## Usage
+### Install
+```bash
+pip install git+https://github.com/jimbozhang/hf_transformers_custom_model_dasheng.git
+```
+### Inference
+```python
+>>> model_name = "mispeech/dasheng-base"
+>>> from dasheng_model.feature_extraction_dasheng import DashengFeatureExtractor
+>>> from dasheng_model.modeling_dasheng import DashengModel
+>>> feature_extractor = DashengFeatureExtractor.from_pretrained(model_name)
+>>> model = DashengModel.from_pretrained(model_name, outputdim=None)  # no linear output layer if `outputdim` is `None`
+>>> import torchaudio
+>>> audio, sampling_rate = torchaudio.load("resources/JeD5V5aaaoI_931_932.wav")
+>>> assert sampling_rate == 16000
+>>> audio.shape
+torch.Size([1, 16000])   # mono audio of 1 second
+>>> inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt")
+>>> inputs.input_values.shape
+torch.Size([1, 64, 101])   # 64 mel-filterbanks, 101 frames
+>>> import torch
+>>> with torch.no_grad():
+...     outputs = model(**inputs)
+>>> outputs.hidden_states.shape
+torch.Size([1, 25, 768])   # 25 T-F patches (patch size 64x4, no overlap), before mean-pooling
+>>> outputs.logits.shape
+torch.Size([1, 768])   # mean-pooled embedding (would be logits from a linear layer if `outputdim` was set)
+```
+### Fine-tuning
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jimbozhang/hf_transformers_custom_model_dasheng/blob/main/example_finetune_esc50.ipynb)
+[`example_finetune_esc50.ipynb`](https://github.com/jimbozhang/hf_transformers_custom_model_dasheng/blob/main/example_finetune_esc50.ipynb) demonstrates how to train a linear head on the ESC-50 dataset with the Dasheng encoder frozen.
+## Citation
+If you find Dasheng useful in your research, please consider citing the following paper:
+```bibtex
+@inproceedings{dinkel2023scaling,
+  title={Scaling up masked audio encoder learning for general audio classification},
+  author={Dinkel, Heinrich and Yan, Zhiyong and Wang, Yongqing and Zhang, Junbo and Wang, Yujun and Wang, Bin},
+  booktitle={Interspeech 2024},
+  year={2024}
+}
+```