jimbozhang commited on
Commit
5b78cf1
·
verified ·
1 Parent(s): e6b6cce

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -0
README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - Audio
5
+ ---
6
+
7
+ # Dasheng: a large scale general-purpose audio encoder
8
+
9
+ Dasheng (**D**eep **A**udio-**S**ignal **H**olistic **E**mbeddi**ng**s), or “大声” ("great sound"), is a general-purpose audio encoder trained on a large-scale self-supervised learning task. Dasheng is designed to capture rich audio information across various domains, including speech, music, and environmental sounds. The model is trained on 272,356 hours of diverse audio data with 1.2 billion parameters, and exhibits significant performance gains on the [HEAR benchmark](https://hearbenchmark.com/). Dasheng outperforms previous works on CREMA-D, LibriCount, Speech Commands, VoxLingua, and competes well in music and environmental sound classification tasks.
10
+
11
+ ![dasheng](https://raw.githubusercontent.com/jimbozhang/hf_transformers_custom_model_dasheng/main/pic/hear_eval.png)
12
+
13
+ ## Usage
14
+
15
+ ### Install
16
+
17
+ ```bash
18
+ pip install git+https://github.com/jimbozhang/hf_transformers_custom_model_dasheng.git
19
+ ```
20
+
21
+ ### Inference
22
+
23
+ ```python
24
+ >>> model_name = "mispeech/dasheng-base"
25
+
26
+ >>> from dasheng_model.feature_extraction_dasheng import DashengFeatureExtractor
27
+ >>> from dasheng_model.modeling_dasheng import DashengModel
28
+
29
+ >>> feature_extractor = DashengFeatureExtractor.from_pretrained(model_name)
30
+ >>> model = DashengModel.from_pretrained(model_name, outputdim=None) # no linear output layer if `outputdim` is `None`
31
+
32
+ >>> import torchaudio
33
+ >>> audio, sampling_rate = torchaudio.load("resources/JeD5V5aaaoI_931_932.wav")
34
+ >>> assert sampling_rate == 16000
35
+ >>> audio.shape
36
+ torch.Size([1, 16000]) # mono audio of 1 second
37
+
38
+ >>> inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt")
39
+ >>> inputs.input_values.shape
40
+ torch.Size([1, 64, 101]) # 64 mel-filterbanks, 101 frames
41
+
42
+ >>> import torch
43
+ >>> with torch.no_grad():
44
+ ... outputs = model(**inputs)
45
+
46
+ >>> outputs.hidden_states.shape
47
+ torch.Size([1, 25, 768]) # 25 T-F patches (patch size 64x4, no overlap), before mean-pooling
48
+
49
+ >>> outputs.logits.shape
50
+ torch.Size([1, 768]) # mean-pooled embedding (would be logits from a linear layer if `outputdim` was set)
51
+ ```
52
+
53
+ ### Fine-tuning
54
+
55
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jimbozhang/hf_transformers_custom_model_dasheng/blob/main/example_finetune_esc50.ipynb)
56
+
57
+ [`example_finetune_esc50.ipynb`](https://github.com/jimbozhang/hf_transformers_custom_model_dasheng/blob/main/example_finetune_esc50.ipynb) demonstrates how to train a linear head on the ESC-50 dataset with the Dasheng encoder frozen.
58
+
59
+ ## Citation
60
+
61
+ If you find Dasheng useful in your research, please consider citing the following paper:
62
+
63
+ ```bibtex
64
+ @inproceedings{dinkel2023scaling,
65
+ title={Scaling up masked audio encoder learning for general audio classification},
66
+ author={Dinkel, Heinrich and Yan, Zhiyong and Wang, Yongqing and Zhang, Junbo and Wang, Yujun and Wang, Bin},
67
+ booktitle={Interspeech 2024},
68
+ year={2024}
69
+ }
70
+ ```