oriyonay commited on
Commit
3ac1a9e
·
verified ·
1 Parent(s): 31ee182

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -5
README.md CHANGED
@@ -1,9 +1,47 @@
1
  ---
2
  tags:
3
- - pytorch_model_hub_mixin
4
- - model_hub_mixin
 
 
 
 
 
5
  ---
6
 
7
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
- - Library: [More Information Needed]
9
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  tags:
3
+ - audio
4
+ - music
5
+ - contrastive-learning
6
+ - self-supervised
7
+ - vision-transformer
8
+ library_name: nnAudio
9
+ license: mit
10
  ---
11
 
12
+ # Myna: Masking-Based Contrastive Learning of Musical Representations
13
+
14
+ ## Model Overview
15
+ Myna is a self-supervised contrastive model designed for musical representation learning. It employs a Vision Transformer (ViT) backbone on mel-spectrograms and introduces token masking as its primary augmentation method. Unlike traditional contrastive learning frameworks that rely on augmentations such as pitch shifts, Myna retains pitch sensitivity, leading to improvements in key detection tasks.
16
+
17
+ ## Abstract
18
+ In this paper, we present Myna, a simple yet effective approach for self-supervised musical representation learning. Built on a contrastive learning framework, Myna introduces two key innovations:
19
+ 1. The use of a **Vision Transformer (ViT)** on mel-spectrograms as the backbone, replacing SampleCNN on raw audio.
20
+ 2. A novel **token masking** strategy that masks 90% of spectrogram tokens (e.g., 16x16 patches).
21
+
22
+ These innovations deliver both **effectiveness and efficiency**:
23
+ - **Token masking** enables a significant increase in per-GPU batch size, from 48 or 120 in traditional contrastive methods (e.g., CLMR, MULE) to 4096.
24
+ - **Avoiding traditional augmentations** (e.g., pitch shifts) retains pitch sensitivity, enhancing performance in tasks like key detection.
25
+ - The use of **vertical patches (128x2 instead of 16x16)** allows the model to better capture critical features for key detection.
26
+
27
+ Our hybrid model, **Myna-22M-Hybrid**, processes both 16x16 and 128x2 patches, achieving **state-of-the-art results**. Trained on a single GPU, it outperforms MULE (62M) and rivals MERT-95M, which was trained on 16 and 64 GPUs, respectively. Additionally, it surpasses MERT-95M-public, establishing itself as the best-performing model trained on publicly available data.
28
+
29
+ ## Installation
30
+ To use Myna, install the necessary dependencies:
31
+ ```bash
32
+ pip3 install -q nnAudio transformers torch
33
+
34
+ ## Usage
35
+ ```python
36
+ import torch
37
+ from transformers import AutoModel
38
+
39
+ model = AutoModel.from_pretrained('oriyonay/myna-base')
40
+
41
+ # Myna supports unbatched (2D) and batched (3D or 4D) inputs:
42
+ output = model(torch.randn(128, 96)) # shape (1, 384)
43
+ output = model(torch.randn(2, 128, 96)) # shape (2, 384)
44
+ output = model(torch.randn(2, 1, 128, 96)) # shape (2, 384)
45
+
46
+ # Additionally, you can load audio directly from a file:
47
+ output = model.from_file('your_file.wav') # shape (n_chunks, 384)