oriyonay
/

myna-base

contrastive-learning

self-supervised

vision-transformer

Model card Files Files and versions

oriyonay commited on Feb 18, 2025

Commit

3ac1a9e

·

verified ·

1 Parent(s): 31ee182

Update README.md

Files changed (1) hide show

README.md +43 -5

README.md CHANGED Viewed

@@ -1,9 +1,47 @@
 ---
 tags:
-- pytorch_model_hub_mixin
-- model_hub_mixin
 ---
-This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
-- Library: [More Information Needed]
-- Docs: [More Information Needed]

 ---
 tags:
+- audio
+- music
+- contrastive-learning
+- self-supervised
+- vision-transformer
+library_name: nnAudio
+license: mit
 ---
+# Myna: Masking-Based Contrastive Learning of Musical Representations
+## Model Overview
+Myna is a self-supervised contrastive model designed for musical representation learning. It employs a Vision Transformer (ViT) backbone on mel-spectrograms and introduces token masking as its primary augmentation method. Unlike traditional contrastive learning frameworks that rely on augmentations such as pitch shifts, Myna retains pitch sensitivity, leading to improvements in key detection tasks.
+## Abstract
+In this paper, we present Myna, a simple yet effective approach for self-supervised musical representation learning. Built on a contrastive learning framework, Myna introduces two key innovations:
+1. The use of a **Vision Transformer (ViT)** on mel-spectrograms as the backbone, replacing SampleCNN on raw audio.
+2. A novel **token masking** strategy that masks 90% of spectrogram tokens (e.g., 16x16 patches).
+These innovations deliver both **effectiveness and efficiency**:
+- **Token masking** enables a significant increase in per-GPU batch size, from 48 or 120 in traditional contrastive methods (e.g., CLMR, MULE) to 4096.
+- **Avoiding traditional augmentations** (e.g., pitch shifts) retains pitch sensitivity, enhancing performance in tasks like key detection.
+- The use of **vertical patches (128x2 instead of 16x16)** allows the model to better capture critical features for key detection.
+Our hybrid model, **Myna-22M-Hybrid**, processes both 16x16 and 128x2 patches, achieving **state-of-the-art results**. Trained on a single GPU, it outperforms MULE (62M) and rivals MERT-95M, which was trained on 16 and 64 GPUs, respectively. Additionally, it surpasses MERT-95M-public, establishing itself as the best-performing model trained on publicly available data.
+## Installation
+To use Myna, install the necessary dependencies:
+```bash
+pip3 install -q nnAudio transformers torch
+## Usage
+```python
+import torch
+from transformers import AutoModel
+model = AutoModel.from_pretrained('oriyonay/myna-base')
+# Myna supports unbatched (2D) and batched (3D or 4D) inputs:
+output = model(torch.randn(128, 96))  # shape (1, 384)
+output = model(torch.randn(2, 128, 96))  # shape (2, 384)
+output = model(torch.randn(2, 1, 128, 96))  # shape (2, 384)
+# Additionally, you can load audio directly from a file:
+output = model.from_file('your_file.wav')  # shape (n_chunks, 384)