Upload MusicBERT base model (GigaMIDI REMI+BPE, 130K steps)
Browse files- README.md +50 -149
- config.json +1 -1
- model.safetensors +2 -2
- special_tokens_map.json +5 -35
- tokenizer_config.json +1 -1
README.md
CHANGED
|
@@ -6,173 +6,74 @@ tags:
|
|
| 6 |
- remi
|
| 7 |
- midi
|
| 8 |
- symbolic-music
|
| 9 |
-
- symbolic-music
|
| 10 |
- gigamidi
|
| 11 |
library_name: transformers
|
| 12 |
pipeline_tag: fill-mask
|
| 13 |
license: mit
|
| 14 |
datasets:
|
| 15 |
- Metacreation/GigaMIDI
|
| 16 |
-
metrics:
|
| 17 |
-
- perplexity
|
| 18 |
---
|
| 19 |
|
| 20 |
-
#
|
| 21 |
|
| 22 |
## Model Description
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
- **Model Type**: BERT for Masked Language Modeling
|
| 29 |
-
- **Training Dataset**: GigaMIDI v1.1.0 (~1.7M MIDI files)
|
| 30 |
-
- **Tokenization**: REMI → BPE (vocab size: 50000)
|
| 31 |
-
- **Architecture**: base BERT
|
| 32 |
-
- **Parameters**: ~~800M parameters
|
| 33 |
-
- **Training Steps**: 85000
|
| 34 |
-
- **Final Loss**: unknown
|
| 35 |
-
|
| 36 |
-
## Training Details
|
| 37 |
-
|
| 38 |
-
### Dataset Preprocessing
|
| 39 |
-
1. **REMI Tokenization**: MIDI files converted to REMI tokens (vocab: 532)
|
| 40 |
-
2. **BPE Encoding**: REMI tokens compressed using BPE with 50000 vocabulary
|
| 41 |
-
3. **Sequence Length**: 1024 tokens
|
| 42 |
-
4. **Max Events per MIDI**: 2048
|
| 43 |
|
| 44 |
-
##
|
| 45 |
-
- **
|
| 46 |
-
- **
|
| 47 |
-
- **
|
| 48 |
-
- **
|
| 49 |
-
- **Training Framework**: HuggingFace Transformers + PyTorch
|
| 50 |
|
| 51 |
-
## Usage
|
| 52 |
|
| 53 |
-
##
|
| 54 |
|
|
|
|
| 55 |
```python
|
| 56 |
-
|
|
|
|
|
|
|
| 57 |
|
| 58 |
# Load model and tokenizer
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
#
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
use_tempos=True,
|
| 83 |
-
use_time_signatures=True,
|
| 84 |
-
use_programs=True,
|
| 85 |
-
beat_res={(0, 4): 8, (4, 12): 4},
|
| 86 |
-
nb_tempos=32,
|
| 87 |
-
tempo_range=(40, 250)
|
| 88 |
-
)
|
| 89 |
-
remi_tokenizer = miditok.REMI(remi_config)
|
| 90 |
-
|
| 91 |
-
# Example: Process a MIDI file
|
| 92 |
-
midi_file = "path/to/your/music.mid"
|
| 93 |
-
|
| 94 |
-
# Step 1: Load MIDI and convert to REMI tokens
|
| 95 |
-
score = symusic.Score.from_file(midi_file)
|
| 96 |
-
remi_tokens = remi_tokenizer.encode(score)
|
| 97 |
-
remi_ids = remi_tokens[0].ids # Extract token IDs
|
| 98 |
-
print(f"REMI tokens: {remi_ids[:10]}") # First 10 tokens
|
| 99 |
-
|
| 100 |
-
# Step 2: Convert REMI tokens to BPE for model input
|
| 101 |
-
remi_text = " ".join(map(str, remi_ids))
|
| 102 |
-
inputs = tokenizer(remi_text, return_tensors="pt", truncation=True, max_length=1024)
|
| 103 |
-
|
| 104 |
-
# Step 3: Use fill-mask pipeline for predictions
|
| 105 |
-
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
|
| 106 |
-
masked_text = remi_text.replace(str(remi_ids[5]), "<MASK>") # Mask the 6th token
|
| 107 |
-
results = fill_mask(masked_text)
|
| 108 |
-
|
| 109 |
-
print("Predicted tokens:")
|
| 110 |
-
for result in results[:3]:
|
| 111 |
-
print(f"Score: {result['score']:.3f} - Token: {result['token_str']}")
|
| 112 |
-
```
|
| 113 |
-
|
| 114 |
-
### Simple Fill-Mask Usage
|
| 115 |
-
|
| 116 |
-
```python
|
| 117 |
-
from transformers import pipeline
|
| 118 |
-
|
| 119 |
-
# Quick start with pipeline
|
| 120 |
-
fill_mask = pipeline("fill-mask", model="manoskary/musicbert")
|
| 121 |
-
|
| 122 |
-
# Example with REMI token sequence (musical events)
|
| 123 |
-
prompt = "14 40 31 <MASK> 14 40 149"
|
| 124 |
-
results = fill_mask(prompt)
|
| 125 |
-
|
| 126 |
-
print("Top predictions:")
|
| 127 |
-
for result in results[:3]:
|
| 128 |
-
print(f"{result['token_str']} (confidence: {result['score']:.1%})")
|
| 129 |
```
|
| 130 |
|
| 131 |
-
##
|
| 132 |
-
|
| 133 |
-
-
|
| 134 |
-
-
|
| 135 |
-
-
|
| 136 |
-
|
| 137 |
-
## Tokenization Details
|
| 138 |
-
|
| 139 |
-
The model uses a two-stage tokenization process:
|
| 140 |
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
- Program changes, tempo changes
|
| 145 |
-
|
| 146 |
-
2. **BPE Encoding**: Compresses REMI sequences
|
| 147 |
-
- Learns common musical patterns
|
| 148 |
-
- Reduces sequence length
|
| 149 |
-
- Improves training efficiency
|
| 150 |
-
|
| 151 |
-
### Special Tokens
|
| 152 |
-
- `<PAD>`: Padding token
|
| 153 |
-
- `<UNK>`: Unknown token
|
| 154 |
-
- `<CLS>`: Classification token
|
| 155 |
-
- `<SEP>`: Separator token
|
| 156 |
-
- `<MASK>`: Mask token for MLM
|
| 157 |
-
|
| 158 |
-
## Limitations
|
| 159 |
-
|
| 160 |
-
- Trained primarily on Western music (GigaMIDI dataset bias)
|
| 161 |
-
- Limited to symbolic (MIDI) music representation
|
| 162 |
-
- Maximum sequence length of 1024 tokens
|
| 163 |
-
- May not capture very long musical dependencies
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
## Aknowledgments
|
| 167 |
-
|
| 168 |
-
The original implementation of MusicBert is implemented [here](https://github.com/microsoft/muzic/tree/main/musicbert).
|
| 169 |
-
|
| 170 |
-
If you use MusicBert in your work remember to cite the original work:
|
| 171 |
-
```bibtex
|
| 172 |
-
@inproceedings{zeng2021musicbert,
|
| 173 |
-
title={Musicbert: Symbolic music understanding with large-scale pre-training},
|
| 174 |
-
author={Zeng, Mingliang and Tan, Xu and Wang, Rui and Ju, Zeqian and Qin, Tao and Liu, Tie-Yan},
|
| 175 |
-
booktitle={Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021},
|
| 176 |
-
year={2021}
|
| 177 |
-
}
|
| 178 |
-
```
|
|
|
|
| 6 |
- remi
|
| 7 |
- midi
|
| 8 |
- symbolic-music
|
|
|
|
| 9 |
- gigamidi
|
| 10 |
library_name: transformers
|
| 11 |
pipeline_tag: fill-mask
|
| 12 |
license: mit
|
| 13 |
datasets:
|
| 14 |
- Metacreation/GigaMIDI
|
|
|
|
|
|
|
| 15 |
---
|
| 16 |
|
| 17 |
+
# musicbert
|
| 18 |
|
| 19 |
## Model Description
|
| 20 |
+
MusicBERT large is a 24-layer BERT-style masked language model trained on REMI+BPE
|
| 21 |
+
symbolic music sequences extracted from the [GigaMIDI](https://huggingface.co/datasets/Metacreation/GigaMIDI)
|
| 22 |
+
corpus. It is tailored for symbolic music understanding, fill-mask style infilling,
|
| 23 |
+
and as a backbone for downstream generative tasks.
|
| 24 |
|
| 25 |
+
- **Checkpoint**: 130000 steps
|
| 26 |
+
- **Hidden size**: 768
|
| 27 |
+
- **Parameters**: ~150M
|
| 28 |
+
- **Validation loss**: 1.509289264678955
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
+
## Training Configuration
|
| 31 |
+
- **Objective**: Masked language modeling with span-aware masking
|
| 32 |
+
- **Dataset**: GigaMIDI (REMI tokens → BPE, vocab size 50000)
|
| 33 |
+
- **Sequence length**: 1024
|
| 34 |
+
- **Max events per MIDI**: 2048
|
|
|
|
| 35 |
|
|
|
|
| 36 |
|
| 37 |
+
## Inference Example
|
| 38 |
|
| 39 |
+
### Using with MIDI files
|
| 40 |
```python
|
| 41 |
+
import torch
|
| 42 |
+
from transformers import BertForMaskedLM
|
| 43 |
+
from miditok import MusicTokenizer
|
| 44 |
|
| 45 |
# Load model and tokenizer
|
| 46 |
+
model = BertForMaskedLM.from_pretrained("manoskary/musicbert")
|
| 47 |
+
tokenizer = MusicTokenizer.from_pretrained("manoskary/miditok-REMI")
|
| 48 |
+
|
| 49 |
+
# Convert MIDI to BPE tokens (MIDI → REMI → BPE pipeline)
|
| 50 |
+
midi_path = "path/to/your/file.mid"
|
| 51 |
+
tok_seq = tokenizer(midi_path)
|
| 52 |
+
bpe_ids = tok_seq.ids
|
| 53 |
+
|
| 54 |
+
# Mask some tokens for prediction
|
| 55 |
+
import random
|
| 56 |
+
mask_token_id = 3 # MASK_None token
|
| 57 |
+
input_ids = bpe_ids.copy()
|
| 58 |
+
mask_positions = random.sample(range(1, len(input_ids)-1), k=5)
|
| 59 |
+
for pos in mask_positions:
|
| 60 |
+
input_ids[pos] = mask_token_id
|
| 61 |
+
|
| 62 |
+
# Run inference
|
| 63 |
+
input_tensor = torch.tensor([input_ids])
|
| 64 |
+
with torch.no_grad():
|
| 65 |
+
outputs = model(input_tensor)
|
| 66 |
+
predictions = outputs.logits[0, mask_positions, :].argmax(dim=-1)
|
| 67 |
+
|
| 68 |
+
print("Predicted token IDs:", predictions.tolist())
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
```
|
| 70 |
|
| 71 |
+
## Limitations and Risks
|
| 72 |
+
- Model is trained purely on symbolic data; it does not produce audio directly.
|
| 73 |
+
- The GigaMIDI dataset is biased towards Western tonal music.
|
| 74 |
+
- Long-form structure beyond 1024 tokens requires chunking or iterative decoding.
|
| 75 |
+
- Generated continuations may need post-processing to ensure musical coherence.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
+
## Citation
|
| 78 |
+
If you use this checkpoint, please cite the original MusicBERT introduction and the
|
| 79 |
+
GigaMIDI dataset.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
config.json
CHANGED
|
@@ -20,5 +20,5 @@
|
|
| 20 |
"transformers_version": "4.52.4",
|
| 21 |
"type_vocab_size": 1,
|
| 22 |
"use_cache": false,
|
| 23 |
-
"vocab_size":
|
| 24 |
}
|
|
|
|
| 20 |
"transformers_version": "4.52.4",
|
| 21 |
"type_vocab_size": 1,
|
| 22 |
"use_cache": false,
|
| 23 |
+
"vocab_size": 40000
|
| 24 |
}
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:307b740e36b0a7edb427c09a33c55376c4197133bd7bbbebbef0ab11f8185bf2
|
| 3 |
+
size 471950760
|
special_tokens_map.json
CHANGED
|
@@ -1,37 +1,7 @@
|
|
| 1 |
{
|
| 2 |
-
"cls_token":
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
"single_word": false
|
| 8 |
-
},
|
| 9 |
-
"mask_token": {
|
| 10 |
-
"content": "<MASK>",
|
| 11 |
-
"lstrip": false,
|
| 12 |
-
"normalized": false,
|
| 13 |
-
"rstrip": false,
|
| 14 |
-
"single_word": false
|
| 15 |
-
},
|
| 16 |
-
"pad_token": {
|
| 17 |
-
"content": "<PAD>",
|
| 18 |
-
"lstrip": false,
|
| 19 |
-
"normalized": false,
|
| 20 |
-
"rstrip": false,
|
| 21 |
-
"single_word": false
|
| 22 |
-
},
|
| 23 |
-
"sep_token": {
|
| 24 |
-
"content": "<SEP>",
|
| 25 |
-
"lstrip": false,
|
| 26 |
-
"normalized": false,
|
| 27 |
-
"rstrip": false,
|
| 28 |
-
"single_word": false
|
| 29 |
-
},
|
| 30 |
-
"unk_token": {
|
| 31 |
-
"content": "<UNK>",
|
| 32 |
-
"lstrip": false,
|
| 33 |
-
"normalized": false,
|
| 34 |
-
"rstrip": false,
|
| 35 |
-
"single_word": false
|
| 36 |
-
}
|
| 37 |
}
|
|
|
|
| 1 |
{
|
| 2 |
+
"cls_token": "<CLS>",
|
| 3 |
+
"mask_token": "<MASK>",
|
| 4 |
+
"pad_token": "<PAD>",
|
| 5 |
+
"sep_token": "<SEP>",
|
| 6 |
+
"unk_token": "<UNK>"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
}
|
tokenizer_config.json
CHANGED
|
@@ -45,7 +45,7 @@
|
|
| 45 |
"cls_token": "<CLS>",
|
| 46 |
"extra_special_tokens": {},
|
| 47 |
"mask_token": "<MASK>",
|
| 48 |
-
"model_max_length":
|
| 49 |
"pad_token": "<PAD>",
|
| 50 |
"sep_token": "<SEP>",
|
| 51 |
"tokenizer_class": "PreTrainedTokenizer",
|
|
|
|
| 45 |
"cls_token": "<CLS>",
|
| 46 |
"extra_special_tokens": {},
|
| 47 |
"mask_token": "<MASK>",
|
| 48 |
+
"model_max_length": 1000000000000000019884624838656,
|
| 49 |
"pad_token": "<PAD>",
|
| 50 |
"sep_token": "<SEP>",
|
| 51 |
"tokenizer_class": "PreTrainedTokenizer",
|