manoskary commited on
Commit
345aee2
·
verified ·
1 Parent(s): e9d17ab

Upload MusicBERT base model (GigaMIDI REMI+BPE, 130K steps)

Browse files
Files changed (5) hide show
  1. README.md +50 -149
  2. config.json +1 -1
  3. model.safetensors +2 -2
  4. special_tokens_map.json +5 -35
  5. tokenizer_config.json +1 -1
README.md CHANGED
@@ -6,173 +6,74 @@ tags:
6
  - remi
7
  - midi
8
  - symbolic-music
9
- - symbolic-music
10
  - gigamidi
11
  library_name: transformers
12
  pipeline_tag: fill-mask
13
  license: mit
14
  datasets:
15
  - Metacreation/GigaMIDI
16
- metrics:
17
- - perplexity
18
  ---
19
 
20
- # MusicBERT
21
 
22
  ## Model Description
 
 
 
 
23
 
24
- This is a **MusicBERT** model trained on the **GigaMIDI** dataset for symbolic music representation. The model uses REMI (REpresentation for Musical Intelligence) tokenization with BPE encoding for efficient processing of MIDI data.
25
-
26
- ## Model Details
27
-
28
- - **Model Type**: BERT for Masked Language Modeling
29
- - **Training Dataset**: GigaMIDI v1.1.0 (~1.7M MIDI files)
30
- - **Tokenization**: REMI → BPE (vocab size: 50000)
31
- - **Architecture**: base BERT
32
- - **Parameters**: ~~800M parameters
33
- - **Training Steps**: 85000
34
- - **Final Loss**: unknown
35
-
36
- ## Training Details
37
-
38
- ### Dataset Preprocessing
39
- 1. **REMI Tokenization**: MIDI files converted to REMI tokens (vocab: 532)
40
- 2. **BPE Encoding**: REMI tokens compressed using BPE with 50000 vocabulary
41
- 3. **Sequence Length**: 1024 tokens
42
- 4. **Max Events per MIDI**: 2048
43
 
44
- ### Training Configuration
45
- - **Batch Size**: 64 × 4 (effective)
46
- - **Learning Rate**: 0.0001
47
- - **Warmup Steps**: 0
48
- - **MLM Probability**: 15% gradually increased to 22.5% towards the end of training.
49
- - **Training Framework**: HuggingFace Transformers + PyTorch
50
 
51
- ## Usage
52
 
53
- ### Loading the Model
54
 
 
55
  ```python
56
- from transformers import AutoTokenizer, AutoModelForMaskedLM
 
 
57
 
58
  # Load model and tokenizer
59
- tokenizer = AutoTokenizer.from_pretrained("manoskary/musicbert")
60
- model = AutoModelForMaskedLM.from_pretrained("manoskary/musicbert")
61
-
62
- # Example: Fill in masked tokens
63
- inputs = tokenizer("14 40 31 <MASK> 14 40 149", return_tensors="pt")
64
- outputs = model(**inputs)
65
- predictions = outputs.logits.argmax(dim=-1)
66
- predicted_tokens = tokenizer.convert_ids_to_tokens(predictions[0])
67
- print(predicted_tokens)
68
- ```
69
-
70
- ### Complete MIDI Processing Workflow
71
-
72
- ```python
73
- import symusic
74
- import miditok
75
- from transformers import pipeline
76
-
77
- # Set up REMI tokenizer (requires miditok)
78
- remi_config = miditok.TokenizerConfig(
79
- num_velocities=32,
80
- use_chords=True,
81
- use_rests=True,
82
- use_tempos=True,
83
- use_time_signatures=True,
84
- use_programs=True,
85
- beat_res={(0, 4): 8, (4, 12): 4},
86
- nb_tempos=32,
87
- tempo_range=(40, 250)
88
- )
89
- remi_tokenizer = miditok.REMI(remi_config)
90
-
91
- # Example: Process a MIDI file
92
- midi_file = "path/to/your/music.mid"
93
-
94
- # Step 1: Load MIDI and convert to REMI tokens
95
- score = symusic.Score.from_file(midi_file)
96
- remi_tokens = remi_tokenizer.encode(score)
97
- remi_ids = remi_tokens[0].ids # Extract token IDs
98
- print(f"REMI tokens: {remi_ids[:10]}") # First 10 tokens
99
-
100
- # Step 2: Convert REMI tokens to BPE for model input
101
- remi_text = " ".join(map(str, remi_ids))
102
- inputs = tokenizer(remi_text, return_tensors="pt", truncation=True, max_length=1024)
103
-
104
- # Step 3: Use fill-mask pipeline for predictions
105
- fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
106
- masked_text = remi_text.replace(str(remi_ids[5]), "<MASK>") # Mask the 6th token
107
- results = fill_mask(masked_text)
108
-
109
- print("Predicted tokens:")
110
- for result in results[:3]:
111
- print(f"Score: {result['score']:.3f} - Token: {result['token_str']}")
112
- ```
113
-
114
- ### Simple Fill-Mask Usage
115
-
116
- ```python
117
- from transformers import pipeline
118
-
119
- # Quick start with pipeline
120
- fill_mask = pipeline("fill-mask", model="manoskary/musicbert")
121
-
122
- # Example with REMI token sequence (musical events)
123
- prompt = "14 40 31 <MASK> 14 40 149"
124
- results = fill_mask(prompt)
125
-
126
- print("Top predictions:")
127
- for result in results[:3]:
128
- print(f"{result['token_str']} (confidence: {result['score']:.1%})")
129
  ```
130
 
131
- ## Model Performance
132
-
133
- - **Training Loss**: unknown
134
- - **Validation Loss**: 6.494499206542969
135
- - **Perplexity**: N/A
136
-
137
- ## Tokenization Details
138
-
139
- The model uses a two-stage tokenization process:
140
 
141
- 1. **REMI Tokenization**: Converts MIDI to symbolic tokens
142
- - Bar markers, time signatures, positions
143
- - Note events (pitch, velocity, duration)
144
- - Program changes, tempo changes
145
-
146
- 2. **BPE Encoding**: Compresses REMI sequences
147
- - Learns common musical patterns
148
- - Reduces sequence length
149
- - Improves training efficiency
150
-
151
- ### Special Tokens
152
- - `<PAD>`: Padding token
153
- - `<UNK>`: Unknown token
154
- - `<CLS>`: Classification token
155
- - `<SEP>`: Separator token
156
- - `<MASK>`: Mask token for MLM
157
-
158
- ## Limitations
159
-
160
- - Trained primarily on Western music (GigaMIDI dataset bias)
161
- - Limited to symbolic (MIDI) music representation
162
- - Maximum sequence length of 1024 tokens
163
- - May not capture very long musical dependencies
164
-
165
-
166
- ## Aknowledgments
167
-
168
- The original implementation of MusicBert is implemented [here](https://github.com/microsoft/muzic/tree/main/musicbert).
169
-
170
- If you use MusicBert in your work remember to cite the original work:
171
- ```bibtex
172
- @inproceedings{zeng2021musicbert,
173
- title={Musicbert: Symbolic music understanding with large-scale pre-training},
174
- author={Zeng, Mingliang and Tan, Xu and Wang, Rui and Ju, Zeqian and Qin, Tao and Liu, Tie-Yan},
175
- booktitle={Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021},
176
- year={2021}
177
- }
178
- ```
 
6
  - remi
7
  - midi
8
  - symbolic-music
 
9
  - gigamidi
10
  library_name: transformers
11
  pipeline_tag: fill-mask
12
  license: mit
13
  datasets:
14
  - Metacreation/GigaMIDI
 
 
15
  ---
16
 
17
+ # musicbert
18
 
19
  ## Model Description
20
+ MusicBERT large is a 24-layer BERT-style masked language model trained on REMI+BPE
21
+ symbolic music sequences extracted from the [GigaMIDI](https://huggingface.co/datasets/Metacreation/GigaMIDI)
22
+ corpus. It is tailored for symbolic music understanding, fill-mask style infilling,
23
+ and as a backbone for downstream generative tasks.
24
 
25
+ - **Checkpoint**: 130000 steps
26
+ - **Hidden size**: 768
27
+ - **Parameters**: ~150M
28
+ - **Validation loss**: 1.509289264678955
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
+ ## Training Configuration
31
+ - **Objective**: Masked language modeling with span-aware masking
32
+ - **Dataset**: GigaMIDI (REMI tokens → BPE, vocab size 50000)
33
+ - **Sequence length**: 1024
34
+ - **Max events per MIDI**: 2048
 
35
 
 
36
 
37
+ ## Inference Example
38
 
39
+ ### Using with MIDI files
40
  ```python
41
+ import torch
42
+ from transformers import BertForMaskedLM
43
+ from miditok import MusicTokenizer
44
 
45
  # Load model and tokenizer
46
+ model = BertForMaskedLM.from_pretrained("manoskary/musicbert")
47
+ tokenizer = MusicTokenizer.from_pretrained("manoskary/miditok-REMI")
48
+
49
+ # Convert MIDI to BPE tokens (MIDI → REMI → BPE pipeline)
50
+ midi_path = "path/to/your/file.mid"
51
+ tok_seq = tokenizer(midi_path)
52
+ bpe_ids = tok_seq.ids
53
+
54
+ # Mask some tokens for prediction
55
+ import random
56
+ mask_token_id = 3 # MASK_None token
57
+ input_ids = bpe_ids.copy()
58
+ mask_positions = random.sample(range(1, len(input_ids)-1), k=5)
59
+ for pos in mask_positions:
60
+ input_ids[pos] = mask_token_id
61
+
62
+ # Run inference
63
+ input_tensor = torch.tensor([input_ids])
64
+ with torch.no_grad():
65
+ outputs = model(input_tensor)
66
+ predictions = outputs.logits[0, mask_positions, :].argmax(dim=-1)
67
+
68
+ print("Predicted token IDs:", predictions.tolist())
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
  ```
70
 
71
+ ## Limitations and Risks
72
+ - Model is trained purely on symbolic data; it does not produce audio directly.
73
+ - The GigaMIDI dataset is biased towards Western tonal music.
74
+ - Long-form structure beyond 1024 tokens requires chunking or iterative decoding.
75
+ - Generated continuations may need post-processing to ensure musical coherence.
 
 
 
 
76
 
77
+ ## Citation
78
+ If you use this checkpoint, please cite the original MusicBERT introduction and the
79
+ GigaMIDI dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -20,5 +20,5 @@
20
  "transformers_version": "4.52.4",
21
  "type_vocab_size": 1,
22
  "use_cache": false,
23
- "vocab_size": 540
24
  }
 
20
  "transformers_version": "4.52.4",
21
  "type_vocab_size": 1,
22
  "use_cache": false,
23
+ "vocab_size": 40000
24
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:101c793ee973fbc259fba414fe6931f08f0845e059feaef5166c710329ef1cfc
3
- size 350571672
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:307b740e36b0a7edb427c09a33c55376c4197133bd7bbbebbef0ab11f8185bf2
3
+ size 471950760
special_tokens_map.json CHANGED
@@ -1,37 +1,7 @@
1
  {
2
- "cls_token": {
3
- "content": "<CLS>",
4
- "lstrip": false,
5
- "normalized": false,
6
- "rstrip": false,
7
- "single_word": false
8
- },
9
- "mask_token": {
10
- "content": "<MASK>",
11
- "lstrip": false,
12
- "normalized": false,
13
- "rstrip": false,
14
- "single_word": false
15
- },
16
- "pad_token": {
17
- "content": "<PAD>",
18
- "lstrip": false,
19
- "normalized": false,
20
- "rstrip": false,
21
- "single_word": false
22
- },
23
- "sep_token": {
24
- "content": "<SEP>",
25
- "lstrip": false,
26
- "normalized": false,
27
- "rstrip": false,
28
- "single_word": false
29
- },
30
- "unk_token": {
31
- "content": "<UNK>",
32
- "lstrip": false,
33
- "normalized": false,
34
- "rstrip": false,
35
- "single_word": false
36
- }
37
  }
 
1
  {
2
+ "cls_token": "<CLS>",
3
+ "mask_token": "<MASK>",
4
+ "pad_token": "<PAD>",
5
+ "sep_token": "<SEP>",
6
+ "unk_token": "<UNK>"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  }
tokenizer_config.json CHANGED
@@ -45,7 +45,7 @@
45
  "cls_token": "<CLS>",
46
  "extra_special_tokens": {},
47
  "mask_token": "<MASK>",
48
- "model_max_length": 2048,
49
  "pad_token": "<PAD>",
50
  "sep_token": "<SEP>",
51
  "tokenizer_class": "PreTrainedTokenizer",
 
45
  "cls_token": "<CLS>",
46
  "extra_special_tokens": {},
47
  "mask_token": "<MASK>",
48
+ "model_max_length": 1000000000000000019884624838656,
49
  "pad_token": "<PAD>",
50
  "sep_token": "<SEP>",
51
  "tokenizer_class": "PreTrainedTokenizer",