Upload MusicBERT base model (GigaMIDI REMI+BPE, 130K steps)

Browse files

Files changed (5) hide show

README.md +50 -149
config.json +1 -1
model.safetensors +2 -2
special_tokens_map.json +5 -35
tokenizer_config.json +1 -1

README.md CHANGED Viewed

@@ -6,173 +6,74 @@ tags:
 - remi
 - midi
 - symbolic-music
-- symbolic-music
 - gigamidi
 library_name: transformers
 pipeline_tag: fill-mask
 license: mit
 datasets:
 - Metacreation/GigaMIDI
-metrics:
-- perplexity
 ---
-# MusicBERT
 ## Model Description
-This is a **MusicBERT** model trained on the **GigaMIDI** dataset for symbolic music representation. The model uses REMI (REpresentation for Musical Intelligence) tokenization with BPE encoding for efficient processing of MIDI data.
-## Model Details
-- **Model Type**: BERT for Masked Language Modeling
-- **Training Dataset**: GigaMIDI v1.1.0 (~1.7M MIDI files)
-- **Tokenization**: REMI → BPE (vocab size: 50000)
-- **Architecture**: base BERT
-- **Parameters**: ~~800M parameters
-- **Training Steps**: 85000
-- **Final Loss**: unknown
-## Training Details
-### Dataset Preprocessing
-1. **REMI Tokenization**: MIDI files converted to REMI tokens (vocab: 532)
-2. **BPE Encoding**: REMI tokens compressed using BPE with 50000 vocabulary
-3. **Sequence Length**: 1024 tokens
-4. **Max Events per MIDI**: 2048
-### Training Configuration
-- **Batch Size**: 64 × 4 (effective)
-- **Learning Rate**: 0.0001
-- **Warmup Steps**: 0
-- **MLM Probability**: 15% gradually increased to 22.5% towards the end of training.
-- **Training Framework**: HuggingFace Transformers + PyTorch
-## Usage
-### Loading the Model
 ```python
-from transformers import AutoTokenizer, AutoModelForMaskedLM
 # Load model and tokenizer
-tokenizer = AutoTokenizer.from_pretrained("manoskary/musicbert")
-model = AutoModelForMaskedLM.from_pretrained("manoskary/musicbert")
-# Example: Fill in masked tokens
-inputs = tokenizer("14 40 31 <MASK> 14 40 149", return_tensors="pt")
-outputs = model(**inputs)
-predictions = outputs.logits.argmax(dim=-1)
-predicted_tokens = tokenizer.convert_ids_to_tokens(predictions[0])
-print(predicted_tokens)
-```
-### Complete MIDI Processing Workflow
-```python
-import symusic
-import miditok
-from transformers import pipeline
-# Set up REMI tokenizer (requires miditok)
-remi_config = miditok.TokenizerConfig(
-    num_velocities=32,
-    use_chords=True,
-    use_rests=True,
-    use_tempos=True,
-    use_time_signatures=True,
-    use_programs=True,
-    beat_res={(0, 4): 8, (4, 12): 4},
-    nb_tempos=32,
-    tempo_range=(40, 250)
-)
-remi_tokenizer = miditok.REMI(remi_config)
-# Example: Process a MIDI file
-midi_file = "path/to/your/music.mid"
-# Step 1: Load MIDI and convert to REMI tokens
-score = symusic.Score.from_file(midi_file)
-remi_tokens = remi_tokenizer.encode(score)
-remi_ids = remi_tokens[0].ids  # Extract token IDs
-print(f"REMI tokens: {remi_ids[:10]}")  # First 10 tokens
-# Step 2: Convert REMI tokens to BPE for model input
-remi_text = " ".join(map(str, remi_ids))
-inputs = tokenizer(remi_text, return_tensors="pt", truncation=True, max_length=1024)
-# Step 3: Use fill-mask pipeline for predictions
-fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
-masked_text = remi_text.replace(str(remi_ids[5]), "<MASK>")  # Mask the 6th token
-results = fill_mask(masked_text)
-print("Predicted tokens:")
-for result in results[:3]:
-    print(f"Score: {result['score']:.3f} - Token: {result['token_str']}")
-```
-### Simple Fill-Mask Usage
-```python
-from transformers import pipeline
-# Quick start with pipeline
-fill_mask = pipeline("fill-mask", model="manoskary/musicbert")
-# Example with REMI token sequence (musical events)
-prompt = "14 40 31 <MASK> 14 40 149"
-results = fill_mask(prompt)
-print("Top predictions:")
-for result in results[:3]:
-    print(f"{result['token_str']} (confidence: {result['score']:.1%})")
 ```
-## Model Performance
-- **Training Loss**: unknown
-- **Validation Loss**: 6.494499206542969
-- **Perplexity**: N/A
-## Tokenization Details
-The model uses a two-stage tokenization process:
-1. **REMI Tokenization**: Converts MIDI to symbolic tokens
-   - Bar markers, time signatures, positions
-   - Note events (pitch, velocity, duration)
-   - Program changes, tempo changes
-2. **BPE Encoding**: Compresses REMI sequences
-   - Learns common musical patterns
-   - Reduces sequence length
-   - Improves training efficiency
-### Special Tokens
-- `<PAD>`: Padding token
-- `<UNK>`: Unknown token
-- `<CLS>`: Classification token
-- `<SEP>`: Separator token
-- `<MASK>`: Mask token for MLM
-## Limitations
-- Trained primarily on Western music (GigaMIDI dataset bias)
-- Limited to symbolic (MIDI) music representation
-- Maximum sequence length of 1024 tokens
-- May not capture very long musical dependencies
-## Aknowledgments
-The original implementation of MusicBert is implemented [here](https://github.com/microsoft/muzic/tree/main/musicbert).
-If you use MusicBert in your work remember to cite the original work:
-```bibtex
-@inproceedings{zeng2021musicbert,
-  title={Musicbert: Symbolic music understanding with large-scale pre-training},
-  author={Zeng, Mingliang and Tan, Xu and Wang, Rui and Ju, Zeqian and Qin, Tao and Liu, Tie-Yan},
-  booktitle={Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021},
-  year={2021}
-}
-```

 - remi
 - midi
 - symbolic-music
 - gigamidi
 library_name: transformers
 pipeline_tag: fill-mask
 license: mit
 datasets:
 - Metacreation/GigaMIDI
 ---
+# musicbert
 ## Model Description
+MusicBERT large is a 24-layer BERT-style masked language model trained on REMI+BPE
+symbolic music sequences extracted from the [GigaMIDI](https://huggingface.co/datasets/Metacreation/GigaMIDI)
+corpus. It is tailored for symbolic music understanding, fill-mask style infilling,
+and as a backbone for downstream generative tasks.
+- **Checkpoint**: 130000 steps
+- **Hidden size**: 768
+- **Parameters**: ~150M
+- **Validation loss**: 1.509289264678955
+## Training Configuration
+- **Objective**: Masked language modeling with span-aware masking
+- **Dataset**: GigaMIDI (REMI tokens → BPE, vocab size 50000)
+- **Sequence length**: 1024
+- **Max events per MIDI**: 2048
+## Inference Example
+### Using with MIDI files
 ```python
+import torch
+from transformers import BertForMaskedLM
+from miditok import MusicTokenizer
 # Load model and tokenizer
+model = BertForMaskedLM.from_pretrained("manoskary/musicbert")
+tokenizer = MusicTokenizer.from_pretrained("manoskary/miditok-REMI")
+# Convert MIDI to BPE tokens (MIDI → REMI → BPE pipeline)
+midi_path = "path/to/your/file.mid"
+tok_seq = tokenizer(midi_path)
+bpe_ids = tok_seq.ids
+# Mask some tokens for prediction
+import random
+mask_token_id = 3  # MASK_None token
+input_ids = bpe_ids.copy()
+mask_positions = random.sample(range(1, len(input_ids)-1), k=5)
+for pos in mask_positions:
+    input_ids[pos] = mask_token_id
+# Run inference
+input_tensor = torch.tensor([input_ids])
+with torch.no_grad():
+    outputs = model(input_tensor)
+    predictions = outputs.logits[0, mask_positions, :].argmax(dim=-1)
+print("Predicted token IDs:", predictions.tolist())
 ```
+## Limitations and Risks
+- Model is trained purely on symbolic data; it does not produce audio directly.
+- The GigaMIDI dataset is biased towards Western tonal music.
+- Long-form structure beyond 1024 tokens requires chunking or iterative decoding.
+- Generated continuations may need post-processing to ensure musical coherence.
+## Citation
+If you use this checkpoint, please cite the original MusicBERT introduction and the
+GigaMIDI dataset.

config.json CHANGED Viewed

@@ -20,5 +20,5 @@
   "transformers_version": "4.52.4",
   "type_vocab_size": 1,
   "use_cache": false,
-  "vocab_size": 540
 }

   "transformers_version": "4.52.4",
   "type_vocab_size": 1,
   "use_cache": false,
+  "vocab_size": 40000
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:101c793ee973fbc259fba414fe6931f08f0845e059feaef5166c710329ef1cfc
-size 350571672

 version https://git-lfs.github.com/spec/v1
+oid sha256:307b740e36b0a7edb427c09a33c55376c4197133bd7bbbebbef0ab11f8185bf2
+size 471950760

special_tokens_map.json CHANGED Viewed

@@ -1,37 +1,7 @@
 {
-  "cls_token": {
-    "content": "<CLS>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "mask_token": {
-    "content": "<MASK>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "pad_token": {
-    "content": "<PAD>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "sep_token": {
-    "content": "<SEP>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "unk_token": {
-    "content": "<UNK>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  }
 }

 {
+  "cls_token": "<CLS>",
+  "mask_token": "<MASK>",
+  "pad_token": "<PAD>",
+  "sep_token": "<SEP>",
+  "unk_token": "<UNK>"
 }

tokenizer_config.json CHANGED Viewed

@@ -45,7 +45,7 @@
   "cls_token": "<CLS>",
   "extra_special_tokens": {},
   "mask_token": "<MASK>",
-  "model_max_length": 2048,
   "pad_token": "<PAD>",
   "sep_token": "<SEP>",
   "tokenizer_class": "PreTrainedTokenizer",

   "cls_token": "<CLS>",
   "extra_special_tokens": {},
   "mask_token": "<MASK>",
+  "model_max_length": 1000000000000000019884624838656,
   "pad_token": "<PAD>",
   "sep_token": "<SEP>",
   "tokenizer_class": "PreTrainedTokenizer",