Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,215 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
tags:
|
| 5 |
+
- music
|
| 6 |
+
- midi
|
| 7 |
+
- tokenization
|
| 8 |
+
- remi
|
| 9 |
+
- bpe
|
| 10 |
+
- symbolic-music
|
| 11 |
+
license: mit
|
| 12 |
+
library_name: miditok
|
| 13 |
+
datasets:
|
| 14 |
+
- Metacreation/GigaMIDI
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# MidiTok-REMI: BPE Tokenizer for Symbolic Music
|
| 18 |
+
|
| 19 |
+
A Byte-Pair Encoding (BPE) tokenizer trained on REMI (REpresentation of MIdi) tokens for efficient symbolic music representation. This tokenizer combines the expressiveness of REMI encoding with the compression benefits of BPE, making it ideal for training large language models on MIDI data.
|
| 20 |
+
|
| 21 |
+
## Model Details
|
| 22 |
+
|
| 23 |
+
- **Tokenizer Type**: BPE (Byte-Pair Encoding)
|
| 24 |
+
- **Base Representation**: REMI (REpresentation of MIdi)
|
| 25 |
+
- **Vocabulary Size**: 40,000 tokens
|
| 26 |
+
- **Training Data**: GigaMIDI dataset (2M+ MIDI files)
|
| 27 |
+
- **Library**: [MidiTok](https://github.com/Natooz/MidiTok)
|
| 28 |
+
- **Compatible Models**: MusicBERT, Llama-based music models
|
| 29 |
+
|
| 30 |
+
## REMI Token Configuration
|
| 31 |
+
|
| 32 |
+
The tokenizer uses the following REMI configuration:
|
| 33 |
+
|
| 34 |
+
- **Velocity Bins**: 32 levels (0-127 quantized)
|
| 35 |
+
- **Beat Resolution**:
|
| 36 |
+
- Measures 0-4: 8 ticks per beat (fine-grained)
|
| 37 |
+
- Measures 4-12: 4 ticks per beat (standard)
|
| 38 |
+
- **Chord Recognition**: Enabled
|
| 39 |
+
- **Special Tokens**: `PAD_None`, `BOS_None`, `EOS_None`, `MASK_None`
|
| 40 |
+
|
| 41 |
+
## Token Types
|
| 42 |
+
|
| 43 |
+
REMI represents MIDI files using the following event types:
|
| 44 |
+
|
| 45 |
+
| Token Type | Description | Example |
|
| 46 |
+
|------------|-------------|---------|
|
| 47 |
+
| `Bar_None` | Measure/bar boundary | `Bar_None` |
|
| 48 |
+
| `TimeSig_X/Y` | Time signature | `TimeSig_4/4` |
|
| 49 |
+
| `Position_N` | Position within measure (ticks) | `Position_16` |
|
| 50 |
+
| `Tempo_X` | Tempo in BPM | `Tempo_121.29` |
|
| 51 |
+
| `Program_N` | MIDI program/instrument | `Program_0` (Piano) |
|
| 52 |
+
| `Pitch_N` | MIDI note pitch (0-127) | `Pitch_69` (A4) |
|
| 53 |
+
| `Velocity_N` | Note velocity (dynamics) | `Velocity_63` |
|
| 54 |
+
| `Duration_X.Y.Z` | Note duration | `Duration_4.0.4` |
|
| 55 |
+
| `Chord_X` | Chord detection | `Chord_C:maj` |
|
| 56 |
+
|
| 57 |
+
## Usage
|
| 58 |
+
|
| 59 |
+
### Installation
|
| 60 |
+
|
| 61 |
+
```bash
|
| 62 |
+
pip install miditok transformers torch
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
### Basic Usage
|
| 66 |
+
|
| 67 |
+
```python
|
| 68 |
+
from miditok import MusicTokenizer
|
| 69 |
+
from pathlib import Path
|
| 70 |
+
|
| 71 |
+
# Load the tokenizer from HuggingFace Hub
|
| 72 |
+
tokenizer = MusicTokenizer.from_pretrained("manoskary/miditok-REMI")
|
| 73 |
+
|
| 74 |
+
# Tokenize a MIDI file
|
| 75 |
+
midi_path = Path("your_music.mid")
|
| 76 |
+
tok_seq = tokenizer(midi_path)
|
| 77 |
+
|
| 78 |
+
# Access token IDs (for training models)
|
| 79 |
+
token_ids = tok_seq.ids
|
| 80 |
+
print(f"Sequence length: {len(token_ids)}")
|
| 81 |
+
print(f"Token IDs: {token_ids[:10]}...") # First 10 tokens
|
| 82 |
+
|
| 83 |
+
# Access human-readable tokens
|
| 84 |
+
print(f"Token strings: {tok_seq.tokens[:10]}")
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
### Complete Pipeline Example
|
| 88 |
+
|
| 89 |
+
```python
|
| 90 |
+
from miditok import MusicTokenizer
|
| 91 |
+
from pathlib import Path
|
| 92 |
+
|
| 93 |
+
# Load tokenizer
|
| 94 |
+
tok = MusicTokenizer.from_pretrained("manoskary/miditok-REMI")
|
| 95 |
+
|
| 96 |
+
# 1. MIDI → Tokens
|
| 97 |
+
midi = Path("input.mid")
|
| 98 |
+
tok_seq = tok(midi)
|
| 99 |
+
|
| 100 |
+
print(f"Original MIDI tokenized:")
|
| 101 |
+
print(f" - Tokens: {tok_seq.tokens[:5]}...")
|
| 102 |
+
print(f" - IDs: {tok_seq.ids[:5]}...")
|
| 103 |
+
print(f" - Length: {len(tok_seq.ids)}")
|
| 104 |
+
|
| 105 |
+
# 2. Tokens → MIDI (reconstruction)
|
| 106 |
+
score = tok.decode(tok_seq.ids)
|
| 107 |
+
score.dump_midi("reconstructed.mid")
|
| 108 |
+
|
| 109 |
+
# 3. Verify reconstruction
|
| 110 |
+
tok_seq_reconstructed = tok("reconstructed.mid")
|
| 111 |
+
assert tok_seq.ids == tok_seq_reconstructed.ids, "Reconstruction failed!"
|
| 112 |
+
print("\n✓ Perfect reconstruction verified!")
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
### Integration with Transformers
|
| 116 |
+
|
| 117 |
+
```python
|
| 118 |
+
import torch
|
| 119 |
+
from miditok import MusicTokenizer
|
| 120 |
+
from transformers import BertForMaskedLM
|
| 121 |
+
|
| 122 |
+
# Load tokenizer and model
|
| 123 |
+
tokenizer = MusicTokenizer.from_pretrained("manoskary/miditok-REMI")
|
| 124 |
+
model = BertForMaskedLM.from_pretrained("your-musicbert-model")
|
| 125 |
+
|
| 126 |
+
# Tokenize MIDI
|
| 127 |
+
midi_path = "song.mid"
|
| 128 |
+
tok_seq = tokenizer(midi_path)
|
| 129 |
+
input_ids = torch.tensor([tok_seq.ids[:512]]) # Truncate to max length
|
| 130 |
+
|
| 131 |
+
# Forward pass
|
| 132 |
+
with torch.no_grad():
|
| 133 |
+
outputs = model(input_ids=input_ids)
|
| 134 |
+
logits = outputs.logits
|
| 135 |
+
|
| 136 |
+
# Generate predictions
|
| 137 |
+
predictions = logits.argmax(dim=-1)
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
### Batch Processing
|
| 141 |
+
|
| 142 |
+
```python
|
| 143 |
+
from miditok import MusicTokenizer
|
| 144 |
+
from pathlib import Path
|
| 145 |
+
import torch
|
| 146 |
+
|
| 147 |
+
tokenizer = MusicTokenizer.from_pretrained("manoskary/miditok-REMI")
|
| 148 |
+
|
| 149 |
+
# Process multiple MIDI files
|
| 150 |
+
midi_files = list(Path("midi_dataset/").glob("*.mid"))
|
| 151 |
+
all_sequences = []
|
| 152 |
+
|
| 153 |
+
for midi_file in midi_files[:100]: # Process first 100 files
|
| 154 |
+
try:
|
| 155 |
+
tok_seq = tokenizer(midi_file)
|
| 156 |
+
all_sequences.append(tok_seq.ids)
|
| 157 |
+
except Exception as e:
|
| 158 |
+
print(f"Error processing {midi_file}: {e}")
|
| 159 |
+
|
| 160 |
+
# Pad sequences for batch processing
|
| 161 |
+
max_len = 2048
|
| 162 |
+
padded_sequences = []
|
| 163 |
+
for seq in all_sequences:
|
| 164 |
+
if len(seq) > max_len:
|
| 165 |
+
seq = seq[:max_len] # Truncate
|
| 166 |
+
else:
|
| 167 |
+
seq = seq + [0] * (max_len - len(seq)) # Pad with PAD token
|
| 168 |
+
padded_sequences.append(seq)
|
| 169 |
+
|
| 170 |
+
batch = torch.tensor(padded_sequences)
|
| 171 |
+
print(f"Batch shape: {batch.shape}") # [batch_size, max_len]
|
| 172 |
+
```
|
| 173 |
+
|
| 174 |
+
## Example Output
|
| 175 |
+
|
| 176 |
+
For a simple MIDI file with a few notes:
|
| 177 |
+
|
| 178 |
+
```python
|
| 179 |
+
TokSequence(
|
| 180 |
+
tokens=[
|
| 181 |
+
'Bar_None', # New measure
|
| 182 |
+
'TimeSig_4/4', # 4/4 time signature
|
| 183 |
+
'Position_0', # Start of measure
|
| 184 |
+
'Tempo_121.29', # Tempo = 121 BPM
|
| 185 |
+
'Program_0', # Piano instrument
|
| 186 |
+
'Pitch_69', # A4 note
|
| 187 |
+
'Velocity_63', # Medium velocity
|
| 188 |
+
'Duration_4.0.4', # Quarter note duration
|
| 189 |
+
'Position_16', # 16 ticks later
|
| 190 |
+
'Program_0', # Piano
|
| 191 |
+
'Pitch_72', # C5 note
|
| 192 |
+
'Velocity_63', # Medium velocity
|
| 193 |
+
'Duration_2.0.8', # Eighth note duration
|
| 194 |
+
'Program_0', # Piano
|
| 195 |
+
'Pitch_76', # E5 note
|
| 196 |
+
'Velocity_63', # Medium velocity
|
| 197 |
+
'Duration_2.0.8' # Eighth note duration
|
| 198 |
+
],
|
| 199 |
+
ids=[532, 4, 531, 190, 374, 580, 850, 2595, 33442, 686],
|
| 200 |
+
# BPE compression: 17 REMI tokens → 10 BPE tokens (41% compression!)
|
| 201 |
+
)
|
| 202 |
+
```
|
| 203 |
+
|
| 204 |
+
## Training Details
|
| 205 |
+
|
| 206 |
+
### Training Data
|
| 207 |
+
- **Dataset**: GigaMIDI v2.0.0 (Metacreation/GigaMIDI on HuggingFace)
|
| 208 |
+
- **Size**: ~200k MIDI files
|
| 209 |
+
- **Preprocessing**: MIDI → REMI tokens → BPE vocabulary learning
|
| 210 |
+
|
| 211 |
+
|
| 212 |
+
## Acknowledgments
|
| 213 |
+
|
| 214 |
+
- **MidiTok Library**: [Nathan Fradet et al.](https://github.com/Natooz/MidiTok)
|
| 215 |
+
- **GigaMIDI Dataset**: [Metacreation Lab](https://huggingface.co/datasets/Metacreation/GigaMIDI)
|