manoskary commited on
Commit
c7ac160
·
verified ·
1 Parent(s): b5d0a94

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +215 -0
README.md ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - music
6
+ - midi
7
+ - tokenization
8
+ - remi
9
+ - bpe
10
+ - symbolic-music
11
+ license: mit
12
+ library_name: miditok
13
+ datasets:
14
+ - Metacreation/GigaMIDI
15
+ ---
16
+
17
+ # MidiTok-REMI: BPE Tokenizer for Symbolic Music
18
+
19
+ A Byte-Pair Encoding (BPE) tokenizer trained on REMI (REpresentation of MIdi) tokens for efficient symbolic music representation. This tokenizer combines the expressiveness of REMI encoding with the compression benefits of BPE, making it ideal for training large language models on MIDI data.
20
+
21
+ ## Model Details
22
+
23
+ - **Tokenizer Type**: BPE (Byte-Pair Encoding)
24
+ - **Base Representation**: REMI (REpresentation of MIdi)
25
+ - **Vocabulary Size**: 40,000 tokens
26
+ - **Training Data**: GigaMIDI dataset (2M+ MIDI files)
27
+ - **Library**: [MidiTok](https://github.com/Natooz/MidiTok)
28
+ - **Compatible Models**: MusicBERT, Llama-based music models
29
+
30
+ ## REMI Token Configuration
31
+
32
+ The tokenizer uses the following REMI configuration:
33
+
34
+ - **Velocity Bins**: 32 levels (0-127 quantized)
35
+ - **Beat Resolution**:
36
+ - Measures 0-4: 8 ticks per beat (fine-grained)
37
+ - Measures 4-12: 4 ticks per beat (standard)
38
+ - **Chord Recognition**: Enabled
39
+ - **Special Tokens**: `PAD_None`, `BOS_None`, `EOS_None`, `MASK_None`
40
+
41
+ ## Token Types
42
+
43
+ REMI represents MIDI files using the following event types:
44
+
45
+ | Token Type | Description | Example |
46
+ |------------|-------------|---------|
47
+ | `Bar_None` | Measure/bar boundary | `Bar_None` |
48
+ | `TimeSig_X/Y` | Time signature | `TimeSig_4/4` |
49
+ | `Position_N` | Position within measure (ticks) | `Position_16` |
50
+ | `Tempo_X` | Tempo in BPM | `Tempo_121.29` |
51
+ | `Program_N` | MIDI program/instrument | `Program_0` (Piano) |
52
+ | `Pitch_N` | MIDI note pitch (0-127) | `Pitch_69` (A4) |
53
+ | `Velocity_N` | Note velocity (dynamics) | `Velocity_63` |
54
+ | `Duration_X.Y.Z` | Note duration | `Duration_4.0.4` |
55
+ | `Chord_X` | Chord detection | `Chord_C:maj` |
56
+
57
+ ## Usage
58
+
59
+ ### Installation
60
+
61
+ ```bash
62
+ pip install miditok transformers torch
63
+ ```
64
+
65
+ ### Basic Usage
66
+
67
+ ```python
68
+ from miditok import MusicTokenizer
69
+ from pathlib import Path
70
+
71
+ # Load the tokenizer from HuggingFace Hub
72
+ tokenizer = MusicTokenizer.from_pretrained("manoskary/miditok-REMI")
73
+
74
+ # Tokenize a MIDI file
75
+ midi_path = Path("your_music.mid")
76
+ tok_seq = tokenizer(midi_path)
77
+
78
+ # Access token IDs (for training models)
79
+ token_ids = tok_seq.ids
80
+ print(f"Sequence length: {len(token_ids)}")
81
+ print(f"Token IDs: {token_ids[:10]}...") # First 10 tokens
82
+
83
+ # Access human-readable tokens
84
+ print(f"Token strings: {tok_seq.tokens[:10]}")
85
+ ```
86
+
87
+ ### Complete Pipeline Example
88
+
89
+ ```python
90
+ from miditok import MusicTokenizer
91
+ from pathlib import Path
92
+
93
+ # Load tokenizer
94
+ tok = MusicTokenizer.from_pretrained("manoskary/miditok-REMI")
95
+
96
+ # 1. MIDI → Tokens
97
+ midi = Path("input.mid")
98
+ tok_seq = tok(midi)
99
+
100
+ print(f"Original MIDI tokenized:")
101
+ print(f" - Tokens: {tok_seq.tokens[:5]}...")
102
+ print(f" - IDs: {tok_seq.ids[:5]}...")
103
+ print(f" - Length: {len(tok_seq.ids)}")
104
+
105
+ # 2. Tokens → MIDI (reconstruction)
106
+ score = tok.decode(tok_seq.ids)
107
+ score.dump_midi("reconstructed.mid")
108
+
109
+ # 3. Verify reconstruction
110
+ tok_seq_reconstructed = tok("reconstructed.mid")
111
+ assert tok_seq.ids == tok_seq_reconstructed.ids, "Reconstruction failed!"
112
+ print("\n✓ Perfect reconstruction verified!")
113
+ ```
114
+
115
+ ### Integration with Transformers
116
+
117
+ ```python
118
+ import torch
119
+ from miditok import MusicTokenizer
120
+ from transformers import BertForMaskedLM
121
+
122
+ # Load tokenizer and model
123
+ tokenizer = MusicTokenizer.from_pretrained("manoskary/miditok-REMI")
124
+ model = BertForMaskedLM.from_pretrained("your-musicbert-model")
125
+
126
+ # Tokenize MIDI
127
+ midi_path = "song.mid"
128
+ tok_seq = tokenizer(midi_path)
129
+ input_ids = torch.tensor([tok_seq.ids[:512]]) # Truncate to max length
130
+
131
+ # Forward pass
132
+ with torch.no_grad():
133
+ outputs = model(input_ids=input_ids)
134
+ logits = outputs.logits
135
+
136
+ # Generate predictions
137
+ predictions = logits.argmax(dim=-1)
138
+ ```
139
+
140
+ ### Batch Processing
141
+
142
+ ```python
143
+ from miditok import MusicTokenizer
144
+ from pathlib import Path
145
+ import torch
146
+
147
+ tokenizer = MusicTokenizer.from_pretrained("manoskary/miditok-REMI")
148
+
149
+ # Process multiple MIDI files
150
+ midi_files = list(Path("midi_dataset/").glob("*.mid"))
151
+ all_sequences = []
152
+
153
+ for midi_file in midi_files[:100]: # Process first 100 files
154
+ try:
155
+ tok_seq = tokenizer(midi_file)
156
+ all_sequences.append(tok_seq.ids)
157
+ except Exception as e:
158
+ print(f"Error processing {midi_file}: {e}")
159
+
160
+ # Pad sequences for batch processing
161
+ max_len = 2048
162
+ padded_sequences = []
163
+ for seq in all_sequences:
164
+ if len(seq) > max_len:
165
+ seq = seq[:max_len] # Truncate
166
+ else:
167
+ seq = seq + [0] * (max_len - len(seq)) # Pad with PAD token
168
+ padded_sequences.append(seq)
169
+
170
+ batch = torch.tensor(padded_sequences)
171
+ print(f"Batch shape: {batch.shape}") # [batch_size, max_len]
172
+ ```
173
+
174
+ ## Example Output
175
+
176
+ For a simple MIDI file with a few notes:
177
+
178
+ ```python
179
+ TokSequence(
180
+ tokens=[
181
+ 'Bar_None', # New measure
182
+ 'TimeSig_4/4', # 4/4 time signature
183
+ 'Position_0', # Start of measure
184
+ 'Tempo_121.29', # Tempo = 121 BPM
185
+ 'Program_0', # Piano instrument
186
+ 'Pitch_69', # A4 note
187
+ 'Velocity_63', # Medium velocity
188
+ 'Duration_4.0.4', # Quarter note duration
189
+ 'Position_16', # 16 ticks later
190
+ 'Program_0', # Piano
191
+ 'Pitch_72', # C5 note
192
+ 'Velocity_63', # Medium velocity
193
+ 'Duration_2.0.8', # Eighth note duration
194
+ 'Program_0', # Piano
195
+ 'Pitch_76', # E5 note
196
+ 'Velocity_63', # Medium velocity
197
+ 'Duration_2.0.8' # Eighth note duration
198
+ ],
199
+ ids=[532, 4, 531, 190, 374, 580, 850, 2595, 33442, 686],
200
+ # BPE compression: 17 REMI tokens → 10 BPE tokens (41% compression!)
201
+ )
202
+ ```
203
+
204
+ ## Training Details
205
+
206
+ ### Training Data
207
+ - **Dataset**: GigaMIDI v2.0.0 (Metacreation/GigaMIDI on HuggingFace)
208
+ - **Size**: ~200k MIDI files
209
+ - **Preprocessing**: MIDI → REMI tokens → BPE vocabulary learning
210
+
211
+
212
+ ## Acknowledgments
213
+
214
+ - **MidiTok Library**: [Nathan Fradet et al.](https://github.com/Natooz/MidiTok)
215
+ - **GigaMIDI Dataset**: [Metacreation Lab](https://huggingface.co/datasets/Metacreation/GigaMIDI)