File size: 10,393 Bytes
3882ff1
 
658905f
ef39f36
 
 
 
 
 
 
 
3882ff1
658905f
3882ff1
 
 
ef39f36
3882ff1
658905f
3882ff1
ef39f36
 
3882ff1
 
658905f
3882ff1
658905f
3882ff1
 
 
 
 
 
94cac4c
658905f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3882ff1
 
 
658905f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1614538
 
 
 
658905f
 
 
3882ff1
 
 
658905f
 
 
 
 
 
 
dfd4885
 
 
 
 
 
 
 
658905f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3882ff1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
658905f
1614538
 
 
 
94cac4c
3882ff1
 
 
 
 
 
ef39f36
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
---
tags:
- music
- midi
- audio
- composer
- year
- classification
- era
- song
- classical_music
datasets:
- TiMauzi/imslp-midi-by-sa
metrics:
- accuracy
- f1
- confusion_matrix
model-index:
- name: EraClassifierBiLSTM-134M
  results: []
license: cc-by-sa-4.0
pipeline_tag: audio-classification
---

# EraClassifierBiLSTM-134M

This model is a bidirectional LSTM neural network designed for musical era classification from MIDI data. It achieves the following results on the evaluation set:
- Loss: 1.0162
- Accuracy: 0.6572
- F1: 0.5121

## Model description

The EraClassifierBiLSTM-134M is a bidirectional LSTM neural network specifically designed for classifying musical compositions into historical eras based on MIDI data analysis. This large model variant (~134M parameters) offers superior performance compared to the [compact 4.76M version](https://huggingface.co/TiMauzi/EraClassifierBiLSTM-4.76M), making it suitable for applications requiring higher accuracy.

### Architecture
- **Model Type**: Custom Bidirectional LSTM (BiLSTM)
- **Input**: Sequences of 8-dimensional feature vectors extracted from MIDI messages
- **Window Size**: 24 MIDI messages per sequence with stride=20 (overlapping windows)
- **Hidden Layers**: 2 bidirectional LSTM layers with 2048 hidden units each
- **Output**: 6-class classification (musical eras)
- **Activation**: LeakyReLU with dropout for regularization
- **Loss Function**: CrossEntropyLoss

### Feature Engineering
The model processes 8 key MIDI features per message, automatically selected as the most frequent features across the dataset:

**Numerical Features (7):**
- **channel**: MIDI channel number (μ=2.01, σ=2.74)
- **control**: Control change values (μ=11.90, σ=17.02)
- **note**: Note pitch/midi note number (μ=64.17, σ=12.00)
- **tempo**: Tempo in microseconds per beat (μ=738221.63, σ=460369.34)
- **time**: Timing information in ticks (μ=714.28, σ=1337451.38)
- **value**: Generic value field (μ=83.91, σ=26.72)
- **velocity**: Note velocity/intensity (μ=42.80, σ=44.24)

**Categorical Features (1):**
- **type**: MIDI message type (mapped to numerical IDs)

All numerical features are normalized using dataset statistics (mean and standard deviation), while categorical features are encoded using learned ID mappings.

### Training Approach
The model uses a sliding window approach to capture temporal patterns in musical structure that are characteristic of different historical periods. Each MIDI file is processed into multiple overlapping sequences, allowing the model to learn both local and global musical patterns.

### Performance Comparison
Compared to the 4.76M model, this larger variant shows significant improvements:
- **Accuracy**: +7.2% improvement (65.7% vs 58.5%)
- **F1 Score**: +19.1% improvement (0.512 vs 0.430)
- **Loss**: 7.1% reduction (1.016 vs 1.094)

## Intended uses & limitations

### Intended Uses
- **Musicological Research**: Analyzing historical trends in musical composition
- **Educational Tools**: Teaching music history through automated era identification
- **Digital Music Libraries**: Automatic categorization and organization of MIDI collections
- **Music Analysis**: Understanding stylistic characteristics across different periods
- **Content Recommendation**: Suggesting music from similar historical periods

### Limitations
- **Performance Variability**: While improved, the model still shows performance differences across eras:
  - Excellent performance on Romantic (85.4%) and Baroque (71.4%) eras
  - Good performance on Renaissance (57.0%) and Modern (51.9%) eras
  - Moderate performance on Classical (21.4%) and Other (21.1%) categories
- **Era Confusion**: Adjacent historical periods are still confused, though less frequently:
  - Renaissance music occasionally misclassified as Baroque (30.0%)
  - Classical music still confused with Baroque (32.3%) and Romantic (31.0%)
  - Modern music sometimes misclassified as Romantic (21.4%)
- **Computational Requirements**: Higher memory and processing requirements compared to smaller models
- **Data Dependencies**: Performance depends on the quality and representativeness of the training data
- **MIDI-Only**: Limited to MIDI format; cannot process audio recordings or sheet music
- **Cultural Bias**: Training data may reflect Western classical music traditions

Below is the confusion matrix for the best-performing checkpoint, visually highlighting these misclassifications (click to enlarge):

[<img src="confusion_matrix_best.png" alt="Confusion Matrix" width="500"/>](confusion_matrix_best.png)

### Recommendations for Use
- Validate results with musicological expertise, especially for Classical period identification
- Use confidence thresholds to filter low-confidence predictions

## Training and evaluation data

### Dataset
- **Source**: TiMauzi/imslp-midi-by-sa (International Music Score Library Project)
- **Format**: MIDI files with associated metadata including composition year and era
- **Preprocessing**: MIDI messages converted to 8-dimensional feature vectors
- **Window Strategy**: 24-message windows with 20-message stride for overlapping sequences

### Musical Eras Covered
0. **Renaissance** (1400-1600): Early polyphonic music, madrigals, motets
1. **Baroque** (1600-1750): Ornamented music, basso continuo, fugues
2. **Classical** (1750-1820): Clear forms, balanced phrases, sonata form
3. **Romantic** (1820-1900): Expressive, emotional, expanded forms
4. **Modern** (1900-present): Atonal, experimental, diverse styles
5. **Other**: Miscellaneous or unclear period classifications

The numbers 0 through 5 correspond to each era's index during inference.

### Data Distribution
The model was trained on 6,992 MIDI files from the IMSLP dataset with the following era distribution:
- **Romantic**: 2,722 samples (38.9%) - median year 1854
- **Baroque**: 1,874 samples (26.8%) - median year 1710
- **Renaissance**: 843 samples (12.1%) - median year 1611
- **Modern**: 763 samples (10.9%) - median year 2020
- **Classical**: 597 samples (8.5%) - median year 1779
- **Other**: 193 samples (2.8%) - median year 1909 (includes Early 20th century and Medieval)

Era thresholding was applied (minimum 150 samples per era), with rare eras like "Early 20th century" (125 samples) and "Medieval" (5 samples) mapped to the "Other" category to maintain classification stability.

### Evaluation Strategy
- **Validation**: Performance measured on held-out validation set
- **Test Set**: Final evaluation on completely unseen test data
- **Metrics**: Accuracy, F1-score (macro-averaged), and confusion matrix analysis
- **Training Duration**: 3 epochs (~58,000 training steps) with fallback to best result (early stopping) based on F1 score

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0006535848403050624
- train_batch_size: 64
- eval_batch_size: 64
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: reduce_lr_on_plateau
- num_epochs: 3
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch  | Step  | Validation Loss | Accuracy | F1     |
|:-------------:|:------:|:-----:|:---------------:|:--------:|:------:|
| 1.1478        | 0.1031 | 2000  | 1.1945          | 0.5275   | 0.3487 |
| 0.9699        | 0.2063 | 4000  | 1.0621          | 0.6357   | 0.4551 |
| 0.9049        | 0.3094 | 6000  | 1.0657          | 0.5898   | 0.4074 |
| 0.8577        | 0.4126 | 8000  | 1.0708          | 0.6032   | 0.4562 |
| 0.8293        | 0.5157 | 10000 | 1.0425          | 0.6096   | 0.4274 |
| 0.8002        | 0.6188 | 12000 | 1.0197          | 0.6157   | 0.4464 |
| 0.7799        | 0.7220 | 14000 | 1.0540          | 0.6103   | 0.4576 |
| 0.7545        | 0.8251 | 16000 | 1.0288          | 0.6266   | 0.4682 |
| 0.7415        | 0.9283 | 18000 | 1.0332          | 0.6206   | 0.4614 |
| 0.7205        | 1.0314 | 20000 | 1.0262          | 0.6333   | 0.4734 |
| 0.7005        | 1.1345 | 22000 | 0.9989          | 0.6363   | 0.4840 |
| 0.6924        | 1.2377 | 24000 | 1.0136          | 0.6347   | 0.4647 |
| 0.6541        | 1.3408 | 26000 | 0.9917          | 0.6466   | 0.4951 |
| 0.6261        | 1.4440 | 28000 | 0.9876          | 0.6465   | 0.4924 |
| 0.6271        | 1.5471 | 30000 | 1.0057          | 0.6449   | 0.4976 |
| 0.6124        | 1.6503 | 32000 | 0.9994          | 0.6494   | 0.5007 |
| 0.6137        | 1.7534 | 34000 | 1.0015          | 0.6493   | 0.4976 |
| 0.604         | 1.8565 | 36000 | 1.0058          | 0.6524   | 0.4997 |
| 0.6063        | 1.9597 | 38000 | 1.0046          | 0.6512   | 0.5032 |
| 0.5859        | 2.0628 | 40000 | 1.0162          | 0.6572   | 0.5121 |
| 0.5778        | 2.1660 | 42000 | 1.0052          | 0.6591   | 0.5089 |
| 0.5679        | 2.2691 | 44000 | 1.0288          | 0.6539   | 0.5044 |
| 0.5646        | 2.3722 | 46000 | 1.0247          | 0.6559   | 0.5085 |
| 0.5693        | 2.4754 | 48000 | 1.0250          | 0.6581   | 0.5096 |
| 0.5607        | 2.5785 | 50000 | 1.0296          | 0.6573   | 0.5069 |
| 0.5641        | 2.6817 | 52000 | 1.0266          | 0.6573   | 0.5080 |
| 0.5601        | 2.7848 | 54000 | 1.0268          | 0.6577   | 0.5098 |
| 0.5607        | 2.8879 | 56000 | 1.0263          | 0.6539   | 0.5060 |
| 0.5582        | 2.9911 | 58000 | 1.0269          | 0.6593   | 0.5103 |

### Training Analysis
Below is the full training metrics plot, showing loss, accuracy, and F1-score trends over the entire training process (click to enlarge):

[<img src="training_metrics.png" alt="Training Metrics" width="500"/>](training_metrics.png)

The training shows decent convergence with the model reaching its best performance around step 40,000 (epoch 2.06). The larger model capacity allows for faster learning and better final performance compared to the [4.76M variant](https://huggingface.co/TiMauzi/EraClassifierBiLSTM-4.76M). The training loss decreases more rapidly while validation metrics show stable improvement, indicating effective use of the increased model capacity without overfitting. The model achieves its peak F1 score of 0.5121 at step 40,000, which was selected as the best checkpoint.

### Framework versions

- Transformers 4.49.0
- Pytorch 2.6.0+cu126
- Datasets 3.3.2
- Tokenizers 0.21.0