Felladrin commited on
Commit
573cbe4
·
verified ·
1 Parent(s): 3e60440

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - audio
6
+ - audio-classification
7
+ - musical-instruments
8
+ - wav2vec2
9
+ - transformers
10
+ - pytorch
11
+ datasets:
12
+ - custom
13
+ metrics:
14
+ - accuracy
15
+ - roc_auc
16
+ model-index:
17
+ - name: epoch_musical_instruments_identification_2
18
+ results:
19
+ - task:
20
+ type: audio-classification
21
+ name: Musical Instrument Classification
22
+ metrics:
23
+ - type: accuracy
24
+ value: 0.9333
25
+ name: Accuracy
26
+ - type: roc_auc
27
+ value: 0.9859
28
+ name: ROC AUC (Macro)
29
+ - type: loss
30
+ value: 1.0639
31
+ name: Validation Loss
32
+ base_model:
33
+ - Bhaveen/Musical-Instrument-Classification
34
+ library_name: transformers.js
35
+ pipeline_tag: audio-classification
36
+ ---
37
+
38
+
39
+
40
+ # Musical-Instrument-Classification (ONNX)
41
+
42
+
43
+ This is an ONNX version of [Bhaveen/Musical-Instrument-Classification](https://huggingface.co/Bhaveen/Musical-Instrument-Classification). It was automatically converted and uploaded using [this Hugging Face Space](https://huggingface.co/spaces/onnx-community/convert-to-onnx).
44
+
45
+
46
+ ## Usage with Transformers.js
47
+
48
+
49
+ See the pipeline documentation for `audio-classification`: https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AudioClassificationPipeline
50
+
51
+
52
+ ---
53
+
54
+
55
+ # Musical Instrument Classification Model
56
+
57
+ This model is a fine-tuned version of [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h) for musical instrument classification. It can identify 9 different musical instruments from audio recordings with high accuracy.
58
+
59
+ ## Model Description
60
+
61
+ - **Model type:** Audio Classification
62
+ - **Base model:** facebook/wav2vec2-base-960h
63
+ - **Language:** Audio (no specific language)
64
+ - **License:** MIT
65
+ - **Fine-tuned on:** Custom musical instrument dataset (200 samples for each class)
66
+
67
+ ## Performance
68
+
69
+ The model achieves excellent performance on the evaluation set after 5 epochs of training:
70
+
71
+ - **Final Accuracy:** 93.33%
72
+ - **Final ROC AUC (Macro):** 98.59%
73
+ - **Final Validation Loss:** 1.064
74
+ - **Evaluation Runtime:** 14.18 seconds
75
+ - **Evaluation Speed:** 25.39 samples/second
76
+
77
+ ### Training Progress
78
+
79
+ | Epoch | Training Loss | Validation Loss | ROC AUC | Accuracy |
80
+ |-------|---------------|-----------------|---------|----------|
81
+ | 1 | 1.9872 | 1.8875 | 0.9248 | 0.6639 |
82
+ | 2 | 1.8652 | 1.4793 | 0.9799 | 0.8000 |
83
+ | 3 | 1.3868 | 1.2311 | 0.9861 | 0.8194 |
84
+ | 4 | 1.3242 | 1.1121 | 0.9827 | 0.9250 |
85
+ | 5 | 1.1869 | 1.0639 | 0.9859 | 0.9333 |
86
+
87
+ ## Supported Instruments
88
+
89
+ The model can classify the following 9 musical instruments:
90
+
91
+ 1. **Acoustic Guitar**
92
+ 2. **Bass Guitar**
93
+ 3. **Drum Set**
94
+ 4. **Electric Guitar**
95
+ 5. **Flute**
96
+ 6. **Hi-Hats**
97
+ 7. **Keyboard**
98
+ 8. **Trumpet**
99
+ 9. **Violin**
100
+
101
+ ## Usage
102
+
103
+ ### Quick Start with Pipeline
104
+
105
+ ```python
106
+ from transformers import pipeline
107
+ import torchaudio
108
+
109
+ # Load the classification pipeline
110
+ classifier = pipeline("audio-classification", model="Bhaveen/epoch_musical_instruments_identification_2")
111
+
112
+ # Load and preprocess audio
113
+ audio, rate = torchaudio.load("your_audio_file.wav")
114
+ transform = torchaudio.transforms.Resample(rate, 16000)
115
+ audio = transform(audio).numpy().reshape(-1)[:48000]
116
+
117
+ # Classify the audio
118
+ result = classifier(audio)
119
+ print(result)
120
+ ```
121
+
122
+ ### Using Transformers Directly
123
+
124
+ ```python
125
+ from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
126
+ import torchaudio
127
+ import torch
128
+
129
+ # Load model and feature extractor
130
+ model_name = "Bhaveen/epoch_musical_instruments_identification_2"
131
+ feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
132
+ model = AutoModelForAudioClassification.from_pretrained(model_name)
133
+
134
+ # Load and preprocess audio
135
+ audio, rate = torchaudio.load("your_audio_file.wav")
136
+ transform = torchaudio.transforms.Resample(rate, 16000)
137
+ audio = transform(audio).numpy().reshape(-1)[:48000]
138
+
139
+ # Extract features and make prediction
140
+ inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
141
+ with torch.no_grad():
142
+ outputs = model(**inputs)
143
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
144
+ predicted_class = torch.argmax(predictions, dim=-1)
145
+
146
+ print(f"Predicted instrument: {model.config.id2label[predicted_class.item()]}")
147
+ ```
148
+
149
+ ## Training Details
150
+
151
+ ### Dataset and Preprocessing
152
+
153
+ - **Custom dataset** with audio recordings of 9 musical instruments
154
+ - **Train/Test Split:** 80/20 using file numbering (files < 160 for training)
155
+ - **Data Balancing:** Random oversampling applied to minority classes
156
+ - **Audio Preprocessing:**
157
+ - Resampling to 16,000 Hz
158
+ - Fixed length of 48,000 samples (3 seconds)
159
+ - Truncation of longer audio files
160
+
161
+ ### Training Configuration
162
+
163
+ ```python
164
+ # Training hyperparameters
165
+ batch_size = 1
166
+ gradient_accumulation_steps = 4
167
+ learning_rate = 5e-6
168
+ num_train_epochs = 5
169
+ warmup_steps = 50
170
+ weight_decay = 0.02
171
+ ```
172
+
173
+ ### Model Architecture
174
+
175
+ - **Base Model:** facebook/wav2vec2-base-960h
176
+ - **Classification Head:** Added for 9-class classification
177
+ - **Parameters:** ~95M trainable parameters
178
+ - **Features:** Wav2Vec2 audio representations with fine-tuned classification layer
179
+
180
+ ## Technical Specifications
181
+
182
+ - **Audio Format:** WAV files
183
+ - **Sample Rate:** 16,000 Hz
184
+ - **Input Length:** 3 seconds (48,000 samples)
185
+ - **Model Framework:** PyTorch + Transformers
186
+ - **Inference Device:** GPU recommended (CUDA)
187
+
188
+ ## Evaluation Metrics
189
+
190
+ The model uses the following evaluation metrics:
191
+
192
+ - **Accuracy:** Standard classification accuracy
193
+ - **ROC AUC:** Macro-averaged ROC AUC with one-vs-rest approach
194
+ - **Multi-class Classification:** Softmax probabilities for all 9 instrument classes
195
+
196
+
197
+
198
+ ## Limitations and Considerations
199
+
200
+ 1. **Audio Duration:** Model expects exactly 3-second audio clips (truncates longer, may not work well with shorter)
201
+ 2. **Single Instrument Focus:** Optimized for single instrument classification, mixed instruments may produce uncertain results
202
+ 3. **Audio Quality:** Performance depends on audio quality and recording conditions
203
+ 4. **Sample Rate:** Input must be resampled to 16kHz for optimal performance
204
+ 5. **Domain Specificity:** Trained on specific instrument recordings, may not generalize to all variants or playing styles
205
+
206
+ ## Training Environment
207
+
208
+ - **Platform:** Google Colab
209
+ - **GPU:** CUDA-enabled device
210
+ - **Libraries:**
211
+ - transformers==4.28.1
212
+ - torchaudio==0.12
213
+ - datasets
214
+ - evaluate
215
+ - imblearn
216
+
217
+ ## Model Files
218
+
219
+ The repository contains:
220
+ - Model weights and configuration
221
+ - Feature extractor configuration
222
+ - Training logs and metrics
223
+ - Label mappings (id2label, label2id)
224
+
225
+ ---
226
+
227
+ *Model trained as part of a hackathon project*
config.json ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_attn_implementation_autoset": true,
3
+ "_name_or_path": "Bhaveen/Musical-Instrument-Classification",
4
+ "activation_dropout": 0.1,
5
+ "adapter_attn_dim": null,
6
+ "adapter_kernel_size": 3,
7
+ "adapter_stride": 2,
8
+ "add_adapter": false,
9
+ "apply_spec_augment": true,
10
+ "architectures": [
11
+ "Wav2Vec2ForSequenceClassification"
12
+ ],
13
+ "attention_dropout": 0.1,
14
+ "bos_token_id": 1,
15
+ "classifier_proj_size": 256,
16
+ "codevector_dim": 256,
17
+ "contrastive_logits_temperature": 0.1,
18
+ "conv_bias": false,
19
+ "conv_dim": [
20
+ 512,
21
+ 512,
22
+ 512,
23
+ 512,
24
+ 512,
25
+ 512,
26
+ 512
27
+ ],
28
+ "conv_kernel": [
29
+ 10,
30
+ 3,
31
+ 3,
32
+ 3,
33
+ 3,
34
+ 2,
35
+ 2
36
+ ],
37
+ "conv_stride": [
38
+ 5,
39
+ 2,
40
+ 2,
41
+ 2,
42
+ 2,
43
+ 2,
44
+ 2
45
+ ],
46
+ "ctc_loss_reduction": "sum",
47
+ "ctc_zero_infinity": false,
48
+ "diversity_loss_weight": 0.1,
49
+ "do_stable_layer_norm": false,
50
+ "eos_token_id": 2,
51
+ "feat_extract_activation": "gelu",
52
+ "feat_extract_dropout": 0.0,
53
+ "feat_extract_norm": "group",
54
+ "feat_proj_dropout": 0.1,
55
+ "feat_quantizer_dropout": 0.0,
56
+ "final_dropout": 0.1,
57
+ "gradient_checkpointing": false,
58
+ "hidden_act": "gelu",
59
+ "hidden_dropout": 0.1,
60
+ "hidden_dropout_prob": 0.1,
61
+ "hidden_size": 768,
62
+ "id2label": {
63
+ "0": "Acoustic_Guitar",
64
+ "1": "Bass_Guitar",
65
+ "2": "Drum_set",
66
+ "3": "Electro_Guitar",
67
+ "4": "flute",
68
+ "5": "Hi_Hats",
69
+ "6": "Keyboard",
70
+ "7": "Trumpet",
71
+ "8": "Violin"
72
+ },
73
+ "initializer_range": 0.02,
74
+ "intermediate_size": 3072,
75
+ "label2id": {
76
+ "LABEL_0": 0,
77
+ "LABEL_1": 1,
78
+ "LABEL_2": 2,
79
+ "LABEL_3": 3,
80
+ "LABEL_4": 4,
81
+ "LABEL_5": 5,
82
+ "LABEL_6": 6,
83
+ "LABEL_7": 7,
84
+ "LABEL_8": 8
85
+ },
86
+ "layer_norm_eps": 1e-05,
87
+ "layerdrop": 0.1,
88
+ "mask_feature_length": 10,
89
+ "mask_feature_min_masks": 0,
90
+ "mask_feature_prob": 0.0,
91
+ "mask_time_length": 10,
92
+ "mask_time_min_masks": 2,
93
+ "mask_time_prob": 0.05,
94
+ "model_type": "wav2vec2",
95
+ "num_adapter_layers": 3,
96
+ "num_attention_heads": 12,
97
+ "num_codevector_groups": 2,
98
+ "num_codevectors_per_group": 320,
99
+ "num_conv_pos_embedding_groups": 16,
100
+ "num_conv_pos_embeddings": 128,
101
+ "num_feat_extract_layers": 7,
102
+ "num_hidden_layers": 12,
103
+ "num_negatives": 100,
104
+ "output_hidden_size": 768,
105
+ "pad_token_id": 0,
106
+ "proj_codevector_dim": 256,
107
+ "tdnn_dilation": [
108
+ 1,
109
+ 2,
110
+ 3,
111
+ 1,
112
+ 1
113
+ ],
114
+ "tdnn_dim": [
115
+ 512,
116
+ 512,
117
+ 512,
118
+ 512,
119
+ 1500
120
+ ],
121
+ "tdnn_kernel": [
122
+ 5,
123
+ 3,
124
+ 3,
125
+ 1,
126
+ 1
127
+ ],
128
+ "torch_dtype": "float32",
129
+ "transformers_version": "4.49.0",
130
+ "use_weighted_layer_sum": false,
131
+ "vocab_size": 32,
132
+ "xvector_output_dim": 512
133
+ }
onnx/model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3adf7396021cb5f8a0cb45fce04aedba32a78ae3bc3a19226f9385608f8e1832
3
+ size 378610509
onnx/model_bnb4.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2aaa291bdfa7c367620e43e5f810258f017d17db3254b03b1239c48db47557e0
3
+ size 84631623
onnx/model_fp16.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a5da1ced37d21aaf885edf4f7ee33fc5f28b8469211a251f316964bb9837ad6d
3
+ size 189468524
onnx/model_int8.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a826b4e026bde44dd279e220cf4b3281b270eda1c7f16e2027c9bea7d7a325f1
3
+ size 95389281
onnx/model_q4.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:db6034aaf2f9d28177d0d42311dab016440609a93d7afaa563f2c5ae71cc8fc8
3
+ size 89976361
onnx/model_q4f16.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:345b9a2c85a0a59d51ad9058f24e1a322fadc4d18a642a5b2aadd63ee29d13ef
3
+ size 66538151
onnx/model_quantized.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:82415992d0636e8ba045584a26b26717ff68aaa22114f73805edd308a7b39e6e
3
+ size 95389322
onnx/model_uint8.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:82415992d0636e8ba045584a26b26717ff68aaa22114f73805edd308a7b39e6e
3
+ size 95389322
preprocessor_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": true,
3
+ "feature_extractor_type": "Wav2Vec2FeatureExtractor",
4
+ "feature_size": 1,
5
+ "padding_side": "right",
6
+ "padding_value": 0.0,
7
+ "return_attention_mask": false,
8
+ "sampling_rate": 16000
9
+ }
quantize_config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "modes": [
3
+ "fp16",
4
+ "q8",
5
+ "int8",
6
+ "uint8",
7
+ "q4",
8
+ "q4f16",
9
+ "bnb4"
10
+ ],
11
+ "per_channel": false,
12
+ "reduce_range": false,
13
+ "block_size": null,
14
+ "is_symmetric": true,
15
+ "accuracy_level": null,
16
+ "quant_type": 1,
17
+ "op_block_list": null
18
+ }