--- license: apache-2.0 language: - en base_model: MIT/ast-finetuned-audioset-10-10-0.4593 tags: - audio-classification - music - speech - ast - audio-spectrogram-transformer pipeline_tag: audio-classification datasets: - AIGenLab/speech-music-82k metrics: - accuracy library_name: transformers --- # 🎵 AST Music vs Speech Classifier (82K) Fine-tuned Audio Spectrogram Transformer (AST) for music vs speech classification. ## Model Details - **Base Model:** MIT/ast-finetuned-audioset-10-10-0.4593 - **Task:** Binary Audio Classification (Music vs Speech) - **Training Dataset:** AIGenLab/speech-music-82k (82000 samples) - **Overall Accuracy:** 86.7% (26/30) --- ## 📊 Performance Results | Category | Accuracy | Correct | Total | |----------|----------|---------|-------| | Pure Music | 100.0% | 10 | 10 | | Pure Speech | 60.0% | 6 | 10 | | Speech + Music | 100.0% | 10 | 10 | ### Pure Music | File | Music Score | Speech Score | Prediction | Result | |------|-------------|--------------|------------|--------| | music_1.wav | 1.000 | 0.000 | MUSIC | ✅ | | music_10.wav | 1.000 | 0.000 | MUSIC | ✅ | | music_2.wav | 1.000 | 0.000 | MUSIC | ✅ | | music_3.wav | 1.000 | 0.000 | MUSIC | ✅ | | music_4.wav | 1.000 | 0.000 | MUSIC | ✅ | | music_5.wav | 1.000 | 0.000 | MUSIC | ✅ | | music_6.wav | 1.000 | 0.000 | MUSIC | ✅ | | music_7.wav | 1.000 | 0.000 | MUSIC | ✅ | | music_8.wav | 1.000 | 0.000 | MUSIC | ✅ | | music_9.wav | 1.000 | 0.000 | MUSIC | ✅ | ### Pure Speech | File | Music Score | Speech Score | Prediction | Result | |------|-------------|--------------|------------|--------| | speech_1.wav | 0.051 | 0.949 | SPEECH | ✅ | | speech_10.wav | 0.039 | 0.961 | SPEECH | ✅ | | speech_2.wav | 0.000 | 1.000 | SPEECH | ✅ | | speech_3.wav | 0.372 | 0.628 | SPEECH | ✅ | | speech_4.wav | 1.000 | 0.000 | MUSIC | ❌ | | speech_5.wav | 0.997 | 0.003 | MUSIC | ❌ | | speech_6.wav | 1.000 | 0.000 | MUSIC | ❌ | | speech_7.wav | 0.170 | 0.830 | SPEECH | ✅ | | speech_8.wav | 0.870 | 0.130 | MUSIC | ❌ | | speech_9.wav | 0.035 | 0.965 | SPEECH | ✅ | ### Speech + Music | File | Music Score | Speech Score | Prediction | Result | |------|-------------|--------------|------------|--------| | speech_and_music_1.wav | 1.000 | 0.000 | MUSIC | ✅ | | speech_and_music_10.wav | 0.987 | 0.013 | MUSIC | ✅ | | speech_and_music_2.wav | 1.000 | 0.000 | MUSIC | ✅ | | speech_and_music_3wav.wav | 1.000 | 0.000 | MUSIC | ✅ | | speech_and_music_4.wav | 1.000 | 0.000 | MUSIC | ✅ | | speech_and_music_5.wav | 1.000 | 0.000 | MUSIC | ✅ | | speech_and_music_6.wav | 1.000 | 0.000 | MUSIC | ✅ | | speech_and_music_7.wav | 1.000 | 0.000 | MUSIC | ✅ | | speech_and_music_8.wav | 1.000 | 0.000 | MUSIC | ✅ | | speech_and_music_9.wav | 1.000 | 0.000 | MUSIC | ✅ | --- ## 🚀 Quick Start ```python from transformers import pipeline # Load the model classifier = pipeline( "audio-classification", model="AIGenLab/AST-speech-and-music-classifier-82K" ) # Classify audio result = classifier("your_audio.wav") print(result) ``` --- ## 🔧 Advanced Usage ```python from transformers import AutoModelForAudioClassification, AutoFeatureExtractor import torch import torchaudio # Load model and feature extractor model = AutoModelForAudioClassification.from_pretrained( "AIGenLab/AST-speech-and-music-classifier-82K" ) feature_extractor = AutoFeatureExtractor.from_pretrained( "AIGenLab/AST-speech-and-music-classifier-82K" ) # Load audio (16kHz required) audio, sr = torchaudio.load("audio.wav") if sr != 16000: audio = torchaudio.functional.resample(audio, sr, 16000) # Process inputs = feature_extractor( audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt" ) # Predict with torch.no_grad(): outputs = model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) music_score = predictions[0][0].item() speech_score = predictions[0][1].item() print(f"Music: {music_score:.3f}") print(f"Speech: {speech_score:.3f}") ``` --- ## 📊 Training Details | Parameter | Value | |-----------|-------| | Base Model | MIT/ast-finetuned-audioset-10-10-0.4593 | | Dataset | AIGenLab/speech-music-82k (82000 samples) | | Epochs | 1 | | Batch Size | 64 | | Learning Rate | 3e-5 | | Loss Weight | Music: 2.5x, Speech: 1.0x | | Optimizer | AdamW | | Framework | Transformers + PyTorch |