--- license: apache-2.0 language: - en base_model: MIT/ast-finetuned-audioset-10-10-0.4593 tags: - audio-classification - music - speech - ast - audio-spectrogram-transformer pipeline_tag: audio-classification datasets: - AIGenLab/speech-music-1k metrics: - accuracy library_name: transformers --- # 🎵 AST Music vs Speech Classifier Fine-tuned Audio Spectrogram Transformer (AST) for music vs speech classification. ## Model Details - **Base Model:** MIT/ast-finetuned-audioset-10-10-0.4593 - **Task:** Binary Audio Classification (Music vs Speech) - **Training Dataset:** AIGenLab/speech-music-1k (1000 samples) - **Overall Accuracy:** 90.0% (27/30) --- ## 📊 Performance Results | Category | Accuracy | Correct | Total | |----------|----------|---------|-------| | Pure Music | 100.0% | 10 | 10 | | Pure Speech | 70.0% | 7 | 10 | | Speech + Music | 100.0% | 10 | 10 | ### Pure Music | File | Music Score | Speech Score | Prediction | Result | |------|-------------|--------------|------------|--------| | music_1.wav | 1.000 | 0.000 | MUSIC | ✅ | | music_10.wav | 1.000 | 0.000 | MUSIC | ✅ | | music_2.wav | 1.000 | 0.000 | MUSIC | ✅ | | music_3.wav | 1.000 | 0.000 | MUSIC | ✅ | | music_4.wav | 1.000 | 0.000 | MUSIC | ✅ | | music_5.wav | 1.000 | 0.000 | MUSIC | ✅ | | music_6.wav | 1.000 | 0.000 | MUSIC | ✅ | | music_7.wav | 1.000 | 0.000 | MUSIC | ✅ | | music_8.wav | 1.000 | 0.000 | MUSIC | ✅ | | music_9.wav | 1.000 | 0.000 | MUSIC | ✅ | ### Pure Speech | File | Music Score | Speech Score | Prediction | Result | |------|-------------|--------------|------------|--------| | speech_1.wav | 0.000 | 1.000 | SPEECH | ✅ | | speech_10.wav | 0.002 | 0.998 | SPEECH | ✅ | | speech_2.wav | 0.000 | 1.000 | SPEECH | ✅ | | speech_3.wav | 0.714 | 0.286 | MUSIC | ❌ | | speech_4.wav | 0.906 | 0.094 | MUSIC | ❌ | | speech_5.wav | 0.350 | 0.650 | SPEECH | ✅ | | speech_6.wav | 0.895 | 0.105 | MUSIC | ❌ | | speech_7.wav | 0.068 | 0.932 | SPEECH | ✅ | | speech_8.wav | 0.097 | 0.903 | SPEECH | ✅ | | speech_9.wav | 0.083 | 0.917 | SPEECH | ✅ | ### Speech + Music | File | Music Score | Speech Score | Prediction | Result | |------|-------------|--------------|------------|--------| | speech_and_music_1.wav | 0.995 | 0.005 | MUSIC | ✅ | | speech_and_music_10.wav | 0.987 | 0.013 | MUSIC | ✅ | | speech_and_music_2.wav | 1.000 | 0.000 | MUSIC | ✅ | | speech_and_music_3wav.wav | 1.000 | 0.000 | MUSIC | ✅ | | speech_and_music_4.wav | 1.000 | 0.000 | MUSIC | ✅ | | speech_and_music_5.wav | 0.998 | 0.002 | MUSIC | ✅ | | speech_and_music_6.wav | 1.000 | 0.000 | MUSIC | ✅ | | speech_and_music_7.wav | 1.000 | 0.000 | MUSIC | ✅ | | speech_and_music_8.wav | 1.000 | 0.000 | MUSIC | ✅ | | speech_and_music_9.wav | 1.000 | 0.000 | MUSIC | ✅ | --- ## 🚀 Quick Start ```python from transformers import pipeline # Load the model classifier = pipeline( "audio-classification", model="AIGenLab/AST-speech-and-music-classifier-1K" ) # Classify audio result = classifier("your_audio.wav") print(result) ``` --- ## 🔧 Advanced Usage ```python from transformers import AutoModelForAudioClassification, AutoFeatureExtractor import torch import torchaudio # Load model and feature extractor model = AutoModelForAudioClassification.from_pretrained( "AIGenLab/AST-speech-and-music-classifier-1K" ) feature_extractor = AutoFeatureExtractor.from_pretrained( "AIGenLab/AST-speech-and-music-classifier-1K" ) # Load audio (16kHz required) audio, sr = torchaudio.load("audio.wav") if sr != 16000: audio = torchaudio.functional.resample(audio, sr, 16000) # Process inputs = feature_extractor( audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt" ) # Predict with torch.no_grad(): outputs = model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) music_score = predictions[0][0].item() speech_score = predictions[0][1].item() print(f"Music: {music_score:.3f}") print(f"Speech: {speech_score:.3f}") ``` --- ## 📊 Training Details | Parameter | Value | |-----------|-------| | Base Model | MIT/ast-finetuned-audioset-10-10-0.4593 | | Dataset | AIGenLab/speech-music-1k (1000 samples) | | Epochs | 1 | | Batch Size | 64 | | Learning Rate | 3e-5 | | Loss Weight | Music: 2.5x, Speech: 1.0x | | Optimizer | AdamW | | Framework | Transformers + PyTorch |