FizzyBrain's picture
Upload README.md with huggingface_hub
b81c758 verified
metadata
license: apache-2.0
base_model: MIT/ast-finetuned-audioset-10-10-0.4593
tags:
  - audio-classification
  - non-speech-sounds
  - ast
  - fine-tuned
datasets:
  - nonspeech7k
language:
  - en
pipeline_tag: audio-classification

AST Fine-tuned for Non-Speech Sound Classification

This model is a fine-tuned version of MIT/ast-finetuned-audioset-10-10-0.4593 on the Nonspeech7k dataset.

Model Details

  • Base Model: MIT/ast-finetuned-audioset-10-10-0.4593
  • Fine-tuned on: Nonspeech7k dataset
  • Classes: breath, cough, crying, laugh, screaming, sneeze, yawn
  • Sample Rate: 16kHz
  • Input Length: 10 seconds (160,000 samples)

Usage

from transformers import ASTFeatureExtractor, ASTForAudioClassification
import torch
import torchaudio

# Load model
feature_extractor = ASTFeatureExtractor.from_pretrained("FizzyBrain/ast-nonspeech7k-finetuned")
model = ASTForAudioClassification.from_pretrained("FizzyBrain/ast-nonspeech7k-finetuned")

# Load and preprocess audio
waveform, sample_rate = torchaudio.load("audio.wav")
inputs = feature_extractor(waveform, sampling_rate=16000, return_tensors="pt")

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class_id = predictions.argmax().item()

Classes

  1. breath
  2. cough
  3. crying
  4. laugh
  5. screaming
  6. sneeze
  7. yawn

Training Details

  • Fine-tuned using advanced augmentation techniques
  • Class-weighted loss for imbalanced data
  • Layer-wise learning rate decay
  • Early stopping with macro-F1 monitoring