Wav2Vec2 Audio Intelligence
A Wav2Vec2-based model for audio quality classification and ASR reliability prediction.
Model Description
This model combines a pretrained Wav2Vec2-base encoder with two task-specific heads:
Audio Quality Classifier: Detects 5 types of audio degradation
- Class 1: Clean audio
- Class 2: Packet loss artifacts
- Class 3: Background noise
- Class 4: Room reverberation
- Class 5: Telephone echo
WER Predictor: Estimates the Word Error Rate (WER) that an ASR system would produce on the input audio, without needing actual transcription.
Usage
import torch
from transformers import Wav2Vec2FeatureExtractor
from wav2vec_audio_intel import Wav2Vec2AudioIntelligence, AudioIntelligenceConfig
# Load model and processor
model = Wav2Vec2AudioIntelligence.from_pretrained("jasonlee-sf/wav2vec_audio_intel")
processor = Wav2Vec2FeatureExtractor.from_pretrained("facebook/wav2vec2-base")
# Process audio (16kHz, mono)
import librosa
audio, sr = librosa.load("your_audio.wav", sr=16000)
# Get predictions
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(inputs.input_values)
# Classification results
probs = outputs.classifier_probs[0]
predicted_class = probs.argmax().item() + 1 # Classes are 1-indexed
label_names = {1: "Clean", 2: "Packet Loss", 3: "Background Noise", 4: "Room Reverb", 5: "Telephone Echo"}
print(f"Predicted: {label_names[predicted_class]} ({probs[predicted_class-1]:.1%})")
# WER prediction
predicted_wer = outputs.wer_prediction[0].item()
print(f"Predicted WER: {predicted_wer:.1%}")
Model Architecture
- Base:
facebook/wav2vec2-base(768-dim hidden states) - Classifier Head: 768 โ 512 โ 256 โ 5 (with ReLU activations)
- Regressor Head: 768 โ 512 โ 256 โ 1 (with ReLU activations)
Training Data
The classifier and regressor heads were trained on audio samples with synthetic augmentations:
- Packet loss simulation (bursty network conditions)
- Background noise mixing (various SNR levels)
- Room impulse response convolution
- Telephone echo simulation
Intended Use
This model is designed for:
- Call center quality monitoring: Detect degraded audio in real-time
- ASR preprocessing: Skip or flag unreliable audio before transcription
- Audio pipeline debugging: Identify sources of audio quality issues
Limitations
- Trained primarily on English speech
- Expects 16kHz mono audio input
- WER predictions are estimates and may vary based on ASR system used
Citation
If you use this model, please cite:
@misc{wav2vec_audio_intel,
title={Wav2Vec2 Audio Intelligence: Audio Quality Classification and WER Prediction},
author={Salesforce Research},
year={2025},
url={https://huggingface.co/jasonlee-sf/wav2vec_audio_intel}
}
- Downloads last month
- 9