Wav2Vec2 Audio Intelligence

A Wav2Vec2-based model for audio quality classification and ASR reliability prediction.

Model Description

This model combines a pretrained Wav2Vec2-base encoder with two task-specific heads:

  1. Audio Quality Classifier: Detects 5 types of audio degradation

    • Class 1: Clean audio
    • Class 2: Packet loss artifacts
    • Class 3: Background noise
    • Class 4: Room reverberation
    • Class 5: Telephone echo
  2. WER Predictor: Estimates the Word Error Rate (WER) that an ASR system would produce on the input audio, without needing actual transcription.

Usage

import torch
from transformers import Wav2Vec2FeatureExtractor
from wav2vec_audio_intel import Wav2Vec2AudioIntelligence, AudioIntelligenceConfig

# Load model and processor
model = Wav2Vec2AudioIntelligence.from_pretrained("jasonlee-sf/wav2vec_audio_intel")
processor = Wav2Vec2FeatureExtractor.from_pretrained("facebook/wav2vec2-base")

# Process audio (16kHz, mono)
import librosa
audio, sr = librosa.load("your_audio.wav", sr=16000)

# Get predictions
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
    outputs = model(inputs.input_values)

# Classification results
probs = outputs.classifier_probs[0]
predicted_class = probs.argmax().item() + 1  # Classes are 1-indexed
label_names = {1: "Clean", 2: "Packet Loss", 3: "Background Noise", 4: "Room Reverb", 5: "Telephone Echo"}
print(f"Predicted: {label_names[predicted_class]} ({probs[predicted_class-1]:.1%})")

# WER prediction
predicted_wer = outputs.wer_prediction[0].item()
print(f"Predicted WER: {predicted_wer:.1%}")

Model Architecture

  • Base: facebook/wav2vec2-base (768-dim hidden states)
  • Classifier Head: 768 โ†’ 512 โ†’ 256 โ†’ 5 (with ReLU activations)
  • Regressor Head: 768 โ†’ 512 โ†’ 256 โ†’ 1 (with ReLU activations)

Training Data

The classifier and regressor heads were trained on audio samples with synthetic augmentations:

  • Packet loss simulation (bursty network conditions)
  • Background noise mixing (various SNR levels)
  • Room impulse response convolution
  • Telephone echo simulation

Intended Use

This model is designed for:

  • Call center quality monitoring: Detect degraded audio in real-time
  • ASR preprocessing: Skip or flag unreliable audio before transcription
  • Audio pipeline debugging: Identify sources of audio quality issues

Limitations

  • Trained primarily on English speech
  • Expects 16kHz mono audio input
  • WER predictions are estimates and may vary based on ASR system used

Citation

If you use this model, please cite:

@misc{wav2vec_audio_intel,
  title={Wav2Vec2 Audio Intelligence: Audio Quality Classification and WER Prediction},
  author={Salesforce Research},
  year={2025},
  url={https://huggingface.co/jasonlee-sf/wav2vec_audio_intel}
}
Downloads last month
9
Safetensors
Model size
95.4M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support