WAVe-1B-Multimodal-NL: Word-Aligned Speech Quality Assessment
Model Description
WAVe-1B-Multimodal-NL is a 1 billion parameter multimodal embedding model that assesses the quality of synthetic speech by measuring speech-transcript alignment at the word level. This model was trained on Dutch data to identify high-quality synthetic audio suitable for ASR training.
The model combines:
- Text Encoder: XLM-RoBERTa (278M params) - multilingual text understanding
- Audio Encoder: Wav2Vec2-BERT 2.0 (581M params) - robust speech representations
- Word-Level Alignment: Multi-head attention + GLU scoring (14M params) - the core innovation
- Projection Layers: Enhanced 2-layer MLPs (10M params) - shared embedding space
Key Innovation
Unlike sentence-level approaches, WAVe uses attention-based word-level alignment to detect subtle quality issues:
- Unnatural prosody or timing
- Mispronunciations or synthesis errors
- Text-audio mismatches
- Poor audio quality
Performance Highlights
- 34% reduction in training steps for downstream ASR models
- 50% improvement in cross-domain generalization
- 30% less synthetic data needed compared to baseline filtering
- Effective detection of localized errors missed by sentence-level methods
How It Works
from transformers import AutoModel, AutoProcessor
import torch
# Load model and processor
processor = AutoProcessor.from_pretrained("yuriyvnv/WAVe-1B-Multimodal-NL")
model = AutoModel.from_pretrained("yuriyvnv/WAVe-1B-Multimodal-NL")
# Prepare inputs
text = "Hallo, hoe gaat het met u?"
# audio = <your 16kHz mono audio array>
inputs = processor(text=text, audio=audio, sampling_rate=16000, return_tensors="pt")
# Get quality assessment
with torch.no_grad():
outputs = model(**inputs)
quality_score = outputs.quality_score.item()
# Interpret results
if quality_score >= 0.8:
print(f"✓ HIGH QUALITY ({quality_score:.3f}) - Safe for ASR training")
elif quality_score >= 0.5:
print(f"âš MEDIUM QUALITY ({quality_score:.3f}) - Review manually")
else:
print(f"✗ LOW QUALITY ({quality_score:.3f}) - Discard")
Model Architecture
Components
Text Encoder: XLM-RoBERTa (paraphrase-multilingual-mpnet-base-v2)
- Hidden size: 768
- Multilingual support
Audio Encoder: Wav2Vec2-BERT 2.0 (facebook/w2v-bert-2.0)
- Hidden size: 1024
- Robust speech representations
Word-Level Alignment Module:
- Multi-head attention (6 heads)
- Multi-head Gated Linear Units (4 heads)
- Learned temperature scaling
Projection Layers:
- 2-layer MLPs with expansion-compression
- Output dimension: 768
- LayerNorm + GELU activation
Training Details
- Dataset: Dutch CommonVoice 16.1 + synthetic data
- Corruption strategies: 5 types (see code in github)
- Training objective: Contrastive learning + word-level supervision
- Encoder fine-tuning: Last 3 layers unfrozen
- Optimization: AdamW with cosine scheduling
- Training epochs: 30 (best checkpoint selected at epoch 28 by similarity gap)
Training Metrics
The best checkpoint was selected at epoch 28 based on the similarity gap metric. The model shows strong separation between clean and corrupted speech.
| Metric | Value |
|---|---|
| Loss | 0.4100 |
| Similarity Gap | 0.4109 |
| Clean Similarity | 0.7722 |
| Corrupt Similarity | 0.3613 |
| Alignment Gap | 0.1297 |
| Positive Alignment | 0.2279 |
| Negative Alignment | 0.0981 |
The model achieves a clean similarity of 0.77 while pushing corrupt similarity down to 0.36, producing a gap of 0.41. The alignment gap of 0.13 indicates that word-level scoring provides an additional discriminative signal beyond the sentence-level similarity.
Training Progress
Outputs
The model returns comprehensive quality metrics:
outputs.quality_score # Overall quality [0, 1] - PRIMARY METRIC
outputs.cosine_similarity # Normalized cosine similarity
outputs.mean_alignment_score # Average word-level alignment [0, 1]
outputs.text_embeds # Text embeddings (batch_size, 768)
outputs.audio_embeds # Audio embeddings (batch_size, 768)
outputs.alignment_scores # Per-word quality (batch_size, num_tokens)
outputs.alignment_matrix # Word-to-frame attention (batch_size, tokens, frames)
Quality Score Formula
# Step 1: Normalize cosine similarity to [0, 1]
cosine_normalized = (cosine_similarity + 1.0) / 2.0
# Step 2: Compute word-level alignment score
alignment_normalized = sigmoid(alignment_scores).mean()
# Step 3: Combine both signals
quality_score = (cosine_normalized + alignment_normalized) / 2.0
Recommended Thresholds
Based on our experiments:
| Quality Score | Label | Recommendation |
|---|---|---|
| 0.8 - 1.0 | High | Safe for ASR training |
| 0.5 - 0.8 | Medium | Review manually or use with caution |
| 0.0 - 0.5 | Low | Discard from training set |
Usage Examples
Batch Processing
# Process multiple samples efficiently
texts = ["Tekst 1", "Tekst 2", "Tekst 3"]
audios = [audio1, audio2, audio3]
inputs = processor(
text=texts,
audio=audios,
sampling_rate=16000,
padding=True,
return_tensors="pt"
)
with torch.no_grad():
outputs = model(**inputs)
quality_scores = outputs.quality_score # Shape: (3,)
# Filter high-quality samples
high_quality_indices = (quality_scores >= 0.8).nonzero()
Dataset Filtering
from datasets import load_dataset
dataset = load_dataset("your-synthetic-dataset")
filtered = []
for sample in dataset:
inputs = processor(
text=sample["text"],
audio=sample["audio"]["array"],
sampling_rate=16000,
return_tensors="pt"
)
with torch.no_grad():
quality = model(**inputs).quality_score.item()
if quality >= 0.8:
filtered.append(sample)
print(f"Kept {len(filtered)}/{len(dataset)} samples ({100*len(filtered)/len(dataset):.1f}%)")
Citation
Paper is comming soon
Related Work
- Paper: [Coming soon]
- Code: https://github.com/yuriyvnv/WAVe
- Models: https://huggingface.co/yuriyvnv
Other Models in the Series
yuriyvnv/WAVe-1B-Multimodal-PT- Portuguese language versionyuriyvnv/whisper-large-v3-high-mixed-nl- ASR trained on WAVe-filtered datayuriyvnv/whisper-small-high-mixed-nl- ASR baseline without filtering
License
Apache 2.0
Acknowledgements
- Text encoder: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
- Audio encoder: facebook/w2v-bert-2.0
- Training data: Mozilla CommonVoice 16.1 Dutch
- Downloads last month
- 16
Model tree for yuriyvnv/WAVe-1B-Multimodal-NL
Base model
facebook/w2v-bert-2.0
