AI & ML interests

None defined yet.

Recent Activity

yuriyvnvΒ  updated a model about 2 months ago
remynd/whisper-large-v3-pt
yuriyvnvΒ  updated a model about 2 months ago
remynd/whisper-small-pt
yuriyvnvΒ  updated a model about 2 months ago
remynd/whisper-medium-pt
View all activity

yuriyvnvΒ 
posted an update 18 days ago
view post
Post
341
🎯 WAVe-1B-Multimodal-NL: Word-Level Speech Quality Assessment for Dutch

Following the release of the Portuguese model, we're releasing the Dutch variant of WAVe β€” a 1B multimodal embedding model that assesses synthetic speech quality at the word level, thereby improving the quality of synthetically augmented datasets for training ASR models.

Trained on CommonVoice 16.1 Dutch with 5 corruption strategies, this model catches mispronunciations, timing errors, and prosody issues in synthetic data that sentence-level embeddings miss entirely.
Resources

- Dutch model: yuriyvnv/WAVe-1B-Multimodal-NL
- Portuguese model: yuriyvnv/WAVe-1B-Multimodal-PT
- Code: https://github.com/yuriyvnv/WAVe

This model builds on CommonVoice Dutch data β€” thanks to @mozilla and the CommonVoice community for making multilingual speech data accessible.

Would be great to hear from the Dutch NLP community β€” @BramVanroy @GroNLP β€” especially if you're working on Dutch ASR or TTS pipelines where quality filtering could help. Also tagging @hf-audio as this sits at the intersection of speech processing and data curation.
yuriyvnvΒ 
posted an update about 1 month ago
view post
Post
2206
🎯 WAVe: 1B Multimodal Embedding Model for Word-Level Speech Quality

Multimodal embeddings for speech + transcript that verify quality at the word level, not just sentence level. Catches mispronunciations, timing errors, and prosody issues that sentence-level filters miss.

πŸ“Š Impact on Portuguese ASR:
β€’ 34% reduction in training steps
β€’ 50% better cross-domain generalization
β€’ 30% less synthetic data needed
β€’ Word-aligned attention finds errors other methods miss

πŸ—οΈ Architecture:
β€’ Text: XLM-RoBERTa (278M params)
β€’ Audio: Wav2Vec2-BERT 2.0 (581M params)
β€’ Word Alignment: Multi-head attention + GLU (14M params)
β€’ Total: 1B parameters

from transformers import AutoModel, AutoProcessor

  processor = AutoProcessor.from_pretrained(
      "yuriyvnv/WAVe-1B-Multimodal-PT",
      trust_remote_code=True
  )
  model = AutoModel.from_pretrained(
      "yuriyvnv/WAVe-1B-Multimodal-PT",
      trust_remote_code=True
  )



# Assess speech-transcript alignment

inputs = processor(text="OlΓ‘, como estΓ‘?", audio=audio_array, sampling_rate=16000, return_tensors="pt")
  quality = model(**inputs).quality_score.item()


Perfect for filtering synthetic speech datasets before ASR training.

Model: yuriyvnv/WAVe-1B-Multimodal-PT
Code to create WAVe : https://github.com/yuriyvnv/WAVe
#multimodal #speech #embeddings #asr
#syntheticdata #qualityassessment
  • 2 replies
Β·