Spaces:
Running
Evoxtral β Research References & Sources
References for the technical report covering expressive tagged transcription, ASR evaluation, and post-training methods.
Core Model
- Voxtral: An Audio-Language Model β Mistral AI, 2025. Voxtral Mini and Small are open-weights audio chat models trained with SFT + DPO + Online DPO. Online DPO delivers crisper grounding, fewer hallucinations, and more helpful responses.
RL / Post-Training for ASR
Group Relative Policy Optimization for Speech Recognition β Proposes GRPO with rule-based rewards for LLM-based ASR. Achieved 18% relative WER reduction on AMI-IHM and 27.9% on AMI-SDM compared to SFT-adapted models.
Advancing Speech Understanding in Speech-Aware Language Models with GRPO β Applies GRPO to large audio language models, investigating different rule-based reward functions and RL data construction strategies.
Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering β Demonstrates RL (GRPO) achieves state-of-the-art on audio QA tasks, outperforming SFT baselines.
Explore the Reinforcement Learning for LLM-based ASR and TTS System β Surveys RL applications to speech models, noting that while RL has enhanced text-based LLMs, application to ASR/TTS remains underexplored.
ASR Evaluation Metrics
Speech Recognition Accuracy: Production Metrics & Optimization 2025 β Deepgram. Covers WER, CER, Keyword Recall Rate (KRR), Real-Time Factor (RTF), and end-to-end latency. Production systems need blended metrics depending on use case.
Moving Beyond Word Error Rate to Evaluate ASR in Clinical Samples β Argues WER alone is insufficient; error type, meaning, and context matter. A single substitution can drastically change intent with the same WER penalty.
Measuring the Accuracy of Automatic Speech Recognition Solutions β ACM survey on ASR evaluation methodology, limitations of WER/CER, and alternative metrics.
On the Robust Approximation of ASR Metrics β ACL 2025. Novel label-free approach for approximating ASR performance using multimodal embeddings.
ProfASR-Bench: A Professional-talk ASR Dataset for High-Stakes Applications β Evaluation suite supporting conventional metrics plus entity-aware scores and slice-wise reporting by accent and gender.
Expressive Speech & TTS
- ElevenLabs v3 Text-to-Speech β TTS model supporting inline expressive audio tags for emotions, non-verbal sounds, and delivery cues. Tag set used as target vocabulary for Evoxtral.
Fine-Tuning Frameworks
ms-swift β ModelScope framework supporting SFT/DPO/GRPO for 600+ LLMs and 300+ MLLMs. AAAI 2025.
OpenRLHF β Scalable agentic RL framework based on Ray, supporting PPO, DAPO, REINFORCE++, and more.
PEFT (Parameter-Efficient Fine-Tuning) β HuggingFace library for LoRA and other adapter methods.
Evaluation Methodology
- jiwer β Python library for WER, CER, and other ASR metrics based on edit distance.
- GitHub: https://github.com/jitsi/jiwer
Our Benchmark: Evoxtral-Bench
Metrics computed by scripts/train_modal.py::evaluate():
| Metric | Description | Direction |
|---|---|---|
| WER | Word Error Rate on plain text (tags stripped) | lower = better |
| CER | Character Error Rate on plain text | lower = better |
| Tag F1 | F1 score for tag extraction (multiset intersection) | higher = better |
| Tag Precision | Fraction of predicted tags that match reference | higher = better |
| Tag Recall | Fraction of reference tags captured by prediction | higher = better |
| Tag Hallucination Rate | Fraction of predicted tags not in reference at all | lower = better |
| Emphasis F1 | F1 for CAPITALIZED emphasis words | higher = better |
| Per-Tag F1 | Breakdown by individual tag type (e.g. [laughs], [sighs]) | higher = better |
Why These Metrics
- WER + CER: Standard ASR quality. CER is more granular and catches character-level errors WER misses.
- Tag Precision vs Recall: F1 alone hides whether the model hallucinates tags (low precision) or misses them (low recall). Judges care about this distinction.
- Tag Hallucination Rate: Critical for downstream TTS β hallucinated tags produce wrong prosody/effects.
- Per-Tag Breakdown: Shows which expressive cues the model handles well vs struggles with, informing future data collection.
- Emphasis F1: CAPS words (e.g., "BELIEVE", "NEVER") carry prosodic emphasis in ElevenLabs v3; measuring this separately tracks delivery accuracy.