ethos / docs /research /references.md
Lior-0618's picture
chore: merge master β†’ dev/video-fer (SSE transcribe-stream)
aa15e90

Evoxtral β€” Research References & Sources

References for the technical report covering expressive tagged transcription, ASR evaluation, and post-training methods.

Core Model

RL / Post-Training for ASR

  • Group Relative Policy Optimization for Speech Recognition β€” Proposes GRPO with rule-based rewards for LLM-based ASR. Achieved 18% relative WER reduction on AMI-IHM and 27.9% on AMI-SDM compared to SFT-adapted models.

  • Advancing Speech Understanding in Speech-Aware Language Models with GRPO β€” Applies GRPO to large audio language models, investigating different rule-based reward functions and RL data construction strategies.

  • Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering β€” Demonstrates RL (GRPO) achieves state-of-the-art on audio QA tasks, outperforming SFT baselines.

  • Explore the Reinforcement Learning for LLM-based ASR and TTS System β€” Surveys RL applications to speech models, noting that while RL has enhanced text-based LLMs, application to ASR/TTS remains underexplored.

ASR Evaluation Metrics

  • Speech Recognition Accuracy: Production Metrics & Optimization 2025 β€” Deepgram. Covers WER, CER, Keyword Recall Rate (KRR), Real-Time Factor (RTF), and end-to-end latency. Production systems need blended metrics depending on use case.

  • Moving Beyond Word Error Rate to Evaluate ASR in Clinical Samples β€” Argues WER alone is insufficient; error type, meaning, and context matter. A single substitution can drastically change intent with the same WER penalty.

  • Measuring the Accuracy of Automatic Speech Recognition Solutions β€” ACM survey on ASR evaluation methodology, limitations of WER/CER, and alternative metrics.

  • On the Robust Approximation of ASR Metrics β€” ACL 2025. Novel label-free approach for approximating ASR performance using multimodal embeddings.

  • ProfASR-Bench: A Professional-talk ASR Dataset for High-Stakes Applications β€” Evaluation suite supporting conventional metrics plus entity-aware scores and slice-wise reporting by accent and gender.

Expressive Speech & TTS

Fine-Tuning Frameworks

Evaluation Methodology

Our Benchmark: Evoxtral-Bench

Metrics computed by scripts/train_modal.py::evaluate():

Metric Description Direction
WER Word Error Rate on plain text (tags stripped) lower = better
CER Character Error Rate on plain text lower = better
Tag F1 F1 score for tag extraction (multiset intersection) higher = better
Tag Precision Fraction of predicted tags that match reference higher = better
Tag Recall Fraction of reference tags captured by prediction higher = better
Tag Hallucination Rate Fraction of predicted tags not in reference at all lower = better
Emphasis F1 F1 for CAPITALIZED emphasis words higher = better
Per-Tag F1 Breakdown by individual tag type (e.g. [laughs], [sighs]) higher = better

Why These Metrics

  • WER + CER: Standard ASR quality. CER is more granular and catches character-level errors WER misses.
  • Tag Precision vs Recall: F1 alone hides whether the model hallucinates tags (low precision) or misses them (low recall). Judges care about this distinction.
  • Tag Hallucination Rate: Critical for downstream TTS β€” hallucinated tags produce wrong prosody/effects.
  • Per-Tag Breakdown: Shows which expressive cues the model handles well vs struggles with, informing future data collection.
  • Emphasis F1: CAPS words (e.g., "BELIEVE", "NEVER") carry prosodic emphasis in ElevenLabs v3; measuring this separately tracks delivery accuracy.