# Training Data Composition & Balancing for Evoxtral Finetuning Research summary on best practices for training data composition when finetuning LLMs, applied to the Evoxtral use case (LoRA finetuning Voxtral-Mini-3B to produce tagged transcriptions). --- ## 1. Data Mixing Strategies for SFT/LoRA Finetuning ### Optimal Ratios of Tagged vs Plain Data The single most important finding across the literature: **always include plain/untagged examples in your training mix.** Training exclusively on tagged transcriptions will cause the model to hallucinate tags everywhere and degrade base transcription quality. **Concrete ratios from research:** | Mix Ratio (Task:Original) | Source | Result | |---------------------------|--------|--------| | 1:1 (50% new, 50% original) | [Mixed Training for Math Reasoning](https://arxiv.org/html/2512.13706) | Best balance -- full new-task performance with only 0.7pp original-task degradation | | 3:1 (75% new, 25% original) | Same study | New-task performance maintained, original task drops ~1.4pp | | 7:1 (87.5% new, 12.5% original) | Same study | Still effective, original task drops ~2.5pp | | 15:1 (93.8% new, 6.2% original) | Same study | Minimum viable -- original task drops ~3.2pp but still far better than 0% | **For Evoxtral specifically:** With a target of 500-1000 tagged training pairs, aim for: - **60-70% tagged transcriptions** (emotion tags, non-verbal markers, delivery cues) - **30-40% plain transcriptions** (standard ASR output, no tags at all) This ratio prevents the model from learning "always add tags" and preserves base transcription quality. ### Preventing Tag Hallucination Research on preventing hallucination during finetuning is directly applicable to preventing over-generation of audio tags. **Key findings from [The Hallucination Tax of Reinforcement Finetuning](https://arxiv.org/html/2505.13988):** - Standard finetuning can reduce refusal rates by >80%, meaning models become overconfident - Tested mixing ratios of 0%, 1%, 10%, 30%, 50% "unanswerable" (negative) examples - **10% negative examples was the optimal ratio** -- restored appropriate refusal behavior while maintaining task accuracy - Higher ratios (30-50%) degraded performance on the primary task **Applied to Evoxtral:** Include ~10-15% of training examples where the audio is emotionally neutral/flat but the ground truth has NO tags (just plain text). This teaches the model that not every utterance needs tags. **Additional anti-hallucination strategies:** - Train on "familiar, low-perplexity data" -- using high-perplexity examples increases hallucination ([Unfamiliar Finetuning Examples](https://arxiv.org/html/2403.05612v1)) - Include examples where the model must produce a balanced positive/negative ratio of tags ([Robust Instruction Tuning](https://arxiv.org/abs/2306.14565)) - Ensure tag density varies naturally across training examples (some heavily tagged, some sparse) ### The "Cocktail Effect" in Data Mixing Research on [Data Mixing Optimization for SFT](https://arxiv.org/html/2508.11953v1) found a "cocktail effect": diverse training data outperforms single-domain approaches. For domain-specific models, including general instruction data alongside specialized content improved results. A medical chatbot achieved best performance with **67.7% general data (Alpaca-GPT4) and 32.3% domain data (PubMedQA).** **For Evoxtral:** Don't just train on tagged transcriptions. Consider including: - General ASR examples (plain transcription) - Diverse audio conditions (clean, noisy, different speakers) - Various text styles and lengths --- ## 2. Balanced Dataset Design for Structured Output Tasks ### Teaching When NOT to Apply Tags This is a critical and under-researched area. The SSML annotation literature provides the closest parallels. **From [SSML Prosody Control Research](https://arxiv.org/html/2508.17494v1):** - Models consistently **under-generate** tags when not enough tagged examples exist - But **over-generate** when training is tag-heavy - The solution: systematic variation in tag density across training examples **Recommended tag density distribution for Evoxtral training data:** | Tag Density | % of Dataset | Description | |-------------|-------------|-------------| | None (0 tags) | 25-35% | Plain transcription, emotionally neutral audio | | Light (1-2 tags) | 25-30% | Subtle emotion, single non-verbal | | Medium (3-5 tags) | 25-30% | Multiple emotions, mixed delivery | | Heavy (6+ tags) | 10-15% | Highly expressive, dramatic audio | ### Structured Output Quality From [Databricks End-to-End Structured Extraction](https://community.databricks.com/t5/technical-blog/end-to-end-structured-extraction-with-llm-part-2-fine-tuning/ba-p/99900): - Training data should be "structured, token-balanced, and metadata-tagged" - For tagged output tasks, ensure the tokenizer properly handles your tag vocabulary - Label masking (computing loss only on output tokens) is essential -- Evoxtral already plans this --- ## 3. Synthetic Data Quality and Diversity ### Best Practices from Research **Quality filtering ([Eugene Yan's comprehensive guide](https://eugeneyan.com/writing/synthetic/)):** - Use **ROUGE-L < 0.7** threshold against existing examples to ensure diversity (Self-Instruct method) - Remove impossible instructions (e.g., referencing images for text-only models) - Apply validation scoring: chain-of-thought + 5-point scale, average 3 scores per response - **54% of synthetic samples having completely valid fields still improved performance by 33%** -- moderate imperfection is workable **Diversity strategies:** - **Iterative sampling**: Start with 8 seed examples, progressively incorporate generated ones - **Template expansion**: Create 2+ alternative formulations for each task - **Attribute conditioning**: Vary all controllable attributes systematically - **Style variation**: Generate multiple styles (e.g., WRAP paper used easy/medium/hard/Q&A formats, achieving 3x training speedup with 1:1 real-to-synthetic ratio) ### Synthetic Data for Speech/Audio Tasks **From [Optimized Synthetic Data for ASR](https://arxiv.org/html/2508.21631v1):** - Cyclically iterate over speakers without replacement to maximize speaker diversity - TTS and voice conversion systems are viable for ASR data augmentation - Synthetic data lacks diversity in pitch, speed, and background noise compared to authentic audio **From [Synthio Audio Classification](https://arxiv.org/html/2410.02056v1):** - Enhancing consistency and diversity with a small-scale version of the target dataset significantly improves performance - Data augmentations for acoustic diversity boost out-of-distribution generalization ### Stratified Sampling for Evoxtral **Recommended stratification axes for the training dataset:** | Axis | Categories | Rationale | |------|-----------|-----------| | Emotion type | excited, sad, angry, nervous, calm, frustrated | Balanced representation of all target emotions | | Non-verbal sounds | laughs, sighs, gasps, clears throat, crying | Each sound type needs adequate coverage | | Speaker gender | male, female, neutral | Prevent gender bias in emotion detection | | Audio length | short (<10s), medium (10-30s), long (30s+) | Varied context window utilization | | Tag density | none, light, medium, heavy (see table above) | Critical for preventing over/under-generation | | Emotional valence | positive, negative, neutral | Prevent bias toward detecting only negative emotions | **Speaker diversity from [Latent Mixup for Speech Recognition](https://arxiv.org/html/2511.20534):** - Constrain pairings to match gender and dataset partition - Maintain distribution characteristics across splits --- ## 4. Catastrophic Forgetting Prevention ### How Much Original-Task Data to Mix In This is the most critical question for Evoxtral: how much plain ASR data to include so that adding emotion tag capability doesn't degrade word-level transcription quality. **Key finding from [Apple's Scaling Laws for Forgetting with Pretraining Data Injection](https://machinelearning.apple.com/research/scaling-laws):** > Injecting as little as **1% of pretraining data** in the finetuning mixture prevents the model > from forgetting the pretraining set. **However, more nuanced findings from [Scaling Laws for Forgetting](https://arxiv.org/html/2401.05605v1):** - Forgetting follows a **strong inverse linear relationship** with fine-tuning loss - Forgetting increases as a **shifted power law** in both parameters finetuned and training steps - Forgetting **cannot be avoided through early stopping** or varying parameter counts - LoRA still suffers from forgetting, though less than full finetuning **Concrete replay buffer recommendations by task type:** | Task Type | Minimum Replay Buffer | Recommended Buffer | Source | |-----------|----------------------|-------------------|--------| | NLU tasks (classification, NLI) | 1-2% | 5% | Empirical study on catastrophic forgetting | | Math/Code tasks | 5-10% | 15-20% | Same study | | Structured output (like tags) | ~10% | 25-35% | Extrapolated from mixed training results | **For Evoxtral specifically:** - Since you're adding a **new structural capability** (tag generation) on top of an existing one (ASR), the risk is higher than simple domain adaptation - **Recommended: 25-35% of training data should be plain ASR transcription** (no tags) - This is supported by the mixed training study showing 1:1 ratio achieving equivalent base-task performance with only 0.7pp degradation ### The "Tax" of Adding New Capabilities From the math finetuning study: - **Math-only training**: Math accuracy went 3.1% -> 12.0%, but NLI dropped 81.0% -> 16.5% (catastrophic) - **1:1 mixed training**: Math accuracy 12.0% (same!), NLI 86.2% (only 0.7pp drop from 86.9%) - **Even 15:1 (93.8% new task)**: Original task maintained at 83.8% vs 86.9% baseline **Bottom line**: With proper data mixing, the "tax" of adding tagged transcription capability should be **less than 3% WER degradation** on plain transcription tasks, likely under 1% with a 1:1 mix. ### LoRA-Specific Forgetting Mitigation LoRA inherently reduces forgetting compared to full finetuning because: - Fewer parameters are modified (lower rank = less forgetting) - Base weights remain frozen - The adapter can be merged or removed However, the [scaling laws paper](https://arxiv.org/html/2401.05605v1) found LoRA still exhibits forgetting that follows the same power law. The data mixing strategy remains essential even with LoRA. --- ## 5. Class Imbalance in Tag/Label Finetuning ### The Problem for Evoxtral Some tags will naturally be rarer than others: - `[excited]` and `[laughs]` likely appear frequently - `[gasps]`, `[stammers]`, `[clears throat]` are much rarer - `[pause]` and emphasis (CAPS) are potentially in every example ### Balancing Strategies **Three main approaches from [Class-Balanced Loss (CVPR 2019)](https://openaccess.thecvf.com/content_CVPR_2019/papers/Cui_Class-Balanced_Loss_Based_on_Effective_Number_of_Samples_CVPR_2019_paper.pdf):** 1. **Oversampling rare classes**: Simple but risks overfitting to repeated examples 2. **Undersampling common classes**: Loses valuable training signal 3. **Weighted loss**: Reweight by effective number of samples -- best theoretical approach **Class-Balanced Loss formula:** ``` weight_i = (1 - beta) / (1 - beta^n_i) ``` where `n_i` = number of samples for class `i`, and `beta` is typically 0.9, 0.99, or 0.999. **For generative models (Evoxtral's case), the [HuggingFace forum discussion](https://discuss.huggingface.co/t/handling-class-imbalance-when-finetuning-a-decoder-model-on-text-generation/173010) notes:** - Weighted loss is harder to apply in token-level generation - **Oversampling with variation is often more practical** for generative models - Ensure rare tags appear in diverse contexts (different sentences, emotions, speakers) ### Recommended Strategy for Evoxtral **Hybrid approach:** 1. **Stratified generation**: When creating synthetic training data, ensure minimum representation: - Each tag type should appear in at least 5-10% of tagged examples - Use the LLM script generator to specifically request rare tag scenarios 2. **Contextual oversampling**: For rare tags, generate multiple variations: - `[gasps]` in surprise context, fear context, excitement context - `[stammers]` in nervous context, angry context, confused context - Aim for 3-5x oversampling of the rarest tags relative to natural distribution 3. **Minimum tag frequency targets:** | Tag Category | Minimum % of Tagged Examples | Natural Frequency | Oversampling Factor | |-------------|-----------------------------|--------------------|---------------------| | [excited], [sad], [angry] | 15-20% each | High | 1x (none) | | [calm], [nervous], [frustrated] | 10-15% each | Medium | 1.5-2x | | [laughs], [sighs] | 10-15% each | Medium-High | 1x | | [gasps], [crying] | 8-12% each | Low | 2-3x | | [whispers], [shouts] | 8-12% each | Low | 2-3x | | [stammers], [clears throat] | 5-10% each | Very Low | 3-5x | | [pause], CAPS emphasis | Present in 40-60% | Very High | 0.5x (undersample) | --- ## 6. Concrete Recommendations for Evoxtral Training Data ### Final Dataset Composition (for 800 total examples) | Category | Count | Percentage | Description | |----------|-------|------------|-------------| | Heavily tagged | 80-120 | 10-15% | 6+ tags, dramatic/expressive audio | | Medium tagged | 200-240 | 25-30% | 3-5 tags, moderate emotion | | Lightly tagged | 200-240 | 25-30% | 1-2 tags, subtle emotion | | Plain transcription | 240-280 | 30-35% | 0 tags, neutral delivery | ### Quality Checklist for Training Data - [ ] Tag density varies naturally (not every sentence has a tag) - [ ] All 15+ target tags appear in at least 40-80 examples - [ ] Rare tags are oversampled 2-5x with diverse contexts - [ ] 30-35% of examples are plain transcription (anti-hallucination) - [ ] Speaker diversity: at least 6-8 distinct voices - [ ] Audio length varies (short, medium, long segments) - [ ] Emotional valence balanced (positive/negative/neutral) - [ ] ROUGE-L between any two examples < 0.7 (diversity check) - [ ] Tag positions vary within sentences (beginning, middle, end) - [ ] Some examples have closely spaced tags, others widely spaced ### Training Configuration Notes - **Epochs**: 1-2 (more increases forgetting risk) - **LoRA rank**: Treat as hyperparameter; sweep [8, 16, 32, 64] - **Learning rate**: Conservative (1e-5 to 5e-5 range) - **Label masking**: Essential -- only compute loss on output tokens - **Evaluation**: Track both WER (plain transcription quality) AND tag F1 simultaneously - **Early stopping**: Monitor WER on a held-out plain transcription set; stop if it degrades >2% --- ## Sources - [Practical Tips for Finetuning LLMs Using LoRA](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms) - Sebastian Raschka - [The Hallucination Tax of Reinforcement Finetuning](https://arxiv.org/html/2505.13988) - Negative example ratios - [Unfamiliar Finetuning Examples Control How LLMs Hallucinate](https://arxiv.org/html/2403.05612v1) - [Mitigating Catastrophic Forgetting via Mixed Training](https://arxiv.org/html/2512.13706) - Data replay ratios - [Scaling Laws for Forgetting When Fine-Tuning LLMs](https://arxiv.org/html/2401.05605v1) - Power law relationships - [Scaling Laws for Forgetting with Pretraining Data Injection](https://machinelearning.apple.com/research/scaling-laws) - Apple, 1% replay finding - [Data Mixing Optimization for SFT](https://arxiv.org/html/2508.11953v1) - Cocktail effect - [How to Generate and Use Synthetic Data for Finetuning](https://eugeneyan.com/writing/synthetic/) - Eugene Yan - [On the Diversity of Synthetic Data](https://arxiv.org/html/2410.15226v2) - [Data Diversity Matters for Robust Instruction Tuning](https://aclanthology.org/2024.findings-emnlp.195.pdf) - [Class-Balanced Loss Based on Effective Number of Samples](https://openaccess.thecvf.com/content_CVPR_2019/papers/Cui_Class-Balanced_Loss_Based_on_Effective_Number_of_Samples_CVPR_2019_paper.pdf) - CVPR 2019 - [Improving French Synthetic Speech Quality via SSML](https://arxiv.org/html/2508.17494v1) - [Towards Improved Speech Recognition through Synthetic Data](https://arxiv.org/html/2508.21631v1) - [Efficient Fine-Tuning with LoRA Guide](https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms) - Databricks - [How to Fine-Tune Focus on Effective Datasets](https://ai.meta.com/blog/how-to-fine-tune-llms-peft-dataset-curation/) - Meta - [Extrinsic Hallucinations in LLMs](https://lilianweng.github.io/posts/2024-07-07-hallucination/) - Lilian Weng