Spaces:
Running
Running
| # Training Data Composition & Balancing for Evoxtral Finetuning | |
| Research summary on best practices for training data composition when finetuning LLMs, | |
| applied to the Evoxtral use case (LoRA finetuning Voxtral-Mini-3B to produce tagged transcriptions). | |
| --- | |
| ## 1. Data Mixing Strategies for SFT/LoRA Finetuning | |
| ### Optimal Ratios of Tagged vs Plain Data | |
| The single most important finding across the literature: **always include plain/untagged examples | |
| in your training mix.** Training exclusively on tagged transcriptions will cause the model to | |
| hallucinate tags everywhere and degrade base transcription quality. | |
| **Concrete ratios from research:** | |
| | Mix Ratio (Task:Original) | Source | Result | | |
| |---------------------------|--------|--------| | |
| | 1:1 (50% new, 50% original) | [Mixed Training for Math Reasoning](https://arxiv.org/html/2512.13706) | Best balance -- full new-task performance with only 0.7pp original-task degradation | | |
| | 3:1 (75% new, 25% original) | Same study | New-task performance maintained, original task drops ~1.4pp | | |
| | 7:1 (87.5% new, 12.5% original) | Same study | Still effective, original task drops ~2.5pp | | |
| | 15:1 (93.8% new, 6.2% original) | Same study | Minimum viable -- original task drops ~3.2pp but still far better than 0% | | |
| **For Evoxtral specifically:** With a target of 500-1000 tagged training pairs, aim for: | |
| - **60-70% tagged transcriptions** (emotion tags, non-verbal markers, delivery cues) | |
| - **30-40% plain transcriptions** (standard ASR output, no tags at all) | |
| This ratio prevents the model from learning "always add tags" and preserves base transcription quality. | |
| ### Preventing Tag Hallucination | |
| Research on preventing hallucination during finetuning is directly applicable to preventing | |
| over-generation of audio tags. | |
| **Key findings from [The Hallucination Tax of Reinforcement Finetuning](https://arxiv.org/html/2505.13988):** | |
| - Standard finetuning can reduce refusal rates by >80%, meaning models become overconfident | |
| - Tested mixing ratios of 0%, 1%, 10%, 30%, 50% "unanswerable" (negative) examples | |
| - **10% negative examples was the optimal ratio** -- restored appropriate refusal behavior while maintaining task accuracy | |
| - Higher ratios (30-50%) degraded performance on the primary task | |
| **Applied to Evoxtral:** Include ~10-15% of training examples where the audio is emotionally | |
| neutral/flat but the ground truth has NO tags (just plain text). This teaches the model that | |
| not every utterance needs tags. | |
| **Additional anti-hallucination strategies:** | |
| - Train on "familiar, low-perplexity data" -- using high-perplexity examples increases hallucination ([Unfamiliar Finetuning Examples](https://arxiv.org/html/2403.05612v1)) | |
| - Include examples where the model must produce a balanced positive/negative ratio of tags ([Robust Instruction Tuning](https://arxiv.org/abs/2306.14565)) | |
| - Ensure tag density varies naturally across training examples (some heavily tagged, some sparse) | |
| ### The "Cocktail Effect" in Data Mixing | |
| Research on [Data Mixing Optimization for SFT](https://arxiv.org/html/2508.11953v1) found a | |
| "cocktail effect": diverse training data outperforms single-domain approaches. For domain-specific | |
| models, including general instruction data alongside specialized content improved results. A medical | |
| chatbot achieved best performance with **67.7% general data (Alpaca-GPT4) and 32.3% domain data | |
| (PubMedQA).** | |
| **For Evoxtral:** Don't just train on tagged transcriptions. Consider including: | |
| - General ASR examples (plain transcription) | |
| - Diverse audio conditions (clean, noisy, different speakers) | |
| - Various text styles and lengths | |
| --- | |
| ## 2. Balanced Dataset Design for Structured Output Tasks | |
| ### Teaching When NOT to Apply Tags | |
| This is a critical and under-researched area. The SSML annotation literature provides the closest parallels. | |
| **From [SSML Prosody Control Research](https://arxiv.org/html/2508.17494v1):** | |
| - Models consistently **under-generate** tags when not enough tagged examples exist | |
| - But **over-generate** when training is tag-heavy | |
| - The solution: systematic variation in tag density across training examples | |
| **Recommended tag density distribution for Evoxtral training data:** | |
| | Tag Density | % of Dataset | Description | | |
| |-------------|-------------|-------------| | |
| | None (0 tags) | 25-35% | Plain transcription, emotionally neutral audio | | |
| | Light (1-2 tags) | 25-30% | Subtle emotion, single non-verbal | | |
| | Medium (3-5 tags) | 25-30% | Multiple emotions, mixed delivery | | |
| | Heavy (6+ tags) | 10-15% | Highly expressive, dramatic audio | | |
| ### Structured Output Quality | |
| From [Databricks End-to-End Structured Extraction](https://community.databricks.com/t5/technical-blog/end-to-end-structured-extraction-with-llm-part-2-fine-tuning/ba-p/99900): | |
| - Training data should be "structured, token-balanced, and metadata-tagged" | |
| - For tagged output tasks, ensure the tokenizer properly handles your tag vocabulary | |
| - Label masking (computing loss only on output tokens) is essential -- Evoxtral already plans this | |
| --- | |
| ## 3. Synthetic Data Quality and Diversity | |
| ### Best Practices from Research | |
| **Quality filtering ([Eugene Yan's comprehensive guide](https://eugeneyan.com/writing/synthetic/)):** | |
| - Use **ROUGE-L < 0.7** threshold against existing examples to ensure diversity (Self-Instruct method) | |
| - Remove impossible instructions (e.g., referencing images for text-only models) | |
| - Apply validation scoring: chain-of-thought + 5-point scale, average 3 scores per response | |
| - **54% of synthetic samples having completely valid fields still improved performance by 33%** -- moderate imperfection is workable | |
| **Diversity strategies:** | |
| - **Iterative sampling**: Start with 8 seed examples, progressively incorporate generated ones | |
| - **Template expansion**: Create 2+ alternative formulations for each task | |
| - **Attribute conditioning**: Vary all controllable attributes systematically | |
| - **Style variation**: Generate multiple styles (e.g., WRAP paper used easy/medium/hard/Q&A formats, achieving 3x training speedup with 1:1 real-to-synthetic ratio) | |
| ### Synthetic Data for Speech/Audio Tasks | |
| **From [Optimized Synthetic Data for ASR](https://arxiv.org/html/2508.21631v1):** | |
| - Cyclically iterate over speakers without replacement to maximize speaker diversity | |
| - TTS and voice conversion systems are viable for ASR data augmentation | |
| - Synthetic data lacks diversity in pitch, speed, and background noise compared to authentic audio | |
| **From [Synthio Audio Classification](https://arxiv.org/html/2410.02056v1):** | |
| - Enhancing consistency and diversity with a small-scale version of the target dataset significantly improves performance | |
| - Data augmentations for acoustic diversity boost out-of-distribution generalization | |
| ### Stratified Sampling for Evoxtral | |
| **Recommended stratification axes for the training dataset:** | |
| | Axis | Categories | Rationale | | |
| |------|-----------|-----------| | |
| | Emotion type | excited, sad, angry, nervous, calm, frustrated | Balanced representation of all target emotions | | |
| | Non-verbal sounds | laughs, sighs, gasps, clears throat, crying | Each sound type needs adequate coverage | | |
| | Speaker gender | male, female, neutral | Prevent gender bias in emotion detection | | |
| | Audio length | short (<10s), medium (10-30s), long (30s+) | Varied context window utilization | | |
| | Tag density | none, light, medium, heavy (see table above) | Critical for preventing over/under-generation | | |
| | Emotional valence | positive, negative, neutral | Prevent bias toward detecting only negative emotions | | |
| **Speaker diversity from [Latent Mixup for Speech Recognition](https://arxiv.org/html/2511.20534):** | |
| - Constrain pairings to match gender and dataset partition | |
| - Maintain distribution characteristics across splits | |
| --- | |
| ## 4. Catastrophic Forgetting Prevention | |
| ### How Much Original-Task Data to Mix In | |
| This is the most critical question for Evoxtral: how much plain ASR data to include | |
| so that adding emotion tag capability doesn't degrade word-level transcription quality. | |
| **Key finding from [Apple's Scaling Laws for Forgetting with Pretraining Data Injection](https://machinelearning.apple.com/research/scaling-laws):** | |
| > Injecting as little as **1% of pretraining data** in the finetuning mixture prevents the model | |
| > from forgetting the pretraining set. | |
| **However, more nuanced findings from [Scaling Laws for Forgetting](https://arxiv.org/html/2401.05605v1):** | |
| - Forgetting follows a **strong inverse linear relationship** with fine-tuning loss | |
| - Forgetting increases as a **shifted power law** in both parameters finetuned and training steps | |
| - Forgetting **cannot be avoided through early stopping** or varying parameter counts | |
| - LoRA still suffers from forgetting, though less than full finetuning | |
| **Concrete replay buffer recommendations by task type:** | |
| | Task Type | Minimum Replay Buffer | Recommended Buffer | Source | | |
| |-----------|----------------------|-------------------|--------| | |
| | NLU tasks (classification, NLI) | 1-2% | 5% | Empirical study on catastrophic forgetting | | |
| | Math/Code tasks | 5-10% | 15-20% | Same study | | |
| | Structured output (like tags) | ~10% | 25-35% | Extrapolated from mixed training results | | |
| **For Evoxtral specifically:** | |
| - Since you're adding a **new structural capability** (tag generation) on top of an existing one (ASR), the risk is higher than simple domain adaptation | |
| - **Recommended: 25-35% of training data should be plain ASR transcription** (no tags) | |
| - This is supported by the mixed training study showing 1:1 ratio achieving equivalent base-task performance with only 0.7pp degradation | |
| ### The "Tax" of Adding New Capabilities | |
| From the math finetuning study: | |
| - **Math-only training**: Math accuracy went 3.1% -> 12.0%, but NLI dropped 81.0% -> 16.5% (catastrophic) | |
| - **1:1 mixed training**: Math accuracy 12.0% (same!), NLI 86.2% (only 0.7pp drop from 86.9%) | |
| - **Even 15:1 (93.8% new task)**: Original task maintained at 83.8% vs 86.9% baseline | |
| **Bottom line**: With proper data mixing, the "tax" of adding tagged transcription capability | |
| should be **less than 3% WER degradation** on plain transcription tasks, likely under 1% with | |
| a 1:1 mix. | |
| ### LoRA-Specific Forgetting Mitigation | |
| LoRA inherently reduces forgetting compared to full finetuning because: | |
| - Fewer parameters are modified (lower rank = less forgetting) | |
| - Base weights remain frozen | |
| - The adapter can be merged or removed | |
| However, the [scaling laws paper](https://arxiv.org/html/2401.05605v1) found LoRA still | |
| exhibits forgetting that follows the same power law. The data mixing strategy remains essential | |
| even with LoRA. | |
| --- | |
| ## 5. Class Imbalance in Tag/Label Finetuning | |
| ### The Problem for Evoxtral | |
| Some tags will naturally be rarer than others: | |
| - `[excited]` and `[laughs]` likely appear frequently | |
| - `[gasps]`, `[stammers]`, `[clears throat]` are much rarer | |
| - `[pause]` and emphasis (CAPS) are potentially in every example | |
| ### Balancing Strategies | |
| **Three main approaches from [Class-Balanced Loss (CVPR 2019)](https://openaccess.thecvf.com/content_CVPR_2019/papers/Cui_Class-Balanced_Loss_Based_on_Effective_Number_of_Samples_CVPR_2019_paper.pdf):** | |
| 1. **Oversampling rare classes**: Simple but risks overfitting to repeated examples | |
| 2. **Undersampling common classes**: Loses valuable training signal | |
| 3. **Weighted loss**: Reweight by effective number of samples -- best theoretical approach | |
| **Class-Balanced Loss formula:** | |
| ``` | |
| weight_i = (1 - beta) / (1 - beta^n_i) | |
| ``` | |
| where `n_i` = number of samples for class `i`, and `beta` is typically 0.9, 0.99, or 0.999. | |
| **For generative models (Evoxtral's case), the [HuggingFace forum discussion](https://discuss.huggingface.co/t/handling-class-imbalance-when-finetuning-a-decoder-model-on-text-generation/173010) notes:** | |
| - Weighted loss is harder to apply in token-level generation | |
| - **Oversampling with variation is often more practical** for generative models | |
| - Ensure rare tags appear in diverse contexts (different sentences, emotions, speakers) | |
| ### Recommended Strategy for Evoxtral | |
| **Hybrid approach:** | |
| 1. **Stratified generation**: When creating synthetic training data, ensure minimum representation: | |
| - Each tag type should appear in at least 5-10% of tagged examples | |
| - Use the LLM script generator to specifically request rare tag scenarios | |
| 2. **Contextual oversampling**: For rare tags, generate multiple variations: | |
| - `[gasps]` in surprise context, fear context, excitement context | |
| - `[stammers]` in nervous context, angry context, confused context | |
| - Aim for 3-5x oversampling of the rarest tags relative to natural distribution | |
| 3. **Minimum tag frequency targets:** | |
| | Tag Category | Minimum % of Tagged Examples | Natural Frequency | Oversampling Factor | | |
| |-------------|-----------------------------|--------------------|---------------------| | |
| | [excited], [sad], [angry] | 15-20% each | High | 1x (none) | | |
| | [calm], [nervous], [frustrated] | 10-15% each | Medium | 1.5-2x | | |
| | [laughs], [sighs] | 10-15% each | Medium-High | 1x | | |
| | [gasps], [crying] | 8-12% each | Low | 2-3x | | |
| | [whispers], [shouts] | 8-12% each | Low | 2-3x | | |
| | [stammers], [clears throat] | 5-10% each | Very Low | 3-5x | | |
| | [pause], CAPS emphasis | Present in 40-60% | Very High | 0.5x (undersample) | | |
| --- | |
| ## 6. Concrete Recommendations for Evoxtral Training Data | |
| ### Final Dataset Composition (for 800 total examples) | |
| | Category | Count | Percentage | Description | | |
| |----------|-------|------------|-------------| | |
| | Heavily tagged | 80-120 | 10-15% | 6+ tags, dramatic/expressive audio | | |
| | Medium tagged | 200-240 | 25-30% | 3-5 tags, moderate emotion | | |
| | Lightly tagged | 200-240 | 25-30% | 1-2 tags, subtle emotion | | |
| | Plain transcription | 240-280 | 30-35% | 0 tags, neutral delivery | | |
| ### Quality Checklist for Training Data | |
| - [ ] Tag density varies naturally (not every sentence has a tag) | |
| - [ ] All 15+ target tags appear in at least 40-80 examples | |
| - [ ] Rare tags are oversampled 2-5x with diverse contexts | |
| - [ ] 30-35% of examples are plain transcription (anti-hallucination) | |
| - [ ] Speaker diversity: at least 6-8 distinct voices | |
| - [ ] Audio length varies (short, medium, long segments) | |
| - [ ] Emotional valence balanced (positive/negative/neutral) | |
| - [ ] ROUGE-L between any two examples < 0.7 (diversity check) | |
| - [ ] Tag positions vary within sentences (beginning, middle, end) | |
| - [ ] Some examples have closely spaced tags, others widely spaced | |
| ### Training Configuration Notes | |
| - **Epochs**: 1-2 (more increases forgetting risk) | |
| - **LoRA rank**: Treat as hyperparameter; sweep [8, 16, 32, 64] | |
| - **Learning rate**: Conservative (1e-5 to 5e-5 range) | |
| - **Label masking**: Essential -- only compute loss on output tokens | |
| - **Evaluation**: Track both WER (plain transcription quality) AND tag F1 simultaneously | |
| - **Early stopping**: Monitor WER on a held-out plain transcription set; stop if it degrades >2% | |
| --- | |
| ## Sources | |
| - [Practical Tips for Finetuning LLMs Using LoRA](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms) - Sebastian Raschka | |
| - [The Hallucination Tax of Reinforcement Finetuning](https://arxiv.org/html/2505.13988) - Negative example ratios | |
| - [Unfamiliar Finetuning Examples Control How LLMs Hallucinate](https://arxiv.org/html/2403.05612v1) | |
| - [Mitigating Catastrophic Forgetting via Mixed Training](https://arxiv.org/html/2512.13706) - Data replay ratios | |
| - [Scaling Laws for Forgetting When Fine-Tuning LLMs](https://arxiv.org/html/2401.05605v1) - Power law relationships | |
| - [Scaling Laws for Forgetting with Pretraining Data Injection](https://machinelearning.apple.com/research/scaling-laws) - Apple, 1% replay finding | |
| - [Data Mixing Optimization for SFT](https://arxiv.org/html/2508.11953v1) - Cocktail effect | |
| - [How to Generate and Use Synthetic Data for Finetuning](https://eugeneyan.com/writing/synthetic/) - Eugene Yan | |
| - [On the Diversity of Synthetic Data](https://arxiv.org/html/2410.15226v2) | |
| - [Data Diversity Matters for Robust Instruction Tuning](https://aclanthology.org/2024.findings-emnlp.195.pdf) | |
| - [Class-Balanced Loss Based on Effective Number of Samples](https://openaccess.thecvf.com/content_CVPR_2019/papers/Cui_Class-Balanced_Loss_Based_on_Effective_Number_of_Samples_CVPR_2019_paper.pdf) - CVPR 2019 | |
| - [Improving French Synthetic Speech Quality via SSML](https://arxiv.org/html/2508.17494v1) | |
| - [Towards Improved Speech Recognition through Synthetic Data](https://arxiv.org/html/2508.21631v1) | |
| - [Efficient Fine-Tuning with LoRA Guide](https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms) - Databricks | |
| - [How to Fine-Tune Focus on Effective Datasets](https://ai.meta.com/blog/how-to-fine-tune-llms-peft-dataset-curation/) - Meta | |
| - [Extrinsic Hallucinations in LLMs](https://lilianweng.github.io/posts/2024-07-07-hallucination/) - Lilian Weng | |