ethos / docs /research /training-data-composition.md
Lior-0618's picture
chore: sync master to HF Spaces (skip binary history)
0b6ab33

Training Data Composition & Balancing for Evoxtral Finetuning

Research summary on best practices for training data composition when finetuning LLMs, applied to the Evoxtral use case (LoRA finetuning Voxtral-Mini-3B to produce tagged transcriptions).


1. Data Mixing Strategies for SFT/LoRA Finetuning

Optimal Ratios of Tagged vs Plain Data

The single most important finding across the literature: always include plain/untagged examples in your training mix. Training exclusively on tagged transcriptions will cause the model to hallucinate tags everywhere and degrade base transcription quality.

Concrete ratios from research:

Mix Ratio (Task:Original) Source Result
1:1 (50% new, 50% original) Mixed Training for Math Reasoning Best balance -- full new-task performance with only 0.7pp original-task degradation
3:1 (75% new, 25% original) Same study New-task performance maintained, original task drops ~1.4pp
7:1 (87.5% new, 12.5% original) Same study Still effective, original task drops ~2.5pp
15:1 (93.8% new, 6.2% original) Same study Minimum viable -- original task drops ~3.2pp but still far better than 0%

For Evoxtral specifically: With a target of 500-1000 tagged training pairs, aim for:

  • 60-70% tagged transcriptions (emotion tags, non-verbal markers, delivery cues)
  • 30-40% plain transcriptions (standard ASR output, no tags at all)

This ratio prevents the model from learning "always add tags" and preserves base transcription quality.

Preventing Tag Hallucination

Research on preventing hallucination during finetuning is directly applicable to preventing over-generation of audio tags.

Key findings from The Hallucination Tax of Reinforcement Finetuning:

  • Standard finetuning can reduce refusal rates by >80%, meaning models become overconfident
  • Tested mixing ratios of 0%, 1%, 10%, 30%, 50% "unanswerable" (negative) examples
  • 10% negative examples was the optimal ratio -- restored appropriate refusal behavior while maintaining task accuracy
  • Higher ratios (30-50%) degraded performance on the primary task

Applied to Evoxtral: Include ~10-15% of training examples where the audio is emotionally neutral/flat but the ground truth has NO tags (just plain text). This teaches the model that not every utterance needs tags.

Additional anti-hallucination strategies:

  • Train on "familiar, low-perplexity data" -- using high-perplexity examples increases hallucination (Unfamiliar Finetuning Examples)
  • Include examples where the model must produce a balanced positive/negative ratio of tags (Robust Instruction Tuning)
  • Ensure tag density varies naturally across training examples (some heavily tagged, some sparse)

The "Cocktail Effect" in Data Mixing

Research on Data Mixing Optimization for SFT found a "cocktail effect": diverse training data outperforms single-domain approaches. For domain-specific models, including general instruction data alongside specialized content improved results. A medical chatbot achieved best performance with 67.7% general data (Alpaca-GPT4) and 32.3% domain data (PubMedQA).

For Evoxtral: Don't just train on tagged transcriptions. Consider including:

  • General ASR examples (plain transcription)
  • Diverse audio conditions (clean, noisy, different speakers)
  • Various text styles and lengths

2. Balanced Dataset Design for Structured Output Tasks

Teaching When NOT to Apply Tags

This is a critical and under-researched area. The SSML annotation literature provides the closest parallels.

From SSML Prosody Control Research:

  • Models consistently under-generate tags when not enough tagged examples exist
  • But over-generate when training is tag-heavy
  • The solution: systematic variation in tag density across training examples

Recommended tag density distribution for Evoxtral training data:

Tag Density % of Dataset Description
None (0 tags) 25-35% Plain transcription, emotionally neutral audio
Light (1-2 tags) 25-30% Subtle emotion, single non-verbal
Medium (3-5 tags) 25-30% Multiple emotions, mixed delivery
Heavy (6+ tags) 10-15% Highly expressive, dramatic audio

Structured Output Quality

From Databricks End-to-End Structured Extraction:

  • Training data should be "structured, token-balanced, and metadata-tagged"
  • For tagged output tasks, ensure the tokenizer properly handles your tag vocabulary
  • Label masking (computing loss only on output tokens) is essential -- Evoxtral already plans this

3. Synthetic Data Quality and Diversity

Best Practices from Research

Quality filtering (Eugene Yan's comprehensive guide):

  • Use ROUGE-L < 0.7 threshold against existing examples to ensure diversity (Self-Instruct method)
  • Remove impossible instructions (e.g., referencing images for text-only models)
  • Apply validation scoring: chain-of-thought + 5-point scale, average 3 scores per response
  • 54% of synthetic samples having completely valid fields still improved performance by 33% -- moderate imperfection is workable

Diversity strategies:

  • Iterative sampling: Start with 8 seed examples, progressively incorporate generated ones
  • Template expansion: Create 2+ alternative formulations for each task
  • Attribute conditioning: Vary all controllable attributes systematically
  • Style variation: Generate multiple styles (e.g., WRAP paper used easy/medium/hard/Q&A formats, achieving 3x training speedup with 1:1 real-to-synthetic ratio)

Synthetic Data for Speech/Audio Tasks

From Optimized Synthetic Data for ASR:

  • Cyclically iterate over speakers without replacement to maximize speaker diversity
  • TTS and voice conversion systems are viable for ASR data augmentation
  • Synthetic data lacks diversity in pitch, speed, and background noise compared to authentic audio

From Synthio Audio Classification:

  • Enhancing consistency and diversity with a small-scale version of the target dataset significantly improves performance
  • Data augmentations for acoustic diversity boost out-of-distribution generalization

Stratified Sampling for Evoxtral

Recommended stratification axes for the training dataset:

Axis Categories Rationale
Emotion type excited, sad, angry, nervous, calm, frustrated Balanced representation of all target emotions
Non-verbal sounds laughs, sighs, gasps, clears throat, crying Each sound type needs adequate coverage
Speaker gender male, female, neutral Prevent gender bias in emotion detection
Audio length short (<10s), medium (10-30s), long (30s+) Varied context window utilization
Tag density none, light, medium, heavy (see table above) Critical for preventing over/under-generation
Emotional valence positive, negative, neutral Prevent bias toward detecting only negative emotions

Speaker diversity from Latent Mixup for Speech Recognition:

  • Constrain pairings to match gender and dataset partition
  • Maintain distribution characteristics across splits

4. Catastrophic Forgetting Prevention

How Much Original-Task Data to Mix In

This is the most critical question for Evoxtral: how much plain ASR data to include so that adding emotion tag capability doesn't degrade word-level transcription quality.

Key finding from Apple's Scaling Laws for Forgetting with Pretraining Data Injection:

Injecting as little as 1% of pretraining data in the finetuning mixture prevents the model from forgetting the pretraining set.

However, more nuanced findings from Scaling Laws for Forgetting:

  • Forgetting follows a strong inverse linear relationship with fine-tuning loss
  • Forgetting increases as a shifted power law in both parameters finetuned and training steps
  • Forgetting cannot be avoided through early stopping or varying parameter counts
  • LoRA still suffers from forgetting, though less than full finetuning

Concrete replay buffer recommendations by task type:

Task Type Minimum Replay Buffer Recommended Buffer Source
NLU tasks (classification, NLI) 1-2% 5% Empirical study on catastrophic forgetting
Math/Code tasks 5-10% 15-20% Same study
Structured output (like tags) ~10% 25-35% Extrapolated from mixed training results

For Evoxtral specifically:

  • Since you're adding a new structural capability (tag generation) on top of an existing one (ASR), the risk is higher than simple domain adaptation
  • Recommended: 25-35% of training data should be plain ASR transcription (no tags)
  • This is supported by the mixed training study showing 1:1 ratio achieving equivalent base-task performance with only 0.7pp degradation

The "Tax" of Adding New Capabilities

From the math finetuning study:

  • Math-only training: Math accuracy went 3.1% -> 12.0%, but NLI dropped 81.0% -> 16.5% (catastrophic)
  • 1:1 mixed training: Math accuracy 12.0% (same!), NLI 86.2% (only 0.7pp drop from 86.9%)
  • Even 15:1 (93.8% new task): Original task maintained at 83.8% vs 86.9% baseline

Bottom line: With proper data mixing, the "tax" of adding tagged transcription capability should be less than 3% WER degradation on plain transcription tasks, likely under 1% with a 1:1 mix.

LoRA-Specific Forgetting Mitigation

LoRA inherently reduces forgetting compared to full finetuning because:

  • Fewer parameters are modified (lower rank = less forgetting)
  • Base weights remain frozen
  • The adapter can be merged or removed

However, the scaling laws paper found LoRA still exhibits forgetting that follows the same power law. The data mixing strategy remains essential even with LoRA.


5. Class Imbalance in Tag/Label Finetuning

The Problem for Evoxtral

Some tags will naturally be rarer than others:

  • [excited] and [laughs] likely appear frequently
  • [gasps], [stammers], [clears throat] are much rarer
  • [pause] and emphasis (CAPS) are potentially in every example

Balancing Strategies

Three main approaches from Class-Balanced Loss (CVPR 2019):

  1. Oversampling rare classes: Simple but risks overfitting to repeated examples
  2. Undersampling common classes: Loses valuable training signal
  3. Weighted loss: Reweight by effective number of samples -- best theoretical approach

Class-Balanced Loss formula:

weight_i = (1 - beta) / (1 - beta^n_i)

where n_i = number of samples for class i, and beta is typically 0.9, 0.99, or 0.999.

For generative models (Evoxtral's case), the HuggingFace forum discussion notes:

  • Weighted loss is harder to apply in token-level generation
  • Oversampling with variation is often more practical for generative models
  • Ensure rare tags appear in diverse contexts (different sentences, emotions, speakers)

Recommended Strategy for Evoxtral

Hybrid approach:

  1. Stratified generation: When creating synthetic training data, ensure minimum representation:

    • Each tag type should appear in at least 5-10% of tagged examples
    • Use the LLM script generator to specifically request rare tag scenarios
  2. Contextual oversampling: For rare tags, generate multiple variations:

    • [gasps] in surprise context, fear context, excitement context
    • [stammers] in nervous context, angry context, confused context
    • Aim for 3-5x oversampling of the rarest tags relative to natural distribution
  3. Minimum tag frequency targets:

Tag Category Minimum % of Tagged Examples Natural Frequency Oversampling Factor
[excited], [sad], [angry] 15-20% each High 1x (none)
[calm], [nervous], [frustrated] 10-15% each Medium 1.5-2x
[laughs], [sighs] 10-15% each Medium-High 1x
[gasps], [crying] 8-12% each Low 2-3x
[whispers], [shouts] 8-12% each Low 2-3x
[stammers], [clears throat] 5-10% each Very Low 3-5x
[pause], CAPS emphasis Present in 40-60% Very High 0.5x (undersample)

6. Concrete Recommendations for Evoxtral Training Data

Final Dataset Composition (for 800 total examples)

Category Count Percentage Description
Heavily tagged 80-120 10-15% 6+ tags, dramatic/expressive audio
Medium tagged 200-240 25-30% 3-5 tags, moderate emotion
Lightly tagged 200-240 25-30% 1-2 tags, subtle emotion
Plain transcription 240-280 30-35% 0 tags, neutral delivery

Quality Checklist for Training Data

  • Tag density varies naturally (not every sentence has a tag)
  • All 15+ target tags appear in at least 40-80 examples
  • Rare tags are oversampled 2-5x with diverse contexts
  • 30-35% of examples are plain transcription (anti-hallucination)
  • Speaker diversity: at least 6-8 distinct voices
  • Audio length varies (short, medium, long segments)
  • Emotional valence balanced (positive/negative/neutral)
  • ROUGE-L between any two examples < 0.7 (diversity check)
  • Tag positions vary within sentences (beginning, middle, end)
  • Some examples have closely spaced tags, others widely spaced

Training Configuration Notes

  • Epochs: 1-2 (more increases forgetting risk)
  • LoRA rank: Treat as hyperparameter; sweep [8, 16, 32, 64]
  • Learning rate: Conservative (1e-5 to 5e-5 range)
  • Label masking: Essential -- only compute loss on output tokens
  • Evaluation: Track both WER (plain transcription quality) AND tag F1 simultaneously
  • Early stopping: Monitor WER on a held-out plain transcription set; stop if it degrades >2%

Sources