Spaces:

mistral-hackaton-2026
/

ethos

Running

App Files Files Community

ethos / docs /research /training-data-composition.md

Lior-0618

chore: sync master to HF Spaces (skip binary history)

0b6ab33 8 days ago

preview code

raw

history blame contribute delete

16.8 kB

Training Data Composition & Balancing for Evoxtral Finetuning

Research summary on best practices for training data composition when finetuning LLMs, applied to the Evoxtral use case (LoRA finetuning Voxtral-Mini-3B to produce tagged transcriptions).

1. Data Mixing Strategies for SFT/LoRA Finetuning

Optimal Ratios of Tagged vs Plain Data

The single most important finding across the literature: always include plain/untagged examples in your training mix. Training exclusively on tagged transcriptions will cause the model to hallucinate tags everywhere and degrade base transcription quality.

Concrete ratios from research:

Mix Ratio (Task:Original)	Source	Result
1:1 (50% new, 50% original)	Mixed Training for Math Reasoning	Best balance -- full new-task performance with only 0.7pp original-task degradation
3:1 (75% new, 25% original)	Same study	New-task performance maintained, original task drops ~1.4pp
7:1 (87.5% new, 12.5% original)	Same study	Still effective, original task drops ~2.5pp
15:1 (93.8% new, 6.2% original)	Same study	Minimum viable -- original task drops ~3.2pp but still far better than 0%

For Evoxtral specifically: With a target of 500-1000 tagged training pairs, aim for:

60-70% tagged transcriptions (emotion tags, non-verbal markers, delivery cues)
30-40% plain transcriptions (standard ASR output, no tags at all)

This ratio prevents the model from learning "always add tags" and preserves base transcription quality.

Preventing Tag Hallucination

Research on preventing hallucination during finetuning is directly applicable to preventing over-generation of audio tags.

Key findings from The Hallucination Tax of Reinforcement Finetuning:

Standard finetuning can reduce refusal rates by >80%, meaning models become overconfident
Tested mixing ratios of 0%, 1%, 10%, 30%, 50% "unanswerable" (negative) examples
10% negative examples was the optimal ratio -- restored appropriate refusal behavior while maintaining task accuracy
Higher ratios (30-50%) degraded performance on the primary task

Applied to Evoxtral: Include ~10-15% of training examples where the audio is emotionally neutral/flat but the ground truth has NO tags (just plain text). This teaches the model that not every utterance needs tags.

Additional anti-hallucination strategies:

Train on "familiar, low-perplexity data" -- using high-perplexity examples increases hallucination (Unfamiliar Finetuning Examples)
Include examples where the model must produce a balanced positive/negative ratio of tags (Robust Instruction Tuning)
Ensure tag density varies naturally across training examples (some heavily tagged, some sparse)

The "Cocktail Effect" in Data Mixing

Research on Data Mixing Optimization for SFT found a "cocktail effect": diverse training data outperforms single-domain approaches. For domain-specific models, including general instruction data alongside specialized content improved results. A medical chatbot achieved best performance with 67.7% general data (Alpaca-GPT4) and 32.3% domain data (PubMedQA).

For Evoxtral: Don't just train on tagged transcriptions. Consider including:

General ASR examples (plain transcription)
Diverse audio conditions (clean, noisy, different speakers)
Various text styles and lengths

2. Balanced Dataset Design for Structured Output Tasks

Teaching When NOT to Apply Tags

This is a critical and under-researched area. The SSML annotation literature provides the closest parallels.

From SSML Prosody Control Research:

Models consistently under-generate tags when not enough tagged examples exist
But over-generate when training is tag-heavy
The solution: systematic variation in tag density across training examples

Recommended tag density distribution for Evoxtral training data:

Tag Density	% of Dataset	Description
None (0 tags)	25-35%	Plain transcription, emotionally neutral audio
Light (1-2 tags)	25-30%	Subtle emotion, single non-verbal
Medium (3-5 tags)	25-30%	Multiple emotions, mixed delivery
Heavy (6+ tags)	10-15%	Highly expressive, dramatic audio

Structured Output Quality

From Databricks End-to-End Structured Extraction:

Training data should be "structured, token-balanced, and metadata-tagged"
For tagged output tasks, ensure the tokenizer properly handles your tag vocabulary
Label masking (computing loss only on output tokens) is essential -- Evoxtral already plans this

3. Synthetic Data Quality and Diversity

Best Practices from Research

Quality filtering (Eugene Yan's comprehensive guide):

Use ROUGE-L < 0.7 threshold against existing examples to ensure diversity (Self-Instruct method)
Remove impossible instructions (e.g., referencing images for text-only models)
Apply validation scoring: chain-of-thought + 5-point scale, average 3 scores per response
54% of synthetic samples having completely valid fields still improved performance by 33% -- moderate imperfection is workable

Diversity strategies:

Iterative sampling: Start with 8 seed examples, progressively incorporate generated ones
Template expansion: Create 2+ alternative formulations for each task
Attribute conditioning: Vary all controllable attributes systematically
Style variation: Generate multiple styles (e.g., WRAP paper used easy/medium/hard/Q&A formats, achieving 3x training speedup with 1:1 real-to-synthetic ratio)

Synthetic Data for Speech/Audio Tasks

From Optimized Synthetic Data for ASR:

Cyclically iterate over speakers without replacement to maximize speaker diversity
TTS and voice conversion systems are viable for ASR data augmentation
Synthetic data lacks diversity in pitch, speed, and background noise compared to authentic audio

From Synthio Audio Classification:

Enhancing consistency and diversity with a small-scale version of the target dataset significantly improves performance
Data augmentations for acoustic diversity boost out-of-distribution generalization

Stratified Sampling for Evoxtral

Recommended stratification axes for the training dataset:

Axis	Categories	Rationale
Emotion type	excited, sad, angry, nervous, calm, frustrated	Balanced representation of all target emotions
Non-verbal sounds	laughs, sighs, gasps, clears throat, crying	Each sound type needs adequate coverage
Speaker gender	male, female, neutral	Prevent gender bias in emotion detection
Audio length	short (<10s), medium (10-30s), long (30s+)	Varied context window utilization
Tag density	none, light, medium, heavy (see table above)	Critical for preventing over/under-generation
Emotional valence	positive, negative, neutral	Prevent bias toward detecting only negative emotions

Speaker diversity from Latent Mixup for Speech Recognition:

Constrain pairings to match gender and dataset partition
Maintain distribution characteristics across splits

4. Catastrophic Forgetting Prevention

How Much Original-Task Data to Mix In

This is the most critical question for Evoxtral: how much plain ASR data to include so that adding emotion tag capability doesn't degrade word-level transcription quality.

Key finding from Apple's Scaling Laws for Forgetting with Pretraining Data Injection:

Injecting as little as 1% of pretraining data in the finetuning mixture prevents the model from forgetting the pretraining set.

However, more nuanced findings from Scaling Laws for Forgetting:

Forgetting follows a strong inverse linear relationship with fine-tuning loss
Forgetting increases as a shifted power law in both parameters finetuned and training steps
Forgetting cannot be avoided through early stopping or varying parameter counts
LoRA still suffers from forgetting, though less than full finetuning

Concrete replay buffer recommendations by task type:

Task Type	Minimum Replay Buffer	Recommended Buffer	Source
NLU tasks (classification, NLI)	1-2%	5%	Empirical study on catastrophic forgetting
Math/Code tasks	5-10%	15-20%	Same study
Structured output (like tags)	~10%	25-35%	Extrapolated from mixed training results

For Evoxtral specifically:

Since you're adding a new structural capability (tag generation) on top of an existing one (ASR), the risk is higher than simple domain adaptation
Recommended: 25-35% of training data should be plain ASR transcription (no tags)
This is supported by the mixed training study showing 1:1 ratio achieving equivalent base-task performance with only 0.7pp degradation

The "Tax" of Adding New Capabilities

From the math finetuning study:

Math-only training: Math accuracy went 3.1% -> 12.0%, but NLI dropped 81.0% -> 16.5% (catastrophic)
1:1 mixed training: Math accuracy 12.0% (same!), NLI 86.2% (only 0.7pp drop from 86.9%)
Even 15:1 (93.8% new task): Original task maintained at 83.8% vs 86.9% baseline

Bottom line: With proper data mixing, the "tax" of adding tagged transcription capability should be less than 3% WER degradation on plain transcription tasks, likely under 1% with a 1:1 mix.

LoRA-Specific Forgetting Mitigation

LoRA inherently reduces forgetting compared to full finetuning because:

Fewer parameters are modified (lower rank = less forgetting)
Base weights remain frozen
The adapter can be merged or removed

However, the scaling laws paper found LoRA still exhibits forgetting that follows the same power law. The data mixing strategy remains essential even with LoRA.

5. Class Imbalance in Tag/Label Finetuning

The Problem for Evoxtral

Some tags will naturally be rarer than others:

[excited] and [laughs] likely appear frequently
[gasps], [stammers], [clears throat] are much rarer
[pause] and emphasis (CAPS) are potentially in every example

Balancing Strategies

Three main approaches from Class-Balanced Loss (CVPR 2019):

Oversampling rare classes: Simple but risks overfitting to repeated examples
Undersampling common classes: Loses valuable training signal
Weighted loss: Reweight by effective number of samples -- best theoretical approach

Class-Balanced Loss formula:

weight_i = (1 - beta) / (1 - beta^n_i)

where n_i = number of samples for class i, and beta is typically 0.9, 0.99, or 0.999.

For generative models (Evoxtral's case), the HuggingFace forum discussion notes:

Weighted loss is harder to apply in token-level generation
Oversampling with variation is often more practical for generative models
Ensure rare tags appear in diverse contexts (different sentences, emotions, speakers)

Recommended Strategy for Evoxtral

Hybrid approach:

Stratified generation: When creating synthetic training data, ensure minimum representation:
- Each tag type should appear in at least 5-10% of tagged examples
- Use the LLM script generator to specifically request rare tag scenarios
Contextual oversampling: For rare tags, generate multiple variations:
- [gasps] in surprise context, fear context, excitement context
- [stammers] in nervous context, angry context, confused context
- Aim for 3-5x oversampling of the rarest tags relative to natural distribution
Minimum tag frequency targets:

Tag Category	Minimum % of Tagged Examples	Natural Frequency	Oversampling Factor
[excited], [sad], [angry]	15-20% each	High	1x (none)
[calm], [nervous], [frustrated]	10-15% each	Medium	1.5-2x
[laughs], [sighs]	10-15% each	Medium-High	1x
[gasps], [crying]	8-12% each	Low	2-3x
[whispers], [shouts]	8-12% each	Low	2-3x
[stammers], [clears throat]	5-10% each	Very Low	3-5x
[pause], CAPS emphasis	Present in 40-60%	Very High	0.5x (undersample)

6. Concrete Recommendations for Evoxtral Training Data

Final Dataset Composition (for 800 total examples)

Category	Count	Percentage	Description
Heavily tagged	80-120	10-15%	6+ tags, dramatic/expressive audio
Medium tagged	200-240	25-30%	3-5 tags, moderate emotion
Lightly tagged	200-240	25-30%	1-2 tags, subtle emotion
Plain transcription	240-280	30-35%	0 tags, neutral delivery

Quality Checklist for Training Data

Tag density varies naturally (not every sentence has a tag)
All 15+ target tags appear in at least 40-80 examples
Rare tags are oversampled 2-5x with diverse contexts
30-35% of examples are plain transcription (anti-hallucination)
Speaker diversity: at least 6-8 distinct voices
Audio length varies (short, medium, long segments)
Emotional valence balanced (positive/negative/neutral)
ROUGE-L between any two examples < 0.7 (diversity check)
Tag positions vary within sentences (beginning, middle, end)
Some examples have closely spaced tags, others widely spaced

Training Configuration Notes

Epochs: 1-2 (more increases forgetting risk)
LoRA rank: Treat as hyperparameter; sweep [8, 16, 32, 64]
Learning rate: Conservative (1e-5 to 5e-5 range)
Label masking: Essential -- only compute loss on output tokens
Evaluation: Track both WER (plain transcription quality) AND tag F1 simultaneously
Early stopping: Monitor WER on a held-out plain transcription set; stop if it degrades >2%

Sources

Practical Tips for Finetuning LLMs Using LoRA - Sebastian Raschka
The Hallucination Tax of Reinforcement Finetuning - Negative example ratios
Unfamiliar Finetuning Examples Control How LLMs Hallucinate
Mitigating Catastrophic Forgetting via Mixed Training - Data replay ratios
Scaling Laws for Forgetting When Fine-Tuning LLMs - Power law relationships
Scaling Laws for Forgetting with Pretraining Data Injection - Apple, 1% replay finding
Data Mixing Optimization for SFT - Cocktail effect
How to Generate and Use Synthetic Data for Finetuning - Eugene Yan
On the Diversity of Synthetic Data
Data Diversity Matters for Robust Instruction Tuning
Class-Balanced Loss Based on Effective Number of Samples - CVPR 2019
Improving French Synthetic Speech Quality via SSML
Towards Improved Speech Recognition through Synthetic Data
Efficient Fine-Tuning with LoRA Guide - Databricks
How to Fine-Tune Focus on Effective Datasets - Meta
Extrinsic Hallucinations in LLMs - Lilian Weng