File size: 16,757 Bytes
0b6ab33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
# Training Data Composition & Balancing for Evoxtral Finetuning

Research summary on best practices for training data composition when finetuning LLMs,
applied to the Evoxtral use case (LoRA finetuning Voxtral-Mini-3B to produce tagged transcriptions).

---

## 1. Data Mixing Strategies for SFT/LoRA Finetuning

### Optimal Ratios of Tagged vs Plain Data

The single most important finding across the literature: **always include plain/untagged examples
in your training mix.** Training exclusively on tagged transcriptions will cause the model to
hallucinate tags everywhere and degrade base transcription quality.

**Concrete ratios from research:**

| Mix Ratio (Task:Original) | Source | Result |
|---------------------------|--------|--------|
| 1:1 (50% new, 50% original) | [Mixed Training for Math Reasoning](https://arxiv.org/html/2512.13706) | Best balance -- full new-task performance with only 0.7pp original-task degradation |
| 3:1 (75% new, 25% original) | Same study | New-task performance maintained, original task drops ~1.4pp |
| 7:1 (87.5% new, 12.5% original) | Same study | Still effective, original task drops ~2.5pp |
| 15:1 (93.8% new, 6.2% original) | Same study | Minimum viable -- original task drops ~3.2pp but still far better than 0% |

**For Evoxtral specifically:** With a target of 500-1000 tagged training pairs, aim for:
- **60-70% tagged transcriptions** (emotion tags, non-verbal markers, delivery cues)
- **30-40% plain transcriptions** (standard ASR output, no tags at all)

This ratio prevents the model from learning "always add tags" and preserves base transcription quality.

### Preventing Tag Hallucination

Research on preventing hallucination during finetuning is directly applicable to preventing
over-generation of audio tags.

**Key findings from [The Hallucination Tax of Reinforcement Finetuning](https://arxiv.org/html/2505.13988):**
- Standard finetuning can reduce refusal rates by >80%, meaning models become overconfident
- Tested mixing ratios of 0%, 1%, 10%, 30%, 50% "unanswerable" (negative) examples
- **10% negative examples was the optimal ratio** -- restored appropriate refusal behavior while maintaining task accuracy
- Higher ratios (30-50%) degraded performance on the primary task

**Applied to Evoxtral:** Include ~10-15% of training examples where the audio is emotionally
neutral/flat but the ground truth has NO tags (just plain text). This teaches the model that
not every utterance needs tags.

**Additional anti-hallucination strategies:**
- Train on "familiar, low-perplexity data" -- using high-perplexity examples increases hallucination ([Unfamiliar Finetuning Examples](https://arxiv.org/html/2403.05612v1))
- Include examples where the model must produce a balanced positive/negative ratio of tags ([Robust Instruction Tuning](https://arxiv.org/abs/2306.14565))
- Ensure tag density varies naturally across training examples (some heavily tagged, some sparse)

### The "Cocktail Effect" in Data Mixing

Research on [Data Mixing Optimization for SFT](https://arxiv.org/html/2508.11953v1) found a
"cocktail effect": diverse training data outperforms single-domain approaches. For domain-specific
models, including general instruction data alongside specialized content improved results. A medical
chatbot achieved best performance with **67.7% general data (Alpaca-GPT4) and 32.3% domain data
(PubMedQA).**

**For Evoxtral:** Don't just train on tagged transcriptions. Consider including:
- General ASR examples (plain transcription)
- Diverse audio conditions (clean, noisy, different speakers)
- Various text styles and lengths

---

## 2. Balanced Dataset Design for Structured Output Tasks

### Teaching When NOT to Apply Tags

This is a critical and under-researched area. The SSML annotation literature provides the closest parallels.

**From [SSML Prosody Control Research](https://arxiv.org/html/2508.17494v1):**
- Models consistently **under-generate** tags when not enough tagged examples exist
- But **over-generate** when training is tag-heavy
- The solution: systematic variation in tag density across training examples

**Recommended tag density distribution for Evoxtral training data:**

| Tag Density | % of Dataset | Description |
|-------------|-------------|-------------|
| None (0 tags) | 25-35% | Plain transcription, emotionally neutral audio |
| Light (1-2 tags) | 25-30% | Subtle emotion, single non-verbal |
| Medium (3-5 tags) | 25-30% | Multiple emotions, mixed delivery |
| Heavy (6+ tags) | 10-15% | Highly expressive, dramatic audio |

### Structured Output Quality

From [Databricks End-to-End Structured Extraction](https://community.databricks.com/t5/technical-blog/end-to-end-structured-extraction-with-llm-part-2-fine-tuning/ba-p/99900):
- Training data should be "structured, token-balanced, and metadata-tagged"
- For tagged output tasks, ensure the tokenizer properly handles your tag vocabulary
- Label masking (computing loss only on output tokens) is essential -- Evoxtral already plans this

---

## 3. Synthetic Data Quality and Diversity

### Best Practices from Research

**Quality filtering ([Eugene Yan's comprehensive guide](https://eugeneyan.com/writing/synthetic/)):**
- Use **ROUGE-L < 0.7** threshold against existing examples to ensure diversity (Self-Instruct method)
- Remove impossible instructions (e.g., referencing images for text-only models)
- Apply validation scoring: chain-of-thought + 5-point scale, average 3 scores per response
- **54% of synthetic samples having completely valid fields still improved performance by 33%** -- moderate imperfection is workable

**Diversity strategies:**
- **Iterative sampling**: Start with 8 seed examples, progressively incorporate generated ones
- **Template expansion**: Create 2+ alternative formulations for each task
- **Attribute conditioning**: Vary all controllable attributes systematically
- **Style variation**: Generate multiple styles (e.g., WRAP paper used easy/medium/hard/Q&A formats, achieving 3x training speedup with 1:1 real-to-synthetic ratio)

### Synthetic Data for Speech/Audio Tasks

**From [Optimized Synthetic Data for ASR](https://arxiv.org/html/2508.21631v1):**
- Cyclically iterate over speakers without replacement to maximize speaker diversity
- TTS and voice conversion systems are viable for ASR data augmentation
- Synthetic data lacks diversity in pitch, speed, and background noise compared to authentic audio

**From [Synthio Audio Classification](https://arxiv.org/html/2410.02056v1):**
- Enhancing consistency and diversity with a small-scale version of the target dataset significantly improves performance
- Data augmentations for acoustic diversity boost out-of-distribution generalization

### Stratified Sampling for Evoxtral

**Recommended stratification axes for the training dataset:**

| Axis | Categories | Rationale |
|------|-----------|-----------|
| Emotion type | excited, sad, angry, nervous, calm, frustrated | Balanced representation of all target emotions |
| Non-verbal sounds | laughs, sighs, gasps, clears throat, crying | Each sound type needs adequate coverage |
| Speaker gender | male, female, neutral | Prevent gender bias in emotion detection |
| Audio length | short (<10s), medium (10-30s), long (30s+) | Varied context window utilization |
| Tag density | none, light, medium, heavy (see table above) | Critical for preventing over/under-generation |
| Emotional valence | positive, negative, neutral | Prevent bias toward detecting only negative emotions |

**Speaker diversity from [Latent Mixup for Speech Recognition](https://arxiv.org/html/2511.20534):**
- Constrain pairings to match gender and dataset partition
- Maintain distribution characteristics across splits

---

## 4. Catastrophic Forgetting Prevention

### How Much Original-Task Data to Mix In

This is the most critical question for Evoxtral: how much plain ASR data to include
so that adding emotion tag capability doesn't degrade word-level transcription quality.

**Key finding from [Apple's Scaling Laws for Forgetting with Pretraining Data Injection](https://machinelearning.apple.com/research/scaling-laws):**
> Injecting as little as **1% of pretraining data** in the finetuning mixture prevents the model
> from forgetting the pretraining set.

**However, more nuanced findings from [Scaling Laws for Forgetting](https://arxiv.org/html/2401.05605v1):**
- Forgetting follows a **strong inverse linear relationship** with fine-tuning loss
- Forgetting increases as a **shifted power law** in both parameters finetuned and training steps
- Forgetting **cannot be avoided through early stopping** or varying parameter counts
- LoRA still suffers from forgetting, though less than full finetuning

**Concrete replay buffer recommendations by task type:**

| Task Type | Minimum Replay Buffer | Recommended Buffer | Source |
|-----------|----------------------|-------------------|--------|
| NLU tasks (classification, NLI) | 1-2% | 5% | Empirical study on catastrophic forgetting |
| Math/Code tasks | 5-10% | 15-20% | Same study |
| Structured output (like tags) | ~10% | 25-35% | Extrapolated from mixed training results |

**For Evoxtral specifically:**
- Since you're adding a **new structural capability** (tag generation) on top of an existing one (ASR), the risk is higher than simple domain adaptation
- **Recommended: 25-35% of training data should be plain ASR transcription** (no tags)
- This is supported by the mixed training study showing 1:1 ratio achieving equivalent base-task performance with only 0.7pp degradation

### The "Tax" of Adding New Capabilities

From the math finetuning study:
- **Math-only training**: Math accuracy went 3.1% -> 12.0%, but NLI dropped 81.0% -> 16.5% (catastrophic)
- **1:1 mixed training**: Math accuracy 12.0% (same!), NLI 86.2% (only 0.7pp drop from 86.9%)
- **Even 15:1 (93.8% new task)**: Original task maintained at 83.8% vs 86.9% baseline

**Bottom line**: With proper data mixing, the "tax" of adding tagged transcription capability
should be **less than 3% WER degradation** on plain transcription tasks, likely under 1% with
a 1:1 mix.

### LoRA-Specific Forgetting Mitigation

LoRA inherently reduces forgetting compared to full finetuning because:
- Fewer parameters are modified (lower rank = less forgetting)
- Base weights remain frozen
- The adapter can be merged or removed

However, the [scaling laws paper](https://arxiv.org/html/2401.05605v1) found LoRA still
exhibits forgetting that follows the same power law. The data mixing strategy remains essential
even with LoRA.

---

## 5. Class Imbalance in Tag/Label Finetuning

### The Problem for Evoxtral

Some tags will naturally be rarer than others:
- `[excited]` and `[laughs]` likely appear frequently
- `[gasps]`, `[stammers]`, `[clears throat]` are much rarer
- `[pause]` and emphasis (CAPS) are potentially in every example

### Balancing Strategies

**Three main approaches from [Class-Balanced Loss (CVPR 2019)](https://openaccess.thecvf.com/content_CVPR_2019/papers/Cui_Class-Balanced_Loss_Based_on_Effective_Number_of_Samples_CVPR_2019_paper.pdf):**

1. **Oversampling rare classes**: Simple but risks overfitting to repeated examples
2. **Undersampling common classes**: Loses valuable training signal
3. **Weighted loss**: Reweight by effective number of samples -- best theoretical approach

**Class-Balanced Loss formula:**
```
weight_i = (1 - beta) / (1 - beta^n_i)
```
where `n_i` = number of samples for class `i`, and `beta` is typically 0.9, 0.99, or 0.999.

**For generative models (Evoxtral's case), the [HuggingFace forum discussion](https://discuss.huggingface.co/t/handling-class-imbalance-when-finetuning-a-decoder-model-on-text-generation/173010) notes:**
- Weighted loss is harder to apply in token-level generation
- **Oversampling with variation is often more practical** for generative models
- Ensure rare tags appear in diverse contexts (different sentences, emotions, speakers)

### Recommended Strategy for Evoxtral

**Hybrid approach:**

1. **Stratified generation**: When creating synthetic training data, ensure minimum representation:
   - Each tag type should appear in at least 5-10% of tagged examples
   - Use the LLM script generator to specifically request rare tag scenarios

2. **Contextual oversampling**: For rare tags, generate multiple variations:
   - `[gasps]` in surprise context, fear context, excitement context
   - `[stammers]` in nervous context, angry context, confused context
   - Aim for 3-5x oversampling of the rarest tags relative to natural distribution

3. **Minimum tag frequency targets:**

| Tag Category | Minimum % of Tagged Examples | Natural Frequency | Oversampling Factor |
|-------------|-----------------------------|--------------------|---------------------|
| [excited], [sad], [angry] | 15-20% each | High | 1x (none) |
| [calm], [nervous], [frustrated] | 10-15% each | Medium | 1.5-2x |
| [laughs], [sighs] | 10-15% each | Medium-High | 1x |
| [gasps], [crying] | 8-12% each | Low | 2-3x |
| [whispers], [shouts] | 8-12% each | Low | 2-3x |
| [stammers], [clears throat] | 5-10% each | Very Low | 3-5x |
| [pause], CAPS emphasis | Present in 40-60% | Very High | 0.5x (undersample) |

---

## 6. Concrete Recommendations for Evoxtral Training Data

### Final Dataset Composition (for 800 total examples)

| Category | Count | Percentage | Description |
|----------|-------|------------|-------------|
| Heavily tagged | 80-120 | 10-15% | 6+ tags, dramatic/expressive audio |
| Medium tagged | 200-240 | 25-30% | 3-5 tags, moderate emotion |
| Lightly tagged | 200-240 | 25-30% | 1-2 tags, subtle emotion |
| Plain transcription | 240-280 | 30-35% | 0 tags, neutral delivery |

### Quality Checklist for Training Data

- [ ] Tag density varies naturally (not every sentence has a tag)
- [ ] All 15+ target tags appear in at least 40-80 examples
- [ ] Rare tags are oversampled 2-5x with diverse contexts
- [ ] 30-35% of examples are plain transcription (anti-hallucination)
- [ ] Speaker diversity: at least 6-8 distinct voices
- [ ] Audio length varies (short, medium, long segments)
- [ ] Emotional valence balanced (positive/negative/neutral)
- [ ] ROUGE-L between any two examples < 0.7 (diversity check)
- [ ] Tag positions vary within sentences (beginning, middle, end)
- [ ] Some examples have closely spaced tags, others widely spaced

### Training Configuration Notes

- **Epochs**: 1-2 (more increases forgetting risk)
- **LoRA rank**: Treat as hyperparameter; sweep [8, 16, 32, 64]
- **Learning rate**: Conservative (1e-5 to 5e-5 range)
- **Label masking**: Essential -- only compute loss on output tokens
- **Evaluation**: Track both WER (plain transcription quality) AND tag F1 simultaneously
- **Early stopping**: Monitor WER on a held-out plain transcription set; stop if it degrades >2%

---

## Sources

- [Practical Tips for Finetuning LLMs Using LoRA](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms) - Sebastian Raschka
- [The Hallucination Tax of Reinforcement Finetuning](https://arxiv.org/html/2505.13988) - Negative example ratios
- [Unfamiliar Finetuning Examples Control How LLMs Hallucinate](https://arxiv.org/html/2403.05612v1)
- [Mitigating Catastrophic Forgetting via Mixed Training](https://arxiv.org/html/2512.13706) - Data replay ratios
- [Scaling Laws for Forgetting When Fine-Tuning LLMs](https://arxiv.org/html/2401.05605v1) - Power law relationships
- [Scaling Laws for Forgetting with Pretraining Data Injection](https://machinelearning.apple.com/research/scaling-laws) - Apple, 1% replay finding
- [Data Mixing Optimization for SFT](https://arxiv.org/html/2508.11953v1) - Cocktail effect
- [How to Generate and Use Synthetic Data for Finetuning](https://eugeneyan.com/writing/synthetic/) - Eugene Yan
- [On the Diversity of Synthetic Data](https://arxiv.org/html/2410.15226v2)
- [Data Diversity Matters for Robust Instruction Tuning](https://aclanthology.org/2024.findings-emnlp.195.pdf)
- [Class-Balanced Loss Based on Effective Number of Samples](https://openaccess.thecvf.com/content_CVPR_2019/papers/Cui_Class-Balanced_Loss_Based_on_Effective_Number_of_Samples_CVPR_2019_paper.pdf) - CVPR 2019
- [Improving French Synthetic Speech Quality via SSML](https://arxiv.org/html/2508.17494v1)
- [Towards Improved Speech Recognition through Synthetic Data](https://arxiv.org/html/2508.21631v1)
- [Efficient Fine-Tuning with LoRA Guide](https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms) - Databricks
- [How to Fine-Tune Focus on Effective Datasets](https://ai.meta.com/blog/how-to-fine-tune-llms-peft-dataset-curation/) - Meta
- [Extrinsic Hallucinations in LLMs](https://lilianweng.github.io/posts/2024-07-07-hallucination/) - Lilian Weng