File size: 8,007 Bytes
be7e72c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
---
library_name: peft
base_model: mistralai/Voxtral-Mini-3B-2507
tags:
  - voxtral
  - lora
  - speech-recognition
  - expressive-transcription
  - audio
  - mistral
  - hackathon
  - rl
  - raft
datasets:
  - custom
language:
  - en
license: apache-2.0
pipeline_tag: automatic-speech-recognition
---

# Evoxtral LoRA — Expressive Tagged Transcription

A LoRA adapter for [Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) that produces transcriptions enriched with inline expressive audio tags from the [ElevenLabs v3 tag set](https://elevenlabs.io/docs/api-reference/text-to-speech).

Built for the **Mistral AI Online Hackathon 2026** (W&B Fine-Tuning Track).

**Two model variants available:**
- **[Evoxtral SFT](https://huggingface.co/YongkangZOU/evoxtral-lora)** — Best overall transcription accuracy (lowest WER)
- **[Evoxtral RL](https://huggingface.co/YongkangZOU/evoxtral-rl)** — Best expressive tag accuracy (highest Tag F1)

## What It Does

Standard ASR:
> So I was thinking maybe we could try that new restaurant downtown. I mean if you're free this weekend.

Evoxtral:
> [nervous] So... [stammers] I was thinking maybe we could... [clears throat] try that new restaurant downtown? [laughs nervously] I mean, if you're free this weekend?

## Training Pipeline

```
Base Voxtral-Mini-3B → SFT (LoRA, 3 epochs) → RL (RAFT, 1 epoch)
```

1. **SFT**: LoRA finetuning on 808 synthetic audio samples with expressive tags (lr=2e-4, 3 epochs)
2. **RL (RAFT)**: Rejection sampling — generate 4 completions per sample, score with rule-based reward (WER accuracy + Tag F1 - hallucination penalty), keep best, then SFT on curated data (lr=5e-5, 1 epoch)

This follows the approach from [GRPO for Speech Recognition](https://arxiv.org/abs/2509.01939) and Voxtral's own SFT→DPO training recipe.

## Evaluation Results

Evaluated on 50 held-out test samples. Full benchmark (Evoxtral-Bench) with 7 metrics:

### Core Metrics — Base vs SFT vs RL

| Metric | Base Voxtral | Evoxtral SFT | Evoxtral RL | Best |
|--------|-------------|-------------|------------|------|
| **WER** | 6.64% | **4.47%** | 5.12% | SFT |
| **CER** | 2.72% | **1.23%** | 1.48% | SFT |
| **Tag F1** | 22.0% | 67.2% | **69.4%** | RL |
| **Tag Precision** | 22.0% | 67.4% | **68.5%** | RL |
| **Tag Recall** | 22.0% | 69.4% | **72.7%** | RL |
| **Emphasis F1** | 42.0% | 84.0% | **86.0%** | RL |
| **Tag Hallucination** | 0.0% | **19.3%** | 20.2% | SFT |

**SFT** excels at raw transcription accuracy (best WER/CER). **RL** further improves expressive tag generation (+2.2% Tag F1, +3.3% Tag Recall, +2% Emphasis F1) at a small cost to WER.

### Per-Tag F1 Breakdown (SFT → RL)

| Tag | SFT F1 | RL F1 | Change | Support |
|-----|--------|-------|--------|---------|
| `[sighs]` | 1.000 | **1.000** | — | 9 |
| `[clears throat]` | 0.889 | **1.000** | +12.5% | 8 |
| `[gasps]` | 0.957 | **0.957** | — | 12 |
| `[pause]` | 0.885 | **0.902** | +1.9% | 25 |
| `[nervous]` | 0.800 | **0.846** | +5.8% | 13 |
| `[stammers]` | 0.889 | 0.842 | -5.3% | 8 |
| `[laughs]` | 0.800 | **0.815** | +1.9% | 12 |
| `[sad]` | 0.667 | **0.750** | +12.4% | 4 |
| `[whispers]` | 0.636 | **0.667** | +4.9% | 13 |
| `[crying]` | 0.750 | 0.571 | -23.9% | 5 |
| `[excited]` | 0.615 | 0.571 | -7.2% | 5 |
| `[shouts]` | 0.400 | **0.500** | +25.0% | 3 |
| `[calm]` | 0.200 | **0.400** | +100% | 6 |
| `[frustrated]` | 0.444 | 0.444 | — | 3 |
| `[angry]` | 0.667 | 0.667 | — | 2 |
| `[confused]` | 0.000 | 0.000 | — | 1 |
| `[scared]` | 0.000 | 0.000 | — | 1 |

RL improved 9 tags, kept 4 stable, and regressed 3. Biggest gains on [clears throat] (+12.5%), [calm] (+100%), [sad] (+12.4%), and [shouts] (+25%).

## Training Details

### SFT Stage

| Parameter | Value |
|-----------|-------|
| Base model | `mistralai/Voxtral-Mini-3B-2507` |
| Method | LoRA (PEFT) |
| LoRA rank | 64 |
| LoRA alpha | 128 |
| LoRA dropout | 0.05 |
| Target modules | q/k/v/o_proj, gate/up/down_proj, multi_modal_projector |
| Learning rate | 2e-4 |
| Scheduler | Cosine |
| Epochs | 3 |
| Batch size | 2 (effective 16 with grad accum 8) |
| NEFTune noise alpha | 5.0 |
| Precision | bf16 |
| GPU | NVIDIA A10G (24GB) |
| Training time | ~25 minutes |
| Trainable params | 124.8M / 4.8B (2.6%) |

### RL Stage (RAFT)

| Parameter | Value |
|-----------|-------|
| Method | Rejection sampling + SFT (RAFT) |
| Samples per input | 4 (temperature=0.7, top_p=0.9) |
| Reward function | 0.4×(1-WER) + 0.4×Tag_F1 + 0.2×(1-hallucination) |
| Curated samples | 727 (bottom 10% filtered, reward > 0.954) |
| Avg reward | 0.980 |
| Learning rate | 5e-5 |
| Epochs | 1 |
| Final loss | 0.021 |
| Training time | ~7 minutes |

## Dataset

Custom synthetic dataset of 1,010 audio samples generated with ElevenLabs TTS v3:
- **808** train / **101** validation / **101** test
- Each sample has audio + tagged transcription with inline ElevenLabs v3 expressive tags
- Tags include: `[sighs]`, `[laughs]`, `[whispers]`, `[nervous]`, `[frustrated]`, `[clears throat]`, `[pause]`, `[excited]`, and more
- Audio encoder (Whisper-based) was frozen during training

## Usage

```python
import torch
from transformers import VoxtralForConditionalGeneration, AutoProcessor
from peft import PeftModel

repo_id = "mistralai/Voxtral-Mini-3B-2507"
# Use "YongkangZOU/evoxtral-lora" for SFT or "YongkangZOU/evoxtral-rl" for RL
adapter_id = "YongkangZOU/evoxtral-rl"

processor = AutoProcessor.from_pretrained(repo_id)
base_model = VoxtralForConditionalGeneration.from_pretrained(
    repo_id, dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, adapter_id)

# Transcribe audio with expressive tags
inputs = processor.apply_transcription_request(
    language="en",
    audio=["path/to/audio.wav"],
    format=["WAV"],
    model_id=repo_id,
    return_tensors="pt",
)
inputs = inputs.to(model.device, dtype=torch.bfloat16)

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
transcription = processor.batch_decode(
    outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True
)[0]
print(transcription)
# [nervous] So... I was thinking maybe we could [clears throat] try that new restaurant downtown?
```

## API

A serverless API with Swagger UI is available on Modal:

```bash
curl -X POST https://yongkang-zou1999--evoxtral-api-evoxtralmodel-web.modal.run/transcribe \
    -F "file=@audio.wav"
```

- [Swagger UI](https://yongkang-zou1999--evoxtral-api-evoxtralmodel-web.modal.run/docs)
- [Live Demo (HF Space)](https://huggingface.co/spaces/YongkangZOU/evoxtral)

## W&B Tracking

All training and evaluation runs are tracked on Weights & Biases:
- [SFT Training](https://wandb.ai/yongkang-zou-ai/evoxtral/runs/t8ak7a20)
- [RL Training (RAFT)](https://wandb.ai/yongkang-zou-ai/evoxtral)
- [Base model eval](https://wandb.ai/yongkang-zou-ai/evoxtral/runs/bvqa4ioo)
- [SFT model eval](https://wandb.ai/yongkang-zou-ai/evoxtral/runs/ayx4ldyd)
- [RL model eval](https://wandb.ai/yongkang-zou-ai/evoxtral)
- [Project dashboard](https://wandb.ai/yongkang-zou-ai/evoxtral)

## Supported Tags

The model can produce any tag from the ElevenLabs v3 expressive tag set, including:

`[laughs]` `[sighs]` `[gasps]` `[clears throat]` `[whispers]` `[sniffs]` `[pause]` `[nervous]` `[frustrated]` `[excited]` `[sad]` `[angry]` `[calm]` `[stammers]` `[yawns]` and more.

## Limitations

- Trained on synthetic (TTS-generated) audio, not natural speech recordings
- ~20% tag hallucination rate — model occasionally predicts tags not in the reference
- Rare/subtle tags ([calm], [confused], [scared]) have low accuracy due to limited training examples
- RL variant trades ~0.65% WER for better tag accuracy
- English only
- Best results on conversational and emotionally expressive speech

## Citation

```bibtex
@misc{evoxtral2026,
  title={Evoxtral: Expressive Tagged Transcription with Voxtral},
  author={Yongkang Zou},
  year={2026},
  url={https://huggingface.co/YongkangZOU/evoxtral-lora}
}
```