evoxtral / README.md
YongkangZOU's picture
Duplicate from YongkangZOU/evoxtral-lora
fc6036c
---
library_name: peft
base_model: mistralai/Voxtral-Mini-3B-2507
tags:
- voxtral
- lora
- speech-recognition
- expressive-transcription
- audio
- mistral
- hackathon
datasets:
- custom
language:
- en
license: apache-2.0
pipeline_tag: automatic-speech-recognition
---
# Evoxtral LoRA — Expressive Tagged Transcription
A LoRA adapter for [Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) that produces transcriptions enriched with inline expressive audio tags from the [ElevenLabs v3 tag set](https://elevenlabs.io/docs/api-reference/text-to-speech).
Built for the **Mistral AI Online Hackathon 2026** (W&B Fine-Tuning Track).
## What It Does
Standard ASR:
> So I was thinking maybe we could try that new restaurant downtown. I mean if you're free this weekend.
Evoxtral:
> [nervous] So... [stammers] I was thinking maybe we could... [clears throat] try that new restaurant downtown? [laughs nervously] I mean, if you're free this weekend?
## Evaluation Results
| Metric | Base Voxtral | Evoxtral (finetuned) | Improvement |
|--------|-------------|---------------------|-------------|
| **WER** (Word Error Rate) | 6.64% | **4.47%** | 32.7% better |
| **Tag F1** (Expressive Tag Accuracy) | 22.0% | **67.2%** | 3x better |
Evaluated on 50 held-out test samples. The finetuned model dramatically improves expressive tag generation while also improving raw transcription accuracy.
## Training Details
| Parameter | Value |
|-----------|-------|
| Base model | `mistralai/Voxtral-Mini-3B-2507` |
| Method | LoRA (PEFT) |
| LoRA rank | 64 |
| LoRA alpha | 128 |
| LoRA dropout | 0.05 |
| Target modules | q/k/v/o_proj, gate/up/down_proj, multi_modal_projector |
| Learning rate | 2e-4 |
| Scheduler | Cosine |
| Epochs | 3 |
| Batch size | 2 (effective 16 with grad accum 8) |
| NEFTune noise alpha | 5.0 |
| Precision | bf16 |
| GPU | NVIDIA A10G (24GB) |
| Training time | ~25 minutes |
| Trainable params | 124.8M / 4.8B (2.6%) |
## Dataset
Custom synthetic dataset of 1,010 audio samples generated with ElevenLabs TTS v3:
- **808** train / **101** validation / **101** test
- Each sample has audio + tagged transcription with inline ElevenLabs v3 expressive tags
- Tags include: `[sighs]`, `[laughs]`, `[whispers]`, `[nervous]`, `[frustrated]`, `[clears throat]`, `[pause]`, `[excited]`, and more
- Audio encoder (Whisper-based) was frozen during training
## Usage
```python
import torch
from transformers import VoxtralForConditionalGeneration, AutoProcessor
from peft import PeftModel
repo_id = "mistralai/Voxtral-Mini-3B-2507"
adapter_id = "YongkangZOU/evoxtral-lora"
processor = AutoProcessor.from_pretrained(repo_id)
base_model = VoxtralForConditionalGeneration.from_pretrained(
repo_id, dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, adapter_id)
# Transcribe audio with expressive tags
inputs = processor.apply_transcription_request(
language="en",
audio=["path/to/audio.wav"],
format=["WAV"],
model_id=repo_id,
return_tensors="pt",
)
inputs = inputs.to(model.device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
transcription = processor.batch_decode(
outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True
)[0]
print(transcription)
# [nervous] So... I was thinking maybe we could [clears throat] try that new restaurant downtown?
```
## W&B Tracking
All training and evaluation runs are tracked on Weights & Biases:
- [Training run](https://wandb.ai/yongkang-zou-ai/evoxtral/runs/t8ak7a20)
- [Base model eval](https://wandb.ai/yongkang-zou-ai/evoxtral/runs/f9l2zwvs)
- [Finetuned model eval](https://wandb.ai/yongkang-zou-ai/evoxtral/runs/b32c74im)
- [Project dashboard](https://wandb.ai/yongkang-zou-ai/evoxtral)
## Supported Tags
The model can produce any tag from the ElevenLabs v3 expressive tag set, including:
`[laughs]` `[sighs]` `[gasps]` `[clears throat]` `[whispers]` `[sniffs]` `[pause]` `[nervous]` `[frustrated]` `[excited]` `[sad]` `[angry]` `[calm]` `[stammers]` `[yawns]` and more.
## Limitations
- Trained on synthetic (TTS-generated) audio, not natural speech recordings
- Tag F1 of 67.2% means ~1/3 of tags may be missed or misplaced
- English only
- Best results on conversational and emotionally expressive speech
## Citation
```bibtex
@misc{evoxtral2026,
title={Evoxtral: Expressive Tagged Transcription with Voxtral},
author={Yongkang Zou},
year={2026},
url={https://huggingface.co/YongkangZOU/evoxtral-lora}
}
```