Spaces:

mistral-hackaton-2026
/

ethos

Running

App Files Files Community

ethos / docs /model_card /README.md

Lior-0618

chore: merge master → dev/video-fer (SSE transcribe-stream)

aa15e90 10 days ago

preview code

raw

history blame contribute delete

8.01 kB

	---
	library_name: peft
	base_model: mistralai/Voxtral-Mini-3B-2507
	tags:
	- voxtral
	- lora
	- speech-recognition
	- expressive-transcription
	- audio
	- mistral
	- hackathon
	- rl
	- raft
	datasets:
	- custom
	language:
	- en
	license: apache-2.0
	pipeline_tag: automatic-speech-recognition
	---

	# Evoxtral LoRA — Expressive Tagged Transcription

	A LoRA adapter for [Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) that produces transcriptions enriched with inline expressive audio tags from the [ElevenLabs v3 tag set](https://elevenlabs.io/docs/api-reference/text-to-speech).

	Built for the Mistral AI Online Hackathon 2026 (W&B Fine-Tuning Track).

	Two model variants available:
	- [Evoxtral SFT](https://huggingface.co/YongkangZOU/evoxtral-lora) — Best overall transcription accuracy (lowest WER)
	- [Evoxtral RL](https://huggingface.co/YongkangZOU/evoxtral-rl) — Best expressive tag accuracy (highest Tag F1)

	## What It Does

	Standard ASR:
	> So I was thinking maybe we could try that new restaurant downtown. I mean if you're free this weekend.

	Evoxtral:
	> [nervous] So... [stammers] I was thinking maybe we could... [clears throat] try that new restaurant downtown? [laughs nervously] I mean, if you're free this weekend?

	## Training Pipeline

	```
	Base Voxtral-Mini-3B → SFT (LoRA, 3 epochs) → RL (RAFT, 1 epoch)
	```

	1. SFT: LoRA finetuning on 808 synthetic audio samples with expressive tags (lr=2e-4, 3 epochs)
	2. RL (RAFT): Rejection sampling — generate 4 completions per sample, score with rule-based reward (WER accuracy + Tag F1 - hallucination penalty), keep best, then SFT on curated data (lr=5e-5, 1 epoch)

	This follows the approach from [GRPO for Speech Recognition](https://arxiv.org/abs/2509.01939) and Voxtral's own SFT→DPO training recipe.

	## Evaluation Results

	Evaluated on 50 held-out test samples. Full benchmark (Evoxtral-Bench) with 7 metrics:

	### Core Metrics — Base vs SFT vs RL

	\| Metric \| Base Voxtral \| Evoxtral SFT \| Evoxtral RL \| Best \|
	\|--------\|-------------\|-------------\|------------\|------\|
	\| WER \| 6.64% \| 4.47% \| 5.12% \| SFT \|
	\| CER \| 2.72% \| 1.23% \| 1.48% \| SFT \|
	\| Tag F1 \| 22.0% \| 67.2% \| 69.4% \| RL \|
	\| Tag Precision \| 22.0% \| 67.4% \| 68.5% \| RL \|
	\| Tag Recall \| 22.0% \| 69.4% \| 72.7% \| RL \|
	\| Emphasis F1 \| 42.0% \| 84.0% \| 86.0% \| RL \|
	\| Tag Hallucination \| 0.0% \| 19.3% \| 20.2% \| SFT \|

	SFT excels at raw transcription accuracy (best WER/CER). RL further improves expressive tag generation (+2.2% Tag F1, +3.3% Tag Recall, +2% Emphasis F1) at a small cost to WER.

	### Per-Tag F1 Breakdown (SFT → RL)

	\| Tag \| SFT F1 \| RL F1 \| Change \| Support \|
	\|-----\|--------\|-------\|--------\|---------\|
	\| `[sighs]` \| 1.000 \| 1.000 \| — \| 9 \|
	\| `[clears throat]` \| 0.889 \| 1.000 \| +12.5% \| 8 \|
	\| `[gasps]` \| 0.957 \| 0.957 \| — \| 12 \|
	\| `[pause]` \| 0.885 \| 0.902 \| +1.9% \| 25 \|
	\| `[nervous]` \| 0.800 \| 0.846 \| +5.8% \| 13 \|
	\| `[stammers]` \| 0.889 \| 0.842 \| -5.3% \| 8 \|
	\| `[laughs]` \| 0.800 \| 0.815 \| +1.9% \| 12 \|
	\| `[sad]` \| 0.667 \| 0.750 \| +12.4% \| 4 \|
	\| `[whispers]` \| 0.636 \| 0.667 \| +4.9% \| 13 \|
	\| `[crying]` \| 0.750 \| 0.571 \| -23.9% \| 5 \|
	\| `[excited]` \| 0.615 \| 0.571 \| -7.2% \| 5 \|
	\| `[shouts]` \| 0.400 \| 0.500 \| +25.0% \| 3 \|
	\| `[calm]` \| 0.200 \| 0.400 \| +100% \| 6 \|
	\| `[frustrated]` \| 0.444 \| 0.444 \| — \| 3 \|
	\| `[angry]` \| 0.667 \| 0.667 \| — \| 2 \|
	\| `[confused]` \| 0.000 \| 0.000 \| — \| 1 \|
	\| `[scared]` \| 0.000 \| 0.000 \| — \| 1 \|

	RL improved 9 tags, kept 4 stable, and regressed 3. Biggest gains on [clears throat] (+12.5%), [calm] (+100%), [sad] (+12.4%), and [shouts] (+25%).

	## Training Details

	### SFT Stage

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base model \| `mistralai/Voxtral-Mini-3B-2507` \|
	\| Method \| LoRA (PEFT) \|
	\| LoRA rank \| 64 \|
	\| LoRA alpha \| 128 \|
	\| LoRA dropout \| 0.05 \|
	\| Target modules \| q/k/v/o_proj, gate/up/down_proj, multi_modal_projector \|
	\| Learning rate \| 2e-4 \|
	\| Scheduler \| Cosine \|
	\| Epochs \| 3 \|
	\| Batch size \| 2 (effective 16 with grad accum 8) \|
	\| NEFTune noise alpha \| 5.0 \|
	\| Precision \| bf16 \|
	\| GPU \| NVIDIA A10G (24GB) \|
	\| Training time \| ~25 minutes \|
	\| Trainable params \| 124.8M / 4.8B (2.6%) \|

	### RL Stage (RAFT)

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Method \| Rejection sampling + SFT (RAFT) \|
	\| Samples per input \| 4 (temperature=0.7, top_p=0.9) \|
	\| Reward function \| 0.4×(1-WER) + 0.4×Tag_F1 + 0.2×(1-hallucination) \|
	\| Curated samples \| 727 (bottom 10% filtered, reward > 0.954) \|
	\| Avg reward \| 0.980 \|
	\| Learning rate \| 5e-5 \|
	\| Epochs \| 1 \|
	\| Final loss \| 0.021 \|
	\| Training time \| ~7 minutes \|

	## Dataset

	Custom synthetic dataset of 1,010 audio samples generated with ElevenLabs TTS v3:
	- 808 train / 101 validation / 101 test
	- Each sample has audio + tagged transcription with inline ElevenLabs v3 expressive tags
	- Tags include: `[sighs]`, `[laughs]`, `[whispers]`, `[nervous]`, `[frustrated]`, `[clears throat]`, `[pause]`, `[excited]`, and more
	- Audio encoder (Whisper-based) was frozen during training

	## Usage

	```python
	import torch
	from transformers import VoxtralForConditionalGeneration, AutoProcessor
	from peft import PeftModel

	repo_id = "mistralai/Voxtral-Mini-3B-2507"
	# Use "YongkangZOU/evoxtral-lora" for SFT or "YongkangZOU/evoxtral-rl" for RL
	adapter_id = "YongkangZOU/evoxtral-rl"

	processor = AutoProcessor.from_pretrained(repo_id)
	base_model = VoxtralForConditionalGeneration.from_pretrained(
	repo_id, dtype=torch.bfloat16, device_map="auto"
	)
	model = PeftModel.from_pretrained(base_model, adapter_id)

	# Transcribe audio with expressive tags
	inputs = processor.apply_transcription_request(
	language="en",
	audio=["path/to/audio.wav"],
	format=["WAV"],
	model_id=repo_id,
	return_tensors="pt",
	)
	inputs = inputs.to(model.device, dtype=torch.bfloat16)

	outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
	transcription = processor.batch_decode(
	outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True
	)[0]
	print(transcription)
	# [nervous] So... I was thinking maybe we could [clears throat] try that new restaurant downtown?
	```

	## API

	A serverless API with Swagger UI is available on Modal:

	```bash
	curl -X POST https://yongkang-zou1999--evoxtral-api-evoxtralmodel-web.modal.run/transcribe \
	-F "file=@audio.wav"
	```

	- [Swagger UI](https://yongkang-zou1999--evoxtral-api-evoxtralmodel-web.modal.run/docs)
	- [Live Demo (HF Space)](https://huggingface.co/spaces/YongkangZOU/evoxtral)

	## W&B Tracking

	All training and evaluation runs are tracked on Weights & Biases:
	- [SFT Training](https://wandb.ai/yongkang-zou-ai/evoxtral/runs/t8ak7a20)
	- [RL Training (RAFT)](https://wandb.ai/yongkang-zou-ai/evoxtral)
	- [Base model eval](https://wandb.ai/yongkang-zou-ai/evoxtral/runs/bvqa4ioo)
	- [SFT model eval](https://wandb.ai/yongkang-zou-ai/evoxtral/runs/ayx4ldyd)
	- [RL model eval](https://wandb.ai/yongkang-zou-ai/evoxtral)
	- [Project dashboard](https://wandb.ai/yongkang-zou-ai/evoxtral)

	## Supported Tags

	The model can produce any tag from the ElevenLabs v3 expressive tag set, including:

	`[laughs]` `[sighs]` `[gasps]` `[clears throat]` `[whispers]` `[sniffs]` `[pause]` `[nervous]` `[frustrated]` `[excited]` `[sad]` `[angry]` `[calm]` `[stammers]` `[yawns]` and more.

	## Limitations

	- Trained on synthetic (TTS-generated) audio, not natural speech recordings
	- ~20% tag hallucination rate — model occasionally predicts tags not in the reference
	- Rare/subtle tags ([calm], [confused], [scared]) have low accuracy due to limited training examples
	- RL variant trades ~0.65% WER for better tag accuracy
	- English only
	- Best results on conversational and emotionally expressive speech

	## Citation

	```bibtex
	@misc{evoxtral2026,
	title={Evoxtral: Expressive Tagged Transcription with Voxtral},
	author={Yongkang Zou},
	year={2026},
	url={https://huggingface.co/YongkangZOU/evoxtral-lora}
	}
	```