--- library_name: peft base_model: mistralai/Voxtral-Mini-3B-2507 tags: - voxtral - lora - speech-recognition - expressive-transcription - audio - mistral - hackathon - rl - raft datasets: - custom language: - en license: apache-2.0 pipeline_tag: automatic-speech-recognition --- # Evoxtral LoRA — Expressive Tagged Transcription A LoRA adapter for [Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) that produces transcriptions enriched with inline expressive audio tags from the [ElevenLabs v3 tag set](https://elevenlabs.io/docs/api-reference/text-to-speech). Built for the **Mistral AI Online Hackathon 2026** (W&B Fine-Tuning Track). **Two model variants available:** - **[Evoxtral SFT](https://huggingface.co/YongkangZOU/evoxtral-lora)** — Best overall transcription accuracy (lowest WER) - **[Evoxtral RL](https://huggingface.co/YongkangZOU/evoxtral-rl)** — Best expressive tag accuracy (highest Tag F1) ## What It Does Standard ASR: > So I was thinking maybe we could try that new restaurant downtown. I mean if you're free this weekend. Evoxtral: > [nervous] So... [stammers] I was thinking maybe we could... [clears throat] try that new restaurant downtown? [laughs nervously] I mean, if you're free this weekend? ## Training Pipeline ``` Base Voxtral-Mini-3B → SFT (LoRA, 3 epochs) → RL (RAFT, 1 epoch) ``` 1. **SFT**: LoRA finetuning on 808 synthetic audio samples with expressive tags (lr=2e-4, 3 epochs) 2. **RL (RAFT)**: Rejection sampling — generate 4 completions per sample, score with rule-based reward (WER accuracy + Tag F1 - hallucination penalty), keep best, then SFT on curated data (lr=5e-5, 1 epoch) This follows the approach from [GRPO for Speech Recognition](https://arxiv.org/abs/2509.01939) and Voxtral's own SFT→DPO training recipe. ## Evaluation Results Evaluated on 50 held-out test samples. Full benchmark (Evoxtral-Bench) with 7 metrics: ### Core Metrics — Base vs SFT vs RL | Metric | Base Voxtral | Evoxtral SFT | Evoxtral RL | Best | |--------|-------------|-------------|------------|------| | **WER** | 6.64% | **4.47%** | 5.12% | SFT | | **CER** | 2.72% | **1.23%** | 1.48% | SFT | | **Tag F1** | 22.0% | 67.2% | **69.4%** | RL | | **Tag Precision** | 22.0% | 67.4% | **68.5%** | RL | | **Tag Recall** | 22.0% | 69.4% | **72.7%** | RL | | **Emphasis F1** | 42.0% | 84.0% | **86.0%** | RL | | **Tag Hallucination** | 0.0% | **19.3%** | 20.2% | SFT | **SFT** excels at raw transcription accuracy (best WER/CER). **RL** further improves expressive tag generation (+2.2% Tag F1, +3.3% Tag Recall, +2% Emphasis F1) at a small cost to WER. ### Per-Tag F1 Breakdown (SFT → RL) | Tag | SFT F1 | RL F1 | Change | Support | |-----|--------|-------|--------|---------| | `[sighs]` | 1.000 | **1.000** | — | 9 | | `[clears throat]` | 0.889 | **1.000** | +12.5% | 8 | | `[gasps]` | 0.957 | **0.957** | — | 12 | | `[pause]` | 0.885 | **0.902** | +1.9% | 25 | | `[nervous]` | 0.800 | **0.846** | +5.8% | 13 | | `[stammers]` | 0.889 | 0.842 | -5.3% | 8 | | `[laughs]` | 0.800 | **0.815** | +1.9% | 12 | | `[sad]` | 0.667 | **0.750** | +12.4% | 4 | | `[whispers]` | 0.636 | **0.667** | +4.9% | 13 | | `[crying]` | 0.750 | 0.571 | -23.9% | 5 | | `[excited]` | 0.615 | 0.571 | -7.2% | 5 | | `[shouts]` | 0.400 | **0.500** | +25.0% | 3 | | `[calm]` | 0.200 | **0.400** | +100% | 6 | | `[frustrated]` | 0.444 | 0.444 | — | 3 | | `[angry]` | 0.667 | 0.667 | — | 2 | | `[confused]` | 0.000 | 0.000 | — | 1 | | `[scared]` | 0.000 | 0.000 | — | 1 | RL improved 9 tags, kept 4 stable, and regressed 3. Biggest gains on [clears throat] (+12.5%), [calm] (+100%), [sad] (+12.4%), and [shouts] (+25%). ## Training Details ### SFT Stage | Parameter | Value | |-----------|-------| | Base model | `mistralai/Voxtral-Mini-3B-2507` | | Method | LoRA (PEFT) | | LoRA rank | 64 | | LoRA alpha | 128 | | LoRA dropout | 0.05 | | Target modules | q/k/v/o_proj, gate/up/down_proj, multi_modal_projector | | Learning rate | 2e-4 | | Scheduler | Cosine | | Epochs | 3 | | Batch size | 2 (effective 16 with grad accum 8) | | NEFTune noise alpha | 5.0 | | Precision | bf16 | | GPU | NVIDIA A10G (24GB) | | Training time | ~25 minutes | | Trainable params | 124.8M / 4.8B (2.6%) | ### RL Stage (RAFT) | Parameter | Value | |-----------|-------| | Method | Rejection sampling + SFT (RAFT) | | Samples per input | 4 (temperature=0.7, top_p=0.9) | | Reward function | 0.4×(1-WER) + 0.4×Tag_F1 + 0.2×(1-hallucination) | | Curated samples | 727 (bottom 10% filtered, reward > 0.954) | | Avg reward | 0.980 | | Learning rate | 5e-5 | | Epochs | 1 | | Final loss | 0.021 | | Training time | ~7 minutes | ## Dataset Custom synthetic dataset of 1,010 audio samples generated with ElevenLabs TTS v3: - **808** train / **101** validation / **101** test - Each sample has audio + tagged transcription with inline ElevenLabs v3 expressive tags - Tags include: `[sighs]`, `[laughs]`, `[whispers]`, `[nervous]`, `[frustrated]`, `[clears throat]`, `[pause]`, `[excited]`, and more - Audio encoder (Whisper-based) was frozen during training ## Usage ```python import torch from transformers import VoxtralForConditionalGeneration, AutoProcessor from peft import PeftModel repo_id = "mistralai/Voxtral-Mini-3B-2507" # Use "YongkangZOU/evoxtral-lora" for SFT or "YongkangZOU/evoxtral-rl" for RL adapter_id = "YongkangZOU/evoxtral-rl" processor = AutoProcessor.from_pretrained(repo_id) base_model = VoxtralForConditionalGeneration.from_pretrained( repo_id, dtype=torch.bfloat16, device_map="auto" ) model = PeftModel.from_pretrained(base_model, adapter_id) # Transcribe audio with expressive tags inputs = processor.apply_transcription_request( language="en", audio=["path/to/audio.wav"], format=["WAV"], model_id=repo_id, return_tensors="pt", ) inputs = inputs.to(model.device, dtype=torch.bfloat16) outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False) transcription = processor.batch_decode( outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True )[0] print(transcription) # [nervous] So... I was thinking maybe we could [clears throat] try that new restaurant downtown? ``` ## API A serverless API with Swagger UI is available on Modal: ```bash curl -X POST https://yongkang-zou1999--evoxtral-api-evoxtralmodel-web.modal.run/transcribe \ -F "file=@audio.wav" ``` - [Swagger UI](https://yongkang-zou1999--evoxtral-api-evoxtralmodel-web.modal.run/docs) - [Live Demo (HF Space)](https://huggingface.co/spaces/YongkangZOU/evoxtral) ## W&B Tracking All training and evaluation runs are tracked on Weights & Biases: - [SFT Training](https://wandb.ai/yongkang-zou-ai/evoxtral/runs/t8ak7a20) - [RL Training (RAFT)](https://wandb.ai/yongkang-zou-ai/evoxtral) - [Base model eval](https://wandb.ai/yongkang-zou-ai/evoxtral/runs/bvqa4ioo) - [SFT model eval](https://wandb.ai/yongkang-zou-ai/evoxtral/runs/ayx4ldyd) - [RL model eval](https://wandb.ai/yongkang-zou-ai/evoxtral) - [Project dashboard](https://wandb.ai/yongkang-zou-ai/evoxtral) ## Supported Tags The model can produce any tag from the ElevenLabs v3 expressive tag set, including: `[laughs]` `[sighs]` `[gasps]` `[clears throat]` `[whispers]` `[sniffs]` `[pause]` `[nervous]` `[frustrated]` `[excited]` `[sad]` `[angry]` `[calm]` `[stammers]` `[yawns]` and more. ## Limitations - Trained on synthetic (TTS-generated) audio, not natural speech recordings - ~20% tag hallucination rate — model occasionally predicts tags not in the reference - Rare/subtle tags ([calm], [confused], [scared]) have low accuracy due to limited training examples - RL variant trades ~0.65% WER for better tag accuracy - English only - Best results on conversational and emotionally expressive speech ## Citation ```bibtex @misc{evoxtral2026, title={Evoxtral: Expressive Tagged Transcription with Voxtral}, author={Yongkang Zou}, year={2026}, url={https://huggingface.co/YongkangZOU/evoxtral-lora} } ```