--- library_name: peft base_model: mistralai/Voxtral-Mini-3B-2507 tags: - voxtral - lora - speech-recognition - expressive-transcription - audio - mistral - hackathon datasets: - custom language: - en license: apache-2.0 pipeline_tag: automatic-speech-recognition --- # Evoxtral LoRA — Expressive Tagged Transcription A LoRA adapter for [Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) that produces transcriptions enriched with inline expressive audio tags from the [ElevenLabs v3 tag set](https://elevenlabs.io/docs/api-reference/text-to-speech). Built for the **Mistral AI Online Hackathon 2026** (W&B Fine-Tuning Track). ## What It Does Standard ASR: > So I was thinking maybe we could try that new restaurant downtown. I mean if you're free this weekend. Evoxtral: > [nervous] So... [stammers] I was thinking maybe we could... [clears throat] try that new restaurant downtown? [laughs nervously] I mean, if you're free this weekend? ## Evaluation Results | Metric | Base Voxtral | Evoxtral (finetuned) | Improvement | |--------|-------------|---------------------|-------------| | **WER** (Word Error Rate) | 6.64% | **4.47%** | 32.7% better | | **Tag F1** (Expressive Tag Accuracy) | 22.0% | **67.2%** | 3x better | Evaluated on 50 held-out test samples. The finetuned model dramatically improves expressive tag generation while also improving raw transcription accuracy. ## Training Details | Parameter | Value | |-----------|-------| | Base model | `mistralai/Voxtral-Mini-3B-2507` | | Method | LoRA (PEFT) | | LoRA rank | 64 | | LoRA alpha | 128 | | LoRA dropout | 0.05 | | Target modules | q/k/v/o_proj, gate/up/down_proj, multi_modal_projector | | Learning rate | 2e-4 | | Scheduler | Cosine | | Epochs | 3 | | Batch size | 2 (effective 16 with grad accum 8) | | NEFTune noise alpha | 5.0 | | Precision | bf16 | | GPU | NVIDIA A10G (24GB) | | Training time | ~25 minutes | | Trainable params | 124.8M / 4.8B (2.6%) | ## Dataset Custom synthetic dataset of 1,010 audio samples generated with ElevenLabs TTS v3: - **808** train / **101** validation / **101** test - Each sample has audio + tagged transcription with inline ElevenLabs v3 expressive tags - Tags include: `[sighs]`, `[laughs]`, `[whispers]`, `[nervous]`, `[frustrated]`, `[clears throat]`, `[pause]`, `[excited]`, and more - Audio encoder (Whisper-based) was frozen during training ## Usage ```python import torch from transformers import VoxtralForConditionalGeneration, AutoProcessor from peft import PeftModel repo_id = "mistralai/Voxtral-Mini-3B-2507" adapter_id = "YongkangZOU/evoxtral-lora" processor = AutoProcessor.from_pretrained(repo_id) base_model = VoxtralForConditionalGeneration.from_pretrained( repo_id, dtype=torch.bfloat16, device_map="auto" ) model = PeftModel.from_pretrained(base_model, adapter_id) # Transcribe audio with expressive tags inputs = processor.apply_transcription_request( language="en", audio=["path/to/audio.wav"], format=["WAV"], model_id=repo_id, return_tensors="pt", ) inputs = inputs.to(model.device, dtype=torch.bfloat16) outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False) transcription = processor.batch_decode( outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True )[0] print(transcription) # [nervous] So... I was thinking maybe we could [clears throat] try that new restaurant downtown? ``` ## W&B Tracking All training and evaluation runs are tracked on Weights & Biases: - [Training run](https://wandb.ai/yongkang-zou-ai/evoxtral/runs/t8ak7a20) - [Base model eval](https://wandb.ai/yongkang-zou-ai/evoxtral/runs/f9l2zwvs) - [Finetuned model eval](https://wandb.ai/yongkang-zou-ai/evoxtral/runs/b32c74im) - [Project dashboard](https://wandb.ai/yongkang-zou-ai/evoxtral) ## Supported Tags The model can produce any tag from the ElevenLabs v3 expressive tag set, including: `[laughs]` `[sighs]` `[gasps]` `[clears throat]` `[whispers]` `[sniffs]` `[pause]` `[nervous]` `[frustrated]` `[excited]` `[sad]` `[angry]` `[calm]` `[stammers]` `[yawns]` and more. ## Limitations - Trained on synthetic (TTS-generated) audio, not natural speech recordings - Tag F1 of 67.2% means ~1/3 of tags may be missed or misplaced - English only - Best results on conversational and emotionally expressive speech ## Citation ```bibtex @misc{evoxtral2026, title={Evoxtral: Expressive Tagged Transcription with Voxtral}, author={Yongkang Zou}, year={2026}, url={https://huggingface.co/YongkangZOU/evoxtral-lora} } ```