Instructions to use YongkangZOU/evoxtral-realtime-sft with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use YongkangZOU/evoxtral-realtime-sft with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("mistralai/Voxtral-Mini-4B-Realtime-2602") model = PeftModel.from_pretrained(base_model, "YongkangZOU/evoxtral-realtime-sft") - Notebooks
- Google Colab
- Kaggle
Evoxtral-Realtime SFT (Recipe I β backchannel)
LoRA adapter on top of mistralai/Voxtral-Mini-4B-Realtime-2602 that emits ElevenLabs-style expressive tags ([whispers], [sighs], [laughs], [pause], etc.) from audio. Designed for half-duplex AI-therapist voice agents where the planner LLM benefits from a parallel affect channel alongside ASR text.
For production, use the RAFT-polished version instead β same architecture, ~5pp lower hallucination. This SFT checkpoint is the unpolished baseline + the input to the RAFT stage.
Architecture: Moshi-style backchannel
This adapter is tag-only β it does NOT produce ASR text. Pair it with the frozen base model for ASR text and merge outputs at inference:
audio ββ¬β base Voxtral-Mini-4B-Realtime-2602 ββ ASR text (clean WER ~10%)
ββ this adapter (LoRA on attention) ββ tag stream
merged: "[whispers] [pause] Listen, I know you're in a meeting"
The dual-channel pattern is inspired by Moshi's parallel-stream design, adapted to Voxtral Realtime's element-wise audio-text fusion architecture. See the project repo's serve_modal.py for a Modal-deployed Mode B reference implementation (two model instances on a single A100-40, parallel forward, top-K tag filter, merged JSON output).
Performance (50-sample test set, greedy)
| Metric | Base alone | This adapter (Recipe I) |
|---|---|---|
| Tag F1 | 22% | 28% |
| Tag Recall | 22% | 51% |
| Tag Precision | 100% (rarely emits) | 34% (over-emits) |
| Tag Hallucination | 0% | 61% |
| WER (text) | 10% | n/a β adapter doesn't emit text |
Tag Recall doubled vs base. An audio-shuffle diagnostic confirmed predictions follow the audio (Tag F1 vs the audio-source reference exactly matched the in-position F1 β predictions are audio-grounded, not text-pattern shortcuts).
Quick start
import torch
from transformers import VoxtralRealtimeForConditionalGeneration, AutoProcessor
from peft import PeftModel
processor = AutoProcessor.from_pretrained("mistralai/Voxtral-Mini-4B-Realtime-2602")
base = VoxtralRealtimeForConditionalGeneration.from_pretrained(
"mistralai/Voxtral-Mini-4B-Realtime-2602",
dtype=torch.bfloat16,
device_map="auto",
)
tag_model = PeftModel.from_pretrained(base, "YongkangZOU/evoxtral-realtime-sft")
tag_model.eval()
# Use `base` for ASR text, `tag_model` for tag stream β see serve_modal.py for the full hybrid.
For end-to-end use (POST audio file, get merged [tag1] [tag2] text output), the project repo ships a Modal-deployed FastAPI server with parallel forward + top-K filter built in.
Training details
- Base:
mistralai/Voxtral-Mini-4B-Realtime-2602 - Schema: v1-style packed targets β target tokens consecutive at
[p_len, p_len+N), EOS atp_len+N, post-EOS labels=-100. No silence-position training (unlike v3 distributed schema, which over-dilutes sparse tag-only training at 60:1 ratio against content). - Target: tag-only (
re.findall(r'\[[^\]]+\]', tagged_text)β text content stripped, only bracket-form tags remain). - LoRA: r=16, Ξ±=64, RS-LoRA, dropout=0.05, attention-only (
q_proj,k_proj,v_proj,o_proj). - Frozen:
audio_tower,multi_modal_projector,time_embedding(frozen afterget_peft_model()to avoid PEFT re-enabling norm grads). - Optimizer: AdamW, lr=2e-5, cosine schedule, warmup=50 steps, weight_decay=0.01, max_grad_norm=1.0.
- Training: bf16, 3 epochs (153 steps), batch=2, grad_accum=8, effective batch=16, gradient checkpointing, NEFTune Ξ±=5.0.
- Hardware: Modal A100-40GB, ~7 min runtime.
- Trainable params: 12 M of 4.5 B (0.27%).
- Dataset: ~810 audio clips (TTS-synthesized via ElevenLabs v3) reused from the predecessor Evoxtral 3B project.
Why this schema worked when others didn't
We iterated through five recipes. Three failed in different ways before this one worked. See prior_work.md for the full Phase 1-3 matrix. Highlights:
| Phase | Schema | Outcome |
|---|---|---|
| 1 | matching-shape (v1), full text | bimodal cliff: under-fit (no tags learned) or over-fit (WER 122%, hallucinated content) |
| 2 | distributed targets (v2/v3), full text | greedy hits premature EOS, sampling helps but Tag F1 caps at 27% |
| 3a | distributed (v3) + tags-only | model only emits [ then long stream_pad runs (60:1 stream_pad-vs-content signal dilution) |
| 3b (this) | packed (v1-style) + tags-only | clean tag emission, audio-grounded, +29pp Recall |
The architectural insight: removing the ASR-routing burden (tags-only target) lets the LoRA's full 12M-param capacity go to audioβtag mapping, and packed layout avoids the stream_pad dilution that kills sparse-target training under v3.
Limitations
- Over-emission. Raw output emits ~5-7 tags per utterance even when 1-2 are correct. Mitigated by inference-time
top_k=2filter (Tag F1 β 29%, Precision β 47%). - Default-emit fallback. On uncertain audio, the model emits
[calm] [pause] [clears throat]as a fallback set. This is a data-side limitation: TTS-synthesized affect signal is too weak to differentiate ambiguous inputs. RAFT does not eliminate it; the RL version only trims it slightly. - TTS dataset. Trained on ElevenLabs-synthesized audio. Real clinical recordings (long pauses, distressed affect, room noise) are out of distribution.
- Tag taxonomy fixed. 15 base tags from
tag_taxonomy.py(6 emotion + 5 nonverbal + 3 delivery + 1 pause). Out-of-taxonomy concepts won't be tagged. - English only. Training data is English; multilingual affect transfer is untested.
See also
- π
YongkangZOU/evoxtral-realtime-rlβ RAFT-polished version of this adapter. Use this in production; the SFT version is the input to the RAFT stage. - ποΈ Project repository β full pipeline, evaluation harness, Mode B hybrid serve, design docs.
- ποΈ Voxtral-Mini-4B-Realtime-2602 β required base model.
License
Apache-2.0, matching the base Voxtral Realtime license.
Citation
@software{evoxtral_realtime_2026,
title = {Evoxtral-Realtime: Backchannel-style affect-tag adapter for Voxtral-Mini-4B-Realtime},
author = {Yongkang Zou},
year = {2026},
url = {https://github.com/Tame-Your-Monkey/evoxtral-realtime}
}
@misc{voxtral_mini_realtime,
author = {Mistral AI},
title = {Voxtral-Mini-4B-Realtime-2602},
year = {2026},
url = {https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602}
}
- Downloads last month
- 106
Model tree for YongkangZOU/evoxtral-realtime-sft
Base model
mistralai/Ministral-3-3B-Base-2512