Evoxtral-Realtime SFT (Recipe I — backchannel)

LoRA adapter on top of mistralai/Voxtral-Mini-4B-Realtime-2602 that emits ElevenLabs-style expressive tags ([whispers], [sighs], [laughs], [pause], etc.) from audio. Designed for half-duplex AI-therapist voice agents where the planner LLM benefits from a parallel affect channel alongside ASR text.

For production, use the RAFT-polished version instead — same architecture, ~5pp lower hallucination. This SFT checkpoint is the unpolished baseline + the input to the RAFT stage.

Architecture: Moshi-style backchannel

This adapter is tag-only — it does NOT produce ASR text. Pair it with the frozen base model for ASR text and merge outputs at inference:

audio ─┬─ base Voxtral-Mini-4B-Realtime-2602 ─→ ASR text (clean WER ~10%)
       └─ this adapter (LoRA on attention)   ─→ tag stream

merged: "[whispers] [pause] Listen, I know you're in a meeting"

The dual-channel pattern is inspired by Moshi's parallel-stream design, adapted to Voxtral Realtime's element-wise audio-text fusion architecture. See the project repo's serve_modal.py for a Modal-deployed Mode B reference implementation (two model instances on a single A100-40, parallel forward, top-K tag filter, merged JSON output).

Performance (50-sample test set, greedy)

Metric	Base alone	This adapter (Recipe I)
Tag F1	22%	28%
Tag Recall	22%	51%
Tag Precision	100% (rarely emits)	34% (over-emits)
Tag Hallucination	0%	61%
WER (text)	10%	n/a — adapter doesn't emit text

Tag Recall doubled vs base. An audio-shuffle diagnostic confirmed predictions follow the audio (Tag F1 vs the audio-source reference exactly matched the in-position F1 — predictions are audio-grounded, not text-pattern shortcuts).

Quick start

import torch
from transformers import VoxtralRealtimeForConditionalGeneration, AutoProcessor
from peft import PeftModel

processor = AutoProcessor.from_pretrained("mistralai/Voxtral-Mini-4B-Realtime-2602")
base = VoxtralRealtimeForConditionalGeneration.from_pretrained(
    "mistralai/Voxtral-Mini-4B-Realtime-2602",
    dtype=torch.bfloat16,
    device_map="auto",
)
tag_model = PeftModel.from_pretrained(base, "YongkangZOU/evoxtral-realtime-sft")
tag_model.eval()
# Use `base` for ASR text, `tag_model` for tag stream — see serve_modal.py for the full hybrid.

For end-to-end use (POST audio file, get merged [tag1] [tag2] text output), the project repo ships a Modal-deployed FastAPI server with parallel forward + top-K filter built in.

Training details

Base: mistralai/Voxtral-Mini-4B-Realtime-2602
Schema: v1-style packed targets — target tokens consecutive at [p_len, p_len+N), EOS at p_len+N, post-EOS labels=-100. No silence-position training (unlike v3 distributed schema, which over-dilutes sparse tag-only training at 60:1 ratio against content).
Target: tag-only (re.findall(r'\[[^\]]+\]', tagged_text) — text content stripped, only bracket-form tags remain).
LoRA: r=16, α=64, RS-LoRA, dropout=0.05, attention-only (q_proj, k_proj, v_proj, o_proj).
Frozen: audio_tower, multi_modal_projector, time_embedding (frozen after get_peft_model() to avoid PEFT re-enabling norm grads).
Optimizer: AdamW, lr=2e-5, cosine schedule, warmup=50 steps, weight_decay=0.01, max_grad_norm=1.0.
Training: bf16, 3 epochs (153 steps), batch=2, grad_accum=8, effective batch=16, gradient checkpointing, NEFTune α=5.0.
Hardware: Modal A100-40GB, ~7 min runtime.
Trainable params: 12 M of 4.5 B (0.27%).
Dataset: ~810 audio clips (TTS-synthesized via ElevenLabs v3) reused from the predecessor Evoxtral 3B project.

Why this schema worked when others didn't

We iterated through five recipes. Three failed in different ways before this one worked. See prior_work.md for the full Phase 1-3 matrix. Highlights:

Phase	Schema	Outcome
1	matching-shape (v1), full text	bimodal cliff: under-fit (no tags learned) or over-fit (WER 122%, hallucinated content)
2	distributed targets (v2/v3), full text	greedy hits premature EOS, sampling helps but Tag F1 caps at 27%
3a	distributed (v3) + tags-only	model only emits `[` then long stream_pad runs (60:1 stream_pad-vs-content signal dilution)
3b (this)	packed (v1-style) + tags-only	clean tag emission, audio-grounded, +29pp Recall

The architectural insight: removing the ASR-routing burden (tags-only target) lets the LoRA's full 12M-param capacity go to audio→tag mapping, and packed layout avoids the stream_pad dilution that kills sparse-target training under v3.

Limitations

Over-emission. Raw output emits ~5-7 tags per utterance even when 1-2 are correct. Mitigated by inference-time top_k=2 filter (Tag F1 → 29%, Precision → 47%).
Default-emit fallback. On uncertain audio, the model emits [calm] [pause] [clears throat] as a fallback set. This is a data-side limitation: TTS-synthesized affect signal is too weak to differentiate ambiguous inputs. RAFT does not eliminate it; the RL version only trims it slightly.
TTS dataset. Trained on ElevenLabs-synthesized audio. Real clinical recordings (long pauses, distressed affect, room noise) are out of distribution.
Tag taxonomy fixed. 15 base tags from tag_taxonomy.py (6 emotion + 5 nonverbal + 3 delivery + 1 pause). Out-of-taxonomy concepts won't be tagged.
English only. Training data is English; multilingual affect transfer is untested.

License

Apache-2.0, matching the base Voxtral Realtime license.

Citation

@software{evoxtral_realtime_2026,
  title  = {Evoxtral-Realtime: Backchannel-style affect-tag adapter for Voxtral-Mini-4B-Realtime},
  author = {Yongkang Zou},
  year   = {2026},
  url    = {https://github.com/Tame-Your-Monkey/evoxtral-realtime}
}

@misc{voxtral_mini_realtime,
  author = {Mistral AI},
  title  = {Voxtral-Mini-4B-Realtime-2602},
  year   = {2026},
  url    = {https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602}
}

Downloads last month: 4

Model tree for YongkangZOU/evoxtral-realtime-sft

Base model

mistralai/Ministral-3-3B-Base-2512

Finetuned

mistralai/Voxtral-Mini-4B-Realtime-2602

Adapter

(4)

this model

YongkangZOU
/

evoxtral-realtime-sft