Evoxtral-Realtime RL (Recipe I + RAFT — production default)

LoRA adapter on top of mistralai/Voxtral-Mini-4B-Realtime-2602 that emits ElevenLabs-style expressive tags ([whispers], [sighs], [laughs], [pause], etc.) from audio. This is the production default for the half-duplex AI-therapist Mode B hybrid pipeline. RAFT-polished version of evoxtral-realtime-sft.

What changed vs SFT

This adapter starts from the SFT checkpoint and runs Stage 2 RAFT (Reward rAnked FineTuning, Dong et al. 2023):

Generate — sample N=4 completions per training input from the SFT model at temperature=0.7 (3232 total samples).
Score — rule-based reward 0.4 × wer_accuracy + 0.4 × tag_f1 + 0.2 × (1 − hallucination_rate).
Curate — keep the highest-reward completion per sample, drop the bottom 10%. ~727 curated samples remain.
SFT-on-curated — 1 epoch (46 steps) at lr=5e-5 from the SFT checkpoint.

Effect vs SFT alone: −5pp hallucination rate (61% → 53% with top_k=2 filter), slightly fewer tags emitted on average, Tag F1 / Recall ≈ flat. RAFT is marginal here because the rule-based reward lacks an absolute anti-overemit term — it ranks by rate of wrong tags, not total count, so over-emitting fallback patterns survive curation. See the project's prior_work.md Phase 4 for the full diagnosis.

Architecture: Moshi-style backchannel

This adapter is tag-only — it does NOT produce ASR text. Pair with frozen base for ASR; merge outputs at inference:

audio ─┬─ base Voxtral-Mini-4B-Realtime-2602 ─→ ASR text (clean WER ~10%)
       └─ this adapter (LoRA + RAFT)        ─→ tag stream → top_k=2 filter

merged: "[whispers] [pause] Listen, I know you're in a meeting"

The dual-channel pattern is inspired by Moshi's parallel-stream design, adapted to Voxtral Realtime's element-wise audio-text fusion architecture. Reference Mode B implementation: serve_modal.py (Modal-deployed FastAPI, two model instances on a single A100-40, parallel forward via asyncio.gather, top-K filter, JSON merged output).

Performance (50-sample test set, greedy)

Metric	Base	SFT only	RL (this) raw	RL + top_k=2 filter (production)
Tag F1	22%	28%	28%	29% ⭐
Tag Recall	22%	51%	50%	42%
Tag Precision	100%	34%	37%	47%
Tag Hallucination	0%	61%	57%	53%
WER (text from base)	10%	n/a	n/a	10% (unchanged)

Production config = this adapter + base for ASR + top_k=2 inference filter = the right of the above table.

Quick start

import torch
from transformers import VoxtralRealtimeForConditionalGeneration, AutoProcessor
from peft import PeftModel

processor = AutoProcessor.from_pretrained("mistralai/Voxtral-Mini-4B-Realtime-2602")
base = VoxtralRealtimeForConditionalGeneration.from_pretrained(
    "mistralai/Voxtral-Mini-4B-Realtime-2602",
    dtype=torch.bfloat16,
    device_map="auto",
)
tag_model = PeftModel.from_pretrained(base, "YongkangZOU/evoxtral-realtime-rl")
tag_model.eval()
# Use `base` for ASR text, `tag_model` for tag stream — see serve_modal.py for the full hybrid.

For end-to-end use (POST audio file → JSON with text, tags_filtered, merged), the project repo ships a Modal-deployed FastAPI server with parallel forward + top-K filter built in.

Training details

Stage 1 inheritance — see the SFT card for: v1-style packed schema, tags-only target, LoRA r=16/α=64 attention-only, frozen audio path.

Stage 2 RAFT additions:

Method: RAFT (rejection sampling + plain SFT). No critic, no KL clipping, no learned reward model.
Generation: N=4 × 808 train samples = 3232 completions, temperature=0.7, top_p=0.9, max_new_tokens=64. ~33 min on A100-40.
Reward function: 0.4 × (1 − WER) + 0.4 × tag_f1 + 0.2 × (1 − hall_rate) (rule-based; for backchannel adapter the WER term is constant 0 since pred has no text content, so reward effectively scores tag quality).
Curated set: 727 samples after bottom-10% reward filter.
SFT-on-curated: 1 epoch (46 steps), lr=5e-5, cosine schedule, warmup=20, gradient_checkpointing=False (PeftModel.from_pretrained + checkpointing crashes on the in-place audio add — see project cheat-sheet).
Trainable: 16.2 M of 4.5 B (0.36%). Slightly higher than SFT due to PeftModel.from_pretrained loading.
Hardware: Modal A100-40GB, bf16, ~3 min runtime.

RAFT pitfalls discovered along the way

The RAFT pipeline (rl_modal.py in the project repo) needed five fixes vs the original Stage 2 design before it ran clean. Documented here for future RAFT-on-Voxtral-Realtime users:

Audio pre-pad missing — generation must pre-pad raw audio to AUDIO_MAX_SAMPLES=240_480 to match the train/eval audio path.
Mel mod-8 padding missing — encoder reshape requires T_mel % 8 == 0.
max_new_tokens=512 excessive for backchannel — tag-only outputs are ~5-10 tokens; reduced to 64.
num_delay_tokens scalar tensor breaks num_return_sequences > 1 in HF generate's _expand_inputs_for_generation. Drop the key before calling generate.
PeftModel.from_pretrained + gradient_checkpointing=True crashes on the in-place audio add at modeling_voxtral_realtime.py:1078. PeftModel.from_pretrained doesn't auto-freeze base params (unlike get_peft_model), and the checkpointing hook combined with frozen embeddings makes inputs_embeds a leaf-with-grad. Disable gradient_checkpointing for RAFT.

See the hard-won facts cheat-sheet for the full set of Voxtral Realtime training pitfalls.

Limitations

Default-emit fallback persists. On uncertain audio, model still emits [calm] [pause] [clears throat] as a default set. RAFT trims this slightly but doesn't eliminate it. Data-side limitation: TTS-synthesized affect signal is too weak to differentiate ambiguous inputs.
Best with top_k=2 filter. Raw output over-emits ~4-6 tags per utterance. Inference-time top-K filter is the production config.
TTS dataset. Trained on ElevenLabs-synthesized audio. Real clinical recordings out of distribution.
Tag taxonomy fixed. 15 base tags. Out-of-taxonomy concepts won't be tagged.
English only.

License

Apache-2.0, matching the base Voxtral Realtime license.

Citation

@software{evoxtral_realtime_2026,
  title  = {Evoxtral-Realtime: RAFT-polished backchannel adapter for Voxtral-Mini-4B-Realtime},
  author = {Yongkang Zou},
  year   = {2026},
  url    = {https://github.com/Tame-Your-Monkey/evoxtral-realtime}
}

@misc{voxtral_mini_realtime,
  author = {Mistral AI},
  title  = {Voxtral-Mini-4B-Realtime-2602},
  year   = {2026},
  url    = {https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602}
}

@misc{dong2023raft,
  title  = {RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment},
  author = {Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong},
  year   = {2023},
  eprint = {2304.06767},
  url    = {https://arxiv.org/abs/2304.06767}
}

Downloads last month: 22

Model tree for YongkangZOU/evoxtral-realtime-rl

Base model

mistralai/Ministral-3-3B-Base-2512

Finetuned

mistralai/Voxtral-Mini-4B-Realtime-2602

Adapter

(4)

this model

Paper for YongkangZOU/evoxtral-realtime-rl

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

Paper • 2304.06767 • Published Apr 13, 2023 • 2

YongkangZOU
/

evoxtral-realtime-rl