Evoxtral-Realtime RL (Recipe I + RAFT β€” production default)

LoRA adapter on top of mistralai/Voxtral-Mini-4B-Realtime-2602 that emits ElevenLabs-style expressive tags ([whispers], [sighs], [laughs], [pause], etc.) from audio. This is the production default for the half-duplex AI-therapist Mode B hybrid pipeline. RAFT-polished version of evoxtral-realtime-sft.

What changed vs SFT

This adapter starts from the SFT checkpoint and runs Stage 2 RAFT (Reward rAnked FineTuning, Dong et al. 2023):

  1. Generate β€” sample N=4 completions per training input from the SFT model at temperature=0.7 (3232 total samples).
  2. Score β€” rule-based reward 0.4 Γ— wer_accuracy + 0.4 Γ— tag_f1 + 0.2 Γ— (1 βˆ’ hallucination_rate).
  3. Curate β€” keep the highest-reward completion per sample, drop the bottom 10%. ~727 curated samples remain.
  4. SFT-on-curated β€” 1 epoch (46 steps) at lr=5e-5 from the SFT checkpoint.

Effect vs SFT alone: βˆ’5pp hallucination rate (61% β†’ 53% with top_k=2 filter), slightly fewer tags emitted on average, Tag F1 / Recall β‰ˆ flat. RAFT is marginal here because the rule-based reward lacks an absolute anti-overemit term β€” it ranks by rate of wrong tags, not total count, so over-emitting fallback patterns survive curation. See the project's prior_work.md Phase 4 for the full diagnosis.

Architecture: Moshi-style backchannel

This adapter is tag-only β€” it does NOT produce ASR text. Pair with frozen base for ASR; merge outputs at inference:

audio ─┬─ base Voxtral-Mini-4B-Realtime-2602 ─→ ASR text (clean WER ~10%)
       └─ this adapter (LoRA + RAFT)        ─→ tag stream β†’ top_k=2 filter

merged: "[whispers] [pause] Listen, I know you're in a meeting"

The dual-channel pattern is inspired by Moshi's parallel-stream design, adapted to Voxtral Realtime's element-wise audio-text fusion architecture. Reference Mode B implementation: serve_modal.py (Modal-deployed FastAPI, two model instances on a single A100-40, parallel forward via asyncio.gather, top-K filter, JSON merged output).

Performance (50-sample test set, greedy)

Metric Base SFT only RL (this) raw RL + top_k=2 filter (production)
Tag F1 22% 28% 28% 29% ⭐
Tag Recall 22% 51% 50% 42%
Tag Precision 100% 34% 37% 47%
Tag Hallucination 0% 61% 57% 53%
WER (text from base) 10% n/a n/a 10% (unchanged)

Production config = this adapter + base for ASR + top_k=2 inference filter = the right of the above table.

Quick start

import torch
from transformers import VoxtralRealtimeForConditionalGeneration, AutoProcessor
from peft import PeftModel

processor = AutoProcessor.from_pretrained("mistralai/Voxtral-Mini-4B-Realtime-2602")
base = VoxtralRealtimeForConditionalGeneration.from_pretrained(
    "mistralai/Voxtral-Mini-4B-Realtime-2602",
    dtype=torch.bfloat16,
    device_map="auto",
)
tag_model = PeftModel.from_pretrained(base, "YongkangZOU/evoxtral-realtime-rl")
tag_model.eval()
# Use `base` for ASR text, `tag_model` for tag stream β€” see serve_modal.py for the full hybrid.

For end-to-end use (POST audio file β†’ JSON with text, tags_filtered, merged), the project repo ships a Modal-deployed FastAPI server with parallel forward + top-K filter built in.

Training details

Stage 1 inheritance β€” see the SFT card for: v1-style packed schema, tags-only target, LoRA r=16/Ξ±=64 attention-only, frozen audio path.

Stage 2 RAFT additions:

  • Method: RAFT (rejection sampling + plain SFT). No critic, no KL clipping, no learned reward model.
  • Generation: N=4 Γ— 808 train samples = 3232 completions, temperature=0.7, top_p=0.9, max_new_tokens=64. ~33 min on A100-40.
  • Reward function: 0.4 Γ— (1 βˆ’ WER) + 0.4 Γ— tag_f1 + 0.2 Γ— (1 βˆ’ hall_rate) (rule-based; for backchannel adapter the WER term is constant 0 since pred has no text content, so reward effectively scores tag quality).
  • Curated set: 727 samples after bottom-10% reward filter.
  • SFT-on-curated: 1 epoch (46 steps), lr=5e-5, cosine schedule, warmup=20, gradient_checkpointing=False (PeftModel.from_pretrained + checkpointing crashes on the in-place audio add β€” see project cheat-sheet).
  • Trainable: 16.2 M of 4.5 B (0.36%). Slightly higher than SFT due to PeftModel.from_pretrained loading.
  • Hardware: Modal A100-40GB, bf16, ~3 min runtime.

RAFT pitfalls discovered along the way

The RAFT pipeline (rl_modal.py in the project repo) needed five fixes vs the original Stage 2 design before it ran clean. Documented here for future RAFT-on-Voxtral-Realtime users:

  1. Audio pre-pad missing β€” generation must pre-pad raw audio to AUDIO_MAX_SAMPLES=240_480 to match the train/eval audio path.
  2. Mel mod-8 padding missing β€” encoder reshape requires T_mel % 8 == 0.
  3. max_new_tokens=512 excessive for backchannel β€” tag-only outputs are ~5-10 tokens; reduced to 64.
  4. num_delay_tokens scalar tensor breaks num_return_sequences > 1 in HF generate's _expand_inputs_for_generation. Drop the key before calling generate.
  5. PeftModel.from_pretrained + gradient_checkpointing=True crashes on the in-place audio add at modeling_voxtral_realtime.py:1078. PeftModel.from_pretrained doesn't auto-freeze base params (unlike get_peft_model), and the checkpointing hook combined with frozen embeddings makes inputs_embeds a leaf-with-grad. Disable gradient_checkpointing for RAFT.

See the hard-won facts cheat-sheet for the full set of Voxtral Realtime training pitfalls.

Limitations

  • Default-emit fallback persists. On uncertain audio, model still emits [calm] [pause] [clears throat] as a default set. RAFT trims this slightly but doesn't eliminate it. Data-side limitation: TTS-synthesized affect signal is too weak to differentiate ambiguous inputs.
  • Best with top_k=2 filter. Raw output over-emits ~4-6 tags per utterance. Inference-time top-K filter is the production config.
  • TTS dataset. Trained on ElevenLabs-synthesized audio. Real clinical recordings out of distribution.
  • Tag taxonomy fixed. 15 base tags. Out-of-taxonomy concepts won't be tagged.
  • English only.

See also

License

Apache-2.0, matching the base Voxtral Realtime license.

Citation

@software{evoxtral_realtime_2026,
  title  = {Evoxtral-Realtime: RAFT-polished backchannel adapter for Voxtral-Mini-4B-Realtime},
  author = {Yongkang Zou},
  year   = {2026},
  url    = {https://github.com/Tame-Your-Monkey/evoxtral-realtime}
}

@misc{voxtral_mini_realtime,
  author = {Mistral AI},
  title  = {Voxtral-Mini-4B-Realtime-2602},
  year   = {2026},
  url    = {https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602}
}

@misc{dong2023raft,
  title  = {RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment},
  author = {Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong},
  year   = {2023},
  eprint = {2304.06767},
  url    = {https://arxiv.org/abs/2304.06767}
}
Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for YongkangZOU/evoxtral-realtime-rl

Paper for YongkangZOU/evoxtral-realtime-rl