Instructions to use YongkangZOU/evoxtral-realtime-rl with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use YongkangZOU/evoxtral-realtime-rl with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("mistralai/Voxtral-Mini-4B-Realtime-2602") model = PeftModel.from_pretrained(base_model, "YongkangZOU/evoxtral-realtime-rl") - Notebooks
- Google Colab
- Kaggle
Evoxtral-Realtime RL (Recipe I + RAFT β production default)
LoRA adapter on top of mistralai/Voxtral-Mini-4B-Realtime-2602 that emits ElevenLabs-style expressive tags ([whispers], [sighs], [laughs], [pause], etc.) from audio. This is the production default for the half-duplex AI-therapist Mode B hybrid pipeline. RAFT-polished version of evoxtral-realtime-sft.
What changed vs SFT
This adapter starts from the SFT checkpoint and runs Stage 2 RAFT (Reward rAnked FineTuning, Dong et al. 2023):
- Generate β sample N=4 completions per training input from the SFT model at temperature=0.7 (3232 total samples).
- Score β rule-based reward
0.4 Γ wer_accuracy + 0.4 Γ tag_f1 + 0.2 Γ (1 β hallucination_rate). - Curate β keep the highest-reward completion per sample, drop the bottom 10%. ~727 curated samples remain.
- SFT-on-curated β 1 epoch (46 steps) at lr=5e-5 from the SFT checkpoint.
Effect vs SFT alone: β5pp hallucination rate (61% β 53% with top_k=2 filter), slightly fewer tags emitted on average, Tag F1 / Recall β flat. RAFT is marginal here because the rule-based reward lacks an absolute anti-overemit term β it ranks by rate of wrong tags, not total count, so over-emitting fallback patterns survive curation. See the project's prior_work.md Phase 4 for the full diagnosis.
Architecture: Moshi-style backchannel
This adapter is tag-only β it does NOT produce ASR text. Pair with frozen base for ASR; merge outputs at inference:
audio ββ¬β base Voxtral-Mini-4B-Realtime-2602 ββ ASR text (clean WER ~10%)
ββ this adapter (LoRA + RAFT) ββ tag stream β top_k=2 filter
merged: "[whispers] [pause] Listen, I know you're in a meeting"
The dual-channel pattern is inspired by Moshi's parallel-stream design, adapted to Voxtral Realtime's element-wise audio-text fusion architecture. Reference Mode B implementation: serve_modal.py (Modal-deployed FastAPI, two model instances on a single A100-40, parallel forward via asyncio.gather, top-K filter, JSON merged output).
Performance (50-sample test set, greedy)
| Metric | Base | SFT only | RL (this) raw | RL + top_k=2 filter (production) |
|---|---|---|---|---|
| Tag F1 | 22% | 28% | 28% | 29% β |
| Tag Recall | 22% | 51% | 50% | 42% |
| Tag Precision | 100% | 34% | 37% | 47% |
| Tag Hallucination | 0% | 61% | 57% | 53% |
| WER (text from base) | 10% | n/a | n/a | 10% (unchanged) |
Production config = this adapter + base for ASR + top_k=2 inference filter = the right of the above table.
Quick start
import torch
from transformers import VoxtralRealtimeForConditionalGeneration, AutoProcessor
from peft import PeftModel
processor = AutoProcessor.from_pretrained("mistralai/Voxtral-Mini-4B-Realtime-2602")
base = VoxtralRealtimeForConditionalGeneration.from_pretrained(
"mistralai/Voxtral-Mini-4B-Realtime-2602",
dtype=torch.bfloat16,
device_map="auto",
)
tag_model = PeftModel.from_pretrained(base, "YongkangZOU/evoxtral-realtime-rl")
tag_model.eval()
# Use `base` for ASR text, `tag_model` for tag stream β see serve_modal.py for the full hybrid.
For end-to-end use (POST audio file β JSON with text, tags_filtered, merged), the project repo ships a Modal-deployed FastAPI server with parallel forward + top-K filter built in.
Training details
Stage 1 inheritance β see the SFT card for: v1-style packed schema, tags-only target, LoRA r=16/Ξ±=64 attention-only, frozen audio path.
Stage 2 RAFT additions:
- Method: RAFT (rejection sampling + plain SFT). No critic, no KL clipping, no learned reward model.
- Generation: N=4 Γ 808 train samples = 3232 completions, temperature=0.7, top_p=0.9, max_new_tokens=64. ~33 min on A100-40.
- Reward function:
0.4 Γ (1 β WER) + 0.4 Γ tag_f1 + 0.2 Γ (1 β hall_rate)(rule-based; for backchannel adapter the WER term is constant 0 since pred has no text content, so reward effectively scores tag quality). - Curated set: 727 samples after bottom-10% reward filter.
- SFT-on-curated: 1 epoch (46 steps), lr=5e-5, cosine schedule, warmup=20, gradient_checkpointing=False (PeftModel.from_pretrained + checkpointing crashes on the in-place audio add β see project cheat-sheet).
- Trainable: 16.2 M of 4.5 B (0.36%). Slightly higher than SFT due to PeftModel.from_pretrained loading.
- Hardware: Modal A100-40GB, bf16, ~3 min runtime.
RAFT pitfalls discovered along the way
The RAFT pipeline (rl_modal.py in the project repo) needed five fixes vs the original Stage 2 design before it ran clean. Documented here for future RAFT-on-Voxtral-Realtime users:
- Audio pre-pad missing β generation must pre-pad raw audio to
AUDIO_MAX_SAMPLES=240_480to match the train/eval audio path. - Mel mod-8 padding missing β encoder reshape requires
T_mel % 8 == 0. max_new_tokens=512excessive for backchannel β tag-only outputs are ~5-10 tokens; reduced to 64.num_delay_tokensscalar tensor breaksnum_return_sequences > 1in HF generate's_expand_inputs_for_generation. Drop the key before calling generate.PeftModel.from_pretrained+gradient_checkpointing=Truecrashes on the in-place audio add atmodeling_voxtral_realtime.py:1078. PeftModel.from_pretrained doesn't auto-freeze base params (unlikeget_peft_model), and the checkpointing hook combined with frozen embeddings makesinputs_embedsa leaf-with-grad. Disable gradient_checkpointing for RAFT.
See the hard-won facts cheat-sheet for the full set of Voxtral Realtime training pitfalls.
Limitations
- Default-emit fallback persists. On uncertain audio, model still emits
[calm] [pause] [clears throat]as a default set. RAFT trims this slightly but doesn't eliminate it. Data-side limitation: TTS-synthesized affect signal is too weak to differentiate ambiguous inputs. - Best with
top_k=2filter. Raw output over-emits ~4-6 tags per utterance. Inference-time top-K filter is the production config. - TTS dataset. Trained on ElevenLabs-synthesized audio. Real clinical recordings out of distribution.
- Tag taxonomy fixed. 15 base tags. Out-of-taxonomy concepts won't be tagged.
- English only.
See also
- βοΈ
YongkangZOU/evoxtral-realtime-sftβ the SFT-only baseline that this adapter was bootstrapped from. - ποΈ Project repository β full pipeline, evaluation harness, Mode B hybrid serve (
serve_modal.py), design docs. - ποΈ Voxtral-Mini-4B-Realtime-2602 β required base model.
- π RAFT paper (Dong et al. 2023) β the Reward rAnked FineTuning method this adapter uses for Stage 2.
License
Apache-2.0, matching the base Voxtral Realtime license.
Citation
@software{evoxtral_realtime_2026,
title = {Evoxtral-Realtime: RAFT-polished backchannel adapter for Voxtral-Mini-4B-Realtime},
author = {Yongkang Zou},
year = {2026},
url = {https://github.com/Tame-Your-Monkey/evoxtral-realtime}
}
@misc{voxtral_mini_realtime,
author = {Mistral AI},
title = {Voxtral-Mini-4B-Realtime-2602},
year = {2026},
url = {https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602}
}
@misc{dong2023raft,
title = {RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment},
author = {Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong},
year = {2023},
eprint = {2304.06767},
url = {https://arxiv.org/abs/2304.06767}
}
- Downloads last month
- 24
Model tree for YongkangZOU/evoxtral-realtime-rl
Base model
mistralai/Ministral-3-3B-Base-2512