Phi-4-Multimodal-DPO
Fine-tuned version of microsoft/Phi-4-multimodal-instruct using Direct Preference Optimization (DPO) for improved code-switching speech recognition.
Evaluation Results (MER - Mixed Error Rate, lower is better)
| Benchmark |
Baseline |
This Model |
Improvement |
| SEAME-SGE |
0.6997 |
0.6109 |
-12.7% |
| SEAME-MAN |
0.5197 |
0.4663 |
-10.3% |
| EMILIA |
0.7098 |
0.0738 |
-89.6% |
| CS-Dialogue |
0.4961 |
0.1065 |
-78.5% |
Benchmark Descriptions
- SEAME-SGE: SEAME dev set (Singapore English focused) - 3,222 samples (AudioLLMs/seame_dev_sge)
- SEAME-MAN: SEAME dev set (Mandarin focused) - 2,610 samples (AudioLLMs/seame_dev_man)
- EMILIA: Synthetic code-switching evaluation set (1,000 samples)
- CS-Dialogue: Code-switching dialogue evaluation set (359 samples)
Examples
Below are examples showing improvements from baseline to DPO-trained model:
Example 1: Hallucination Fixed (Valentine's Day)
|
Transcription |
| Ground Truth |
(呃) 我们 二月 多 有 valentine's day |
| Baseline |
ah moment ah month ah month ah month ah month... (repeated 250+ times) |
| This Model |
So a moment 二月多有 Valentine's Day。 |
| MER |
56.89 → 0.33 |
Example 2: English Terms Preserved (Christmas Party)
|
Transcription |
| Ground Truth |
(嗯) 那个 christmas party christmas party 是 几点 |
| Baseline |
嗯那个克拉斯马斯巴迪克拉斯马斯巴迪是几点 (transliterated to Chinese) |
| This Model |
嗯,那个 Christmas buddy Christmas buddy 是几点? |
| MER |
1.40 → 0.20 |
Example 3: Code-Switching Preserved (Caller ID)
|
Transcription |
| Ground Truth |
还有 caller id 对 then |
| Baseline |
Uh, yeah. (lost Chinese content) |
| This Model |
还有 caller ID 对。 Then。 |
| MER |
1.00 → 0.00 |
Example 4: Mixed Language Preserved (Heels)
|
Transcription |
| Ground Truth |
for 女人 (er) heels 就是 最 损害 你的 脚 损伤 你的 脚 |
| Baseline |
for 女人,呃。 he just is the most harm to your feet, harm to your feet. (Chinese translated to English) |
| This Model |
For 女人,呃, heel 就是最损害你的脚,损伤你的脚。 |
| MER |
0.83 → 0.11 |
Example 5: Complex Code-Switching Preserved (Disclaimer)
|
Transcription |
| Ground Truth |
[哦] 我要 disclaimer 一下 [诶] 我 没有 弄坏 [乎] [诶] 我 只是 拿 来看 看 而已 我 真的 没有 弄坏 okay 如果 坏 了 不 怪 我 的 事 disclaimer |
| Baseline |
我ask him a moment. I didn't break it. I just took it to look at it... (Chinese translated to English) |
| This Model |
我 discount 我一下,哎,我没有弄坏哦,我只是拿来看看而已,我真的没有弄坏。 Okay,如果坏了不关我的事。 |
| MER |
0.98 → 0.24 |
Training Configuration
| Parameter |
Value |
| Base Model |
microsoft/Phi-4-multimodal-instruct |
| Training Method |
DPO (Direct Preference Optimization) |
| Learning Rate |
5e-6 |
| DPO Beta |
0.05 |
| Epochs |
1 |
| Batch Size (per GPU) |
1 |
| Gradient Accumulation |
4 |
| Effective Batch Size |
256 |
| Optimizer |
AdamW |
| LR Scheduler |
Cosine |
| Warmup Ratio |
0.1 |
| Max Length |
2048 |
| DeepSpeed |
ZeRO-2 |
Usage
from transformers import AutoModelForCausalLM, AutoProcessor
import soundfile as sf
model = AutoModelForCausalLM.from_pretrained(
"myaccountfor/Phi-4-multimodal-DPO",
trust_remote_code=True,
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained(
"myaccountfor/Phi-4-multimodal-DPO",
trust_remote_code=True
)
audio, sr = sf.read("your_audio.wav")
prompt = "<|user|><|audio_1|>Please transcribe this speech.<|end|><|assistant|>"
inputs = processor(text=prompt, audios=[(audio, sr)], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
transcription = processor.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0]
print(transcription)
Files
├── README.md
├── config.json
├── model-*.safetensors # Model weights (~11.5 GB total)
├── tokenizer files
└── eval_results/
├── baseline_seame_sge.json # Baseline results on SEAME-SGE
├── baseline_seame_man.json # Baseline results on SEAME-MAN
├── baseline_emilia.json # Baseline results on EMILIA
├── baseline_cs_dialogue.json # Baseline results on CS-Dialogue
├── trained_seame_sge.json # This model's results on SEAME-SGE
├── trained_seame_man.json # This model's results on SEAME-MAN
├── trained_emilia.json # This model's results on EMILIA
└── trained_cs_dialogue.json # This model's results on CS-Dialogue
License
This model inherits the MIT license from Microsoft's Phi-4 model.