Phi-4-Multimodal-DPO

Fine-tuned version of microsoft/Phi-4-multimodal-instruct using Direct Preference Optimization (DPO) for improved code-switching speech recognition.

Evaluation Results (MER - Mixed Error Rate, lower is better)

Benchmark Baseline This Model Improvement
SEAME-SGE 0.6997 0.6109 -12.7%
SEAME-MAN 0.5197 0.4663 -10.3%
EMILIA 0.7098 0.0738 -89.6%
CS-Dialogue 0.4961 0.1065 -78.5%

Benchmark Descriptions

  • SEAME-SGE: SEAME dev set (Singapore English focused) - 3,222 samples (AudioLLMs/seame_dev_sge)
  • SEAME-MAN: SEAME dev set (Mandarin focused) - 2,610 samples (AudioLLMs/seame_dev_man)
  • EMILIA: Synthetic code-switching evaluation set (1,000 samples)
  • CS-Dialogue: Code-switching dialogue evaluation set (359 samples)

Examples

Below are examples showing improvements from baseline to DPO-trained model:

Example 1: Hallucination Fixed (Valentine's Day)

Transcription
Ground Truth (呃) 我们 二月 多 有 valentine's day
Baseline ah moment ah month ah month ah month ah month... (repeated 250+ times)
This Model So a moment 二月多有 Valentine's Day。
MER 56.89 → 0.33

Example 2: English Terms Preserved (Christmas Party)

Transcription
Ground Truth (嗯) 那个 christmas party christmas party 是 几点
Baseline 嗯那个克拉斯马斯巴迪克拉斯马斯巴迪是几点 (transliterated to Chinese)
This Model 嗯,那个 Christmas buddy Christmas buddy 是几点?
MER 1.40 → 0.20

Example 3: Code-Switching Preserved (Caller ID)

Transcription
Ground Truth 还有 caller id 对 then
Baseline Uh, yeah. (lost Chinese content)
This Model 还有 caller ID 对。 Then。
MER 1.00 → 0.00

Example 4: Mixed Language Preserved (Heels)

Transcription
Ground Truth for 女人 (er) heels 就是 最 损害 你的 脚 损伤 你的 脚
Baseline for 女人,呃。 he just is the most harm to your feet, harm to your feet. (Chinese translated to English)
This Model For 女人,呃, heel 就是最损害你的脚,损伤你的脚。
MER 0.83 → 0.11

Example 5: Complex Code-Switching Preserved (Disclaimer)

Transcription
Ground Truth [哦] 我要 disclaimer 一下 [诶] 我 没有 弄坏 [乎] [诶] 我 只是 拿 来看 看 而已 我 真的 没有 弄坏 okay 如果 坏 了 不 怪 我 的 事 disclaimer
Baseline 我ask him a moment. I didn't break it. I just took it to look at it... (Chinese translated to English)
This Model 我 discount 我一下,哎,我没有弄坏哦,我只是拿来看看而已,我真的没有弄坏。 Okay,如果坏了不关我的事。
MER 0.98 → 0.24

Training Configuration

Parameter Value
Base Model microsoft/Phi-4-multimodal-instruct
Training Method DPO (Direct Preference Optimization)
Learning Rate 5e-6
DPO Beta 0.05
Epochs 1
Batch Size (per GPU) 1
Gradient Accumulation 4
Effective Batch Size 256
Optimizer AdamW
LR Scheduler Cosine
Warmup Ratio 0.1
Max Length 2048
DeepSpeed ZeRO-2

Usage

from transformers import AutoModelForCausalLM, AutoProcessor
import soundfile as sf

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "myaccountfor/Phi-4-multimodal-DPO",
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(
    "myaccountfor/Phi-4-multimodal-DPO",
    trust_remote_code=True
)

# Load audio
audio, sr = sf.read("your_audio.wav")

# Build prompt
prompt = "<|user|><|audio_1|>Please transcribe this speech.<|end|><|assistant|>"

# Process and generate
inputs = processor(text=prompt, audios=[(audio, sr)], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
transcription = processor.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0]
print(transcription)

Files

├── README.md
├── config.json
├── model-*.safetensors       # Model weights (~11.5 GB total)
├── tokenizer files
└── eval_results/
    ├── baseline_seame_sge.json    # Baseline results on SEAME-SGE
    ├── baseline_seame_man.json    # Baseline results on SEAME-MAN
    ├── baseline_emilia.json       # Baseline results on EMILIA
    ├── baseline_cs_dialogue.json  # Baseline results on CS-Dialogue
    ├── trained_seame_sge.json     # This model's results on SEAME-SGE
    ├── trained_seame_man.json     # This model's results on SEAME-MAN
    ├── trained_emilia.json        # This model's results on EMILIA
    └── trained_cs_dialogue.json   # This model's results on CS-Dialogue

License

This model inherits the MIT license from Microsoft's Phi-4 model.

Downloads last month
150
Safetensors
Model size
6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for myaccountfor/Phi-4-multimodal-DPO

Finetuned
(47)
this model