Phi-4-Multimodal-DPO

Fine-tuned version of microsoft/Phi-4-multimodal-instruct using Direct Preference Optimization (DPO) for improved code-switching speech recognition.

Evaluation Results (MER - Mixed Error Rate, lower is better)

Benchmark	Baseline	This Model	Improvement
SEAME-SGE	0.6997	0.6109	-12.7%
SEAME-MAN	0.5197	0.4663	-10.3%
EMILIA	0.7098	0.0738	-89.6%
CS-Dialogue	0.4961	0.1065	-78.5%

Benchmark Descriptions

SEAME-SGE: SEAME dev set (Singapore English focused) - 3,222 samples (AudioLLMs/seame_dev_sge)
SEAME-MAN: SEAME dev set (Mandarin focused) - 2,610 samples (AudioLLMs/seame_dev_man)
EMILIA: Synthetic code-switching evaluation set (1,000 samples)
CS-Dialogue: Code-switching dialogue evaluation set (359 samples)

Examples

Below are examples showing improvements from baseline to DPO-trained model:

Example 1: Hallucination Fixed (Valentine's Day)

	Transcription
Ground Truth	(呃) 我们二月多有 valentine's day
Baseline	ah moment ah month ah month ah month ah month... (repeated 250+ times)
This Model	So a moment 二月多有 Valentine's Day。
MER	56.89 → 0.33

Example 2: English Terms Preserved (Christmas Party)

	Transcription
Ground Truth	(嗯) 那个 christmas party christmas party 是几点
Baseline	嗯那个克拉斯马斯巴迪克拉斯马斯巴迪是几点 (transliterated to Chinese)
This Model	嗯，那个 Christmas buddy Christmas buddy 是几点？
MER	1.40 → 0.20

Example 3: Code-Switching Preserved (Caller ID)

	Transcription
Ground Truth	还有 caller id 对 then
Baseline	Uh, yeah. (lost Chinese content)
This Model	还有 caller ID 对。 Then。
MER	1.00 → 0.00

Example 4: Mixed Language Preserved (Heels)

	Transcription
Ground Truth	for 女人 (er) heels 就是最损害你的脚损伤你的脚
Baseline	for 女人，呃。 he just is the most harm to your feet, harm to your feet. (Chinese translated to English)
This Model	For 女人，呃， heel 就是最损害你的脚，损伤你的脚。
MER	0.83 → 0.11

Example 5: Complex Code-Switching Preserved (Disclaimer)

	Transcription
Ground Truth	[哦] 我要 disclaimer 一下 [诶] 我没有弄坏 [乎] [诶] 我只是拿来看看而已我真的没有弄坏 okay 如果坏了不怪我的事 disclaimer
Baseline	我ask him a moment. I didn't break it. I just took it to look at it... (Chinese translated to English)
This Model	我 discount 我一下，哎，我没有弄坏哦，我只是拿来看看而已，我真的没有弄坏。 Okay，如果坏了不关我的事。
MER	0.98 → 0.24

Training Configuration

Parameter	Value
Base Model	microsoft/Phi-4-multimodal-instruct
Training Method	DPO (Direct Preference Optimization)
Learning Rate	5e-6
DPO Beta	0.05
Epochs	1
Batch Size (per GPU)	1
Gradient Accumulation	4
Effective Batch Size	256
Optimizer	AdamW
LR Scheduler	Cosine
Warmup Ratio	0.1
Max Length	2048
DeepSpeed	ZeRO-2

Usage

from transformers import AutoModelForCausalLM, AutoProcessor
import soundfile as sf

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "myaccountfor/Phi-4-multimodal-DPO",
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(
    "myaccountfor/Phi-4-multimodal-DPO",
    trust_remote_code=True
)

# Load audio
audio, sr = sf.read("your_audio.wav")

# Build prompt
prompt = "<|user|><|audio_1|>Please transcribe this speech.<|end|><|assistant|>"

# Process and generate
inputs = processor(text=prompt, audios=[(audio, sr)], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
transcription = processor.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0]
print(transcription)

Files

├── README.md
├── config.json
├── model-*.safetensors       # Model weights (~11.5 GB total)
├── tokenizer files
└── eval_results/
    ├── baseline_seame_sge.json    # Baseline results on SEAME-SGE
    ├── baseline_seame_man.json    # Baseline results on SEAME-MAN
    ├── baseline_emilia.json       # Baseline results on EMILIA
    ├── baseline_cs_dialogue.json  # Baseline results on CS-Dialogue
    ├── trained_seame_sge.json     # This model's results on SEAME-SGE
    ├── trained_seame_man.json     # This model's results on SEAME-MAN
    ├── trained_emilia.json        # This model's results on EMILIA
    └── trained_cs_dialogue.json   # This model's results on CS-Dialogue

License

This model inherits the MIT license from Microsoft's Phi-4 model.

Downloads last month: 150

Safetensors

Model size

6B params

Tensor type

BF16

Model tree for myaccountfor/Phi-4-multimodal-DPO

Base model

microsoft/Phi-4-multimodal-instruct

Finetuned

(47)

this model