Qwen2-Audio-7B-DPO-CodeSwitch

A LoRA adapter for Qwen/Qwen2-Audio-7B-Instruct fine-tuned with DPO (Direct Preference Optimization) on code-switching speech transcription data.

Evaluation Results (MER - Mixed Error Rate, lower is better)

Benchmark	Baseline	This Model	Improvement
SEAME	0.5677	0.5301	-6.6%
EMILIA	0.4470	0.4208	-5.9%
CS-Dialogue	0.3891	0.3140	-19.3%

Benchmark Descriptions

SEAME: English-Mandarin code-switching conversational speech from Singapore/Malaysia (9,764 samples)
EMILIA: Synthetic code-switching evaluation set (1,000 samples)
CS-Dialogue: Code-switching dialogue evaluation set (359 samples)

Examples

Below are examples showing improvements from baseline to DPO-trained model:

Example 1: Code-Switching Preserved (Lifestyle)

	Transcription
Ground Truth	我们都应该 pursue a healthy lifestyle
Baseline	我们都应该追求健康的生活方式 (fully translated to Chinese)
This Model	我们都应该 pursue a healthy lifestyle
MER	1.00 → 0.00

Example 2: Mixed Language Preserved (Christmas)

	Transcription
Ground Truth	every christmas 我就应该是没有人跟我庆祝了 [啦]
Baseline	every christmas i would - should be no one to tell me (Chinese translated to English)
This Model	every christmas 我就应该是没有人跟我庆祝了啦
MER	0.88 → 0.00

Example 3: Technical Terms Preserved

	Transcription
Ground Truth	(呃) 每个 lecture different lecturer 那个 notes 不不怎么好的 [啦]
Baseline	呃那个老师不同风格的老师 (lost technical terms)
This Model	呃每个 lecture different lecturer 那个 notes 不不怎么好的啦
MER	0.75 → 0.00

Example 4: Complex Code-Switching Preserved

	Transcription
Ground Truth	[哦] 还有什么好吃的吗还是你只是去那些 very expensive places like dempsey to eat
Baseline	Oh, what else? Oh, yeah, there's always that expensive place like... to eat (lost Chinese content)
This Model	哦还有什么好吃的吗还是你只是去那些 very expensive places like dancy to eat
MER	0.83 → 0.04

Example 5: Professional Terms Preserved

	Transcription
Ground Truth	[哦] 因为是个 professional degree
Baseline	哦因为他有个专业的学位 (translated to Chinese)
This Model	哦因为是个 professional degree
MER	1.00 → 0.00

Training Configuration

Model Architecture

Parameter	Value
Base Model	Qwen/Qwen2-Audio-7B-Instruct
Adapter Type	LoRA (Low-Rank Adaptation)
LoRA Rank (r)	256
LoRA Alpha	128
LoRA Dropout	0.05
LoRA Target Modules	All attention (q_proj, k_proj, v_proj, o_proj) + MLP (up_proj, down_proj, gate_proj)
Trainable Parameters	~1.28B (adapter only)

Training Hyperparameters

Parameter	Value
Training Method	DPO (Direct Preference Optimization)
DPO Beta (β)	0.3
DPO Loss	Sigmoid
Learning Rate	3e-5
LR Scheduler	Cosine
Warmup Ratio	0.1
Batch Size (per device)	1
Gradient Accumulation Steps	4
Global Batch Size	32 (8 GPUs × 1 × 4)
Precision	BF16
Max Sequence Length	8192
Weight Decay	0.01
Max Gradient Norm	1.0

Usage

from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch
import librosa

# Load base model
base_model = Qwen2AudioForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-Audio-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-Audio-7B-Instruct",
    trust_remote_code=True
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch")
model.eval()

# Inference example
conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "path/to/audio.wav"},
        {"type": "text", "text": "Please transcribe this speech."}
    ]}
]

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = [librosa.load("path/to/audio.wav", sr=processor.feature_extractor.sampling_rate)[0]]

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=256)

transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

Files

├── README.md                      # This file
├── adapter_config.json            # LoRA configuration
├── adapter_model.safetensors      # LoRA adapter weights (~1.28 GB)
├── tokenizer files                # Tokenizer assets
└── eval_results/
    ├── baseline_seame.json        # Baseline model results on SEAME
    ├── baseline_emilia.json       # Baseline model results on EMILIA
    ├── baseline_cs_dialogue.json  # Baseline model results on CS-Dialogue
    ├── trained_seame.json         # This model's results on SEAME
    ├── trained_emilia.json        # This model's results on EMILIA
    └── trained_cs_dialogue.json   # This model's results on CS-Dialogue

License

This adapter inherits the license of the base Qwen2-Audio-7B-Instruct model.

Downloads last month: 128

Model tree for myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch

Base model

Qwen/Qwen2-Audio-7B-Instruct

Adapter

(9)

this model