Qwen2-Audio-7B-DPO-CodeSwitch

A LoRA adapter for Qwen/Qwen2-Audio-7B-Instruct fine-tuned with DPO (Direct Preference Optimization) on code-switching speech transcription data.

Evaluation Results (MER - Mixed Error Rate, lower is better)

Benchmark Baseline This Model Improvement
SEAME 0.5677 0.5301 -6.6%
EMILIA 0.4470 0.4208 -5.9%
CS-Dialogue 0.3891 0.3140 -19.3%

Benchmark Descriptions

  • SEAME: English-Mandarin code-switching conversational speech from Singapore/Malaysia (9,764 samples)
  • EMILIA: Synthetic code-switching evaluation set (1,000 samples)
  • CS-Dialogue: Code-switching dialogue evaluation set (359 samples)

Examples

Below are examples showing improvements from baseline to DPO-trained model:

Example 1: Code-Switching Preserved (Lifestyle)

Transcription
Ground Truth 我们 都 应该 pursue a healthy lifestyle
Baseline 我们都应该追求健康的生活方式 (fully translated to Chinese)
This Model 我们都应该 pursue a healthy lifestyle
MER 1.00 → 0.00

Example 2: Mixed Language Preserved (Christmas)

Transcription
Ground Truth every christmas 我 就 应该 是 没有 人 跟我 庆祝 了 [啦]
Baseline every christmas i would - should be no one to tell me (Chinese translated to English)
This Model every christmas 我就应该是没有人跟我庆祝了啦
MER 0.88 → 0.00

Example 3: Technical Terms Preserved

Transcription
Ground Truth (呃) 每个 lecture different lecturer 那个 notes 不 不怎么 好的 [啦]
Baseline 呃那个老师不同风格的老师 (lost technical terms)
This Model 呃 每个 lecture different lecturer 那个 notes 不不怎么好的啦
MER 0.75 → 0.00

Example 4: Complex Code-Switching Preserved

Transcription
Ground Truth [哦] 还有 什么 好吃 的 吗 还是 你 只是 去 那些 very expensive places like dempsey to eat
Baseline Oh, what else? Oh, yeah, there's always that expensive place like... to eat (lost Chinese content)
This Model 哦 还有什么好吃的吗 还是你只是去那些 very expensive places like dancy to eat
MER 0.83 → 0.04

Example 5: Professional Terms Preserved

Transcription
Ground Truth [哦] 因为 是个 professional degree
Baseline 哦因为他有个专业的学位 (translated to Chinese)
This Model 哦 因为 是个 professional degree
MER 1.00 → 0.00

Training Configuration

Model Architecture

Parameter Value
Base Model Qwen/Qwen2-Audio-7B-Instruct
Adapter Type LoRA (Low-Rank Adaptation)
LoRA Rank (r) 256
LoRA Alpha 128
LoRA Dropout 0.05
LoRA Target Modules All attention (q_proj, k_proj, v_proj, o_proj) + MLP (up_proj, down_proj, gate_proj)
Trainable Parameters ~1.28B (adapter only)

Training Hyperparameters

Parameter Value
Training Method DPO (Direct Preference Optimization)
DPO Beta (β) 0.3
DPO Loss Sigmoid
Learning Rate 3e-5
LR Scheduler Cosine
Warmup Ratio 0.1
Batch Size (per device) 1
Gradient Accumulation Steps 4
Global Batch Size 32 (8 GPUs × 1 × 4)
Precision BF16
Max Sequence Length 8192
Weight Decay 0.01
Max Gradient Norm 1.0

Usage

from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch
import librosa

# Load base model
base_model = Qwen2AudioForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-Audio-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-Audio-7B-Instruct",
    trust_remote_code=True
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch")
model.eval()

# Inference example
conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "path/to/audio.wav"},
        {"type": "text", "text": "Please transcribe this speech."}
    ]}
]

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = [librosa.load("path/to/audio.wav", sr=processor.feature_extractor.sampling_rate)[0]]

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=256)

transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

Files

├── README.md                      # This file
├── adapter_config.json            # LoRA configuration
├── adapter_model.safetensors      # LoRA adapter weights (~1.28 GB)
├── tokenizer files                # Tokenizer assets
└── eval_results/
    ├── baseline_seame.json        # Baseline model results on SEAME
    ├── baseline_emilia.json       # Baseline model results on EMILIA
    ├── baseline_cs_dialogue.json  # Baseline model results on CS-Dialogue
    ├── trained_seame.json         # This model's results on SEAME
    ├── trained_emilia.json        # This model's results on EMILIA
    └── trained_cs_dialogue.json   # This model's results on CS-Dialogue

License

This adapter inherits the license of the base Qwen2-Audio-7B-Instruct model.

Downloads last month
128
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch

Adapter
(9)
this model